Why are most status pages delayed?

Question

As I type this, Reddit is down. My own requests return 500s, Down Detector reports that there is an outage but Reddit itself says all systems operational.This is a pattern that I have noticed time and time again with many services. Why even have a status page if it is not going to be accurate in real time? It's also not uncommon that smaller issues never get acknowledged.Is this a factor of how Atlassian Statuspage works?Edit: Redditstatus finally acknowledged the issue as of 04:27 PST, a good 20+ minutes after Down Detector charts show the spike

bberenberg · Accepted Answer

Alert fatigue. Down Detector will show an outage with a service when the intermediate network is down. Companies have to triage alerts and once they&rsquo;re validated they are posted on a status page. Some companies abuse this to hide their outages. Others delay in a reasonable manner.I have considered building something to address this and even own honeststatuspage.com to eventually host it on. But it&rsquo;s a complex problem without an obviously correct answer.

Bender · Answer

Bureaucracy. Companies have service level agreements with other companies. They want to be damn sure they can not disavow an outage before something says there is an outage. There will in most cases be a process involved in updating the status page that will intentionally have many layers of bureaucracy hurdles to jump through including many approvals. The preference will often be to downgrade an "outage" to a "degradation" or "partial outage" or some other term to downplay it and avoid having to pay credits on their B2B service level agreements and such.

xeonmc · Answer

Status pages can be replaced with the video feed of a webcam pointed at a whiteboard with post-it notes manually updated by employees.

zokier · Answer

> It's also not uncommon that smaller issues never get acknowledged.You kinda answered your own question here. The intent of the status pages is to report any major issues and not every small glitch.

bithaze · Answer

> Why even have a status page if it is not going to be accurate in real time?The funny thing is reddit's status page used to have real-time graphs of things like error rate, comment backlog, visits, etc. Not with any numbers on the Y-axis, so you could only see relative changes, really, but they were still helpful to see changes before humans got around to updating the status page.

amelius · Answer

Because legal rights can be derived from it.

knorker · Answer

To add to the reasons others gave: It needs to be correct.
Engineers are working the problem. They have a pretty good understanding of the impact of the outage. Then an external comms person asks for an engineer to proof read the external outage comms. Which triggers rounds of "no, this part is not technically correct" and "I know the internal system scope impact, but not how that maps to external product names you want to communicate".
Sure, it'd be nice if the message "we are investigating an issue with… uh… some products" would come up faster.

arccy · Answer

Another way to look at it is: you already know the service is down because you can't use it. The status page being manually updated means someone is aware and actually working on fixing it, rather than it being automated and the other side just ignoring it...

anomaloustho · Answer

It&rsquo;s already been said, but most companies already have those instant &ldquo;alarms&rdquo; that go off within minutes. 80% of the time, those alarms are red herrings that get triaged. At a lot of companies, they go off constantly.As a company, you don&rsquo;t want to declare an outage readily and you definitely don&rsquo;t want it to be declared frequently. Declaring an outage frequently means:&bull; Telling your exec team that your department is not running well &bull; Negative signal to your investors &bull; Bad reputation with your customers &bull; Admitting culpability to your customers and partners (inviting lawsuits and refunds) &bull; Telling your engineering leadership team that your specific team isn&rsquo;t running well &bull; Messing up your quarterly goals, bonuses etcetera for outages that aren&rsquo;t realSo every social and incentive structure along the way basically signals that you don&rsquo;t want to declare an outage when it isn&rsquo;t real. You want to make sure you get it right. Therefore, you don&rsquo;t just want to flip a status page because a few API calls had a timeout.

swiftcoder · Answer

Because for most major sites, updating the status page requires (a significant number of) humans in the loop.
Back when I worked at a major cloud provider (which admittedly was >5 years ago), our alarms would go off after ~3-15 minutes of degraded functionality (depending on the sensitivity settings of that specific alarm). At that point the on call gets paged in to investigate and validates that the issue is real (and not trivially correctable). There was also automatic escalation if the on call doesn't acknowledge the issue after 15 minutes.
If so, a manager gets paged in to coordinate the response, and if the manager considers the outage to be serious (or to affect a key customer), a director or above gets paged in. The director/VP has the ultimate say about posting an outage, but they in parallel consult the PR/comms team to consult on the wording/severity of the notification, any partnership managers for key affected clients, and legal re any contractual requirements the outage may be breaching...
So in a best-case scenario you'd have 3 minutes (for a fast alarm to raise) plus ~5 minutes for the on call to engage, plus ~10 minutes of initial investigation, plus ~20 minutes of escalations and discussions... all before anyone with permission to edit the status page can go ahead and do so

sd9 · Answer

I&rsquo;ve seen &ldquo;You broke Reddit&rdquo; too much recently. Today, and during the AWS outage.I didn&rsquo;t break Reddit by trying to access the homepage.

jpalawaga · Answer

It&rsquo;s not a technical issue, it&rsquo;s a business one.Those status pages are often linked to contractual SLAs and updating the page tangibly means money lost.So there&rsquo;s an incentive to only up it when the issue is severe and not quickly remediated.It&rsquo;s not an engineers tool, it&rsquo;s a liability tool.

fouc · Answer

[delayed]

hypeatei · Answer

Why even check the status page at all if you're experiencing errors and others are reporting the same? I don't see the point in getting worked up over how long it takes a company to update their status page.There is a ton of moving pieces in software these days and networks in general. There is no straightforward way to declare an outage via health checks, especially if declaring said outage can cost you $$ due to various SLAs. Manual verification+updates take time.

kachapopopow · Answer

because these systems are so big and the people who can validate problems might be asleep as the wheel or be pretty far up the chain and it takes time to reach it. most of the spikes on downdetector are often unrelated to the service, but a 3rd party failure.

wahnfrieden · Answer

Because they are editorialized

mirekrusin · Answer

Because they're incentivized to delay it, ideally until resolved, this way their SLA uptime is 100%. Less of reported downtime is better for them so they push it as much as possible. If they were to report all failures their pretty green history would be filled with red. What, are you going to do, sue them? They can do it so they do.

slig · Answer

They lie. Meta ad deliver has been completely fucked for about two months, and they rarely update their status page [1]. When they do, they update with "medium disruptions" hours late and then with "resolved" hours before actually solving. I could understand the "resolved" before because of "web scale" and deploying slowly.
This third partt status page reflects the issues much better [2].
[1]: https://metastatus.com/ads-manager [2]: https://statusgator.com/services/meta

JohnFen · Answer

I always assumed it's because the pages are manually updated.

chrismorgan · Answer

A few months ago, Cloudflare accidentally turned off 1.1.1.1 (I’m simplifying slightly, most notably DNS-over-HTTPS continued to work). Over the course of five or six minutes, traffic dropped to 10% of normal, and stayed there. Somehow, it took another six minutes before an alert fired, at which point they noticed.
https://news.ycombinator.com/item?id=44578490
You’d think that for such a company they’d notice if global traffic for one of their important services for a given minute had dropped below 50% compared with the last hour, but apparently not.
And that’s Cloudflare, who I would expect better of than most.

sjsdaiuasgdia · Answer

Status pages usually start as a human-updated thing because it's easy to implement.
Some time later, you might add an automated check thing that makes some synthetic requests to the service and validates what's returned. And maybe you wire that directly to the status page so issues can be shown as soon as possible.
Then, false alarms happen. Maybe someone forgot to rotate the credentials for the test account and it got locked out. Maybe the testing system has a bug. Maybe a change is pushed to the service that changes the output such that the test thinks the result is invalid. Maybe a localized infrastructure problem is preventing the testing system from reaching the service. There's a lot of ways for false alarms to appear, some intermittent and some persistent.
So then you spread out. You add more testing systems in diverse locations. You require some N of M tests to fail, and if that threshold is reached the status page gets updated automatically. That protects you from a few categories of false alarms, but not all of them.
You could go further to continue whacking away at the false alarm sources, but as you go you run into the same problem of service reliability, where each additional "9" costs much more than the one that came before. You reach a point where you realize the cost of making your automatic status page updates fully automatically accurate becomes prohibitive.
So you go back to having a human assess the alarm and authorize a status page update if it is legitimate.

colinbartlett · Answer

This delay in status page acknowledgement is a huge reason that my app, StatusGator, has blown up in popularity recently.
We are now regularly detecting outages long before providers acknowledge them which is hugely beneficial to IT teams.
For this Reddit outage, we alerted 13 minutes before the official status page.
Last weeks Azure outage, it was 42 minutes prior (!?!).

Mojah · Answer

Most companies prefer to fix any downtime before it's noticed, and sharing any details on a status page means admitting something went wrong.
There's plenty of status page solutions that tie in uptime monitoring with status updates, essentially providing a "if we get an alert, anyone can follow along through the status page" for near real-time updates. But, it means showing _all_ users that something went wrong, when maybe only a handful noticed it in the first place.
It's a flawed tactic to try and hide/dismiss any downtime (people will notice), but it's in our human nature to try and hide the bad things?
[1] ie https://ohdear.app/features/status-pages

ntomas · Answer

I think it all has to do with how companies react to their own outages and their processes around publishing the info. I imagine that bigger companies need to go through a process to validate all the information they share with the public.
I don't think it's a factor in how Statuspage works. Cloudflare, for example, uses them, and usually it's pretty fast to update their status page and release outage information.
For companies that need to monitor critical dependencies, my company ( https://isdown.app ) helps by aggregating status page information with crowdsourced reports. This way, companies can be alerted way sooner than when the status page is updated.

dgeiser13 · Answer

Most service status pages are Layer 8 in the OSI Model