HACKER Q&A
📣 yansoki

As a developer, am I wrong to think monitoring alerts are mostly noise?


I'm a solo developer working on a new tool, and I need a reality check from the ops and infrastructure experts here. My background is in software development, not SRE. From my perspective, the monitoring alerts that bubble up from our infrastructure have always felt like a massive distraction. I'll get a page for "High CPU" on a service, spend an hour digging through logs and dashboards, only to find out it was just a temporary traffic spike and not a real issue. It feels like a huge waste of developer time. My hypothesis is that the tools we use are too focused on static thresholds (e.g., "CPU > 80%") and lack the context to tell us what's actually an anomaly. I've been exploring a different approach based on peer-group comparisons (e.g., is api-server-5 behaving differently from its peers api-server-1 through 4?). But I'm coming at this from a dev perspective and I'm very aware that I might be missing the bigger picture. I'd love to learn from the people who live and breathe this stuff. How much developer time is lost at your company to investigating "false positive" infrastructure alerts? Do you think the current tools (Datadog, Prometheus, etc.) create a significant burden for dev teams? Is the idea of "peer-group context" a sensible direction, or are there better ways to solve this that I'm not seeing? I haven't built much yet because I'm committed to solving a real problem. Any brutal feedback or insights would be incredibly valuable.


  👤 aristofun Accepted Answer ✓
> investigating "false positive" infrastructure alerts?

Gradually with each false positive (or negative) you learn to tweak your alerts and update dashboards to reduce the noise as much as possible.


👤 rozenmd
Bingo, too many folks focus on "oh god is my cpu good, ram good, etc" rather than "does the app still do the thing in a reasonable time?"

👤 PaulHoule
No. Useful alerts are a devilishly hard problem.

👤 chasing0entropy
You're not wrong.

👤 toast0
Monitoring is one of those things where you report on what's easy to measure, because measuring the "real metric" is very difficult.

If you can take a reasonable amount of time and come up with something better for your system, great; do it. I've worked with a lot of systems where noisy alerts and human filtering was the best we could reasonably do, and it was fine. In a system like that, not every alert demands immediate response... a single high cpu page doesn't demand a quick response, and the appropriate response could be 'cpu was high for a short time, I don't need to look at anything else' Of course, you could be missing an important signal that should have been investigated, too. OTOH, if you get high cpu alerts from many hosts at once, something is up --- but it could just be an external event that causes high usage and hopefully your system survives those autonomously anyway.

Ideally, monitoring, operations and development feed into each other, so that the system evolves to work best with the human needs it has.


👤 tudelo
Alerting has to be a constant iterative process. Some things should be nice to know, and some things should be "halt what you are doing and investigate". The latter needs to really be decided based on what your SLI/SLAs have been defined as, and need to be high quality indicators. Whenever one of the halt-and-do things alerts start to be less high signal they should be downgraded or thresholds should be increased. Like I said, an iterative process. When you are talking about a system owned by a team there should be some occasional semi-formal review of current alerting practices and when someone is on-call and notices flaky/bad alerting they should spend time tweaking/fixing so the next person doesn't have the same churn.

There isn't a simple way but having some tooling to go from alert -> relevant dashboards -> remediation steps can help cut down on the process... it takes a lot of time investment to make these things work in a way that allows you to save time and not spend more time solving issues. FWIW I think developers need to be deeply involved in this process and basically own it. Static thresholds usually would just be a warning to look at later, you want more service level indicators. For example if you have a streaming system you probably want to know if one of your consumers are stuck or behind by a certain amount, and also if there is any measurable data loss. If you have automated pushes, you would probably want alerting for a push that is x amount of time stale. For rpc type systems you would want some recurrent health checks that might warn on cpu/etc but put higher severity alerting on whether or not responses are correct and as expected or not happening at all.

As a solo dev it might be easier just to do the troubleshooting process every time, but as a team grows it becomes a huge time sink and troubleshooting production issues is stressful, so the goal is to make it as easy as possible. Especially if downtime == $$.

I don't have good recommendations for tooling because I have used mostly internal tools but generally this is my experience.


👤 gethly
Figuring out logging and metrics is the hardest part of running online projects nowadays. I would say though, unpopularily, that 99% of work put into this is wasted. You are unlikely running a critical software that cannot go down, so I would not worry too much about it.

YAGNI


👤 brudgers
I'm a solo developer

To a first approximation, monitoring tools are built for teams, projects running at scale, and for systems where falling over matters at scale. And monitoring as a "best practice" is good engineering practice only in those contexts.

You don't have that context and probably should resist the temptation to boilerplate it and considering it as moving the project forward. Because monitoring doesn't get you customers/users; does not solve customer/user problems; and nobody cares if you monitor or not (except assholes who want to tell you you are morally failing unless you monitor).

Good engineering is doing what makes sense only in terms of the actual problems at hand. Good luck.


👤 everforward
Good alerting is hard, even for those of who are SMEs on it.

My biggest advice is to leverage alerting levels, and to only send high priority alerts for things visible to users.

For alert levels, I usually have 3. P1 (the highest level) is the only one that will fire a phone call/alarm 24/7/365, and only alerts if some kind of very user-visible issue happens (increase in error rate, unacceptable latency, etc). P2 is a mid-tier and only expected to get a response during business hours. That's where I send things that are maybe an issue or can wait, like storage filling up (but not critically so). P3 alerts get sent to a Slack channel, and exist mostly so if you get a P1 alert you can get a quick view of "things that are odd" like CPU spiking.

For monitoring, I try to only page on user-visible issues. Eg I don't routinely monitor CPU usage, because it doesn't correlate to user-visible issues very well. Lots of things can cause CPU to spike, and if it's not impacting users then I don't care. Ditto for network usage, disk IO, etc, etc. Presuming your service does network calls, the 2 things you really care about are success rate and latency. A drop in success rate should trigger a P1 page, and an increase in latency should trigger a P2 alert if it's higher than you'd like but okay and a P1 alert at the "this is impacting users" point. You may want to split those out by endpoint as well, because your acceptable latency probably differs by endpoint.

If your service can't scale, you might also want to adjust those alerts by traffic levels (i.e. if you know you can't handle 10k QPS and you can't scale past 10k QPS, there's no point in paging someone).

You can also add some automation, especially if the apps are stateless. If api-server-5 is behaving weirdly, kill it and spin up a new api-server-5 (or reboot it if physical). A lot of the common first line of defense options are pretty automatable, and can save you from getting paged if an automated restart will fix it. You probably do want some monitoring and rate limiting over that as well, though. E.g. a P2 alert that api-server-5 has been rebooted 4 times today, because repeated reboots are probably an indication of an underlying issue even if reboots temporarily resolve it.


👤 ElevenLathe
If "CPU > 80%" is not an error state for your application, then that is a pointless alert and it should be removed.

Ideally alerts should only be generated when ($severity_of_potential_bad_state * $probability_of_that_state) is high. In other words, for marginally bad states, you want a high confidence before alerting. For states that are really mega bad, it may be OK to loosen that and alert when you are less confident that it is actually occurring.

IME CPU% alerts are typically totally spurious in a modern cloud application. In general, to get the most out of your spend, you actually want your instances working close to their limits because the intent is to scale out when your application gets busy. Therefore, you instead want to monitor things that are as close to user experience or business metric as possible. P99 request latency, 5xx rate, etc. are OK, but ideally you go even further into application-specific metrics. For example, Facebook might ask: What's the latency between uploading a cat picture and getting its first like?


👤 cyclonereef
Most of the time checking for "typical" thresholds for infrastructure will yield more noise than signal. By typical thresholds I mean things like CPU Usage %, Memory Consumed and so on. My typical recommendation for clients who want to implement infrastructure is "Don't bother". You are better off in most cases measuring impact to user-facing services such as web page response times, task completion times for batch jobs and so on. If I have a client who is insistent on monitoring their infrastructure I tell them to monitor different metrics.

For CPU, check for CPU IOWait For memory, check for Memory swap-in rate For disk, check for latency or queue depth For network, check for dropped packets

All you want to check at an infrastructure layer is whether there is a bottleneck and what that bottleneck is. Whether an application is using 10% or 99% of available memory is moot if the application isn't impacted by it. The above metrics are indicators (but not always proof) that a resource is being bottlenecked and needs investigation.

Monitor further up the application stack, check for error code rates over time, implement tracing to the extent that you can for core user journeys, ignore infrastructure-level monitoring until you have no choice


👤 al_borland
I spent the first half of my career in ops, watching those alerts, escalating things, fixing stuff, writing EDA to fix stuff, working with monitoring teams and dev teams to tune monitoring, etc. Over time I worked my way into a dev role, but still am focused on the infrastructure.

The problem you’re starting to run into is that you’re seeing the monitors as useless, which will ultimately lead to ignoring them, so when there is a problem you won’t know it.

What you should be doing is tuning the monitors to make them useful. If your app will see occasional spikes that last 10 minutes, and the monitor checks every 5 minutes, set it to only create an alert after 3 consecutive failures. That creates some tolerance for spikes, but will still alert you if there is a prolonged issue that needs to be addressed due to the inevitable performance issues it will cause.

If there are other alerts that happen often that need action taken, which is repeatable, that’s where EDA (Event Driven Automation) would come in. Write some code to fix what needs to be fixed, and when the alert comes in the code automatically runs to fix it. You then only need to handle when that EDA code can’t fix the issue. Fix it once it code instead of every time you get an alert.


👤 hshdhdhehd
CPU usage I tend to see used for two things. Scaling and maybe diagnostics (for 5% of investigations). Dont alert on it. Maybe alert if you scaled too much though.

I would recommend alerting on reliability. If errors for an endpoint go above whatever yoy judge to set e.g. 1% or 0.1% or 0.01% for a sustained period then alarm.

Maybe do the same for latency.

For hobby projects though I just point a free tier of one of those down detector things at a few urls. I may make a health check url.

Every false alarm should lead to some decision of how to fix e.g. different alarm, different threshold or even just forget that alarm.


👤 kazinator
This sounds like a case of the alerts being badly tuned.

If you are distracted by a high CPU alert that turns out to just be an expected spike, the alert needs to filter out spikes and only report persistent high CPU situations.

Think of how the body and brain report pain. Sudden new pain is more noticeable than a chronic background level of pain. Maybe CPU alarms should be that: CPU activity which is (a) X percent above the 24 hour average, (b) persistently for a certain duration, is alarm worthy.

So 100% CPU wouldn't be alarm-worthy for a node that is always at 100% CPU, as an expected condition: very busy sytem constantly loaded down. But 45% CPU for 30 minutes could be alarm-worthy on a machine that averages 15%.

Kind of thing.


👤 haiji2025
I can share my experience: monitoring and alerting should be calibrated to the number of users you serve. Early on, we run load/stress tests; if those look good, many ancillary alerts aren’t necessary. Alerts are best reserved for truly critical events—such as outages and other severe incidents. Thresholds should be tuned to real-world conditions and adjusted over time. Hope this helps.

👤 thiago_fm
They are very useful. Of your example "High CPU" could mean you need either a bigger CPU, more cores in the same host etc.

This will let you tune your application being run.

Also, this might change in a time of burst traffic, and if you don't have such tools like Prometheus, DD etc, you aren't able to tune accordingly.

The thing is that tuning a production setup is a bit of an art, there are many tradeoffs for what you do (typically, cost vs. benefit), so you need to make those decisions yourself.

If the alert is constantly ringing and you are satisfied with how the system is running and tradeoffs, you should disable it.