I built Pulsekeep after missing actual production downtime.
The frustrating part wasn’t lack of monitoring — it was too much of it.
One region would time out.
Another would recover.
Alerts would fire, then silence, then fire again.
I started ignoring them.
That’s when a real outage slipped through.
So I’m building Pulsekeep around a simple rule:
If something is down, you should know immediately.
If everything is fine, you shouldn’t hear anything.
Recently I shipped multi-region checks, and the hard part wasn’t infra — it was deciding when an alert is actually justified.
A single probe failing isn’t downtime.
Consensus is.
Still early (MVP), still learning, but this constraint has shaped every decision so far.
If you’ve fought alert fatigue or false positives, I’d love to hear how you handle i
Alert fatigue is such an underappreciated problem. The irony is that more monitoring often means less visibility because the signal-to-noise ratio tanks.
Your "consensus" approach to multi-region checks is smart. One failure could be a network blip; multiple failures in different regions is a pattern worth waking someone up for.
We've dealt with this by implementing severity tiers with different notification channels. Critical (consensus-confirmed outage) = phone call. Warning (single probe failing) = Slack message that batches. Info = dashboard only. The key was being ruthless about what qualifies as "critical" - if it doesn't require immediate human action, it's not critical.
Curious how you're handling the "flapping" case - where something goes down/up/down/up rapidly. That's where most alert systems get noisy.