Yesterday I broke production with a deploy I pushed myself. And the part that actually bothered me wasn't the break - it was how I found out.
I noticed. That's it. Nothing alerted me. Something I shipped took a system down and my detection mechanism was "happen to be looking at the right thing at the right moment." For a backend that other people are meant to depend on for their auth and billing, that's not a monitoring gap, that's the absence of monitoring.
So the next thing I'm building for BuildBase is a status page, plus real system alerts behind it. Something like status.kinde.com - a public, honest view of whether every part of the platform is actually up, and alerts that fire the second one isn't, instead of me stumbling onto it hours later.
The reasoning is mostly about trust. If BuildBase is running your auth and your billing and something goes sideways, your first question is "is it me or is it them?" Right now the answer is a support message and a wait. A status page makes that answer public and instant, which is the least I owe anyone building on top of us.
Two things I'm being deliberate about while building it, because a status page done carelessly creates the exact problems it's supposed to catch:
Keep it lightweight. A health-check system that pings every service every few seconds is just self-inflicted load. If my reliability feature measurably slows the thing it's watching, I've made it worse. It needs to barely register.
Don't fight auto-scaling. Naive health checks read a scaling event as an outage, or worse, trip the scaling logic itself. The checker has to understand the difference between "this instance is down" and "this instance is spinning up." That's the part I'm still working through.
The meta-lesson, for anyone building in public: the best reliability work tends to come straight out of your own mistakes, not a tidy roadmap. I wouldn't have prioritised this if I hadn't personally broken something and felt how blind I was.
For those who've built status pages or health monitoring at a small scale - how do you keep the checks from becoming load, and how do you stop them false-alarming during scale events? That's the open question I'm sitting with.
Building it at https://buildbase.app if you want to see where it lands.
The auto-scaling issue you raised is a real one. One approach I have seen work is routing health checks through the load balancer endpoint instead of individual instances. If the LB is serving traffic, the fleet is healthy regardless of scale events. Worth keeping a separate synthetic check on the critical user path from an external provider too, so you catch things internal checks miss because they share the same infrastructure blind spots.