$ cat blog/on-call-that-doesnt-suck.md
Building an On-Call Culture That Doesn't Suck
August 10, 2025 · 2 min read
The Problem
On-call is unavoidable in modern operations. But it doesn’t have to mean sleepless nights and constant anxiety. Most on-call pain comes from:
- Alert fatigue — too many alerts, most non-actionable
- Poor documentation — no runbooks, tribal knowledge only
- Blame culture — fear of making mistakes during incidents
- Uneven rotations — same people always getting paged
Alert Hygiene
The single most impactful thing you can do is fix your alerts:
Rule: Every alert must be actionable.
If it pages someone, they must be able to DO something about it.
If not, it's a notification, not an alert.
We audit alerts quarterly:
- Delete alerts that never fire or always get acknowledged without action
- Downgrade informational alerts to dashboards
- Consolidate related alerts into single, meaningful signals
- Document what to do when each alert fires
After our first audit, we reduced alerts by 60%. On-call satisfaction went up immediately.
Runbooks
Every alert should link to a runbook. A good runbook has:
- What this alert means (one sentence)
- Impact (who/what is affected)
- Diagnostic steps (what to check first)
- Remediation steps (how to fix it)
- Escalation path (when and who to escalate to)
Store them in git, next to the alert definitions. Review them in postmortems.
Blameless Postmortems
After every significant incident:
- Write a timeline of what happened
- Identify contributing factors (not “root cause” — it’s rarely one thing)
- List action items with owners and deadlines
- Share widely — incidents are learning opportunities
The key word is blameless. People make mistakes. Systems should be resilient to human error. If a single person’s mistake can cause an outage, that’s a systems problem, not a people problem.
Making It Sustainable
- Fair rotations — distribute equally, compensate appropriately
- Follow-the-sun — if you have global teams, use them
- Protected recovery time — after a rough night, take the next morning off
- Regular reviews — is on-call getting better or worse? Measure it.
On-call is a reflection of your engineering culture. Invest in it.