$ cat blog/on-call-that-doesnt-suck.md

Building an On-Call Culture That Doesn't Suck

August 10, 2025 · 2 min read

The Problem

On-call is unavoidable in modern operations. But it doesn’t have to mean sleepless nights and constant anxiety. Most on-call pain comes from:

  1. Alert fatigue — too many alerts, most non-actionable
  2. Poor documentation — no runbooks, tribal knowledge only
  3. Blame culture — fear of making mistakes during incidents
  4. Uneven rotations — same people always getting paged

Alert Hygiene

The single most impactful thing you can do is fix your alerts:

Rule: Every alert must be actionable.
If it pages someone, they must be able to DO something about it.
If not, it's a notification, not an alert.

We audit alerts quarterly:

  • Delete alerts that never fire or always get acknowledged without action
  • Downgrade informational alerts to dashboards
  • Consolidate related alerts into single, meaningful signals
  • Document what to do when each alert fires

After our first audit, we reduced alerts by 60%. On-call satisfaction went up immediately.

Runbooks

Every alert should link to a runbook. A good runbook has:

  • What this alert means (one sentence)
  • Impact (who/what is affected)
  • Diagnostic steps (what to check first)
  • Remediation steps (how to fix it)
  • Escalation path (when and who to escalate to)

Store them in git, next to the alert definitions. Review them in postmortems.

Blameless Postmortems

After every significant incident:

  1. Write a timeline of what happened
  2. Identify contributing factors (not “root cause” — it’s rarely one thing)
  3. List action items with owners and deadlines
  4. Share widely — incidents are learning opportunities

The key word is blameless. People make mistakes. Systems should be resilient to human error. If a single person’s mistake can cause an outage, that’s a systems problem, not a people problem.

Making It Sustainable

  • Fair rotations — distribute equally, compensate appropriately
  • Follow-the-sun — if you have global teams, use them
  • Protected recovery time — after a rough night, take the next morning off
  • Regular reviews — is on-call getting better or worse? Measure it.

On-call is a reflection of your engineering culture. Invest in it.