Building an On-Call Culture That Doesn't Suck

The Problem

On-call is unavoidable in modern operations. But it doesn’t have to mean sleepless nights and constant anxiety. Most on-call pain comes from:

Alert fatigue — too many alerts, most non-actionable
Poor documentation — no runbooks, tribal knowledge only
Blame culture — fear of making mistakes during incidents
Uneven rotations — same people always getting paged

Alert Hygiene

The single most impactful thing you can do is fix your alerts:

Rule: Every alert must be actionable.
If it pages someone, they must be able to DO something about it.
If not, it's a notification, not an alert.

We audit alerts quarterly:

Delete alerts that never fire or always get acknowledged without action
Downgrade informational alerts to dashboards
Consolidate related alerts into single, meaningful signals
Document what to do when each alert fires

After our first audit, we reduced alerts by 60%. On-call satisfaction went up immediately.

Runbooks

Every alert should link to a runbook. A good runbook has:

What this alert means (one sentence)
Impact (who/what is affected)
Diagnostic steps (what to check first)
Remediation steps (how to fix it)
Escalation path (when and who to escalate to)

Store them in git, next to the alert definitions. Review them in postmortems.

Blameless Postmortems

After every significant incident:

Write a timeline of what happened
Identify contributing factors (not “root cause” — it’s rarely one thing)
List action items with owners and deadlines
Share widely — incidents are learning opportunities

The key word is blameless. People make mistakes. Systems should be resilient to human error. If a single person’s mistake can cause an outage, that’s a systems problem, not a people problem.

Making It Sustainable

Fair rotations — distribute equally, compensate appropriately
Follow-the-sun — if you have global teams, use them
Protected recovery time — after a rough night, take the next morning off
Regular reviews — is on-call getting better or worse? Measure it.

On-call is a reflection of your engineering culture. Invest in it.