Demo Type · 12

Incident Report (RCA)

Use this when you want to teach a hard lesson through a real failure — a timeline of what happened, the root cause dug out with 5-whys, the blast radius, and the fixes with owners.

This is a copyable exemplar. Lift the demo section below into a lesson built from assets/lesson-template.html — keep the design tokens and the Simple → Technical pattern intact.

1

What happened, in plain words


On the morning of March 14, our main app stopped saving anything. People could open it and click around, but every attempt to write data failed. The reason turned out to be embarrassingly simple: the database server's disk was completely full. With no room left to write, the database refused all new records.

A report like this one walks through the failure honestly so the same thing never bites us twice. It has four parts you can read top to bottom: a timeline of events, the root cause found by asking "why?" five times, an impact summary, and a checklist of fixes with a named owner for each.

Think of it like… a kitchen sink that overflows because the drain was slowly clogging for weeks. The flood is the incident; the real story is the clog nobody was watching — and the fix is a drain alarm, not just a mop.

What actually failed

The primary Postgres instance (db-prod-01) hit 100% usage on its data volume. Postgres entered a read-only state once the write-ahead log (WAL) directory could no longer flush, surfacing as PANIC: could not write to file "pg_wal/…": No space left on device. The application layer translated this into HTTP 500s on every mutating endpoint while reads from cache continued, which is why the app "looked" alive.

Why disk filled silently

Three contributors stacked: (1) a verbose debug log level shipped to prod two weeks earlier inflated WAL and log volume; (2) automated VACUUM was disabled on a large append-only table, so dead tuples never reclaimed space; (3) the only disk alert was a manual dashboard nobody had open at 03:00. No threshold alarm existed below 100%.

2

Incident at a glance


SEV-1

The day the database disk filled up

Detected2026-03-14 · 03:12 UTC
Resolved2026-03-14 · 04:47 UTC
Duration1h 35m
Servicedb-prod-01 (writes)
AuthorOn-call · Platform
3

Timeline of events


Read it top to bottom. Olive dots are routine, clay is a warning sign, red is the failure, green is recovery. Notice how the trouble started weeks before anyone noticed.

  1. Feb 28

    Debug logging shipped to prod

    A deploy flipped the DB log level to debug "temporarily" for an investigation. It was never reverted.

    root in motion
  2. Mar 11

    Disk crosses 85%

    The data volume quietly passes 85% usage. No alert is configured at this threshold, so nobody is paged.

    missed signal
  3. 03:09

    Disk hits 100% — writes fail

    Postgres can no longer flush its write-ahead log and drops into read-only. Every save in the app starts returning errors.

    incident start
  4. 03:12

    First customer report

    A user emails support: "I can't save my work." On-call is paged 3 minutes later by the error-rate alarm, not by disk.

    detected
  5. 03:31

    Root cause identified

    On-call runs df -h on the host and sees 100% /var/lib/postgresql. The cause is now clear.

    diagnosed
  6. 03:58

    Emergency space reclaimed

    Old WAL and rotated logs are archived off-box and the volume is grown by 40 GB. Writes resume within minutes.

    mitigated
  7. 04:47

    Service confirmed healthy

    Error rate back to baseline, write latency normal, no data lost. Incident closed; postmortem scheduled.

    resolved
Feb 28 debug log shipped Mar 11 disk 85% (no alert) 03:09 100% — writes fail 03:12 detected 03:58 mitigated 04:47 resolved
Same story compressed: the root cause (debug log) was lit two weeks before the outage.
4

Root cause — five whys


Keep asking "but why did that happen?" until you reach something you can actually fix. The first answer ("disk was full") is a symptom, not a cause. The fifth answer is the one worth fixing.

  1. Why did the app stop saving?

    The database refused all writes.

  2. Why did the database refuse writes?

    Its disk was 100% full — Postgres could no longer flush the write-ahead log.

  3. Why did the disk fill up?

    Log and WAL volume had been growing unusually fast, and dead rows in a big table were never reclaimed.

  4. Why was the growth unnoticed?

    There was no alert below 100%. Disk usage lived on a dashboard nobody watches at 3 a.m.

  5. Root cause · why no early alert existed

    Disk capacity was never treated as a first-class SLO. A "temporary" debug log change shipped without a revert ticket, and no one owned proactive capacity alerting. The system had no way to warn us before it broke.

Contributing factors (not the single root)

5-whys finds the primary chain; real incidents have side roads. Here: autovacuum was disabled on events_raw during a past migration and never re-enabled; the WAL retention window was set high for a since-removed replica; and the runbook for "disk full" was three years stale. None alone caused the outage, but each shortened the fuse.

Why we avoid blame

The engineer who shipped debug logging acted reasonably given the tools — there was no guardrail, no revert reminder, no capacity alarm. A blameless postmortem fixes the system (add the alarm, add revert tickets) rather than the person.

5

Impact summary


The blast radius in numbers — what it cost, and the good news at the end.

1h 35m

Write downtime

~4,200

Failed save attempts

~310

Affected users

0

Records lost
6

Action items


Every fix has an owner and a priority. Check items off as they ship — the bar tracks progress. P1 closes the door on this exact failure; P2/P3 harden the area around it.

0 of 5 done
  • P1
  • P1
  • P2
  • P2
  • P3