Use this when you want to teach a hard lesson through a real failure — a timeline of what happened, the root cause dug out with 5-whys, the blast radius, and the fixes with owners.
This is a copyable exemplar. Lift the demo section below into a lesson built from assets/lesson-template.html — keep the design tokens and the Simple → Technical pattern intact.
On the morning of March 14, our main app stopped saving anything. People could open it and click around, but every attempt to write data failed. The reason turned out to be embarrassingly simple: the database server's disk was completely full. With no room left to write, the database refused all new records.
A report like this one walks through the failure honestly so the same thing never bites us twice. It has four parts you can read top to bottom: a timeline of events, the root cause found by asking "why?" five times, an impact summary, and a checklist of fixes with a named owner for each.
Think of it like… a kitchen sink that overflows because the drain was slowly clogging for weeks. The flood is the incident; the real story is the clog nobody was watching — and the fix is a drain alarm, not just a mop.
The primary Postgres instance (db-prod-01) hit 100% usage on its data volume. Postgres entered a read-only state once the write-ahead log (WAL) directory could no longer flush, surfacing as PANIC: could not write to file "pg_wal/…": No space left on device. The application layer translated this into HTTP 500s on every mutating endpoint while reads from cache continued, which is why the app "looked" alive.
Three contributors stacked: (1) a verbose debug log level shipped to prod two weeks earlier inflated WAL and log volume; (2) automated VACUUM was disabled on a large append-only table, so dead tuples never reclaimed space; (3) the only disk alert was a manual dashboard nobody had open at 03:00. No threshold alarm existed below 100%.
Read it top to bottom. Olive dots are routine, clay is a warning sign, red is the failure, green is recovery. Notice how the trouble started weeks before anyone noticed.
Debug logging shipped to prod
A deploy flipped the DB log level to debug "temporarily" for an investigation. It was never reverted.
Disk crosses 85%
The data volume quietly passes 85% usage. No alert is configured at this threshold, so nobody is paged.
missed signalDisk hits 100% — writes fail
Postgres can no longer flush its write-ahead log and drops into read-only. Every save in the app starts returning errors.
incident startFirst customer report
A user emails support: "I can't save my work." On-call is paged 3 minutes later by the error-rate alarm, not by disk.
detectedRoot cause identified
On-call runs df -h on the host and sees 100% /var/lib/postgresql. The cause is now clear.
Emergency space reclaimed
Old WAL and rotated logs are archived off-box and the volume is grown by 40 GB. Writes resume within minutes.
mitigatedService confirmed healthy
Error rate back to baseline, write latency normal, no data lost. Incident closed; postmortem scheduled.
resolvedKeep asking "but why did that happen?" until you reach something you can actually fix. The first answer ("disk was full") is a symptom, not a cause. The fifth answer is the one worth fixing.
Why did the app stop saving?
The database refused all writes.
Why did the database refuse writes?
Its disk was 100% full — Postgres could no longer flush the write-ahead log.
Why did the disk fill up?
Log and WAL volume had been growing unusually fast, and dead rows in a big table were never reclaimed.
Why was the growth unnoticed?
There was no alert below 100%. Disk usage lived on a dashboard nobody watches at 3 a.m.
Root cause · why no early alert existed
Disk capacity was never treated as a first-class SLO. A "temporary" debug log change shipped without a revert ticket, and no one owned proactive capacity alerting. The system had no way to warn us before it broke.
5-whys finds the primary chain; real incidents have side roads. Here: autovacuum was disabled on events_raw during a past migration and never re-enabled; the WAL retention window was set high for a since-removed replica; and the runbook for "disk full" was three years stale. None alone caused the outage, but each shortened the fuse.
The engineer who shipped debug logging acted reasonably given the tools — there was no guardrail, no revert reminder, no capacity alarm. A blameless postmortem fixes the system (add the alarm, add revert tickets) rather than the person.
The blast radius in numbers — what it cost, and the good news at the end.
1h 35m
Write downtime~4,200
Failed save attempts~310
Affected users0
Records lostEvery fix has an owner and a priority. Check items off as they ship — the bar tracks progress. P1 closes the door on this exact failure; P2/P3 harden the area around it.