Demo Type · 12

Incident Report (RCA)

Use this when you want to teach a hard lesson through a real failure — a timeline of what happened, the root cause dug out with 5-whys, the blast radius, and the fixes with owners.

This is a copyable exemplar. Lift the demo section below into a lesson built from assets/lesson-template.html — keep the design tokens and the Simple → Technical pattern intact.

What happened, in plain words

On the morning of March 14, our main app stopped saving anything. People could open it and click around, but every attempt to write data failed. The reason turned out to be embarrassingly simple: the database server's disk was completely full. With no room left to write, the database refused all new records.

A report like this one walks through the failure honestly so the same thing never bites us twice. It has four parts you can read top to bottom: a timeline of events, the root cause found by asking "why?" five times, an impact summary, and a checklist of fixes with a named owner for each.

Think of it like… a kitchen sink that overflows because the drain was slowly clogging for weeks. The flood is the incident; the real story is the clog nobody was watching — and the fix is a drain alarm, not just a mop.

What actually failed

The primary Postgres instance (db-prod-01) hit 100% usage on its data volume. Postgres entered a read-only state once the write-ahead log (WAL) directory could no longer flush, surfacing as PANIC: could not write to file "pg_wal/…": No space left on device. The application layer translated this into HTTP 500s on every mutating endpoint while reads from cache continued, which is why the app "looked" alive.

Why disk filled silently

Three contributors stacked: (1) a verbose debug log level shipped to prod two weeks earlier inflated WAL and log volume; (2) automated VACUUM was disabled on a large append-only table, so dead tuples never reclaimed space; (3) the only disk alert was a manual dashboard nobody had open at 03:00. No threshold alarm existed below 100%.

Incident at a glance

SEV-1

The day the database disk filled up

Detected2026-03-14 · 03:12 UTC

Resolved2026-03-14 · 04:47 UTC

Duration1h 35m

Servicedb-prod-01 (writes)

AuthorOn-call · Platform

Timeline of events

Read it top to bottom. Olive dots are routine, clay is a warning sign, red is the failure, green is recovery. Notice how the trouble started weeks before anyone noticed.

Feb 28

Debug logging shipped to prod

A deploy flipped the DB log level to debug "temporarily" for an investigation. It was never reverted.
root in motion
Mar 11

Disk crosses 85%

The data volume quietly passes 85% usage. No alert is configured at this threshold, so nobody is paged.
missed signal
03:09

Disk hits 100% — writes fail

Postgres can no longer flush its write-ahead log and drops into read-only. Every save in the app starts returning errors.
incident start
03:12

First customer report

A user emails support: "I can't save my work." On-call is paged 3 minutes later by the error-rate alarm, not by disk.
detected
03:31

Root cause identified

On-call runs df -h on the host and sees 100% /var/lib/postgresql. The cause is now clear.
diagnosed
03:58

Emergency space reclaimed

Old WAL and rotated logs are archived off-box and the volume is grown by 40 GB. Writes resume within minutes.
mitigated
04:47

Service confirmed healthy

Error rate back to baseline, write latency normal, no data lost. Incident closed; postmortem scheduled.
resolved

Same story compressed: the root cause (debug log) was lit two weeks before the outage.

Root cause — five whys

Keep asking "but why did that happen?" until you reach something you can actually fix. The first answer ("disk was full") is a symptom, not a cause. The fifth answer is the one worth fixing.

Why did the app stop saving?

The database refused all writes.
Why did the database refuse writes?

Its disk was 100% full — Postgres could no longer flush the write-ahead log.
Why did the disk fill up?

Log and WAL volume had been growing unusually fast, and dead rows in a big table were never reclaimed.
Why was the growth unnoticed?

There was no alert below 100%. Disk usage lived on a dashboard nobody watches at 3 a.m.
Root cause · why no early alert existed

Disk capacity was never treated as a first-class SLO. A "temporary" debug log change shipped without a revert ticket, and no one owned proactive capacity alerting. The system had no way to warn us before it broke.

Contributing factors (not the single root)

5-whys finds the primary chain; real incidents have side roads. Here: autovacuum was disabled on events_raw during a past migration and never re-enabled; the WAL retention window was set high for a since-removed replica; and the runbook for "disk full" was three years stale. None alone caused the outage, but each shortened the fuse.

Why we avoid blame

The engineer who shipped debug logging acted reasonably given the tools — there was no guardrail, no revert reminder, no capacity alarm. A blameless postmortem fixes the system (add the alarm, add revert tickets) rather than the person.

Impact summary

The blast radius in numbers — what it cost, and the good news at the end.

1h 35m

Write downtime

~4,200

Failed save attempts

~310

Affected users

Records lost

Action items

Every fix has an owner and a priority. Check items off as they ship — the bar tracks progress. P1 closes the door on this exact failure; P2/P3 harden the area around it.

0 of 5 done

Add disk-usage alerts at 70% and 85% that page on-call Owner @priya · Platform · due Mar 17 P1
Revert debug log level; require a revert ticket for any "temporary" prod change Owner @diego · Platform · due Mar 16 P1
Re-enable autovacuum on events_raw and verify reclaim Owner @sam · Data · due Mar 20 P2
Make disk capacity a tracked SLO with a monthly headroom review Owner @priya · Platform · due Mar 28 P2
Rewrite the "disk full" runbook and rehearse it in a game day Owner @maria · SRE · due Apr 04 P3