Performance

The Post-Mortem: How a Missing Index Took Production Down for 4 Hours

A detailed incident retrospective — root cause, timeline, detection failure, the fix, and the process changes that prevent this class of incident from recurring.

admin · April 6, 2026 · 2 min read

What Happened

At 14:32 UTC on a Tuesday, API response times on the order processing service began increasing. By 14:47, the service was returning 503s to 40% of requests. By 15:00, checkout was completely non-functional. Four hours later, service was restored. Estimated lost revenue: $180,000.

Root cause: a missing composite index on a query added to the codebase 6 weeks earlier, during a period of low traffic that masked the performance problem until a volume spike exposed it.

How It Happened

A developer added a feature querying order_items filtered by product_sku and status. The query ran in 3ms in development on a small dataset. It was not reviewed for its execution plan during code review. No slow query monitoring was configured on production. Six weeks later, a promotional campaign drove 60% higher than normal order volume. The query degraded to 340ms. The following week, social media attention drove a further spike — the query hit 2800ms, connection pool exhausted, service appeared down.

The Fix

Adding the composite index took 47 seconds once root cause was identified. The 4-hour gap between incident start and resolution was entirely investigation time — the absence of query-level monitoring meant diagnosing from application logs rather than database metrics.

Process Changes Implemented

  • Every PR that adds or modifies a database query now requires EXPLAIN ANALYZE output in the PR description, run against production-scale data volumes in a staging environment
  • pg_stat_statements extension added to all production PostgreSQL instances
  • Alert configured on any query averaging over 100ms over a 5-minute sliding window
  • Load test added to the staging CI pipeline, validating query performance under 3× current production request volume