HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY
← back to homepage
Write blameless postmortems that learnSKILL #TING
Creative

postmortem-writing

Write blameless postmortems that learn

Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response processes.

↗ github · ★ 37k·src: wshobson/agents

the manual

Postmortem Writing

Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.

When to Use This Skill

  • Conducting post-incident reviews
  • Writing postmortem documents
  • Facilitating blameless postmortem meetings
  • Identifying root causes and contributing factors
  • Creating actionable follow-up items
  • Building organizational learning culture

Core Concepts

1. Blameless Culture

Blame-FocusedBlameless
"Who caused this?""What conditions allowed this?"
"Someone made a mistake""The system allowed this mistake"
Punish individualsImprove systems
Hide informationShare learnings
Fear of speaking upPsychological safety

2. Postmortem Triggers

  • SEV1 or SEV2 incidents
  • Customer-facing outages > 15 minutes
  • Data loss or security incidents
  • Near-misses that could have been severe
  • Novel failure modes
  • Incidents requiring unusual intervention

Quick Start

Postmortem Timeline

Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents

Templates and detailed worked examples

Full template library and detailed worked examples live in references/details.md. Read that file when you need the concrete templates.

References


### Template 2: 5 Whys Analysis

```markdown
# 5 Whys Analysis: [Incident]

## Problem Statement

Payment service experienced 47-minute outage due to database connection exhaustion.

## Analysis

### Why #1: Why did the service fail?

**Answer**: Database connections were exhausted, causing all new requests to fail.

**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.

---

### Why #2: Why were database connections exhausted?

**Answer**: Each incoming request opened a new database connection instead of using the connection pool.

**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.

---

### Why #3: Why did the code bypass the connection pool?

**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.

**Evidence**: PR #1234 shows the change, made while fixing a different bug.

---

### Why #4: Why wasn't this caught in code review?

**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.

**Evidence**: Review comments only discuss business logic.

---

### Why #5: Why isn't there a safety net for this type of change?

**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.

**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.

## Root Causes Identified

1. **Primary**: Missing automated tests for infrastructure behavior
2. **Secondary**: Insufficient documentation of architectural patterns
3. **Tertiary**: Code review checklist doesn't include infrastructure considerations

## Systemic Improvements

| Root Cause    | Improvement                       | Type       |
| ------------- | --------------------------------- | ---------- |
| Missing tests | Add infrastructure behavior tests | Prevention |
| Missing docs  | Document connection patterns      | Prevention |
| Review gaps   | Update review checklist           | Detection  |
| No canary     | Implement canary deployments      | Mitigation |

Template 3: Quick Postmortem (Minor Incidents)

# Quick Postmortem: [Brief Title]

**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3

## What Happened

API latency spiked to 5s due to cache miss storm after cache flush.

## Timeline

- 10:00 - Cache flush initiated for config update
- 10:02 - Latency alerts fire
- 10:05 - Identified as cache miss storm
- 10:08 - Enabled cache warming
- 10:12 - Latency normalized

## Root Cause

Full cache flush for minor config update caused thundering herd.

## Fix

- Immediate: Enabled cache warming
- Long-term: Implement partial cache invalidation (ENG-999)

## Lessons

Don't full-flush cache in production; use targeted invalidation.

Facilitation Guide

Running a Postmortem Meeting

## Meeting Structure (60 minutes)

### 1. Opening (5 min)

- Remind everyone of blameless culture
- "We're here to learn, not to blame"
- Review meeting norms

### 2. Timeline Review (15 min)

- Walk through events chronologically
- Ask clarifying questions
- Identify gaps in timeline

### 3. Analysis Discussion (20 min)

- What failed?
- Why did it fail?
- What conditions allowed this?
- What would have prevented it?

### 4. Action Items (15 min)

- Brainstorm improvements
- Prioritize by impact and effort
- Assign owners and due dates

### 5. Closing (5 min)

- Summarize key learnings
- Confirm action item owners
- Schedule follow-up if needed

## Facilitation Tips

- Keep discussion on track
- Redirect blame to systems
- Encourage quiet participants
- Document dissenting views
- Time-box tangents

Anti-Patterns to Avoid

Anti-PatternProblemBetter Approach
Blame gameShuts down learningFocus on systems
Shallow analysisDoesn't prevent recurrenceAsk "why" 5 times
No action itemsWaste of timeAlways have concrete next steps
Unrealistic actionsNever completedScope to achievable tasks
No follow-upActions forgottenTrack in ticketing system

Best Practices

Do's

  • Start immediately - Memory fades fast
  • Be specific - Exact times, exact errors
  • Include graphs - Visual evidence
  • Assign owners - No orphan action items
  • Share widely - Organizational learning

Don'ts

  • Don't name and shame - Ever
  • Don't skip small incidents - They reveal patterns
  • Don't make it a blame doc - That kills learning
  • Don't create busywork - Actions should be meaningful
  • Don't skip follow-up - Verify actions completed

more creative

Transform ideas into actionable designs
Creative
HOT
Transform ideas into actionable designs
brainstorming
1@ 0 240k
Create clear implementation plans fast
Creative
HOT
Create clear implementation plans fast
writing-plans
0@ 0 240k
Boost your writing skills for clarity
Creative
HOT
Boost your writing skills for clarity
writing-skills
0@ 0 240k
Style your artifacts in seconds
Creative
HOT
Style your artifacts in seconds
theme-factory
0@ 0 156k
Build complex web artifacts fast
Creative
HOT
Build complex web artifacts fast
web-artifacts-builder
0@ 0 156k
Streamline your document co-authoring process
Creative
HOT
Streamline your document co-authoring process
doc-coauthoring
0@ 0 156k
Design stunning frontends that stand out
Creative
HOT
Design stunning frontends that stand out
frontend-design
0@ 0 156k
Transform your visuals with brand style
Creative
HOT
Transform your visuals with brand style
brand-guidelines
0@ 0 156k