Incident Management

How to Write an Incident Post-Mortem: Template and Guide for SaaS Teams

StatusRay Team
11 min read
Last updated: March 24, 2026
How to Write an Incident Post-Mortem: Template and Guide for SaaS Teams

Something broke. You fixed it. Now what?

If your team's answer is "move on and hope it doesn't happen again," you're leaving the most valuable part of incident response on the table. An incident post-mortem turns a bad experience into a system that prevents the next one.

This guide covers how to write effective, blameless post-mortems for SaaS teams — with a ready-to-use template you can copy today.

What Is an Incident Post-Mortem?

An incident post-mortem (also called a post-incident review) is a structured document written after an incident is resolved. It answers four questions:

  • What happened?
  • Why did it happen?
  • What did we do well?
  • What will we change to prevent it?

The key word is blameless. A post-mortem isn't about finding who messed up. It's about finding what in your systems and processes allowed the incident to happen — and fixing those systems.

If people fear blame, they'll hide information. If they trust the process, they'll share details that prevent the next outage.

When Should You Write a Post-Mortem?

Not every incident needs a post-mortem. A good rule for SaaS teams:

  • Always write one for: SEV-1 and SEV-2 incidents, any data loss, any security incident, any incident lasting more than 1 hour
  • Consider writing one for: SEV-3 incidents that reveal systemic issues, near-misses that could have been worse, incidents that generated customer complaints
  • Skip for: SEV-4 issues, incidents resolved in under 5 minutes with no customer impact, known issues with existing fixes in progress

Timeline: Schedule the post-mortem within 48 hours of resolution. Memories fade fast — details that seem obvious on day one become fuzzy by day five.

Incident Post-Mortem Template (Copy and Use)

Copy this template and fill it in after your next incident. Keep it in a shared location (wiki, Notion, Google Docs) where the whole team can access it.


[Incident Title] — Post-Mortem

Date: [Date of incident]
Author: [Name]
Severity: [SEV-1 / SEV-2 / SEV-3]
Duration: [Start time] to [End time] - [total duration]
Impact: [Number of users affected, services impacted, revenue impact if known]


Summary

[2-3 sentences describing what happened, what was affected, and how it was resolved. Write this for someone who wasn't involved.]


Timeline

Time (UTC) Event
[HH:MM] Issue begins — [what triggered it]
[HH:MM] Monitoring alert fires / customer reports issue
[HH:MM] On-call engineer acknowledges and begins investigation
[HH:MM] Status page updated — incident acknowledged
[HH:MM] Root cause identified — [brief description]
[HH:MM] Fix deployed / rollback completed
[HH:MM] Service confirmed restored — monitoring normal
[HH:MM] Status page updated — incident resolved

Root Cause

[Describe the root cause in plain language. What actually broke and why? Use the 5 Whys technique to go deeper than the surface-level cause.]

5 Whys:

  1. Why did [symptom]? Because [cause 1].
  2. Why did [cause 1]? Because [cause 2].
  3. Why did [cause 2]? Because [cause 3].
  4. Why did [cause 3]? Because [cause 4].
  5. Why did [cause 4]? Because [root cause].

What Went Well

  • [Thing that worked — e.g., "Monitoring detected the issue within 2 minutes"]
  • [Thing that worked — e.g., "Status page was updated promptly, reducing support tickets"]
  • [Thing that worked — e.g., "Team coordinated effectively in the incident channel"]

What Didn't Go Well

  • [Thing that failed — e.g., "It took 20 minutes to identify the root cause because logs were unclear"]
  • [Thing that failed — e.g., "No runbook existed for this failure mode"]
  • [Thing that failed — e.g., "Customer notification was delayed by 15 minutes"]

Action Items

Action Owner Priority Due Date
[Specific action — e.g., "Add alerting for database connection pool exhaustion"] [Name] [P1/P2/P3] [Date]
[Specific action — e.g., "Create runbook for database failover"] [Name] [P1/P2/P3] [Date]
[Specific action — e.g., "Improve log formatting for payment service"] [Name] [P1/P2/P3] [Date]

Lessons Learned

[1-2 paragraphs summarizing the key takeaways. What did the team learn? What would you tell a new team member about this type of failure?]


How to Run a Post-Mortem Meeting

The document is important, but the meeting is where learning happens. Here's how to run it well.

Before the Meeting

  • Draft the post-mortem before the meeting. The author (usually the incident commander or lead responder) fills in the timeline, root cause, and initial analysis. The meeting is for discussion, not for writing from scratch.
  • Invite the right people. Everyone who was involved in the incident response. For a small SaaS team, that's often 3-6 people. If your whole team is under 10, just invite everyone.
  • Set expectations. This is blameless. We're discussing systems, not judging people.

During the Meeting (30-45 minutes)

  1. Walk through the timeline (10 min) — Read through the timeline together. Fill in gaps. Correct any inaccuracies.
  2. Discuss the root cause (10 min) — Does everyone agree on the root cause? Run the 5 Whys together if the author wasn't sure.
  3. Review what went well / didn't go well (10 min) — Add items that others noticed. Be specific.
  4. Define action items (10 min) — Each action item gets an owner and a due date. No action item should be "be more careful" — it should be a concrete system change.

After the Meeting

  • Finalize the document and share it with the team
  • Add action items to your task tracker (Jira, Linear, GitHub Issues)
  • Follow up on action items in your next sprint planning
  • Archive the post-mortem where future team members can find it

The 5 Whys Technique

The 5 Whys is the simplest root cause analysis technique. Keep asking "why" until you reach a systemic cause you can fix.

Example:

  1. Why did the API return 500 errors? Because the database connection pool was exhausted.
  2. Why was the connection pool exhausted? Because a new query was holding connections open for 30+ seconds.
  3. Why was the query taking 30+ seconds? Because it was doing a full table scan on a 10M row table.
  4. Why was it doing a full table scan? Because the migration that added the index failed silently.
  5. Why did the migration fail silently? Because our deployment pipeline doesn't check migration status.

Root cause: Deployment pipeline doesn't verify database migrations completed successfully.

Action item: Add a post-deploy check that verifies all migrations ran. Alert if any migration is in a failed state.

Notice how different this is from "the developer wrote a bad query." The 5 Whys leads you to a system fix (deployment check) instead of a people fix (tell someone to be more careful).

What Makes a Good Post-Mortem

Be Specific, Not Vague

Vague (avoid) Specific (do this)
"Improve monitoring" "Add alerting for connection pool usage > 80% on production database"
"Better communication next time" "Update status page within 5 minutes of SEV-1 detection — add to incident checklist"
"Review deployment process" "Add post-deploy migration status check to CI/CD pipeline by March 15"
"We need more testing" "Add integration test for payment flow that runs on every deploy"

Keep It the Right Length

For most incidents, a post-mortem should be 1-2 pages. If you're writing 5+ pages, you're over-documenting. If it's a paragraph, you're under-analyzing.

The template above produces a document that's thorough enough to be useful and short enough that people actually read it.

Follow Up on Action Items

A post-mortem without follow-through is a waste of time. Track action item completion:

  • Add every action item to your sprint/task tracker
  • Review open post-mortem action items monthly
  • If an action item keeps getting deprioritized, discuss whether it's actually important or should be dropped

Common Post-Mortem Mistakes

Assigning Blame

"John deployed the bad code" is blame. "A code change was deployed without adequate testing due to missing CI checks" is analysis. The first makes people defensive. The second leads to better systems.

If someone in the meeting starts blaming, redirect: "We're looking at systems, not people. What in our process allowed this to happen?"

Skipping the "What Went Well" Section

It feels awkward to celebrate during a post-mortem, but this section matters. It reinforces good practices and shows the team that their response was valued.

If your monitoring caught the issue in 30 seconds, that's worth noting. If your status page update reduced support tickets by 60%, document it.

Writing Action Items Nobody Owns

"We should add more monitoring" is not an action item. Who adds it? What kind of monitoring? By when?

Every action item needs: what (specific change), who (one owner), when (due date).

Never Publishing the Post-Mortem

A post-mortem that lives in someone's personal notes helps nobody. Publish it to a shared location where:

  • Current team members can reference it
  • New team members can learn from past incidents
  • You can spot patterns across multiple incidents

Should You Share Post-Mortems With Customers?

For major incidents, yes — but as a simplified version. Customers don't need your internal 5 Whys analysis. They need:

  • What happened (1-2 sentences)
  • How long it lasted
  • What you're doing to prevent it
  • An apology

Many SaaS teams publish these on their status page or blog. It builds trust and shows you take reliability seriously.

With StatusRay, your incident history is automatically visible on your status page. Customers can see past incidents and how you handled them — building a transparency track record over time.

Measuring Post-Mortem Effectiveness

Track these to know if your post-mortem process is working:

Metric Target Why It Matters
Post-mortem completion rate 100% for SEV-1/SEV-2 Are you doing them consistently?
Time to publish Within 48 hours Are you writing them while details are fresh?
Action item completion rate > 80% within 30 days Are you following through?
Repeat incident rate Decreasing over time Are post-mortems actually preventing recurrence?
Time from detection to status update Improving over time Is your communication getting faster?

FAQ

How long should a post-mortem take to write? The initial draft should take 30-60 minutes. The review meeting takes another 30-45 minutes. Total time investment: about 90 minutes per incident. That's a small price for preventing recurrence.

Who should write the post-mortem? The incident commander or lead responder writes the first draft. They were closest to the incident and have the most context. Others contribute during the review meeting.

What if we disagree on the root cause? That's normal and valuable. Document multiple contributing factors if the team can't agree on a single root cause. Most incidents have more than one contributing cause anyway.

Should we post-mortem near-misses? Yes, for significant near-misses. An incident that almost caused a major outage is a free learning opportunity. You get the lessons without the customer impact.

How do we keep post-mortems blameless when someone clearly made a mistake? Reframe: "Why did our system allow this mistake to have this impact?" Every human error is also a system design failure. If one person's mistake can cause a major outage, that's a system problem — not a people problem.

What tools should we use for post-mortems? Keep it simple. A shared document (Google Docs, Notion, Confluence) works fine. The tool matters less than the process. Some teams use their status page's incident history as the starting point and expand from there.

Start Writing Better Post-Mortems Today

Your next incident is an opportunity to make your system more resilient. Copy the template above, customize it for your team, and commit to writing a post-mortem after every significant incident.

The teams that get better at incident response aren't the ones that never have outages. They're the ones that learn from every outage.

And when that outage happens, make sure your customers are informed. StatusRay gives you a professional status page with built-in monitoring — detect issues automatically, communicate them in one click, and build a transparency record your customers can trust.

Create your status page — free →


Related reading:

Your next outage is coming. Be ready.

Create a professional status page with built-in monitoring in under 10 minutes.

No credit card required. Free forever plan available.

Related Articles