Incident Severity Levels: A Complete Guide to Classifying and Responding to Incidents
Not every incident deserves the same response. A button misaligned on your settings page is not the same as your API returning 500 errors to every customer.
Incident severity levels give your team a shared language for classifying incidents by impact — so you respond to a full outage differently than a cosmetic bug, without debating priority in the middle of a crisis.
This guide covers how to define incident severity levels that work for growing SaaS teams, with a ready-to-use severity matrix you can copy and adapt today.
What Are Incident Severity Levels?
Incident severity levels are a classification system that categorizes incidents based on their impact on users and business operations. Think of it as triage: when multiple things break at once, severity levels tell your team what to fix first.
Most teams use a 4-level system (SEV-1 through SEV-4), though some use 3 or 5 levels. The exact number matters less than having clear definitions that everyone on your team agrees on.
The goal is simple: when someone on your team says "this is a SEV-2," everyone should understand exactly what that means — how many users are affected, how fast to respond, who gets paged, and how often to communicate.
The 4-Level Incident Severity Framework
Here's a severity level framework designed for SaaS teams. Adapt the specifics to your product, but keep the structure.
SEV-1 — Critical
Definition: Complete service outage or critical functionality unavailable for all or most users.
Examples:
- Your application is completely down
- API returning errors for all requests
- Data loss or data corruption
- Security breach with active exploitation
- Payment processing failure
Response:
- All hands on deck — drop everything
- Page the on-call engineer immediately
- Post a status page update within 5 minutes
- Update customers every 15-30 minutes until resolved
- Executive notification required
SLA target: Acknowledge within 5 minutes. Resolve or mitigate within 1 hour.
SEV-2 — High
Definition: Major feature degraded or unavailable. A significant portion of users are impacted, but the service is not completely down.
Examples:
- Dashboard loading but showing stale data
- Email notifications not sending
- Login working but extremely slow (>10s response times)
- A critical integration (Slack, PagerDuty) is broken
- Monitoring checks failing intermittently
Response:
- On-call engineer begins work immediately
- Post a status page update within 15 minutes
- Update customers every 30-60 minutes
- May escalate to SEV-1 if impact grows
SLA target: Acknowledge within 15 minutes. Resolve within 4 hours.
SEV-3 — Medium
Definition: Minor feature impaired or a workaround exists. A small subset of users is affected.
Examples:
- A single monitoring check returning false positives
- CSV export timing out for large datasets
- UI rendering issue in one browser
- Non-critical API endpoint responding slowly
- Scheduled report delayed by 30+ minutes
Response:
- Addressed during business hours
- No status page update unless customers report it
- Fix in next deployment cycle or expedited if worsening
SLA target: Acknowledge within 2 hours. Resolve within 1 business day.
SEV-4 — Low
Definition: Cosmetic issue, minor inconvenience, or improvement request. No meaningful user impact.
Examples:
- Typo in the UI
- Tooltip displaying incorrect text
- Minor CSS alignment issue
- Documentation outdated
- Feature request disguised as a bug report
Response:
- Added to backlog
- Fixed when convenient or as part of planned work
- No status page update needed
SLA target: No SLA. Fix at team discretion.
Incident Severity Matrix (Copy and Use)
Use this matrix as a starting point. Print it, pin it in Slack, or add it to your runbook.
| Level | Name | User Impact | Response Time | Update Cadence | Status Page? | Who's Involved |
|---|---|---|---|---|---|---|
| SEV-1 | Critical | All / most users | 5 minutes | Every 15-30 min | Yes — immediately | All engineers + leadership |
| SEV-2 | High | Large subset | 15 minutes | Every 30-60 min | Yes — within 15 min | On-call + relevant team |
| SEV-3 | Medium | Small subset | 2 hours | As needed | Only if reported | Assigned engineer |
| SEV-4 | Low | Minimal / none | Next business day | None | No | Backlog |
How to Classify Incidents: The Impact-Urgency Model
When an alert fires or a customer reports an issue, your team needs to assign a severity fast — often within the first 2 minutes. Use two dimensions:
Impact — How many users are affected and how badly?
- All users, core functionality broken → High impact
- Subset of users, major feature degraded → Medium impact
- Few users, workaround available → Low impact
Urgency — Is the situation getting worse?
- Revenue loss, data at risk, or security exposure → High urgency
- Degrading but stable → Medium urgency
- Stable, not worsening → Low urgency
| High Urgency | Medium Urgency | Low Urgency | |
|---|---|---|---|
| High Impact | SEV-1 | SEV-1 | SEV-2 |
| Medium Impact | SEV-2 | SEV-2 | SEV-3 |
| Low Impact | SEV-2 | SEV-3 | SEV-4 |
Rule of thumb: When in doubt, classify higher. You can always downgrade a SEV-1 to a SEV-2 as you learn more. You can't un-ignore a critical incident.
Severity Levels vs. Priority Levels
These terms get confused constantly. They're related but different:
- Severity measures impact — how bad is this for users right now?
- Priority measures order — when should we fix this relative to other work?
A SEV-4 cosmetic bug on your pricing page might get P1 priority because it's costing conversions. A SEV-2 bug affecting 5% of users on an obscure feature might get P3 priority because the workaround is simple.
Severity is set during the incident based on impact. Priority is set after the incident based on business judgment.
Implementing Severity Levels for Your Team
Step 1: Define Your Criteria
Take the framework above and customize it for your product. The key questions:
- What does "all users affected" mean for your product? (If you have 50 customers, one enterprise customer being down might be SEV-1)
- Which features are "critical"? (Your core value proposition — the thing customers pay for)
- What constitutes data loss or security exposure in your context?
Write it down. Put it in your runbook or internal wiki. If it's not written down, it doesn't exist.
Step 2: Build Response Playbooks
For each severity level, document:
- Who gets paged and through what channel
- Response time expectation — how fast should someone acknowledge
- Communication protocol — when to update the status page, who writes the update
- Escalation triggers — when does a SEV-2 become a SEV-1
For detailed communication templates, see our guide on incident communication best practices.
Step 3: Integrate with Your Monitoring
Your monitoring tool should map alert conditions to severity levels automatically where possible:
- HTTP 5xx error rate > 50% → SEV-1 alert
- Response time > 5s for 5 minutes → SEV-2 alert
- SSL certificate expiring in < 7 days → SEV-3 alert
- Uptime check failure from single location → SEV-3 (could be false positive)
- Uptime check failure from multiple locations → SEV-1
With StatusRay your monitoring and status page are in one tool — when monitoring detects an issue, you update your status page in one click. No context-switching between your monitoring dashboard and a separate status page tool.
Step 4: Practice and Iterate
Run tabletop exercises quarterly. Present a scenario ("Your database primary just went down, read replicas are serving stale data") and have the team classify it, assign roles, and walk through the response. This sounds like overkill for a 10-person team, but it takes 30 minutes and prevents confusion during real incidents.
After every SEV-1 and SEV-2 incident, review the severity classification in your post-mortem. Was it classified correctly? Did the response match the severity?
Common Severity Classification Mistakes
The "Everything Is SEV-1" Problem
When every issue gets classified as critical, nothing is critical. Your team gets alert fatigue, response quality drops, and actual SEV-1 incidents get slower responses because everyone is already tired.
Fix it: Track your severity distribution monthly. A healthy ratio looks roughly like:
| Level | Expected % of Total Incidents |
|---|---|
| SEV-1 | 5-10% |
| SEV-2 | 15-25% |
| SEV-3 | 40-50% |
| SEV-4 | 20-30% |
If more than 20% of your incidents are SEV-1, your definitions are too loose.
Ignoring Business Context
A technical issue that seems minor can have outsized business impact. Your checkout flow being slow during a product launch is not a SEV-3 just because the service is "technically up."
Always consider: who is affected, when it's happening, and what they're trying to do. A 30-second delay on your status page during a customer's outage investigation is more impactful than a 30-second delay on your blog at 3am.
Never Adjusting Severity
Severity isn't permanent. As you investigate, new information changes the picture:
- You thought it was a minor issue but discover data corruption → Upgrade to SEV-1
- You classified it as SEV-1 but found only 3 users are affected → Downgrade to SEV-3
Make it explicit when you change severity and communicate why.
Measuring the Effectiveness of Your Severity Framework
Track these metrics quarterly to know if your severity levels are working:
| Metric | What It Tells You | Watch For |
|---|---|---|
| MTTR by severity level | Whether response matches severity | SEV-1 MTTR > 1 hour |
| Severity distribution | Whether definitions are calibrated | > 20% SEV-1 incidents |
| Reclassification rate | How often initial classification changes | > 30% reclassification |
| Time to classify | Whether criteria are clear enough | > 5 minutes to assign severity |
| Customer complaints vs severity | Whether your levels match customer perception | Complaints on incidents classified SEV-3/4 |
FAQ
How many severity levels should we use? Four is the sweet spot for most SaaS teams. Three levels lack nuance (you'll argue about borderline cases). Five or more add complexity without improving decisions. Start with four and only add more if you have a clear need.
Who decides the severity level? The first responder assigns an initial severity based on available information. Anyone can escalate the severity at any time. Only the incident commander (or equivalent) should downgrade severity.
How often should we review our severity definitions? Quarterly at minimum, and after every major incident. As your product grows and your customer base changes, what counts as "critical" evolves too.
Should different services have different severity levels? Yes, if the services have different user impact. Your API being slow is probably more severe than your admin dashboard being slow. Document service-specific criteria in your runbook so there's no ambiguity.
What's the difference between severity and priority? Severity measures current impact (how bad is it for users). Priority measures importance relative to other work (when should we fix it). A low-severity bug can have high priority if it affects a key customer or revenue. They're related but not the same.
How do incident severity levels relate to SLAs? Your severity levels should map directly to your SLA response and resolution commitments. If your SLA promises 99.9% uptime, any incident causing downtime is at minimum a SEV-2 because it's consuming your error budget.
Start Classifying Incidents with Confidence
A clear severity framework eliminates the "how bad is this?" debate during incidents. Your team responds faster, communicates consistently, and resolves issues with less chaos.
The framework in this guide works for most SaaS teams out of the box. Copy the severity matrix, customize the examples for your product, and put it where your team can find it during an incident.
And when incidents happen, make sure your customers know about it. StatusRay gives you a professional status page with built-in monitoring — so you detect issues automatically and communicate them in one click.
Create your status page — free →
Related reading: