Guide 10 min read Operations Incident Response SRE

Incident Response Planning Guide

Building an effective incident response capability for technology organizations.

Incident Response Planning Guide

A practical framework for building incident response capabilities that minimize impact and accelerate recovery.

Incident Response Fundamentals

What is an Incident?

An incident is an unplanned interruption or reduction in quality of an IT service. Not every alert is an incident, and not every incident is a crisis.

Severity Classification

Severity	Definition	Example	Response Time
SEV-1	Complete service outage or critical security breach	Production down, data breach	Immediate (24/7)
SEV-2	Major functionality degraded	Payment processing failing	< 30 minutes
SEV-3	Minor functionality impacted	Feature degraded for subset	< 2 hours
SEV-4	Minimal impact	Cosmetic issue, workaround exists	Next business day

Incident Response Phases

Phase 1: Detection

Goal: Identify incidents as quickly as possible.

Detection Sources:

Automated monitoring and alerting
Customer reports
Internal user reports
Security tools
Synthetic monitoring

Best Practices:

Alert on symptoms, not just causes
Reduce alert noise to improve signal
Ensure alerts are actionable
Test alerting regularly

Phase 2: Triage

Goal: Assess severity and mobilize appropriate response.

Triage Questions:

What is the user impact?
How many users are affected?
Is the impact increasing or stable?
Do we have a workaround?
What severity level is appropriate?

Actions:

Classify severity
Assign incident commander
Open communication channel
Begin incident documentation

Phase 3: Investigation

Goal: Understand what is happening and why.

Investigation Approach:

Gather context: What changed recently? What do logs show?
Form hypotheses: What could cause these symptoms?
Test hypotheses: Check evidence for/against each theory
Narrow focus: Eliminate possibilities systematically

Tools:

Log aggregation (searching recent events)
Metrics dashboards (identifying anomalies)
Distributed tracing (following request paths)
Deployment history (correlating with changes)

Phase 4: Mitigation

Goal: Reduce or eliminate user impact.

Mitigation Strategies:

Strategy	When to Use
Rollback	Recent deployment caused issue
Restart	Service is in bad state
Failover	Primary component is unhealthy
Scale	Capacity issue
Feature flag	Specific feature is problematic
Block traffic	Abusive traffic pattern

Key Principle: Mitigate first, debug later. Reduce impact even if you don’t fully understand the cause.

Phase 5: Resolution

Goal: Restore full service and confirm stability.

Resolution Checklist:

Service metrics returned to normal
No recurring errors in logs
Customer-facing impact confirmed resolved
Temporary mitigations can remain in place
Stakeholders notified of resolution

Phase 6: Post-Incident

Goal: Learn from the incident and improve.

Post-Incident Activities:

Conduct blameless post-mortem
Document timeline and root cause
Identify action items
Track action items to completion
Share learnings broadly

Incident Response Roles

Incident Commander (IC)

Responsibilities:

Overall incident coordination
Severity assessment and updates
Communication decisions
Resource mobilization
Declaring incident resolved

Key Behaviors:

Stay calm and organized
Delegate technical work
Keep the big picture
Communicate proactively

Technical Lead

Responsibilities:

Lead investigation efforts
Coordinate technical responders
Propose and evaluate mitigations
Ensure proper fixes are implemented

Communications Lead

Responsibilities:

Draft status updates
Coordinate with customer-facing teams
Manage status page updates
Handle media/PR if needed

Scribe

Responsibilities:

Document timeline
Capture decisions and actions
Record key findings
Prepare post-incident documentation

Communication During Incidents

Internal Communication

Channels:

Dedicated incident Slack/Teams channel
Bridge call for SEV-1/SEV-2
Ticket for tracking and history

Update Cadence:

Severity	Update Frequency
SEV-1	Every 15 minutes
SEV-2	Every 30 minutes
SEV-3	Every hour
SEV-4	As needed

Update Template:

INCIDENT UPDATE - [Time]
Severity: SEV-X
Status: Investigating/Mitigating/Monitoring/Resolved

Current Impact:
[What users are experiencing]

Recent Actions:
[What we've done since last update]

Next Steps:
[What we're doing next]

ETA to Resolution: [Estimate or "Unknown"]

External Communication

Status Page Updates:

Use clear, non-technical language
Focus on user impact, not technical details
Provide realistic expectations
Update regularly until resolved

Customer Communication:

Acknowledge the issue promptly
Explain impact honestly
Provide updates proactively
Share post-mortem findings (appropriately redacted)

On-Call Structure

On-Call Rotations

Considerations:

Rotation length (1 week typical)
Coverage hours (business hours vs. 24/7)
Escalation paths
Backup coverage

Healthy On-Call Practices:

Clear handoff procedures
Protected sleep time
Incident frequency limits
Post-on-call feedback

Escalation Paths

Level 1: Primary on-call

First responder for all alerts
Triages and handles routine incidents
Escalates when needed

Level 2: Secondary on-call / Tech lead

Complex issues requiring expertise
Multi-team coordination
Extended duration incidents

Level 3: Leadership / Specialists

Critical incidents (SEV-1)
External communication needed
Major business impact

Runbooks

What Makes a Good Runbook

Structure:

# [Alert/Scenario Name]

## Overview
Brief description of what this alert/scenario means.

## Severity Assessment
How to determine severity level.

## Initial Investigation
Step-by-step diagnostic commands:
1. Check [metric/dashboard]
2. Run [command] to verify [thing]
3. Look for [pattern] in logs

## Common Causes and Fixes

### Cause 1: [Description]
**Symptoms:** [What you'll see]
**Fix:** [Step-by-step remediation]

### Cause 2: [Description]
**Symptoms:** [What you'll see]
**Fix:** [Step-by-step remediation]

## Escalation
When and how to escalate.

## Related Resources
Links to dashboards, documentation, contacts.

Runbook Maintenance

Review after every incident that uses them
Update when systems change
Test periodically (game days)
Track which runbooks are most used

Post-Incident Review

Blameless Post-Mortems

Principles:

Focus on systems, not individuals
Assume good intentions
Seek understanding, not blame
Goal is learning and improvement

Questions to Answer:

What happened? (Timeline)
What was the impact?
What was the root cause?
What went well in our response?
What could have gone better?
What will we do to prevent recurrence?

Post-Mortem Template

# Post-Mortem: [Incident Title]

**Date:** [Date]
**Duration:** [Start - End]
**Severity:** SEV-X
**Author:** [Name]

## Summary
[2-3 sentence summary of what happened]

## Impact
- [User impact]
- [Business impact]
- [Data impact if any]

## Timeline
| Time | Event |
|------|-------|
| HH:MM | Alert fired |
| HH:MM | IC assigned |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Incident resolved |

## Root Cause
[Detailed explanation of why this happened]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## What Went Well
- [Positive 1]
- [Positive 2]

## What Could Be Improved
- [Improvement 1]
- [Improvement 2]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Action 1] | [Name] | [Date] | Open |
| [Action 2] | [Name] | [Date] | Open |

## Lessons Learned
[Key takeaways for the organization]

Measuring Incident Response

Key Metrics

Metric	Definition	Target
MTTD	Mean time to detect	< 5 minutes
MTTA	Mean time to acknowledge	< 15 minutes
MTTR	Mean time to resolve	Severity dependent
Incident volume	Incidents per week	Trending down
Repeat incidents	Same root cause	< 10%

Reviewing Metrics

Weekly:

Incident count by severity
Longest resolution times
On-call load balance

Monthly:

MTTD/MTTA/MTTR trends
Action item completion rate
Repeat incident patterns

Quarterly:

Incident trends analysis
Process improvement assessment
Training needs identification

For help building your incident response capability, contact our team.

Need help implementing these practices?

Our team can help you apply these frameworks to your specific context.

Get in Touch

Back to Resources