Incident Response Planning Guide
A practical framework for building incident response capabilities that minimize impact and accelerate recovery.
Incident Response Fundamentals
What is an Incident?
An incident is an unplanned interruption or reduction in quality of an IT service. Not every alert is an incident, and not every incident is a crisis.
Severity Classification
| Severity | Definition | Example | Response Time |
|---|---|---|---|
| SEV-1 | Complete service outage or critical security breach | Production down, data breach | Immediate (24/7) |
| SEV-2 | Major functionality degraded | Payment processing failing | < 30 minutes |
| SEV-3 | Minor functionality impacted | Feature degraded for subset | < 2 hours |
| SEV-4 | Minimal impact | Cosmetic issue, workaround exists | Next business day |
Incident Response Phases
Phase 1: Detection
Goal: Identify incidents as quickly as possible.
Detection Sources:
- Automated monitoring and alerting
- Customer reports
- Internal user reports
- Security tools
- Synthetic monitoring
Best Practices:
- Alert on symptoms, not just causes
- Reduce alert noise to improve signal
- Ensure alerts are actionable
- Test alerting regularly
Phase 2: Triage
Goal: Assess severity and mobilize appropriate response.
Triage Questions:
- What is the user impact?
- How many users are affected?
- Is the impact increasing or stable?
- Do we have a workaround?
- What severity level is appropriate?
Actions:
- Classify severity
- Assign incident commander
- Open communication channel
- Begin incident documentation
Phase 3: Investigation
Goal: Understand what is happening and why.
Investigation Approach:
- Gather context: What changed recently? What do logs show?
- Form hypotheses: What could cause these symptoms?
- Test hypotheses: Check evidence for/against each theory
- Narrow focus: Eliminate possibilities systematically
Tools:
- Log aggregation (searching recent events)
- Metrics dashboards (identifying anomalies)
- Distributed tracing (following request paths)
- Deployment history (correlating with changes)
Phase 4: Mitigation
Goal: Reduce or eliminate user impact.
Mitigation Strategies:
| Strategy | When to Use |
|---|---|
| Rollback | Recent deployment caused issue |
| Restart | Service is in bad state |
| Failover | Primary component is unhealthy |
| Scale | Capacity issue |
| Feature flag | Specific feature is problematic |
| Block traffic | Abusive traffic pattern |
Key Principle: Mitigate first, debug later. Reduce impact even if you don’t fully understand the cause.
Phase 5: Resolution
Goal: Restore full service and confirm stability.
Resolution Checklist:
- Service metrics returned to normal
- No recurring errors in logs
- Customer-facing impact confirmed resolved
- Temporary mitigations can remain in place
- Stakeholders notified of resolution
Phase 6: Post-Incident
Goal: Learn from the incident and improve.
Post-Incident Activities:
- Conduct blameless post-mortem
- Document timeline and root cause
- Identify action items
- Track action items to completion
- Share learnings broadly
Incident Response Roles
Incident Commander (IC)
Responsibilities:
- Overall incident coordination
- Severity assessment and updates
- Communication decisions
- Resource mobilization
- Declaring incident resolved
Key Behaviors:
- Stay calm and organized
- Delegate technical work
- Keep the big picture
- Communicate proactively
Technical Lead
Responsibilities:
- Lead investigation efforts
- Coordinate technical responders
- Propose and evaluate mitigations
- Ensure proper fixes are implemented
Communications Lead
Responsibilities:
- Draft status updates
- Coordinate with customer-facing teams
- Manage status page updates
- Handle media/PR if needed
Scribe
Responsibilities:
- Document timeline
- Capture decisions and actions
- Record key findings
- Prepare post-incident documentation
Communication During Incidents
Internal Communication
Channels:
- Dedicated incident Slack/Teams channel
- Bridge call for SEV-1/SEV-2
- Ticket for tracking and history
Update Cadence:
| Severity | Update Frequency |
|---|---|
| SEV-1 | Every 15 minutes |
| SEV-2 | Every 30 minutes |
| SEV-3 | Every hour |
| SEV-4 | As needed |
Update Template:
INCIDENT UPDATE - [Time]
Severity: SEV-X
Status: Investigating/Mitigating/Monitoring/Resolved
Current Impact:
[What users are experiencing]
Recent Actions:
[What we've done since last update]
Next Steps:
[What we're doing next]
ETA to Resolution: [Estimate or "Unknown"]
External Communication
Status Page Updates:
- Use clear, non-technical language
- Focus on user impact, not technical details
- Provide realistic expectations
- Update regularly until resolved
Customer Communication:
- Acknowledge the issue promptly
- Explain impact honestly
- Provide updates proactively
- Share post-mortem findings (appropriately redacted)
On-Call Structure
On-Call Rotations
Considerations:
- Rotation length (1 week typical)
- Coverage hours (business hours vs. 24/7)
- Escalation paths
- Backup coverage
Healthy On-Call Practices:
- Clear handoff procedures
- Protected sleep time
- Incident frequency limits
- Post-on-call feedback
Escalation Paths
Level 1: Primary on-call
- First responder for all alerts
- Triages and handles routine incidents
- Escalates when needed
Level 2: Secondary on-call / Tech lead
- Complex issues requiring expertise
- Multi-team coordination
- Extended duration incidents
Level 3: Leadership / Specialists
- Critical incidents (SEV-1)
- External communication needed
- Major business impact
Runbooks
What Makes a Good Runbook
Structure:
# [Alert/Scenario Name]
## Overview
Brief description of what this alert/scenario means.
## Severity Assessment
How to determine severity level.
## Initial Investigation
Step-by-step diagnostic commands:
1. Check [metric/dashboard]
2. Run [command] to verify [thing]
3. Look for [pattern] in logs
## Common Causes and Fixes
### Cause 1: [Description]
**Symptoms:** [What you'll see]
**Fix:** [Step-by-step remediation]
### Cause 2: [Description]
**Symptoms:** [What you'll see]
**Fix:** [Step-by-step remediation]
## Escalation
When and how to escalate.
## Related Resources
Links to dashboards, documentation, contacts.
Runbook Maintenance
- Review after every incident that uses them
- Update when systems change
- Test periodically (game days)
- Track which runbooks are most used
Post-Incident Review
Blameless Post-Mortems
Principles:
- Focus on systems, not individuals
- Assume good intentions
- Seek understanding, not blame
- Goal is learning and improvement
Questions to Answer:
- What happened? (Timeline)
- What was the impact?
- What was the root cause?
- What went well in our response?
- What could have gone better?
- What will we do to prevent recurrence?
Post-Mortem Template
# Post-Mortem: [Incident Title]
**Date:** [Date]
**Duration:** [Start - End]
**Severity:** SEV-X
**Author:** [Name]
## Summary
[2-3 sentence summary of what happened]
## Impact
- [User impact]
- [Business impact]
- [Data impact if any]
## Timeline
| Time | Event |
|------|-------|
| HH:MM | Alert fired |
| HH:MM | IC assigned |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Incident resolved |
## Root Cause
[Detailed explanation of why this happened]
## Contributing Factors
- [Factor 1]
- [Factor 2]
## What Went Well
- [Positive 1]
- [Positive 2]
## What Could Be Improved
- [Improvement 1]
- [Improvement 2]
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Action 1] | [Name] | [Date] | Open |
| [Action 2] | [Name] | [Date] | Open |
## Lessons Learned
[Key takeaways for the organization]
Measuring Incident Response
Key Metrics
| Metric | Definition | Target |
|---|---|---|
| MTTD | Mean time to detect | < 5 minutes |
| MTTA | Mean time to acknowledge | < 15 minutes |
| MTTR | Mean time to resolve | Severity dependent |
| Incident volume | Incidents per week | Trending down |
| Repeat incidents | Same root cause | < 10% |
Reviewing Metrics
Weekly:
- Incident count by severity
- Longest resolution times
- On-call load balance
Monthly:
- MTTD/MTTA/MTTR trends
- Action item completion rate
- Repeat incident patterns
Quarterly:
- Incident trends analysis
- Process improvement assessment
- Training needs identification
For help building your incident response capability, contact our team.