Guide 10 min read Operations Incident Response SRE

Incident Response Planning Guide

Building an effective incident response capability for technology organizations.

Incident Response Planning Guide

A practical framework for building incident response capabilities that minimize impact and accelerate recovery.


Incident Response Fundamentals

What is an Incident?

An incident is an unplanned interruption or reduction in quality of an IT service. Not every alert is an incident, and not every incident is a crisis.

Severity Classification

SeverityDefinitionExampleResponse Time
SEV-1Complete service outage or critical security breachProduction down, data breachImmediate (24/7)
SEV-2Major functionality degradedPayment processing failing< 30 minutes
SEV-3Minor functionality impactedFeature degraded for subset< 2 hours
SEV-4Minimal impactCosmetic issue, workaround existsNext business day

Incident Response Phases

Phase 1: Detection

Goal: Identify incidents as quickly as possible.

Detection Sources:

  • Automated monitoring and alerting
  • Customer reports
  • Internal user reports
  • Security tools
  • Synthetic monitoring

Best Practices:

  • Alert on symptoms, not just causes
  • Reduce alert noise to improve signal
  • Ensure alerts are actionable
  • Test alerting regularly

Phase 2: Triage

Goal: Assess severity and mobilize appropriate response.

Triage Questions:

  1. What is the user impact?
  2. How many users are affected?
  3. Is the impact increasing or stable?
  4. Do we have a workaround?
  5. What severity level is appropriate?

Actions:

  • Classify severity
  • Assign incident commander
  • Open communication channel
  • Begin incident documentation

Phase 3: Investigation

Goal: Understand what is happening and why.

Investigation Approach:

  1. Gather context: What changed recently? What do logs show?
  2. Form hypotheses: What could cause these symptoms?
  3. Test hypotheses: Check evidence for/against each theory
  4. Narrow focus: Eliminate possibilities systematically

Tools:

  • Log aggregation (searching recent events)
  • Metrics dashboards (identifying anomalies)
  • Distributed tracing (following request paths)
  • Deployment history (correlating with changes)

Phase 4: Mitigation

Goal: Reduce or eliminate user impact.

Mitigation Strategies:

StrategyWhen to Use
RollbackRecent deployment caused issue
RestartService is in bad state
FailoverPrimary component is unhealthy
ScaleCapacity issue
Feature flagSpecific feature is problematic
Block trafficAbusive traffic pattern

Key Principle: Mitigate first, debug later. Reduce impact even if you don’t fully understand the cause.

Phase 5: Resolution

Goal: Restore full service and confirm stability.

Resolution Checklist:

  • Service metrics returned to normal
  • No recurring errors in logs
  • Customer-facing impact confirmed resolved
  • Temporary mitigations can remain in place
  • Stakeholders notified of resolution

Phase 6: Post-Incident

Goal: Learn from the incident and improve.

Post-Incident Activities:

  1. Conduct blameless post-mortem
  2. Document timeline and root cause
  3. Identify action items
  4. Track action items to completion
  5. Share learnings broadly

Incident Response Roles

Incident Commander (IC)

Responsibilities:

  • Overall incident coordination
  • Severity assessment and updates
  • Communication decisions
  • Resource mobilization
  • Declaring incident resolved

Key Behaviors:

  • Stay calm and organized
  • Delegate technical work
  • Keep the big picture
  • Communicate proactively

Technical Lead

Responsibilities:

  • Lead investigation efforts
  • Coordinate technical responders
  • Propose and evaluate mitigations
  • Ensure proper fixes are implemented

Communications Lead

Responsibilities:

  • Draft status updates
  • Coordinate with customer-facing teams
  • Manage status page updates
  • Handle media/PR if needed

Scribe

Responsibilities:

  • Document timeline
  • Capture decisions and actions
  • Record key findings
  • Prepare post-incident documentation

Communication During Incidents

Internal Communication

Channels:

  • Dedicated incident Slack/Teams channel
  • Bridge call for SEV-1/SEV-2
  • Ticket for tracking and history

Update Cadence:

SeverityUpdate Frequency
SEV-1Every 15 minutes
SEV-2Every 30 minutes
SEV-3Every hour
SEV-4As needed

Update Template:

INCIDENT UPDATE - [Time]
Severity: SEV-X
Status: Investigating/Mitigating/Monitoring/Resolved

Current Impact:
[What users are experiencing]

Recent Actions:
[What we've done since last update]

Next Steps:
[What we're doing next]

ETA to Resolution: [Estimate or "Unknown"]

External Communication

Status Page Updates:

  • Use clear, non-technical language
  • Focus on user impact, not technical details
  • Provide realistic expectations
  • Update regularly until resolved

Customer Communication:

  • Acknowledge the issue promptly
  • Explain impact honestly
  • Provide updates proactively
  • Share post-mortem findings (appropriately redacted)

On-Call Structure

On-Call Rotations

Considerations:

  • Rotation length (1 week typical)
  • Coverage hours (business hours vs. 24/7)
  • Escalation paths
  • Backup coverage

Healthy On-Call Practices:

  • Clear handoff procedures
  • Protected sleep time
  • Incident frequency limits
  • Post-on-call feedback

Escalation Paths

Level 1: Primary on-call

  • First responder for all alerts
  • Triages and handles routine incidents
  • Escalates when needed

Level 2: Secondary on-call / Tech lead

  • Complex issues requiring expertise
  • Multi-team coordination
  • Extended duration incidents

Level 3: Leadership / Specialists

  • Critical incidents (SEV-1)
  • External communication needed
  • Major business impact

Runbooks

What Makes a Good Runbook

Structure:

# [Alert/Scenario Name]

## Overview
Brief description of what this alert/scenario means.

## Severity Assessment
How to determine severity level.

## Initial Investigation
Step-by-step diagnostic commands:
1. Check [metric/dashboard]
2. Run [command] to verify [thing]
3. Look for [pattern] in logs

## Common Causes and Fixes

### Cause 1: [Description]
**Symptoms:** [What you'll see]
**Fix:** [Step-by-step remediation]

### Cause 2: [Description]
**Symptoms:** [What you'll see]
**Fix:** [Step-by-step remediation]

## Escalation
When and how to escalate.

## Related Resources
Links to dashboards, documentation, contacts.

Runbook Maintenance

  • Review after every incident that uses them
  • Update when systems change
  • Test periodically (game days)
  • Track which runbooks are most used

Post-Incident Review

Blameless Post-Mortems

Principles:

  • Focus on systems, not individuals
  • Assume good intentions
  • Seek understanding, not blame
  • Goal is learning and improvement

Questions to Answer:

  1. What happened? (Timeline)
  2. What was the impact?
  3. What was the root cause?
  4. What went well in our response?
  5. What could have gone better?
  6. What will we do to prevent recurrence?

Post-Mortem Template

# Post-Mortem: [Incident Title]

**Date:** [Date]
**Duration:** [Start - End]
**Severity:** SEV-X
**Author:** [Name]

## Summary
[2-3 sentence summary of what happened]

## Impact
- [User impact]
- [Business impact]
- [Data impact if any]

## Timeline
| Time | Event |
|------|-------|
| HH:MM | Alert fired |
| HH:MM | IC assigned |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Incident resolved |

## Root Cause
[Detailed explanation of why this happened]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## What Went Well
- [Positive 1]
- [Positive 2]

## What Could Be Improved
- [Improvement 1]
- [Improvement 2]

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Action 1] | [Name] | [Date] | Open |
| [Action 2] | [Name] | [Date] | Open |

## Lessons Learned
[Key takeaways for the organization]

Measuring Incident Response

Key Metrics

MetricDefinitionTarget
MTTDMean time to detect< 5 minutes
MTTAMean time to acknowledge< 15 minutes
MTTRMean time to resolveSeverity dependent
Incident volumeIncidents per weekTrending down
Repeat incidentsSame root cause< 10%

Reviewing Metrics

Weekly:

  • Incident count by severity
  • Longest resolution times
  • On-call load balance

Monthly:

  • MTTD/MTTA/MTTR trends
  • Action item completion rate
  • Repeat incident patterns

Quarterly:

  • Incident trends analysis
  • Process improvement assessment
  • Training needs identification

For help building your incident response capability, contact our team.

Need help implementing these practices?

Our team can help you apply these frameworks to your specific context.

Get in Touch