Why You Need a Kubernetes Incident Playbook
It's 3 AM. PagerDuty fires. Your production cluster has a problem. You open your laptop, bleary-eyed, and stare at a terminal.
Now what?
Without a playbook, the next 45 minutes look like this:
- 5 minutes: Remember which cluster this is
- 10 minutes: Run random kubectl commands hoping something stands out
- 15 minutes: Switch between Grafana, Loki, and the terminal
- 10 minutes: Google the error message
- 5 minutes: Actually fix it
With a playbook, the same incident takes 10–15 minutes. With AI-assisted diagnosis, under 5.
This is that playbook.
Phase 1: Triage (0–2 Minutes)
Goal: Understand the scope and severity. Don't fix anything yet.
Step 1: Quick Cluster Health Check
# Are nodes healthy?
kubectl get nodes
# Look for: NotReady, SchedulingDisabled, MemoryPressure, DiskPressure
# What's broken in the affected namespace?
kubectl get pods -n <namespace> --field-selector=status.phase!=Running
# Shows only non-running pods
# Recent events (last 10 minutes)
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
Step 2: Classify the Incident
| Severity | Criteria | Response | |----------|----------|----------| | SEV1 — Outage | Customer-facing service down, data loss risk | All hands, escalate immediately | | SEV2 — Degraded | Service slow or partially broken | On-call engineer, notify team lead | | SEV3 — Warning | Non-critical service affected, no customer impact | On-call engineer, fix during business hours | | SEV4 — Info | Potential issue detected, no current impact | Document, schedule investigation |
Step 3: Check If It's Already Known
Before diving in:
- Check your team's #incidents Slack channel
- Check if a deployment happened in the last 30 minutes
- Check if someone else is already investigating
# Recent deployments
kubectl rollout history deployment/<name> -n <namespace>
# Who changed what recently?
kubectl get events -n <namespace> --field-selector=reason=ScalingReplicaSet | tail -5
Phase 2: Diagnosis (2–10 Minutes)
Goal: Find the root cause. Resist the urge to restart things.
The Diagnostic Flowchart
Pod not running?
├── Status: CrashLoopBackOff → Check logs (Phase 2a)
├── Status: ImagePullBackOff → Check image/registry (Phase 2b)
├── Status: Pending → Check scheduling (Phase 2c)
├── Status: OOMKilled → Check resources (Phase 2d)
└── Status: Running but unhealthy → Check probes + metrics (Phase 2e)
Phase 2a: CrashLoopBackOff
The pod starts, crashes, restarts, crashes again.
# Check the last crash logs
kubectl logs <pod> -n <namespace> --previous
# Check the exit code
kubectl describe pod <pod> -n <namespace> | grep -A5 "Last State"
# Exit Code 1 = Application error (check logs)
# Exit Code 137 = OOMKilled (increase memory)
# Exit Code 143 = SIGTERM (graceful shutdown, usually fine)
# Check if it's a startup issue
kubectl describe pod <pod> -n <namespace> | grep -A10 "Events"
Common causes:
- Missing environment variable or config
- Database connection string wrong after rotation
- New image version has a bug
- Memory limit too low (OOMKilled)
Phase 2b: ImagePullBackOff
The pod can't pull its container image.
kubectl describe pod <pod> -n <namespace> | grep -A3 "Events"
# Look for: "unauthorized", "not found", "timeout"
Common causes:
- Image tag doesn't exist (typo or deleted)
- Registry credentials expired (imagePullSecret)
- Private registry unreachable (network/firewall)
Quick fix:
# Check which image is requested
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.containers[*].image}'
# Check if the secret exists and is valid
kubectl get secret <pull-secret> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
Phase 2c: Pending Pods
The pod exists but can't be scheduled to a node.
kubectl describe pod <pod> -n <namespace> | grep -A10 "Events"
# Look for: Insufficient cpu, Insufficient memory, node affinity, taints
Common causes:
- Not enough resources on any node
- Node affinity/anti-affinity constraints too strict
- PersistentVolumeClaim not bound
- All nodes tainted and pod has no toleration
Quick check:
# Available resources across all nodes
kubectl describe nodes | grep -A5 "Allocated resources"
# PVC status
kubectl get pvc -n <namespace>
Phase 2d: OOMKilled
The container exceeded its memory limit and the kernel killed it.
# Confirm OOMKilled
kubectl describe pod <pod> -n <namespace> | grep -i "oomkilled"
# Check current memory limit vs actual usage
kubectl top pod <pod> -n <namespace> --containers
# Check memory limit
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.containers[*].resources.limits.memory}'
Is it a memory leak or just under-provisioned?
- If memory grew slowly over hours → likely a memory leak → needs code fix
- If memory spiked with traffic → under-provisioned → increase limit
- If memory spiked after deployment → new version uses more memory → increase limit or investigate
Phase 2e: Running But Unhealthy
Pod is running but liveness/readiness probes fail, or latency is high.
# Check probe configuration
kubectl describe pod <pod> -n <namespace> | grep -A10 "Liveness\|Readiness"
# Check recent restarts due to probe failures
kubectl get pod <pod> -n <namespace> -o jsonpath='{.status.containerStatuses[*].restartCount}'
# Check application logs for errors
kubectl logs <pod> -n <namespace> --since=10m | grep -i "error\|exception\|timeout"
Phase 3: Fix (10–15 Minutes)
Goal: Apply the minimum change that restores service. Don't over-engineer at 3 AM.
Quick Fixes (Apply Now, Root-Cause Later)
| Problem | Quick Fix | Permanent Fix (Tomorrow) |
|---------|-----------|--------------------------|
| OOMKilled | kubectl set resources deploy/<name> --limits memory=1Gi | Profile memory usage, set proper limits |
| Image wrong | kubectl set image deploy/<name> <container>=<good-image> | Fix CI/CD pipeline |
| Too few replicas | kubectl scale deploy/<name> --replicas=5 | Configure HPA properly |
| Bad config | kubectl rollout undo deploy/<name> | Fix config, redeploy |
| Node pressure | kubectl drain <node> --ignore-daemonsets | Add capacity or optimize |
The Golden Rule: Rollback First, Investigate Later
If a deployment happened in the last 30 minutes and that's when things broke:
# Rollback to previous version
kubectl rollout undo deployment/<name> -n <namespace>
# Verify it's recovering
kubectl rollout status deployment/<name> -n <namespace>
This restores service in 1–2 minutes. Investigate the root cause during business hours.
Phase 4: Verify (15–20 Minutes)
Goal: Confirm the fix worked and no side effects.
# Pods healthy?
kubectl get pods -n <namespace>
# No new error events?
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -10
# Application responding?
kubectl exec -it <pod> -n <namespace> -- curl -s localhost:<port>/health
# Metrics back to normal?
# Check Grafana dashboard for the affected service
Phase 5: Post-Mortem (Next Business Day)
Don't skip this. The post-mortem prevents the next 3 AM wake-up.
Post-Mortem Template
## Incident: [Title]
**Date:** [Date] | **Severity:** SEV[1-4] | **Duration:** [X] minutes
### Timeline
- HH:MM — Alert fired
- HH:MM — Engineer responded
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Service recovered
### Root Cause
[One paragraph explaining what actually went wrong]
### What Went Well
- [e.g., "Alert fired within 2 minutes of the issue"]
- [e.g., "Rollback procedure worked smoothly"]
### What Went Poorly
- [e.g., "Took 25 minutes to find the root cause"]
- [e.g., "No runbook existed for this failure mode"]
### Action Items
- [ ] [Fix to prevent recurrence]
- [ ] [Monitoring improvement]
- [ ] [Documentation update]
How AI Cuts Each Phase
| Phase | Manual | With AI | Savings | |-------|--------|---------|---------| | Triage | 2–5 min | 30 sec (auto-classified) | 80% | | Diagnosis | 15–30 min | 30–60 sec (auto-analyzed) | 95% | | Fix | 10–20 min | 2–3 min (auto-PR) | 80% | | Verify | 5 min | 2 min (auto-monitored) | 60% | | Total | 35–60 min | 5–7 min | ~90% |
The biggest impact is in the diagnosis phase. AI tools like KI-Ops run all diagnostic commands in parallel, correlate logs with metrics and cluster state, and present a synthesized root-cause analysis — eliminating 95% of manual investigation time.
Start With the Playbook, Then Automate
- Print this playbook (or save it in your team's wiki). Having structured steps reduces panic at 3 AM.
- Time your next 5 incidents. Measure how long each phase takes.
- Automate the diagnosis phase first. That's where 75% of time is wasted.
- Try AI-powered analysis. KI-Ops free tier gives you unlimited cluster analysis with AI root-cause identification — no license required.