Kubernetes Incident Response: The Complete Playbook for On-Call Engineers

A step-by-step incident response playbook for Kubernetes. From alert to resolution: triage, diagnosis, fix, and post-mortem — with the exact kubectl commands you need.

Back to overview
March 11, 2026
KI-Ops Team
Incident ResponseKubernetes

Why You Need a Kubernetes Incident Playbook

It's 3 AM. PagerDuty fires. Your production cluster has a problem. You open your laptop, bleary-eyed, and stare at a terminal.

Now what?

Without a playbook, the next 45 minutes look like this:

  • 5 minutes: Remember which cluster this is
  • 10 minutes: Run random kubectl commands hoping something stands out
  • 15 minutes: Switch between Grafana, Loki, and the terminal
  • 10 minutes: Google the error message
  • 5 minutes: Actually fix it

With a playbook, the same incident takes 10–15 minutes. With AI-assisted diagnosis, under 5.

This is that playbook.

Phase 1: Triage (0–2 Minutes)

Goal: Understand the scope and severity. Don't fix anything yet.

Step 1: Quick Cluster Health Check

# Are nodes healthy?
kubectl get nodes
# Look for: NotReady, SchedulingDisabled, MemoryPressure, DiskPressure

# What's broken in the affected namespace?
kubectl get pods -n <namespace> --field-selector=status.phase!=Running
# Shows only non-running pods

# Recent events (last 10 minutes)
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Step 2: Classify the Incident

| Severity | Criteria | Response | |----------|----------|----------| | SEV1 — Outage | Customer-facing service down, data loss risk | All hands, escalate immediately | | SEV2 — Degraded | Service slow or partially broken | On-call engineer, notify team lead | | SEV3 — Warning | Non-critical service affected, no customer impact | On-call engineer, fix during business hours | | SEV4 — Info | Potential issue detected, no current impact | Document, schedule investigation |

Step 3: Check If It's Already Known

Before diving in:

  • Check your team's #incidents Slack channel
  • Check if a deployment happened in the last 30 minutes
  • Check if someone else is already investigating
# Recent deployments
kubectl rollout history deployment/<name> -n <namespace>

# Who changed what recently?
kubectl get events -n <namespace> --field-selector=reason=ScalingReplicaSet | tail -5

Phase 2: Diagnosis (2–10 Minutes)

Goal: Find the root cause. Resist the urge to restart things.

The Diagnostic Flowchart

Pod not running?
├── Status: CrashLoopBackOff → Check logs (Phase 2a)
├── Status: ImagePullBackOff → Check image/registry (Phase 2b)
├── Status: Pending → Check scheduling (Phase 2c)
├── Status: OOMKilled → Check resources (Phase 2d)
└── Status: Running but unhealthy → Check probes + metrics (Phase 2e)

Phase 2a: CrashLoopBackOff

The pod starts, crashes, restarts, crashes again.

# Check the last crash logs
kubectl logs <pod> -n <namespace> --previous

# Check the exit code
kubectl describe pod <pod> -n <namespace> | grep -A5 "Last State"
# Exit Code 1 = Application error (check logs)
# Exit Code 137 = OOMKilled (increase memory)
# Exit Code 143 = SIGTERM (graceful shutdown, usually fine)

# Check if it's a startup issue
kubectl describe pod <pod> -n <namespace> | grep -A10 "Events"

Common causes:

  • Missing environment variable or config
  • Database connection string wrong after rotation
  • New image version has a bug
  • Memory limit too low (OOMKilled)

Phase 2b: ImagePullBackOff

The pod can't pull its container image.

kubectl describe pod <pod> -n <namespace> | grep -A3 "Events"
# Look for: "unauthorized", "not found", "timeout"

Common causes:

  • Image tag doesn't exist (typo or deleted)
  • Registry credentials expired (imagePullSecret)
  • Private registry unreachable (network/firewall)

Quick fix:

# Check which image is requested
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

# Check if the secret exists and is valid
kubectl get secret <pull-secret> -n <namespace> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

Phase 2c: Pending Pods

The pod exists but can't be scheduled to a node.

kubectl describe pod <pod> -n <namespace> | grep -A10 "Events"
# Look for: Insufficient cpu, Insufficient memory, node affinity, taints

Common causes:

  • Not enough resources on any node
  • Node affinity/anti-affinity constraints too strict
  • PersistentVolumeClaim not bound
  • All nodes tainted and pod has no toleration

Quick check:

# Available resources across all nodes
kubectl describe nodes | grep -A5 "Allocated resources"

# PVC status
kubectl get pvc -n <namespace>

Phase 2d: OOMKilled

The container exceeded its memory limit and the kernel killed it.

# Confirm OOMKilled
kubectl describe pod <pod> -n <namespace> | grep -i "oomkilled"

# Check current memory limit vs actual usage
kubectl top pod <pod> -n <namespace> --containers

# Check memory limit
kubectl get pod <pod> -n <namespace> -o jsonpath='{.spec.containers[*].resources.limits.memory}'

Is it a memory leak or just under-provisioned?

  • If memory grew slowly over hours → likely a memory leak → needs code fix
  • If memory spiked with traffic → under-provisioned → increase limit
  • If memory spiked after deployment → new version uses more memory → increase limit or investigate

Phase 2e: Running But Unhealthy

Pod is running but liveness/readiness probes fail, or latency is high.

# Check probe configuration
kubectl describe pod <pod> -n <namespace> | grep -A10 "Liveness\|Readiness"

# Check recent restarts due to probe failures
kubectl get pod <pod> -n <namespace> -o jsonpath='{.status.containerStatuses[*].restartCount}'

# Check application logs for errors
kubectl logs <pod> -n <namespace> --since=10m | grep -i "error\|exception\|timeout"

Phase 3: Fix (10–15 Minutes)

Goal: Apply the minimum change that restores service. Don't over-engineer at 3 AM.

Quick Fixes (Apply Now, Root-Cause Later)

| Problem | Quick Fix | Permanent Fix (Tomorrow) | |---------|-----------|--------------------------| | OOMKilled | kubectl set resources deploy/<name> --limits memory=1Gi | Profile memory usage, set proper limits | | Image wrong | kubectl set image deploy/<name> <container>=<good-image> | Fix CI/CD pipeline | | Too few replicas | kubectl scale deploy/<name> --replicas=5 | Configure HPA properly | | Bad config | kubectl rollout undo deploy/<name> | Fix config, redeploy | | Node pressure | kubectl drain <node> --ignore-daemonsets | Add capacity or optimize |

The Golden Rule: Rollback First, Investigate Later

If a deployment happened in the last 30 minutes and that's when things broke:

# Rollback to previous version
kubectl rollout undo deployment/<name> -n <namespace>

# Verify it's recovering
kubectl rollout status deployment/<name> -n <namespace>

This restores service in 1–2 minutes. Investigate the root cause during business hours.

Phase 4: Verify (15–20 Minutes)

Goal: Confirm the fix worked and no side effects.

# Pods healthy?
kubectl get pods -n <namespace>

# No new error events?
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -10

# Application responding?
kubectl exec -it <pod> -n <namespace> -- curl -s localhost:<port>/health

# Metrics back to normal?
# Check Grafana dashboard for the affected service

Phase 5: Post-Mortem (Next Business Day)

Don't skip this. The post-mortem prevents the next 3 AM wake-up.

Post-Mortem Template

## Incident: [Title]
**Date:** [Date]  |  **Severity:** SEV[1-4]  |  **Duration:** [X] minutes

### Timeline
- HH:MM — Alert fired
- HH:MM — Engineer responded
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Service recovered

### Root Cause
[One paragraph explaining what actually went wrong]

### What Went Well
- [e.g., "Alert fired within 2 minutes of the issue"]
- [e.g., "Rollback procedure worked smoothly"]

### What Went Poorly
- [e.g., "Took 25 minutes to find the root cause"]
- [e.g., "No runbook existed for this failure mode"]

### Action Items
- [ ] [Fix to prevent recurrence]
- [ ] [Monitoring improvement]
- [ ] [Documentation update]

How AI Cuts Each Phase

| Phase | Manual | With AI | Savings | |-------|--------|---------|---------| | Triage | 2–5 min | 30 sec (auto-classified) | 80% | | Diagnosis | 15–30 min | 30–60 sec (auto-analyzed) | 95% | | Fix | 10–20 min | 2–3 min (auto-PR) | 80% | | Verify | 5 min | 2 min (auto-monitored) | 60% | | Total | 35–60 min | 5–7 min | ~90% |

The biggest impact is in the diagnosis phase. AI tools like KI-Ops run all diagnostic commands in parallel, correlate logs with metrics and cluster state, and present a synthesized root-cause analysis — eliminating 95% of manual investigation time.

Start With the Playbook, Then Automate

  1. Print this playbook (or save it in your team's wiki). Having structured steps reduces panic at 3 AM.
  2. Time your next 5 incidents. Measure how long each phase takes.
  3. Automate the diagnosis phase first. That's where 75% of time is wasted.
  4. Try AI-powered analysis. KI-Ops free tier gives you unlimited cluster analysis with AI root-cause identification — no license required.

Next step: See how KI-Ops reduces MTTR from 45 to 4 minutes

Questions or feedback?

Drop us a line – we love technical discussions.

Get in Touch