The Problem: Manual Intervention
Classic scenario:
- 3:00 AM — Pod crashes
- 3:05 AM — Monitoring alert fires
- 3:08 AM — SRE wakes up
- 3:15 AM — SRE logs in, checks logs
- 3:20 AM — SRE restarts the pod
- 3:21 AM — Service works again
18 minutes of downtime for something Kubernetes could have handled on its own.
With self-healing patterns, your systems fix themselves — without human intervention. That's the promise of Kubernetes, but most teams implement these patterns incorrectly.
Pattern 1: Liveness Probes (Monitor Pod Health)
A Liveness Probe checks whether a container is still "alive." If the probe fails, Kubernetes automatically restarts the container.
apiVersion: v1
kind: Pod
metadata:
name: api-service
spec:
containers:
- name: api
image: my-api:v1
ports:
- containerPort: 8080
# Liveness Probe: Check every 10 seconds if /health returns OK
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # Give the container time to start
periodSeconds: 10 # Check every 10 seconds
timeoutSeconds: 2 # Timeout after 2 seconds
failureThreshold: 3 # Restart only after 3 consecutive failures
What the probe does:
- Every 10 seconds, Kubernetes sends an HTTP GET to
http://localhost:8080/health - If the container responds with status 200, all good
- If the probe fails 3 times in a row, Kubernetes restarts the container
- All of this happens automatically without human intervention
Common mistake: Probe too aggressive
# WRONG: Restarts the container every second for minor issues
livenessProbe:
periodSeconds: 1 # Too frequent!
failureThreshold: 1 # Too strict!
# RIGHT: Gives the container time to recover
livenessProbe:
periodSeconds: 10 # Every 10 seconds
failureThreshold: 3 # Restart after 3 failures
initialDelaySeconds: 30 # 30s for startup
Real-world example:
# Node.js Express service
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
// In your app code:
app.get('/health', (req, res) => {
// Check if the database is still reachable
if (db.connection.isConnected) {
res.status(200).json({ status: 'alive' });
} else {
res.status(503).json({ status: 'unhealthy' });
}
});
Pattern 2: Readiness Probes (Is the Service Ready for Traffic?)
The difference from Liveness Probes:
- Liveness: "Is the container still alive?" → Restart it if not
- Readiness: "Can this container handle traffic?" → Don't send it traffic if not
# Readiness: Check if the service can process traffic
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5 # Check after 5 seconds
periodSeconds: 5 # Every 5 seconds
successThreshold: 1 # At least 1 success
failureThreshold: 2 # After 2 failures: no traffic
Classic scenario without Readiness Probe:
10:00 - Pod starts
10:00 - Kubernetes asks Liveness: "Are you alive?" → YES
10:00 - Kubernetes sends traffic
10:02 - Database connection pool is still initializing (takes 2 min)
10:02 - Container can't process database queries yet
10:02 - Error rate 100% because API isn't ready
With Readiness Probe:
10:00 - Pod starts
10:00 - Kubernetes asks Liveness: "Are you alive?" → YES
10:00 - Kubernetes asks Readiness: "Can you handle traffic?" → NO (DB pool not ready)
10:00 - Kubernetes does NOT send traffic
10:02 - Readiness says YES (DB pool ready)
10:02 - Now Kubernetes sends traffic
10:02 - Error rate 0% because everything is ready
In your code:
// Readiness endpoint
app.get('/ready', async (req, res) => {
const checks = {
database: await db.isConnected(),
cache: await redis.ping(),
externalAPI: await checkExternalServiceHealth(),
};
if (Object.values(checks).every(c => c === true)) {
res.status(200).json({ ready: true });
} else {
res.status(503).json({ ready: false, details: checks });
}
});
Pattern 3: Pod Disruption Budgets (PDB)
When Kubernetes updates your cluster, it drains nodes — moving pods to other nodes. Without a PDB, all your pods can go down at the same time.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service-pdb
spec:
minAvailable: 2 # At least 2 replicas must always be running
selector:
matchLabels:
app: api-service
What it does:
- If your cluster has 5 api-service pods and Kubernetes needs to update
- Kubernetes keeps at least 2 replicas running at all times
- It stops a maximum of 3 pods at a time
- The service stays available throughout the update
Without PDB:
Before: 5 API pods running
Cluster update starts
Kubernetes: "I'll stop all 5 pods to update them"
All 5 pods down
Service has 100% downtime for 5 minutes
With PDB (minAvailable: 2):
Before: 5 API pods running
Cluster update starts
Kubernetes: "I need to keep at least 2 running"
Stops 3 pods → 2 pods still running
Traffic goes to the 2 running pods
3 pods update and restart
Remaining 2 pods update
Service had ~0% downtime
Best practice:
# For stateless services
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # At least 2 must run
selector:
matchLabels:
tier: api
---
# For databases or stateful services
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: db-pdb
spec:
minAvailable: 1 # Only 1 needs to run (important for quorum)
selector:
matchLabels:
tier: database
Pattern 4: Horizontal Pod Autoscaling (HPA)
When load increases, scale up. When load decreases, scale down.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale when CPU > 70%
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale when memory > 80%
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
Scenario:
10:00 - Normal traffic: 2 pods running
10:05 - Black Friday sale starts: traffic 10x
10:06 - CPU reaches 75%
10:07 - HPA detects it, creates 2 new pods
10:08 - 4 pods running, traffic better distributed
10:09 - CPU drops to 65%, everything stable
11:00 - Sale over, traffic returns to normal
11:05 - CPU under 70%, HPA starts downscaling
11:10 - Back to 2 pods (saving costs)
Common mistake:
# WRONG: Too aggressive thresholds
metrics:
- resource:
name: cpu
target:
averageUtilization: 30 # Scales up and down constantly (flapping)
# RIGHT: Reasonable thresholds
metrics:
- resource:
name: cpu
target:
averageUtilization: 70 # Gives you buffer
Pattern 5: Resource Limits and Requests
Resource limits prevent one pod from consuming all of a node's resources and starving other pods.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: my-api:v1
resources:
# Request: Minimum resources the pod needs
# Kubernetes schedules on a node with enough capacity
requests:
cpu: 250m # 250 milliCPU = 1/4 CPU
memory: 256Mi # 256 MB
# Limit: Maximum the pod is allowed to use
# If the pod exceeds the limit, it gets killed
limits:
cpu: 500m # Maximum 500 milliCPU
memory: 512Mi # Maximum 512 MB
Without limits:
10:00 - Pod has a memory leak, uses more and more memory
10:15 - Pod uses 90% of node memory
10:16 - Other pods can't start
10:17 - Cluster is effectively down
10:30 - SRE finds the leak
Total impact: 30+ minutes
With limits (512Mi):
10:00 - Pod has a memory leak
10:05 - Pod reaches 512Mi (= limit)
10:06 - Kubernetes kills the pod (OOMKilled)
10:07 - Kubernetes starts a new pod replica
10:08 - New instance runs, old one is gone
10:09 - SRE sees OOMKilled events and investigates the root cause
Downtime: ~1-2 minutes instead of 30 minutes
Best practices for limits:
# Node has 4 CPUs, 16GB RAM
# You plan 5 pods on it
# Per pod
requests:
cpu: 700m # 5 x 700m = 3500m = 3.5 CPUs (under 4)
memory: 2Gi # 5 x 2Gi = 10Gi (under 16Gi)
limits:
cpu: 1000m # Give some buffer
memory: 3Gi
How KI-Ops Helps
KI-Ops detects when your self-healing patterns are misconfigured:
ki-ops analyze --namespace production
# Output:
❌ api-service: No Liveness Probe configured
Recommendation: Add /health endpoint with Liveness Probe
❌ api-service: Memory limit too small (256Mi)
Current usage: ~450Mi average
Recommendation: Increase to 512Mi to prevent OOMKills
⚠️ api-service: Only 1 replica, but PDB says minAvailable: 1
Recommendation: Increase replicas to at least 2 for HA
✅ api-service: HPA and resource limits properly configured
With KI-Ops Pro, misconfigured patterns don't just get flagged — the AI generates a validated PR that fixes them automatically.
Measurable Results
With all 5 patterns properly configured:
| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Unexpected downtime | 20h/month | 2h/month | 90% less | | Manual pod restarts | 40x/month | 0x/month | Fully automatic | | MTTR | 45 min | 5 min | 89% faster | | SRE time on ops | 80h/month | 10h/month | 88% less toil | | Cluster update downtime | 30 min | 0 min | Completely seamless |
The Checklist
For every production service:
- [ ] Liveness Probe (checks if container is alive)
- [ ] Readiness Probe (checks if container can handle traffic)
- [ ] Resource Requests set (for scheduling)
- [ ] Resource Limits set (no noisy neighbors)
- [ ] HPA configured (auto-scale)
- [ ] PDB with minAvailable (cluster updates)
- [ ] At least 2 replicas (high availability)
If all of these are set, your service is 99.9% self-healing.
Next step: Use KI-Ops free tier to validate these patterns in your cluster — AI-powered analysis, unlimited runs, no license required.