The Problem: Manual Intervention

Classic scenario:

3:00 AM — Pod crashes
3:05 AM — Monitoring alert fires
3:08 AM — SRE wakes up
3:15 AM — SRE logs in, checks logs
3:20 AM — SRE restarts the pod
3:21 AM — Service works again

18 minutes of downtime for something Kubernetes could have handled on its own.

With self-healing patterns, your systems fix themselves — without human intervention. That's the promise of Kubernetes, but most teams implement these patterns incorrectly.

Pattern 1: Liveness Probes (Monitor Pod Health)

A Liveness Probe checks whether a container is still "alive." If the probe fails, Kubernetes automatically restarts the container.

apiVersion: v1
kind: Pod
metadata:
  name: api-service
spec:
  containers:
  - name: api
    image: my-api:v1
    ports:
    - containerPort: 8080

    # Liveness Probe: Check every 10 seconds if /health returns OK
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30    # Give the container time to start
      periodSeconds: 10          # Check every 10 seconds
      timeoutSeconds: 2          # Timeout after 2 seconds
      failureThreshold: 3        # Restart only after 3 consecutive failures

What the probe does:

Every 10 seconds, Kubernetes sends an HTTP GET to http://localhost:8080/health
If the container responds with status 200, all good
If the probe fails 3 times in a row, Kubernetes restarts the container
All of this happens automatically without human intervention

Common mistake: Probe too aggressive

# WRONG: Restarts the container every second for minor issues
livenessProbe:
  periodSeconds: 1     # Too frequent!
  failureThreshold: 1  # Too strict!

# RIGHT: Gives the container time to recover
livenessProbe:
  periodSeconds: 10    # Every 10 seconds
  failureThreshold: 3  # Restart after 3 failures
  initialDelaySeconds: 30  # 30s for startup

Real-world example:

# Node.js Express service
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3

// In your app code:
app.get('/health', (req, res) => {
  // Check if the database is still reachable
  if (db.connection.isConnected) {
    res.status(200).json({ status: 'alive' });
  } else {
    res.status(503).json({ status: 'unhealthy' });
  }
});

Pattern 2: Readiness Probes (Is the Service Ready for Traffic?)

The difference from Liveness Probes:

Liveness: "Is the container still alive?" → Restart it if not
Readiness: "Can this container handle traffic?" → Don't send it traffic if not

# Readiness: Check if the service can process traffic
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5     # Check after 5 seconds
  periodSeconds: 5           # Every 5 seconds
  successThreshold: 1        # At least 1 success
  failureThreshold: 2        # After 2 failures: no traffic

Classic scenario without Readiness Probe:

10:00 - Pod starts
10:00 - Kubernetes asks Liveness: "Are you alive?" → YES
10:00 - Kubernetes sends traffic
10:02 - Database connection pool is still initializing (takes 2 min)
10:02 - Container can't process database queries yet
10:02 - Error rate 100% because API isn't ready

With Readiness Probe:
10:00 - Pod starts
10:00 - Kubernetes asks Liveness: "Are you alive?" → YES
10:00 - Kubernetes asks Readiness: "Can you handle traffic?" → NO (DB pool not ready)
10:00 - Kubernetes does NOT send traffic
10:02 - Readiness says YES (DB pool ready)
10:02 - Now Kubernetes sends traffic
10:02 - Error rate 0% because everything is ready

In your code:

// Readiness endpoint
app.get('/ready', async (req, res) => {
  const checks = {
    database: await db.isConnected(),
    cache: await redis.ping(),
    externalAPI: await checkExternalServiceHealth(),
  };

  if (Object.values(checks).every(c => c === true)) {
    res.status(200).json({ ready: true });
  } else {
    res.status(503).json({ ready: false, details: checks });
  }
});

Pattern 3: Pod Disruption Budgets (PDB)

When Kubernetes updates your cluster, it drains nodes — moving pods to other nodes. Without a PDB, all your pods can go down at the same time.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-service-pdb
spec:
  minAvailable: 2        # At least 2 replicas must always be running
  selector:
    matchLabels:
      app: api-service

What it does:

If your cluster has 5 api-service pods and Kubernetes needs to update
Kubernetes keeps at least 2 replicas running at all times
It stops a maximum of 3 pods at a time
The service stays available throughout the update

Without PDB:

Before: 5 API pods running
Cluster update starts
Kubernetes: "I'll stop all 5 pods to update them"
All 5 pods down
Service has 100% downtime for 5 minutes

With PDB (minAvailable: 2):

Before: 5 API pods running
Cluster update starts
Kubernetes: "I need to keep at least 2 running"
Stops 3 pods → 2 pods still running
Traffic goes to the 2 running pods
3 pods update and restart
Remaining 2 pods update
Service had ~0% downtime

Best practice:

# For stateless services
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2       # At least 2 must run
  selector:
    matchLabels:
      tier: api

---
# For databases or stateful services
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: db-pdb
spec:
  minAvailable: 1       # Only 1 needs to run (important for quorum)
  selector:
    matchLabels:
      tier: database

Pattern 4: Horizontal Pod Autoscaling (HPA)

When load increases, scale up. When load decreases, scale down.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service

  minReplicas: 2
  maxReplicas: 10

  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70    # Scale when CPU > 70%

  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80    # Scale when memory > 80%

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
    scaleUp:
      stabilizationWindowSeconds: 0    # Scale up immediately

Scenario:

10:00 - Normal traffic: 2 pods running
10:05 - Black Friday sale starts: traffic 10x
10:06 - CPU reaches 75%
10:07 - HPA detects it, creates 2 new pods
10:08 - 4 pods running, traffic better distributed
10:09 - CPU drops to 65%, everything stable
11:00 - Sale over, traffic returns to normal
11:05 - CPU under 70%, HPA starts downscaling
11:10 - Back to 2 pods (saving costs)

Common mistake:

# WRONG: Too aggressive thresholds
metrics:
- resource:
    name: cpu
    target:
      averageUtilization: 30  # Scales up and down constantly (flapping)

# RIGHT: Reasonable thresholds
metrics:
- resource:
    name: cpu
    target:
      averageUtilization: 70  # Gives you buffer

Pattern 5: Resource Limits and Requests

Resource limits prevent one pod from consuming all of a node's resources and starving other pods.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: my-api:v1

        resources:
          # Request: Minimum resources the pod needs
          # Kubernetes schedules on a node with enough capacity
          requests:
            cpu: 250m           # 250 milliCPU = 1/4 CPU
            memory: 256Mi       # 256 MB

          # Limit: Maximum the pod is allowed to use
          # If the pod exceeds the limit, it gets killed
          limits:
            cpu: 500m           # Maximum 500 milliCPU
            memory: 512Mi       # Maximum 512 MB

Without limits:

10:00 - Pod has a memory leak, uses more and more memory
10:15 - Pod uses 90% of node memory
10:16 - Other pods can't start
10:17 - Cluster is effectively down
10:30 - SRE finds the leak
Total impact: 30+ minutes

With limits (512Mi):
10:00 - Pod has a memory leak
10:05 - Pod reaches 512Mi (= limit)
10:06 - Kubernetes kills the pod (OOMKilled)
10:07 - Kubernetes starts a new pod replica
10:08 - New instance runs, old one is gone
10:09 - SRE sees OOMKilled events and investigates the root cause
Downtime: ~1-2 minutes instead of 30 minutes

Best practices for limits:

# Node has 4 CPUs, 16GB RAM
# You plan 5 pods on it

# Per pod
requests:
  cpu: 700m    # 5 x 700m = 3500m = 3.5 CPUs (under 4)
  memory: 2Gi  # 5 x 2Gi = 10Gi (under 16Gi)
limits:
  cpu: 1000m   # Give some buffer
  memory: 3Gi

How KI-Ops Helps

KI-Ops detects when your self-healing patterns are misconfigured:

ki-ops analyze --namespace production

# Output:
❌ api-service: No Liveness Probe configured
   Recommendation: Add /health endpoint with Liveness Probe

❌ api-service: Memory limit too small (256Mi)
   Current usage: ~450Mi average
   Recommendation: Increase to 512Mi to prevent OOMKills

⚠️  api-service: Only 1 replica, but PDB says minAvailable: 1
   Recommendation: Increase replicas to at least 2 for HA

✅ api-service: HPA and resource limits properly configured

With KI-Ops Pro, misconfigured patterns don't just get flagged — the AI generates a validated PR that fixes them automatically.

Measurable Results

With all 5 patterns properly configured:

| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Unexpected downtime | 20h/month | 2h/month | 90% less | | Manual pod restarts | 40x/month | 0x/month | Fully automatic | | MTTR | 45 min | 5 min | 89% faster | | SRE time on ops | 80h/month | 10h/month | 88% less toil | | Cluster update downtime | 30 min | 0 min | Completely seamless |

The Checklist

For every production service:

[ ] Liveness Probe (checks if container is alive)
[ ] Readiness Probe (checks if container can handle traffic)
[ ] Resource Requests set (for scheduling)
[ ] Resource Limits set (no noisy neighbors)
[ ] HPA configured (auto-scale)
[ ] PDB with minAvailable (cluster updates)
[ ] At least 2 replicas (high availability)

If all of these are set, your service is 99.9% self-healing.

Next step: Use KI-Ops free tier to validate these patterns in your cluster — AI-powered analysis, unlimited runs, no license required.

5 Kubernetes Self-Healing Patterns That Reduce Incidents by 60%