What Auto-Remediation Actually Means

The term "auto-remediation" gets thrown around a lot, but most tools mean something different by it:

| Tool | What they call "auto-remediation" | |------|-----------------------------------| | Kubernetes itself | Restart a crashed pod (that's self-healing, not remediation) | | PagerDuty | Run a predefined script when an alert fires | | Datadog | Trigger a webhook that calls your API | | Rundeck | Execute a runbook step-by-step | | Actual auto-remediation | AI diagnoses the problem, generates a specific fix, validates it, and creates a PR for your team to review |

Restarting a pod isn't remediation. That's putting a band-aid on a broken leg. The pod will crash again in 5 minutes for the same reason.

Real auto-remediation means: find the root cause, generate the correct fix, validate it won't break anything, and ship it as a reviewable pull request.

The 3 Levels of Kubernetes Auto-Remediation

Level 1: Self-Healing (Built Into Kubernetes)

Kubernetes already does this:

Restart crashed containers (restartPolicy: Always)
Replace failed pods (ReplicaSet controller)
Reschedule pods from failed nodes (kubelet)
Scale horizontally (HPA)

Limitation: Self-healing fixes symptoms, not causes. An OOMKilled pod gets restarted, but it will OOMKill again because the memory limit is still too low.

Level 2: Runbook Automation (Script-Based)

Tools like Rundeck or PagerDuty's automation:

Pre-written scripts for known failure modes
Triggered by specific alerts
Execute predefined remediation steps

Limitation: Only handles problems you've already seen and written scripts for. New failure modes require new scripts. No intelligence — just if/then logic.

Level 3: AI-Powered Auto-Remediation (What We're Talking About)

The AI:

Analyzes the cluster state, logs, and metrics
Identifies the root cause (not just the symptom)
Generates a specific fix (YAML diff, Helm values change, Terraform update)
Validates the fix won't break anything (dry-run, template checks, policy checks)
Creates a pull request with full context for human review
Your team reviews and merges — the human stays in the loop

This is the only approach that handles novel failures — problems your team has never seen before.

How AI Auto-Remediation Works: Step by Step

Step 1: Diagnosis

$ ki-ops analyze --namespace production

🔴 INCIDENT: api-gateway CrashLoopBackOff
├─ Pod: api-gateway-7f8b9d6c4-x2k
├─ Status: OOMKilled (12 restarts in 47 min)
├─ Memory limit: 512Mi
├─ Peak usage: 687Mi (34% over limit)
├─ Trigger: Traffic spike (3x baseline)
├─ Memory leak: None detected
└─ Confidence: 96%

The AI ran kubectl, checked Loki logs, queried Prometheus metrics, and synthesized a root-cause analysis — all in under 30 seconds.

Step 2: Fix Generation

The AI generates the minimal change to fix the issue:

# File: k8s/deployments/api-gateway.yaml
# Before:
resources:
  requests:
    memory: 256Mi
  limits:
    memory: 512Mi

# After:
resources:
  requests:
    memory: 750Mi
  limits:
    memory: 1Gi

How the AI calculates the new values:

Peak observed usage: 687Mi
Safety margin: +30% → 893Mi
Rounded to standard K8s unit: 1Gi
Request set to ~75% of limit: 750Mi
Verified cluster has capacity: 45Gi free → fits easily

Step 3: Validation Pipeline

Before creating the PR, the fix runs through 6 validation checks:

✓ Step 1: YAML Schema Validation — syntax correct, API version compatible
✓ Step 2: kubectl apply --dry-run=client — API server accepts the change
✓ Step 3: Helm template check — chart renders valid YAML (if using Helm)
✓ Step 4: Terraform plan — no unintended infra changes (if using Terraform)
✓ Step 5: Policy checks — no security policy violations (OPA/Kyverno)
✓ Step 6: Resource quota check — cluster has capacity for the new limits

If any check fails, the PR is not created. The AI reports what failed and suggests alternatives.

Step 4: Pull Request Creation

# PR #1847: Fix api-gateway OOM — increase memory limit

## Problem
Pod `api-gateway-7f8b9d6c4-x2k` in `production` is in CrashLoopBackOff.
- Root cause: OOMKilled (memory limit 512Mi, peak usage 687Mi)
- Trigger: Traffic spike (3x baseline from marketing campaign)
- No memory leak detected

## Changes
- `k8s/deployments/api-gateway.yaml`: memory limit 512Mi → 1Gi

## Validation
- ✅ kubectl dry-run passed
- ✅ Helm template check passed
- ✅ Resource quota: 45Gi free (needs 500Mi increase)
- ✅ Policy checks passed

## Risk Assessment
- Risk: LOW (memory increase only, rolling update)
- Rollback: Revert this PR
- Expected downtime: 0 (rolling restart)

Step 5: Human Review & Merge

Your team reviews the PR like any other code change:

Is the root cause analysis correct?
Is the fix appropriate?
Are the validation results trustworthy?

If yes → merge. If no → comment, adjust, re-validate.

Average review time: 2 minutes. The PR has all the context an engineer needs to make a decision.

What Can Be Auto-Remediated?

Not everything should be auto-fixed. Here's what works well:

Good Candidates for Auto-Remediation

| Issue | Auto-Fix | Why It Works | |-------|----------|-------------| | OOMKilled | Increase memory limit | Well-defined problem, predictable fix | | Image tag not found | Rollback to last known-good tag | Registry has version history | | HPA maxed out | Increase max replicas or adjust targets | Metric-driven, low risk | | PVC nearly full | Expand storage (if StorageClass allows) | Growth rate is predictable | | Certificate expiring | Trigger cert-manager renewal | Automated process exists | | Missing resource limits | Add based on observed usage | P95 metrics provide good baselines |

Bad Candidates (Leave to Humans)

| Issue | Why Not | |-------|---------| | Data corruption | Too risky, needs manual investigation | | Security breach | Requires incident response team | | Application bugs | Need code changes, not infra changes | | Architecture problems | Design decisions need human judgment | | Network partitions | Complex root causes, multiple systems |

The Human-in-the-Loop Principle

Auto-remediation doesn't mean "AI does whatever it wants." It means:

AI proposes — generates a validated fix
Human reviews — engineer checks the proposal
Human approves — merge or reject
System deploys — GitOps controller applies the change

The AI never directly modifies your cluster. Every change goes through your existing Git workflow, code review process, and deployment pipeline.

This is critical for:

Compliance: Every change is auditable (PR history)
Safety: No rogue AI modifying production directly
Learning: Engineers see what the AI suggests and build intuition

Measuring Auto-Remediation Impact

Track these metrics before and after enabling auto-remediation:

| Metric | What to Measure | |--------|----------------| | MTTR | Total resolution time per incident | | MTTD | Time from alert to root-cause identification | | Auto-fix acceptance rate | % of AI-generated PRs that get merged | | Incident recurrence | Same root cause appearing again within 30 days | | On-call burden | Incidents requiring human intervention after hours |

Target: 80%+ auto-fix acceptance rate within 30 days. If AI-generated fixes are consistently rejected, the AI needs better context about your infrastructure.

Getting Started With Auto-Remediation

Start with diagnosis only. Use KI-Ops free tier for AI-powered root-cause analysis. Get comfortable trusting the AI's diagnosis before enabling auto-fixes.
Enable auto-fix for low-risk issues first. Memory limit increases, HPA scaling, image tag rollbacks. These are well-understood, low-risk, and easy to verify.
Review every PR carefully for the first 2 weeks. Build confidence in the validation pipeline.
Gradually expand scope. Once your team trusts the AI's judgment, enable auto-fix for more complex scenarios.

Ready to try auto-remediation? Start with free diagnostics — then upgrade to Pro (€250/year, whole team) when you're ready for auto-fix PRs.

Auto-Remediation in Kubernetes: From Manual Fixes to AI-Generated PRs