What Auto-Remediation Actually Means
The term "auto-remediation" gets thrown around a lot, but most tools mean something different by it:
| Tool | What they call "auto-remediation" | |------|-----------------------------------| | Kubernetes itself | Restart a crashed pod (that's self-healing, not remediation) | | PagerDuty | Run a predefined script when an alert fires | | Datadog | Trigger a webhook that calls your API | | Rundeck | Execute a runbook step-by-step | | Actual auto-remediation | AI diagnoses the problem, generates a specific fix, validates it, and creates a PR for your team to review |
Restarting a pod isn't remediation. That's putting a band-aid on a broken leg. The pod will crash again in 5 minutes for the same reason.
Real auto-remediation means: find the root cause, generate the correct fix, validate it won't break anything, and ship it as a reviewable pull request.
The 3 Levels of Kubernetes Auto-Remediation
Level 1: Self-Healing (Built Into Kubernetes)
Kubernetes already does this:
- Restart crashed containers (restartPolicy: Always)
- Replace failed pods (ReplicaSet controller)
- Reschedule pods from failed nodes (kubelet)
- Scale horizontally (HPA)
Limitation: Self-healing fixes symptoms, not causes. An OOMKilled pod gets restarted, but it will OOMKill again because the memory limit is still too low.
Level 2: Runbook Automation (Script-Based)
Tools like Rundeck or PagerDuty's automation:
- Pre-written scripts for known failure modes
- Triggered by specific alerts
- Execute predefined remediation steps
Limitation: Only handles problems you've already seen and written scripts for. New failure modes require new scripts. No intelligence — just if/then logic.
Level 3: AI-Powered Auto-Remediation (What We're Talking About)
The AI:
- Analyzes the cluster state, logs, and metrics
- Identifies the root cause (not just the symptom)
- Generates a specific fix (YAML diff, Helm values change, Terraform update)
- Validates the fix won't break anything (dry-run, template checks, policy checks)
- Creates a pull request with full context for human review
- Your team reviews and merges — the human stays in the loop
This is the only approach that handles novel failures — problems your team has never seen before.
How AI Auto-Remediation Works: Step by Step
Step 1: Diagnosis
$ ki-ops analyze --namespace production
🔴 INCIDENT: api-gateway CrashLoopBackOff
├─ Pod: api-gateway-7f8b9d6c4-x2k
├─ Status: OOMKilled (12 restarts in 47 min)
├─ Memory limit: 512Mi
├─ Peak usage: 687Mi (34% over limit)
├─ Trigger: Traffic spike (3x baseline)
├─ Memory leak: None detected
└─ Confidence: 96%
The AI ran kubectl, checked Loki logs, queried Prometheus metrics, and synthesized a root-cause analysis — all in under 30 seconds.
Step 2: Fix Generation
The AI generates the minimal change to fix the issue:
# File: k8s/deployments/api-gateway.yaml
# Before:
resources:
requests:
memory: 256Mi
limits:
memory: 512Mi
# After:
resources:
requests:
memory: 750Mi
limits:
memory: 1Gi
How the AI calculates the new values:
- Peak observed usage: 687Mi
- Safety margin: +30% → 893Mi
- Rounded to standard K8s unit: 1Gi
- Request set to ~75% of limit: 750Mi
- Verified cluster has capacity: 45Gi free → fits easily
Step 3: Validation Pipeline
Before creating the PR, the fix runs through 6 validation checks:
✓ Step 1: YAML Schema Validation — syntax correct, API version compatible
✓ Step 2: kubectl apply --dry-run=client — API server accepts the change
✓ Step 3: Helm template check — chart renders valid YAML (if using Helm)
✓ Step 4: Terraform plan — no unintended infra changes (if using Terraform)
✓ Step 5: Policy checks — no security policy violations (OPA/Kyverno)
✓ Step 6: Resource quota check — cluster has capacity for the new limits
If any check fails, the PR is not created. The AI reports what failed and suggests alternatives.
Step 4: Pull Request Creation
# PR #1847: Fix api-gateway OOM — increase memory limit
## Problem
Pod `api-gateway-7f8b9d6c4-x2k` in `production` is in CrashLoopBackOff.
- Root cause: OOMKilled (memory limit 512Mi, peak usage 687Mi)
- Trigger: Traffic spike (3x baseline from marketing campaign)
- No memory leak detected
## Changes
- `k8s/deployments/api-gateway.yaml`: memory limit 512Mi → 1Gi
## Validation
- ✅ kubectl dry-run passed
- ✅ Helm template check passed
- ✅ Resource quota: 45Gi free (needs 500Mi increase)
- ✅ Policy checks passed
## Risk Assessment
- Risk: LOW (memory increase only, rolling update)
- Rollback: Revert this PR
- Expected downtime: 0 (rolling restart)
Step 5: Human Review & Merge
Your team reviews the PR like any other code change:
- Is the root cause analysis correct?
- Is the fix appropriate?
- Are the validation results trustworthy?
If yes → merge. If no → comment, adjust, re-validate.
Average review time: 2 minutes. The PR has all the context an engineer needs to make a decision.
What Can Be Auto-Remediated?
Not everything should be auto-fixed. Here's what works well:
Good Candidates for Auto-Remediation
| Issue | Auto-Fix | Why It Works | |-------|----------|-------------| | OOMKilled | Increase memory limit | Well-defined problem, predictable fix | | Image tag not found | Rollback to last known-good tag | Registry has version history | | HPA maxed out | Increase max replicas or adjust targets | Metric-driven, low risk | | PVC nearly full | Expand storage (if StorageClass allows) | Growth rate is predictable | | Certificate expiring | Trigger cert-manager renewal | Automated process exists | | Missing resource limits | Add based on observed usage | P95 metrics provide good baselines |
Bad Candidates (Leave to Humans)
| Issue | Why Not | |-------|---------| | Data corruption | Too risky, needs manual investigation | | Security breach | Requires incident response team | | Application bugs | Need code changes, not infra changes | | Architecture problems | Design decisions need human judgment | | Network partitions | Complex root causes, multiple systems |
The Human-in-the-Loop Principle
Auto-remediation doesn't mean "AI does whatever it wants." It means:
- AI proposes — generates a validated fix
- Human reviews — engineer checks the proposal
- Human approves — merge or reject
- System deploys — GitOps controller applies the change
The AI never directly modifies your cluster. Every change goes through your existing Git workflow, code review process, and deployment pipeline.
This is critical for:
- Compliance: Every change is auditable (PR history)
- Safety: No rogue AI modifying production directly
- Learning: Engineers see what the AI suggests and build intuition
Measuring Auto-Remediation Impact
Track these metrics before and after enabling auto-remediation:
| Metric | What to Measure | |--------|----------------| | MTTR | Total resolution time per incident | | MTTD | Time from alert to root-cause identification | | Auto-fix acceptance rate | % of AI-generated PRs that get merged | | Incident recurrence | Same root cause appearing again within 30 days | | On-call burden | Incidents requiring human intervention after hours |
Target: 80%+ auto-fix acceptance rate within 30 days. If AI-generated fixes are consistently rejected, the AI needs better context about your infrastructure.
Getting Started With Auto-Remediation
-
Start with diagnosis only. Use KI-Ops free tier for AI-powered root-cause analysis. Get comfortable trusting the AI's diagnosis before enabling auto-fixes.
-
Enable auto-fix for low-risk issues first. Memory limit increases, HPA scaling, image tag rollbacks. These are well-understood, low-risk, and easy to verify.
-
Review every PR carefully for the first 2 weeks. Build confidence in the validation pipeline.
-
Gradually expand scope. Once your team trusts the AI's judgment, enable auto-fix for more complex scenarios.
Ready to try auto-remediation? Start with free diagnostics — then upgrade to Pro (€250/year, whole team) when you're ready for auto-fix PRs.