Von Diagnose zu Behebung in Sekunden
KI-Ops Pro erstellt validierte PRs für häufige K8s-Fehler. Ihre Team reviewt, genehmigt, merged. Vollständig GitOps-native.
Das Agentic AI Problem: Diagnose ist nicht Lösung
Sie haben ein Problem (CrashLoopBackOff, PVC voll, Image falsch). KI-Ops zeigt Ihnen:
🔴 PROBLEM: Deployment "api-service" crasht
Root Cause: Memory limit 512Mi ist zu klein
Recommendation: Erhöhen Sie auf 1Gi
Dann springen Sie in Ihren Editor:
git checkout -b fix/memory-limit- Datei öffnen, Memory-Limit ändern
- Test ob YAML valid ist
git push- PR erstellen
- Auf Review warten
- Mergen
10-20 Minuten für etwas das die KI in 5 Sekunden lösen könnte.
KI-Ops Pro automatisiert das komplett — mit vollständiger Kontrolle.
Agentic AI Operations: Wie es funktioniert
Schritt 1: Diagnose → Auto-Fix Proposal
$ ki-ops analyze
[Analyzing cluster...]
🔴 INCIDENT: api-service CrashLoopBackOff
Root Cause: OOMKilled (Memory Limit 512Mi, Peak 687Mi)
💡 Auto-Fix Available (KI-Ops Pro)
Ready to create PR with fix?
$ ki-ops fix --auto-create-pr
$ ki-ops fix --auto-create-pr
Creating PR for memory limit increase...
✅ PR Created: #1847
├─ Branch: fix/api-service-memory-limit-prod
├─ Changes: Deployment "api-service"
├─ Memory: 512Mi → 1Gi
├─ Validations Passed:
│ ├─ kubectl apply --dry-run=client ✓
│ ├─ Helm Template (if applicable) ✓
│ ├─ Terraform Plan (if IaC) ✓
│ ├─ Policy Checks (Custom Rego) ✓
│ └─ Resource Quota Check (will fit) ✓
│
├─ URL: https://github.com/yourorg/yourrepo/pull/1847
└─ Status: Waiting for Review
Suggested Reviewers: @platform-lead, @oncall-engineer
Schritt 2: PR mit vollem Kontext
# Fix: Increase memory limit for api-service deployment
## Problem
Pod `api-service-xy81f` in production is in CrashLoopBackOff state.
- Status: OOMKilled
- Memory Limit: 512Mi
- Observed Peak: 687Mi
- Downtime: 2 hours (since 14:23 UTC)
## Root Cause Analysis
Machine Learning analysis of cluster metrics + logs identified:
- Memory limit too small for current traffic patterns
- No memory leaks detected
- Traffic increase is legitimate (promotional campaign)
## Solution
Increase memory request/limit to 1Gi based on:
- P99 memory usage: 687Mi
- Safety margin: 30%
- Total: 892Mi → rounded to 1Gi
## Changes
- File: `k8s/deployments/api-service.yaml`
- Change: `memory: 512Mi` → `1Gi` (both request + limit)
## Validation Results
- ✅ kubectl apply --dry-run=client passed
- ✅ Helm template check passed
- ✅ Resource quota check passed (cluster has 45Gi free)
- ✅ Policy check passed (no PII or secrets)
- ✅ Network policy compatible
- ✅ No service disruption expected (rolling restart)
## Impact Assessment
- Service: api-service (production)
- Risk Level: LOW (memory increase only)
- Rollout Strategy: Rolling update (0 downtime)
- Rollback: Simple (revert memory limit)
- Testing: All e2e tests passed in dry-run
## Follow-up Actions (Optional)
1. Monitor memory usage post-deployment
2. Consider HPA memory-based scaling if spike recurring
3. Profile application for potential memory leaks
---
**Generated by KI-Ops Pro** | Analysis: 2024-03-04 14:24:12 UTC | Confidence: 98%
Schritt 3: Review & Merge
# Ihr Team reviewt (takes 2 min)
# Checklist im PR:
# ✓ Validate YAML syntax
# ✓ Check resource impact
# ✓ Review root cause analysis
# ✓ Confirm memory increase is safe
# Merge PR
git merge #1847
# Automatic Deployment (GitOps)
# ArgoCD detects change, automatically deploys to production
# Monitoring: KI-Ops watches rollout and confirms success
Auto-Fix Templates: Was KI-Ops automatisieren kann
Template 1: Memory/CPU Limit Tuning
# BEFORE
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 512Mi
cpu: 1000m
# AFTER (KI-Ops calculated)
resources:
requests:
memory: 1Gi
cpu: 750m
limits:
memory: 1.2Gi
cpu: 1500m
How KI-Ops calculates:
- Analyze P95 memory/CPU over 7 days
- Add 20-30% safety margin
- Round to standard Kubernetes units
- Check cluster has capacity
- Validate HPA won't trigger
Template 2: Image Tag Fix
# Problem: Image "myregistry/api:latest" doesn't exist
# BEFORE
containers:
- name: api
image: myregistry.io/api:latest # Tag doesn't exist!
# AFTER
containers:
- name: api
image: myregistry.io/api:v2.3.0 # Use previous known-good
How KI-Ops detects + fixes:
- Event log: ImagePullBackOff
- Check registry for available tags
- Find last successful deployment image
- Propose rollback to that version
- Validate image exists + compatible
Template 3: PVC Expansion
# Problem: PVC "postgres-data" 92% full, growth rate 15GB/day
# BEFORE
spec:
resources:
requests:
storage: 100Gi
# AFTER (KI-Ops + input)
spec:
resources:
requests:
storage: 250Gi # 100 + (15 * 10 days safety margin)
How KI-Ops calculates:
- Query PVC usage metrics
- Calculate growth rate (GB/day)
- Estimate when 95% capacity reached
- Propose expansion to give 10-30 days buffer
- Check StorageClass allows expansion
Template 4: Deployment Replica Scaling
# Problem: HPA maxed out (50 replicas), still high latency
# BEFORE
spec:
replicas: 3 # Static, no HPA
# oder
spec:
maxReplicas: 50
targetCPUUtilizationPercentage: 70
# AFTER
spec:
replicas: 5 # Increase static base
# und
spec:
maxReplicas: 100 # Increase HPA max
targetCPUUtilizationPercentage: 65 # More aggressive scale-up
Template 5: Helm/Terraform Updates
# KI-Ops kann auch Helm Values / Terraform .tfvars updaten
# Helm Values BEFORE
api:
replicaCount: 3
resources:
memory: 512Mi
# Helm Values AFTER (generated PR)
api:
replicaCount: 5
resources:
memory: 1Gi
# Terraform BEFORE
resource "kubernetes_deployment" "api" {
spec {
template {
spec {
container {
resources {
limits = {
memory = "512Mi"
}
}
}
}
}
}
}
# Terraform AFTER
resource "kubernetes_deployment" "api" {
spec {
template {
spec {
container {
resources {
limits = {
memory = "1Gi" # KI-Ops updated this
}
}
}
}
}
}
}
Validation Pipeline: Guardrails
Bevor PR erstellt wird, durchläuft jede Change:
$ ki-ops fix --auto-create-pr --validate-all
Step 1: YAML Schema Validation
├─ Syntax prüfen
├─ Kubernetes API Version kompatibilität
└─ Required fields vorhanden
Step 2: kubectl Dry-Run
├─ $ kubectl apply --dry-run=client -f fixed-deployment.yaml
├─ Prüfe: API Server akzeptiert Changes
└─ Prüfe: Keine Konflikte mit existierenden Objekten
Step 3: Helm Simulation (falls Helm verwendet)
├─ $ helm template api ./chart
├─ Prüfe: Chart generiert valid YAML
└─ Prüfe: Values sind kompatibel
Step 4: Terraform Validation (falls IaC verwendet)
├─ $ terraform plan -out=tfplan
├─ Prüfe: Infra-Changes sind sicher
├─ Prüfe: Terraform State conflict check
└─ Prüfe: Keine unerwarteten Destroys
Step 5: Policy Checks (Custom Rego / Kyverno)
├─ Security Policies (z.B.: no privileged containers)
├─ Cost Policies (z.B.: no oversizing)
├─ Compliance Policies (z.B.: resource limits required)
└─ Custom Org Policies
Step 6: Resource Quota Check
├─ Cluster: requests memory = 45Gi/100Gi free
├─ Node: proposed change needs 500Mi
├─ Result: ✓ Fits, no issues
Step 7: Impact Analysis
├─ Services affected: api-service (prod)
├─ Pod restarts needed: yes (rolling update)
├─ Expected downtime: 0 minutes (rolling)
└─ Dependencies updated: postgres-secret required, checking...
✅ All Validation Passed!
Creating PR...
Human-in-the-Loop: Wer reviewt?
KI-Ops Pro schlägt automatisch Reviewer vor:
PR #1847: Fix api-service memory limit
├─ Suggested Reviewers:
│ ├─ @alice (Deployment owner, last modified this file)
│ ├─ @bob (On-call engineer for api-service)
│ └─ @platform-lead (Architecture approval)
│
├─ Required Approvals: 2
├─ Auto-Merge Policy: true (if all checks pass)
└─ Escalation: None (low-risk change)
# Wenn Team nicht einverstanden mit AI Fix:
# → Kommentieren Sie, verändern Sie, push changes
# → Re-trigger validation
# → Neues Commit auto-created mit feedback loop
Pro Features: Was Sie extra bekommen
| Feature | Free | Pro | Enterprise | |---------|------|-----|------------| | Analysis & Diagnostics | ✓ | ✓ | ✓ | | Auto-Fix PR Creation | - | ✓ | ✓ | | Multi-Repo Support | - | ✓ | ✓ | | Helm/Terraform Validation | - | ✓ | ✓ | | Custom Policies | - | ✓ | ✓ | | GitOps Integration | - | ✓ | ✓ | | RBAC-controlled Actions | - | ✓ | ✓ | | On-Premise Deployment | - | - | ✓ | | SSO & Audit Logs | - | - | ✓ | | Slack Notifications | ✓ | ✓ | ✓ | | Scheduled Analyses | - | ✓ | ✓ |
Praktisches Example: Volle Workflow
# 14:23 UTC: Incident passiert
api-service Pod: CrashLoopBackOff
Ursache: OOMKilled
# 14:24 UTC: KI-Ops erkennt Problem automatisch
$ ki-ops analyze
[Analyzing...]
🔴 PROBLEM DETECTED: api-service OOMKilled
# 14:24:30 UTC: Auto-Fix vorgeschlagen + PR erstellt
Auto-Fix Ready. Creating PR...
✅ PR #1847 created
# 14:25 UTC: Team wird benachrichtigt (Slack)
@alice @bob: New PR #1847 (auto-created by KI-Ops)
Link: https://github.com/yourorg/repo/pull/1847
# 14:26 UTC: @alice reviewt PR (2 minutes)
LGTM! The memory increase makes sense.
Approving.
# 14:27 UTC: @bob reviewt und mergt
✅ Approved by alice
Merging...
# 14:27:30 UTC: GitOps Auto-Deployment
ArgoCD detects change, deploying to production...
# 14:29 UTC: Pod recovers, service restored
api-service-xy81f: Running (1/1 ready)
Total MTTR: 6 minutes (vs. 2+ hours manual debugging)
Was Sie mit Agentic AI Operations gewinnen
- MTTR aus Stunden → Minuten: Von "Problem erkannt" zu "Fixed & deployed"
- Runbook Automation: Häufige Fixes werden nie manuell gemacht
- Knowledge Codification: KI-Ops codifiziert Ihre Team's Best Practices
- Compliance & Governance: Jede Change ist auditiert, policy-approved, reviewed
- Team Empowerment: Junior Engineers können Senior-Level Diagnostik durchführen
- Zero Manual Toil: Repetitive fix-tasks sind komplett automatisiert
Probieren Sie Pro aus: Die ersten 2 Wochen sind kostenlos. 250€/year für gesamtes Team.
Jetzt ausprobieren
Starte mit dem Free-Tier und analysiere deinen Cluster in unter 5 Minuten.
Kostenlos starten