Security Observability - Anomaly Detection - KI-Ops

Das SecOps-Problem: Silos zwischen Ops und Security

Typische Setups haben:

Ops Team: Prometheus, Grafana, Loki (Logs, Metriken, kein Security-Fokus)
Security Team: Falco, AppArmor, SELinux, Vault (Separate Tools, Separate Alerts)
Resultat: Zwei Teams, zwei Toolchains, zwei UIs, keine Cross-Visibility

Ein Security-Event passiert: Falco wirft Alert, Ops-Team sieht es nicht (liegt in anderem Tool). Trotzdem: API-Latency steigt, Ops debuggt das Netzwerk statt Intrusion.

KI-Ops integriert Security als erstklassiges Observability-Signal.

Security Signals in der Ops Stack

1. RBAC & Access Auditing

$ ki-ops audit-rbac --namespace production
Auditing Role-Based Access Control...

🔴 ANOMALIES DETECTED:

1. ServiceAccount: deployment-automation (ns: ci-cd)
   Trend Analysis:
   ├─ Today: 45 API calls
   ├─ Last 7 Days: avg 8 calls
   ├─ Spike Factor: 5.6x higher than normal
   └─ NEW PERMISSION GRANT: cluster-admin (granted 3 minutes ago)

   ⚠️ Risk Assessment:
     High: ServiceAccount escalated to cluster-admin
     Timing: Right before unusual activity surge

   Investigation:
   $ kubectl logs -n ci-cd deploy/deployment-automation --tail=100
   (Check ob legitim oder kompromittiert)

2. User: alice@company.com
   Action: read secrets in production
   Trend:
   ├─ Never before accessed Secrets
   ├─ 23 Secret reads in last 30 minutes
   ├─ Time: 02:47 UTC (unusual hour)
   └─ Source IP: 203.0.113.45 (new IP, nicht üblich)

   ⚠️ Risk: Potential Credential Theft
   Alerts:
   ├─ Check if alice is on-call
   ├─ Verify Laptop VPN ist aktiv
   ├─ Review welche Secrets gelesen wurden
   └─ Consider: Rotate affected Secrets if suspicious

3. Automation: ImagePullSecret Updated
   Event:
   ├─ Secret: "gcr-secret" (ns: default)
   ├─ Action: username changed
   ├─ Changed By: kubernetes.io/system:admin
   ├─ Timestamp: 2024-03-04 14:23:45
   └─ Previous: "serviceaccount@company.iam.gserviceaccount.com"
   └─ Current: "automated-backup@different-org.iam.gserviceaccount.com"

   ⚠️ Critical: Service Account in Secret hat sich geändert
   Fragen:
   ├─ Wer authorisiert das?
   ├─ Welche Pods nutzen dieses Secret?
   ├─ Hat jemand die Zugriffsberechtigung verloren?

   $ kubectl get pods -A -o jsonpath='{..imagePullSecrets[*].name}' | grep gcr-secret
   (Alle betroffenen Pods finden)

2. Runtime Behavior Anomalies

$ ki-ops detect-anomalies --security --lookback 7d
Detecting behavioral anomalies...

BEHAVIORAL ANOMALIES (Machine Learning basiert):

1. Pod: api-service-xy81f (ns: production)
   Anomaly: Outbound Connection Pattern verändert
   ├─ Normal Behavior:
   │  ├─ Connect to: postgres (10.0.1.5:5432) - täglich
   │  ├─ Connect to: redis (10.0.2.10:6379) - täglich
   │  └─ Connect to: dns (10.96.0.10:53) - kontinuierlich
   │
   ├─ NEW Behavior (seit 45 min):
   │  ├─ Connect to: 185.220.100.45:443 (Tor Exit Node!)
   │  ├─ Frequency: 234 connections/min
   │  └─ Data Transfer: 450MB in 45 min
   │
   └─ Risk Level: CRITICAL
      ├─ Dieser Pod sollte nicht nach außen connecten
      ├─ TOR-Nutzung ist stark verdächtig
      └─ Mögliche Ursachen: Malware, Kompromittierung, Crypto-Miner

   Immediate Actions:
   $ kubectl set image deploy/api-service api-service=previous-good-image
   $ kubectl exec -it api-service-xy81f -- bash
     # Inside Pod: whoami? Welche Prozesse laufen?
     # ps aux | grep -E 'curl|wget|nc|perl|python'

2. Process Behavior Anomaly
   Container: postgres-backup (ns: databases)
   ├─ Normal: Läuft täglich 01:00-02:30 UTC, CPU 15%, Memory 800MB
   ├─ Anomaly: Läuft jetzt 13:47 UTC (!), CPU 95%, Memory 4.5GB
   │  ├─ Process Tree: bash → dd → /dev/zero (?)
   │  ├─ File Access: /var/lib/postgresql (unusual read pattern)
   │  └─ System Calls: 2345 getuid() calls (why so many privilege checks?)
   │
   └─ Assessment:
      ├─ Scheduled Job lauft zur unerwarteten Zeit
      ├─ Resource-Verbrauch 30x höher als normal
      └─ Verdacht: Job-Controller Bug, fehlgeschlagene Cleanup, oder Malicious Cron

   Debug:
   $ kubectl logs postgres-backup -n databases --timestamps
   $ kubectl describe pod postgres-backup -n databases

3. Vulnerability Scanning Integration

$ ki-ops scan-vulnerabilities --cluster
Scanning images and dependencies...

🔴 VULNERABILITIES FOUND:

1. Image: api-service:v2.3.1
   Base OS: ubuntu:20.04
   Vulnerabilities:
   ├─ CVE-2024-1234: OpenSSL Integer Overflow (CRITICAL)
   │  ├─ CVSS Score: 9.8 (Very High)
   │  ├─ Affected: libssl1.1 (2.3.0)
   │  ├─ Fixed In: libssl1.1 (2.3.2)
   │  └─ Pods Using: api-service-1, api-service-2, ... (47 pods)
   │
   ├─ CVE-2024-5678: XSS in express.js (HIGH)
   │  ├─ CVSS Score: 7.2
   │  ├─ Affected: express (4.18.1)
   │  ├─ Fixed In: express (4.19.0)
   │  └─ Action: Update package.json, rebuild image
   │
   └─ CVE-2023-9999: Path Traversal in fs library (MEDIUM)

2. Deployed Image: worker-job:v1.2.0
   Identified: 12 known CVEs, 3 CRITICAL
   Last Updated: 2021-03-15 (!!!)
   ├─ Image ist 3 Jahre alt
   ├─ Wahrscheinlich 100+ neue Vulns bekannt
   └─ Status: Should retire or rebuild immediately

3. Kubernetes RBAC Risk
   ClusterRole: "edit"
   ├─ Users: developer@company.com, automation-bot
   ├─ Permissions: Zu viele (create pods, delete services, etc.)
   └─ Recommendation: Least privilege RBAC policy
      → Create role "deployment-editor" mit nur deploy/service Rechten

4. Network Policy Gaps
   Namespace: production
   ├─ No NetworkPolicies defined (!!)
   ├─ All pods can communicate with all pods
   ├─ All pods can communicate with external internet
   └─ Risk: Lateral movement, data exfiltration nicht verhindert

   Fix:
   $ kubectl apply -f - <<EOF
   apiVersion: networking.k8s.io/v1
   kind: NetworkPolicy
   metadata:
     name: default-deny-all
     namespace: production
   spec:
     podSelector: {}
     policyTypes:
     - Ingress
     - Egress
   EOF

Anomaly Detection: Wie es funktioniert

KI-Ops verwendet Machine Learning (nicht Rules):

Phase 1: LEARNING (erste 7 Tage)
├─ Sammelt Baseline-Behavior für alle Pods
├─ Metriken: CPU, Memory, Network I/O, Prozesse, Connections
├─ Baseline: P50, P95, P99 Latenz, Throughput, Error Rates
└─ Builds Statistical Model pro Workload

Phase 2: DETECTION (kontinuierlich)
├─ Vergleicht aktuelle Signale gegen Baseline
├─ Erkennt Anomalien wenn Abweichung > 3 Standardabweichungen
├─ Filterung: False Positives durch seasonality model
│  (z.B.: Jeden Montag höherer Traffic ist NORMAL, kein Alert)
└─ Human-in-the-Loop: Security Team kann Models fine-tunen

Phase 3: CORRELATION
├─ Mehrere Anomalien gleichzeitig?
  → Wahrscheinlich ein Root Cause
  → Cluster zusammenhängende Events
└─ Single anomaly -> Low severity
   Multiple correlated -> High severity

Use Cases in der Praxis

Szenario 1: Crypto-Mining Malware

1. Pod startet unerwartete Prozesse
   → eBPF erkennt: bash → gcc → make (???)
   → abnormal: Dieser Pod sollte nicht compilieren

2. CPU-Auslastung steigt von 10% auf 98%
   → Metric Anomaly erkannt

3. Network Connection zu Mining Pool
   → eBPF zeigt: tcp:185.220.100.45:443
   → Behavior Anomaly: Pod connectet sich nicht normal

4. KI-Ops Alert:
   "Possible Crypto-Mining in api-service-xy81f
    ├─ Correlation Score: 98% (sicher)
    ├─ Recommendation: Kill pod immediately
    └─ Follow-up: Rebuild image, audit code"

Szenario 2: Credentials Leak

1. New Secret created by user "alice"
   → Access audit log zeigt: alice created "aws-secret"
   → Nicht üblich: alice ist Application Developer, nicht DevOps

2. Secret Daten leaked zu external service
   → eBPF zeigt: pod exfiltrates SECRET_VALUE zu external IP
   → Behavior Anomaly: nie vorher gesehen

3. KI-Ops Alert:
   "Possible Credential Leak Detected
    ├─ Secret 'aws-secret' accessed 45 times in 10 min
    ├─ Exfiltrated to: 203.0.113.99:443
    ├─ Severity: CRITICAL
    └─ Actions:
        a) Rotate AWS Credentials (do now!)
        b) Kill affected pods
        c) Review CloudTrail für unauthorized API calls"

Security + Ops Integration

# KI-Ops Config: Security Events in Alerting einbeziehen
observability:
  security:
    enabled: true
    sources:
      - falco          # Runtime Security
      - rbac-audit     # Kubernetes Audit Logs
      - vulnerability  # Image Scanning
      - network        # eBPF Network Flows

  alerting:
    rules:
      - name: "Critical Security Anomaly"
        condition: "security_risk_score > 8"
        notify: ["#security-oncall", "devops@company.com"]

      - name: "Image Vulnerability"
        condition: "vulnerability.cvss_score > 7.0 AND deployed_pod_count > 0"
        notify: ["#platform-eng"]

Was Sie mit Security Observability gewinnen

Holistic View: Security + Performance + Reliability in einer Plattform
Faster Detection: Anomalien werden erkannt bevor Damage auftritt
Context: Warum ist Pod suspekt? Weil: neue Behavior + falsche RBAC + alte Image
Automated Response: KI-Ops kann auto-kill suspicious pods, rotate secrets, block IPs
Compliance: Vollständiger Audit Trail für alle Security-relevanten Events
Team Alignment: Ops und Security sprechen die gleiche Sprache (Signals, not silos)

Start: ki-ops detect-anomalies --security um sofort Behavioral-Anomalien zu sehen.

Sicherheit ist nicht separat. Es ist ein Signal.