AIC//
LIVE
◈ DEMO MODESimulated incident data — no live cluster required↗ GitHub

Portfolio Project · AIOps · Kubernetes

Autonomous Incident
Commander

AI-powered incident detection, root-cause analysis, and safe remediation for Kubernetes. Reduces MTTR by 50%+. Cuts alert noise by 40%. Resolves 60%+ of common incidents without shell access.

◈ PLATFORM METRICS — LIVE

MTTR IMPROVEMENT

58%

simulated incidents

vs baseline

ALERT NOISE CUT

43%

dedup + grouping

via dedup + grouping

AUTO-RESOLVED

67%

no manual shell access

AVG RESOLUTION

14min

triage to close

◈ CONTROL LOOP ARCHITECTURE

01

Telemetry Ingestion

Alertmanager webhook → normalize alerts, pull Loki logs, fetch rollout history → canonical IncidentContext object

02

AI Reasoning

LangChain agent retrieves similar incidents + runbooks via FAISS, prompts LLM with structured evidence, returns typed hypothesis

03

Policy Gate

Confidence threshold check, namespace allowlist, action allowlist. High-risk actions route to human approval queue

04

Safe Remediation

Kubernetes Python client executes restart/scale. ArgoCD REST API triggers rollback. Post-action health polling confirms recovery

05

Audit & Observability

Every prompt, decision, approval, and action appended to immutable audit log. Prometheus metrics on all services

◈ TECHNOLOGY STACK

PLATFORM

AWS EKSTerraformArgoCDHelm

OBSERVABILITY

PrometheusAlertmanagerLokiGrafana

AI / AGENT

PythonLangChainOpenAIFAISS

AUTOMATION

K8s Python ClientkubectlArgoCD RESTWebhooks

SECURITY

RBACNetwork PoliciesTrivySecrets Manager

CI/CD

GitHub ActionsDockerECRArgoCD GitOps

◈ BUSINESS OUTCOMES

Reduces MTTR

50%+ reduction in mean time to resolve in simulated incidents

Cuts alert noise

40%+ via deduplication and grouping of related alerts

Autonomous resolution

60%+ of common incidents resolved without manual shell access

Auditability

Every prompt, decision, approval, and action logged with full traceability

Human control preserved

High-risk actions require explicit approval. Agent cannot call arbitrary tools.

Policy-gated execution

Allowlist-only actions. Blocked namespaces. Confidence thresholds enforced.

◈ READY TO EXPLORE

See it in action

Open the live incident dashboard to see AI triage, confidence scoring, and remediation decisions on simulated Kubernetes incidents.

OPEN DASHBOARD →