Srenix · Agentic SRE
Autonomous SRE for Kubernetes.
In-cluster · Policy-bounded · Audit-anchored · Bring-your-own-LLM · Apache-2.0 open core
Why agentic SRE > AIOps observability
- Closed-loop is the default, not a roadmap milestone.
- Every cycle: detect → propose fix → (operator click) → apply → re-diagnose. Komodor, Robusta, Causely, NeuBird, Resolve, Ciroos: they observe and summarise. Srenix mutates.
- In-cluster — no telemetry exfiltration.
- No SaaS endpoint. No vendor LLM lock-in. Point at any OpenAI-compatible endpoint — in-cluster vLLM, Azure OpenAI, your own gateway. Cluster data and prompts never leave your perimeter. Air-gap and sovereign deployments supported.
- Eight-layer action policy — the leash on the agent.
- Action-kind allowlist, target-resource scope, namespace protection, GitOps-managed skip, signed JWT click-to-fix, dual-approval for T3 vault break-glass, hash-chained audit, per-(approver, class) rate budget. All in Apache-2.0 source.
- Open core — you can read it before you install.
- OSS engine is Apache-2.0: 21 K8s probes, 30 cloud probes (AWS/GCP/Azure), 20 analyzers, 5 policy-bounded fixers, controller-runtime operator. Srenix Enterprise paid tier adds the LLM Investigator + T0–T3 AI SRE ladder.
Architecture — the autopilot loop
┌──────────────────────────────────────────────────────┐ │ In-cluster watcher (leader-elected, K8s 1.27+) │ │ │ │ Probes ── 21 K8s ── 30 cloud ── 20 analyzers │ │ │ Nodes RDS GitOps (Argo/Flux) │ │ │ Pods EBS WorkloadState (CNPG) │ │ │ Postgres EKS RBAC (wildcards/SA) │ │ │ Ceph IAM │ │ ▼ │ │ Diagnose ──► DriftReport CRDs │ │ │ │ │ ▼ │ │ T0 narration (LLM) ──► Slack / Alertmanager │ │ │ │ │ ▼ │ │ T1 fix proposer (LLM, JSON-mode) │ │ │ ╭── patch_validator: JSONPath allow-list │ │ │ ╰── action_kind allowlist (closed enum) │ │ ▼ │ │ Ed25519-signed JWT click-to-fix URL │ │ │ ── delivered to operator via Slack / ticket │ │ │ ── expiry + JTI replay protection │ │ ▼ │ │ Operator clicks ──► Mutate (5 fixers) │ │ │ │ │ ▼ │ │ T2 multi-step planner / T3 vault runbook │ │ (dual-approval, 30-min window, key names only) │ │ │ │ │ ▼ │ │ Re-diagnose ──► DriftReport cleared (deleted) │ │ │ │ │ ▼ │ │ AuditEvent (hash-chained, prev_hash) │ │ ── JSONL sink (operator-controlled) │ │ ── Prometheus metrics │ └──────────────────────────────────────────────────────┘
T0 → T3 tier ladder
- T0 Narration — LLM-enriched diagnostic text in Slack · sub-second latency on in-cluster Qwen 3.6 35B
- T1 Fix proposer — agent picks action_kind from whitelist; JWT click-to-fix URL
- T2 Planner — multi-step plan, each step still a JWT click; plan signed as a unit; prerequisite ordering enforced at execution
- T3 Vault break-glass — runbook text + dual-approval timeline; Srenix never executes, operator runs manually