Srenix · Agentic SRE

Autonomous SRE for Kubernetes.

In-cluster · Policy-bounded · Audit-anchored · Bring-your-own-LLM · Apache-2.0 open core

Why agentic SRE > AIOps observability

Closed-loop is the default, not a roadmap milestone.: Every cycle: detect → propose fix → (operator click) → apply → re-diagnose. Komodor, Robusta, Causely, NeuBird, Resolve, Ciroos: they observe and summarise. Srenix mutates.
In-cluster — no telemetry exfiltration.: No SaaS endpoint. No vendor LLM lock-in. Point at any OpenAI-compatible endpoint — in-cluster vLLM, Azure OpenAI, your own gateway. Cluster data and prompts never leave your perimeter. Air-gap and sovereign deployments supported.
Eight-layer action policy — the leash on the agent.: Action-kind allowlist, target-resource scope, namespace protection, GitOps-managed skip, signed JWT click-to-fix, dual-approval for T3 vault break-glass, hash-chained audit, per-(approver, class) rate budget. All in Apache-2.0 source.
Open core — you can read it before you install.: OSS engine is Apache-2.0: 21 K8s probes, 30 cloud probes (AWS/GCP/Azure), 20 analyzers, 5 policy-bounded fixers, controller-runtime operator. Srenix Enterprise paid tier adds the LLM Investigator + T0–T3 AI SRE ladder.

Architecture — the autopilot loop

   ┌──────────────────────────────────────────────────────┐
   │   In-cluster watcher (leader-elected, K8s 1.27+)     │
   │                                                      │
   │   Probes ── 21 K8s ── 30 cloud ── 20 analyzers       │
   │     │        Nodes    RDS     GitOps (Argo/Flux)     │
   │     │        Pods     EBS     WorkloadState (CNPG)   │
   │     │        Postgres EKS     RBAC (wildcards/SA)    │
   │     │        Ceph     IAM                            │
   │     ▼                                                │
   │   Diagnose ──► DriftReport CRDs                      │
   │     │                                                │
   │     ▼                                                │
   │   T0 narration (LLM) ──► Slack / Alertmanager        │
   │     │                                                │
   │     ▼                                                │
   │   T1 fix proposer (LLM, JSON-mode)                   │
   │     │   ╭── patch_validator: JSONPath allow-list     │
   │     │   ╰── action_kind allowlist (closed enum)      │
   │     ▼                                                │
   │   Ed25519-signed JWT click-to-fix URL                │
   │     │   ── delivered to operator via Slack / ticket  │
   │     │   ── expiry + JTI replay protection            │
   │     ▼                                                │
   │   Operator clicks ──► Mutate (5 fixers)              │
   │     │                                                │
   │     ▼                                                │
   │   T2 multi-step planner / T3 vault runbook           │
   │   (dual-approval, 30-min window, key names only)     │
   │     │                                                │
   │     ▼                                                │
   │   Re-diagnose ──► DriftReport cleared (deleted)      │
   │     │                                                │
   │     ▼                                                │
   │   AuditEvent (hash-chained, prev_hash)               │
   │     ── JSONL sink (operator-controlled)              │
   │     ── Prometheus metrics                            │
   └──────────────────────────────────────────────────────┘

T0 → T3 tier ladder

T0 Narration — LLM-enriched diagnostic text in Slack · sub-second latency on in-cluster Qwen 3.6 35B
T1 Fix proposer — agent picks action_kind from whitelist; JWT click-to-fix URL
T2 Planner — multi-step plan, each step still a JWT click; plan signed as a unit; prerequisite ordering enforced at execution
T3 Vault break-glass — runbook text + dual-approval timeline; Srenix never executes, operator runs manually