S Srenix

Blog · Engineering

From cluster health analyzer to AI SRE.

Why we re-framed Srenix and the safety envelope under the new name. (Historical post — predates the v0.1.0-alpha.1 re-baseline.)

Salil Kadam · 2026-05-27 · 7 min
Historical post. Written before the 2026-06-18 pre-alpha re-baseline. Version numbers below (v1.x) refer to internal pre-alpha iterations mis-numbered as 1.x; all of that work is now folded into the current v0.1.0-alpha.1 release. The 1.x tags here are kept verbatim for the historical record, not as the current version.

The feedback

Earlier this month an external GTM reviewer looked at the Srenix pitch and gave us three specific notes:

  1. The name cluster health analyzer sells the v1.0 product, not the v1.6 product. It anchors readers on observability instead of on the closed-loop agent we actually shipped.
  2. The catalog is still secret-heavy. Vault and ESO are the most-developed drift class — but real on-call rotations also page on GitOps stuck syncs, CNPG primary failover, and RBAC blast-radius growth. Broaden the surface.
  3. The language in the docs is rule-based ("if condition then fixer"). That framing makes the product feel static and reactive. Lead with the LLM-driven investigation surface, not the deterministic fixers underneath.

Each note had teeth. None of them were "rewrite the product." They were "describe the product you actually built, and finish the catalog you said you would." We turned them into a four-week implementation plan and started shipping.

The reframe (Workstream A · C)

The OSS engine kept its name — agentic-sre and the srenix binary. Those Helm releases are in production on pilot clusters; renaming them would have been a backwards-incompatibility without an upside. What changed was the marketing surface and the paid tier framing:

  • The homepage hero leads with "Autonomous SRE for Kubernetes." The eyebrow chip reads AI SRE · In-cluster · Bring-your-own-LLM · Open core. The H1 itself stays Detect. Remediate. Verify. — that line tested as the most evocative anchor of what the product does.
  • The "vs. the competition" table names Komodor, Robusta, Causely, NeuBird, Resolve AI, Ciroos, and Tracer/OpenSRE by name — and calls out the structural difference: they observe; Srenix mutates.
  • The paid tier is now "Srenix Enterprise AI SRE" rather than "Srenix Enterprise paid." The T0–T3 tier ladder (narration → fix proposal → multi-step plan → vault break-glass runbook) is above the fold on pricing.
  • A new /features/policy page documents the eight-layer action policy that constrains the agent. "Reasoning power is the agent; the policy is the leash" became the explicit framing.

Three new drift-class analyzers (Workstream B)

The biggest mechanical change in v1.7.0: the OSS engine's catalog broadens from secret/credential drift into three additional classes, each with its own analyzer registered in catalog/ and toggleable in Helm.

1. GitOpsDrift

PR #69

Argo CD Application out-of-sync or Degraded, Flux Kustomization / HelmRelease Ready=False past a 10-minute grace. Reasons matching *Failed (BuildFailed, UpgradeFailed, InstallFailed) escalate to critical. Reader ClusterRole gets read on the Argo and Flux CRDs.

2. WorkloadStateDrift

PR #70

State-tier signals the basic "X/Y ready" probe misses. CNPG cluster: non-healthy phase, follower lag while phase healthy, primary switchover stuck. StatefulSet ordinal-zero: pod-0 missing while higher ordinals run; pod-0 unready while higher ordinals Ready.

3. RBACDrift

PR #71

Wildcard verbs in user-defined Role / ClusterRole (skips cluster-admin, system:*, and the kube-system / kube-public / kube-node-lease namespaces). Unbound ServiceAccount mounted by a Pod (skips the default SA, skips kube-system Pods). Remediation includes the exact kubectl create rolebinding command.

All three analyzers are default-on. Each has an opt-out Helm value (analyzers.gitopsDrift.enabled, etc.) plus an env-var gate (SRENIX_ANALYZER_GITOPS_DRIFT=off) for clusters that don't run Argo / Flux or want to defer the RBAC noise until they've cleaned their wildcards. The Reader ClusterRole picks up the new CRD reads through the chart's RBAC template — no manual edits required.

An LLM-classified fixer matcher (Workstream C)

In Srenix Enterprise v1.7.0 (the paid binary), the keyword DefaultFixerMatcher — a switch statement that keyword-matches diagnostic.Source to a fixer name — is joined by an opt-in LLMFixerMatcher. When you pass --ai-llm-fixer-matcher, the agent makes a small JSON-mode LLM call against the diagnostic ("which fixer name, from this whitelist, best handles this finding?"), permissively parses the response (strict JSON / fenced JSON / JSON-in-prose / bare name), and falls back to the keyword matcher on any error or hallucination.

Worst case is identical to v1.6.x behavior. Best case, the agent recognises new diagnostic surfaces without us shipping a new keyword rule first. The action_kind whitelist is still in force — the LLM gets to pick a fixer name, not to invent one.

The safety story under the new name

Repositioning as an AI SRE means the policy story matters more, not less. The eight-layer action policy hasn't changed structurally since v1.6:

  • Action-kind allowlistpkg/ai/types.go, closed enum. DeletePod, PatchDeploymentAnnotation, DeleteJob, DeleteCertificateRequest, PatchIngressTLS. Anything else is rejected by the validator before it reaches an operator.
  • Target-resource scope — each action_kind ships with a hard-coded resource filter. PatchDeploymentAnnotation only patches the kubectl restartedAt annotation — never container images, never replicas.
  • Namespace protection — protected namespaces (kube-system, vault, external-secrets, cnpg-system, rook-ceph by default; configurable) are refused.
  • GitOps-managed skip — resources labelled by Argo / Flux / Helm are skipped. The agent doesn't fight a reconciliation loop. (And v1.7.0 now also detects when one of those loops is itself the drift.)
  • Signed JWT click-to-fix — every T1+ mutation is wrapped in an Ed25519-signed URL with expiry + JTI replay protection. Without the click, nothing mutates.
  • Dual-approval for T3 — vault break-glass runbooks require two distinct approvers separated by at least 30 minutes. The runbook is never executed by Srenix; the operator runs it manually after dual approval. Key names only, never values.
  • Hash-chained audit — every AI action emits an AuditEvent with prev_hash chained against the prior event. Tamper-evident even if a downstream sink is compromised.
  • Investigation rate budget — layer-2 LLM investigator calls have their own per-(approver, diagnostic_class) token-bucket budget independent of the proposal budget. Default 10/hour; prevents flapping-workload cost blowup.

The full eight layers are on the /features/policy page; the code paths are linked from each row.

Live-evidence

Yesterday's verification run hit the T0–T3 ladder against Qwen 3.6 35B served by an in-cluster vLLM endpoint:

  • T0 narration — latency under one second; the enrichment lands in the Slack alert body before an operator has finished reading the title.
  • T1 fix proposal — LLM picks an action_kind from the whitelist; the validator's patch_validator.go gates the JSONPath of the proposed mutation before the click-to-fix URL is ever issued.
  • T2 multi-step plan — the planner emits a structured sequence of T1 actions that the operator approves as a set. Each step is still a JWT click; the plan is signed as a unit.
  • T3 vault break-glass — runbook text only. Dual-approval timeline enforced; key names never values; nothing executes server-side.

The same flow runs against OpenAI's hosted models and Anthropic's Claude family for customers who prefer not to host their own endpoint. The protocol is OpenAI-compatible JSON-mode; the policy envelope is identical regardless of which model is on the other end of the call.

What's next (v1.8)

v1.7.0 closed Workstream B's first three drift classes. v1.8 takes the watcher to a controller-runtime operator (kubebuilder), adds the remaining three drift classes (config: CM hash divergence, CRD version mismatch, Helm-values vs cluster-live; capacity: HPA min/max divergence, PVC growth trajectory — needs metrics-server; security: PSS downgrade, image attestation, NetworkPolicy coverage gaps), and brings in GCP / Azure cloud probes plus the M2 K8s probe slice (Kong, HPA, ArgoCD, Velero).

The full sequencing lives on the roadmap page; v1.7.0 itself is documented in the CHANGELOG with PR-level granularity. The implementation plan that drove the repositioning is checked into the OSS repo as docs/design/2026-05-ai-sre-positioning.md — the feedback is reproduced verbatim, the workstream allocation is the commit you're reading.


Questions, pushback, or pilot interest: [email protected].