Detect. Remediate. Verify.
Autonomous SRE for Kubernetes. Policy-bounded. Audit-anchored. Bring-your-own-LLM. An in-cluster agent across Kubernetes, AWS, GCP, Azure, and the edge. Everyone else observes; Srenix actually mutates.
$ helm install srenix srenix/agentic-sre
One operator. Every cloud. Every stack you already run.
The autopilot loop, by default.
Five steps. Re-run on every cycle. Closed-loop is the default mode, not a roadmap milestone.
Install
helm install srenix srenix/agentic-sre — works on any conformant K8s 1.27+ (EKS, GKE, AKS, k3s, RKE2, OpenShift).
$ helm repo add srenix https://srenix-ai.github.io/agentic-sre
$ helm install srenix srenix/agentic-sre
$ kubectl get driftreports # cluster-scoped CRD
NAME SEVERITY SOURCE SUBJECT
drift-9c1e04f2a77b3d18 warning StuckCertificateRequests CertificateRequest/kube-system/api-tls Detect
21 K8s probes (incl. KongRoutes + GPUNodes) + 10 AWS + 10 GCP + 10 Azure probe families. 20 OSS analyzers (drift, log, workload, diagnostic — LogPatternMatcher, OOMKillRecurrence, PVOrphan, CronJobStuck, DisruptionDrift on top of GitOps/Workload/RBAC/Capacity/Security/Config). Three trigger classes (resource events, Prometheus alerts via Alertmanager, external HMAC-authed webhooks).
Probes 21 K8s 10 AWS 10 GCP 10 Azure
Ceph RDS CloudSQL SQL DB
Nodes EBS GKE AKS
Postgres EKS PD Disk
PVCs IAM IAM SA Identity
Endpoints ALB LB AppGW
NodePr. ACM Cert Backend
DaemonSet KMS GCS/KMS Subnet
Pending S3 Backend ...
CrashLoop VPC
ETCD
FailedMnt
KongRoutes ← M2
GPUNodes ← M3 Remediate
5 policy-bounded fixers run by default. AI-tier fix proposals require human approval via signed click-to-fix URLs — OR auto-merge silently at very-high confidence (Phase 3.B): matching approve-class policy + verified Ed25519 attestation + Wilson-bound class success-rate ≥ threshold (default 0.95) + closed circuit breaker. RAG memory is live: Srenix reads prior resolutions before proposing (short-circuit default ON at similarity ≥ 0.92). Paid tier: opt-in deep-RCA grounded in live web research via Firecrawl — LLM synthesizes a generic technical query (no namespace, hostname, or secret leaves the cluster); the RCA is forwarded into every AI tier (T0 → T3).
DriftReport StaleErrorPod — fixer ran OK
DriftReport StuckJob — fixer ran OK
DriftReport StuckRS — fixer ran OK
DriftReport StuckCertReq — fixer ran OK
DriftReport TLSSecretMismatch — fixer ran OK
DriftReport SecurityDrift — DigestPin PR
attestation: Ed25519 ✓
auto-merge gate: 5/5 ✓
squash-merged via API ✓
Re-verify in 60s ... Report
Findings flow to Slack, Alertmanager, OpenProject (OSS), and Jira / ServiceNow (paid). DriftReport CRDs let you kubectl get your cluster’s drift state.
kubectl get driftreports # columns: SEVERITY SOURCE SUBJECT LAST SEEN COUNT TICKET
NAME SEVERITY SOURCE SUBJECT COUNT TICKET
drift-9c1e04f2a77b3d18 critical SecretKeyMissing Secret/mcp/openproject-url 4 WP-1287
drift-4f2a77b3d189c1e0f warning CronJobStuck CronJob/ai/nightly-index 2
# active findings have a CR; cleared findings are deleted on the next cycle Verify
Re-diagnose after every fix. No "the fix maybe worked" — Srenix actively re-checks and closes the loop.
diagnose → fix → re-diagnose → resolve
↑ |
+--------+
Closed-loop is the DEFAULT, not a roadmap milestone. Why Srenix, structurally.
Three architectural commitments that competitors cannot copy without rewriting their product.
An agent that actually mutates
Komodor, Robusta, Causely, Resolve — they observe and summarise. Srenix Enterprise proposes an action, signs it as a JWT, and (with one operator click) executes it. Every action lands inside the operator-defined policy: which action_kinds, which namespaces, which resources. The agent has reasoning power; the policy has the leash.
In-cluster + bring-your-own-LLM
No SaaS. No vendor LLM lock-in. Point Srenix Enterprise at any OpenAI-compatible endpoint — your in-cluster vLLM, an Azure OpenAI deployment, your own gateway. Cluster data and prompts never leave your perimeter.
Open core, audit-anchored
OSS engine is Apache-2.0 — 21 K8s probes, 10 AWS + 10 GCP + 10 Azure cloud probe families, 20 OSS analyzers (drift, log, workload, diagnostic), 5 policy-bounded fixers. Srenix Enterprise paid tier adds the LLM Investigator agent + the T0–T3 AI SRE flow + Phase 2 closure (HA aiwatch via leader-election, Prometheus instrumentation, cosign-style PR attestation) + Phase 3 (auto-merge gate, target-history RAG grounding, SOC2 audit-bundle exporter). Every AI action is JWT-signed, hash-chained, replayable.
Compliance-ready audit bundle
srenix-enterprise audit-bundle --since 30d --output bundle.tar.gz produces a SOC2-friendly evidence pack with manifest.json (versions + SHA-256 of each file), audit.jsonl (verbatim copy of the JSONL audit log: every approval click + auto-apply + LLM call + verifier result), and outcomes.jsonl (every RAG outcome within --since). Local-only — no network egress. Shipped (v0.1.0-alpha.1).
Auto-merge at very-high confidence
When the Phase 2.B "approve+remember class" policy matches AND the Phase 2.H Ed25519 attestation verifies AND the Phase 2.C Wilson-bound class success-rate clears the operator-set threshold (default 0.95) AND the circuit breaker is closed, freshly-opened DigestPin PRs auto-merge via the Forge API without a human click. Closes the "incidents resolved without paging a human" promise. Shipped (v0.1.0-alpha.1).
vs. the competition.
Detect-fix-verify is the default loop. Every other player is a copilot for the on-call rotation you already have.
| Product | Where it runs | Closed-loop? | Pricing |
|---|---|---|---|
| Srenix (us) | In-cluster operator | Yes, by default | Flat per-cluster (OSS / Team / Enterprise) |
| NeuBird | SaaS, pulls telemetry | No — "architecturally enforced read-only" | $15–25 per investigation |
| Resolve AI | SaaS + thin Satellite | Roadmap (their words: "next milestone") | Contact sales |
| Ciroos | SaaS, zero-copy queries | Opaque "autonomy slider" | Contact sales |
| OpenSRE (Tracer) | Customer-hosted (docker-compose) | No — code-blocks mutations | OSS only |
On-call should be quieter every week.
Srenix is how you get there. Helm install in 5 minutes. No telemetry exfiltration. No per-investigation surprises.