S Srenix
Docs / Helm Reference

Docs

Helm Reference

Every values.yaml option, generated from chart v0.1.0-alpha.1 (synced 2026-06-26). Full chart source at charts/agentic-sre/ in the GitHub repo.

Install with defaults first — most options have safe defaults. Add cloud probes, ticketing, and AI tiers incrementally as you need them. This reference is generated directly from the chart's values.yaml (chart v0.1.0-alpha.1), so every key listed here exists in the chart.

image

Key Default Description
image.repository docker4zerocool/agentic-sre Docker Hub is the canonical publish target. docker4zerocool is the org's operational registry (same account that hosts every ai-* and mcp-* image the platform depends on); no extra pull-secret config needed for operators who already use it. GHCR is published as a mirror by…
image.tag "" default = .Chart.appVersion (e.g. v0.1.0-alpha.1)
image.pullPolicy "" pullPolicy is left empty so the `srenix.pullPolicy` helper picks it automatically: Always for mutable tags (latest / main / dev / `*-latest`) and IfNotPresent for semver-pinned tags. Override only when you specifically need to force one or the other.
image.pullSecrets [] e.g. [{name: dockerhub-secret}]

diagnose

Diagnose CronJob — always enabled.

Key Default Description
diagnose.enabled true
diagnose.schedule 0 9 * * * Daily 09:00 UTC. Same default as the bash version.
diagnose.successfulJobsHistoryLimit 3
diagnose.failedJobsHistoryLimit 3
diagnose.concurrencyPolicy Forbid
diagnose.backoffLimit 1 backoffLimit caps how many times a failed pod within the Job is retried before the Job itself is marked Failed. K8s default is 6, which compounds with the per-pod default of 5 minutes of backoff — a hung run can keep spawning pods for over an hour. Srenix's diagnose is…
diagnose.activeDeadlineSeconds 300 activeDeadlineSeconds caps the Job's total runtime (counted across pod restarts). Raised 120 → 300 in v1.8.1: the v1.8 analyzer + M2-probe set adds a meaningful number of cluster List calls (CRDs, HPAs, all namespaces + pods + NetworkPolicies, Kong/Velero/Argo CRDs), and a live…
diagnose.format daily daily | text | json — daily posts to #healthinfo; text/json for ad-hoc runs

remediation

Remediate CronJob — opt-in.

Key Default Description
remediation.enabled false
remediation.schedule */30 * * * * Every 30 min. Off by default — turn on once your team trusts the fixers.
remediation.successfulJobsHistoryLimit 3
remediation.failedJobsHistoryLimit 3
remediation.concurrencyPolicy Forbid
remediation.backoffLimit 1 See diagnose.backoffLimit comment. Remediation Jobs mutate cluster state so we especially do NOT want them retrying a hung-mid-mutation run.
remediation.activeDeadlineSeconds 300 See diagnose.activeDeadlineSeconds comment. Remediation runs the whitelisted fixers AND a full re-probe (which now includes the v1.8 analyzer + M2-probe set), so it inherits the same ~157s read cost. Raised 120 → 300 in v1.8.1 to match diagnose.
remediation.dryRun false When true, fixers report Refused without mutating cluster state.

slack

Slack three-channel routing. Each channel maps to a dedicated incoming-webhook Secret. All URLs must come from pre-existing Secrets in the install namespace — prefer wiring them via External Secrets Operator from Vault (see SECURITY.md). alerts → #ceph-alerts: event-driven, Srenix acted (auto-fixed issues) critical → #ceph-critical: event-driven, human action required healthinfo→ #healthinfo:…

Key Default Description
slack.alerts.enabled false
slack.alerts.secretName "" e.g. "srenix-slack-ceph-alerts"
slack.alerts.secretKey WEBHOOK_URL
slack.critical.enabled false
slack.critical.secretName "" e.g. "srenix-slack-ceph-critical"
slack.critical.secretKey WEBHOOK_URL
slack.healthinfo.enabled false
slack.healthinfo.secretName "" e.g. "srenix-slack-healthinfo"
slack.healthinfo.secretKey WEBHOOK_URL

driftReport

DriftReport CRD — kubectl-queryable diagnostic objects. Each cron tick upserts one DriftReport per active diagnostic; CRs whose subject is no longer reported get deleted automatically. Inspect with: kubectl get driftreports -A Disable if you don't want CRD-shaped output (Slack + JSON still work).

Key Default Description
driftReport.enabled true

resolutionRecord

ResolutionRecord CRD — append-only outcome log (one CR per applied+ verified remediation). The durable system-of-record the RAG memory layer embeds + retrieves. Inspect with: kubectl get resolutionrecords -l srenix.ai/verified=cleared

Key Default Description
resolutionRecord.enabled true

silence

Silence — operator-controlled noise suppression. When `enabled`, the watcher fetches active Silence CRs once per cycle and drops matched diagnostics before downstream emission (DriftReport / Slack / Alertmanager / ticketing). Operators create Silences with `kubectl create -f silence.yaml` in any namespace; the matching is cluster-wide. Defaults to ON for new installs because installing the CRD on…

Key Default Description
silence.installCRD true ship the CRD via the chart
silence.enabled true provision the watcher's ClusterRole + binding

operator

Operator (Phase 1) — ships the AgenticSRE CRD shape only. The srenix-operator binary that reconciles AgenticSRE resources into the watcher Deployment / CronJobs / ServiceAccount lands in Phase 2 (next release). Installing the CRD alone is harmless: the resource is just queryable state. Default ON for new installs so the CRD is in place by the time the operator binary ships and clusters that opt…

Key Default Description
operator.installCRD true
operator.enabled false Phase 1b (v1.8) — the srenix-operator controller-runtime manager. Default OFF: existing chart-managed installs continue to work unchanged. Flip to true to deploy the operator binary; the operator only takes over resources named by a AgenticSRE CR (which operators create…
operator.replicas 1 Operator replicas. 1 is sufficient because the manager uses lease-based leader-election; additional replicas stand by.
operator.resources {}

vaultProbe

Vault-probe — closes the L1 stale-Ready window: queries Vault directly to verify each ExternalSecret's referenced path + property still exists, BEFORE the ESO controller's next refresh marks itself not-Ready. Catches the case where someone edits Vault but pods stay alive on cached Secret data and the next pod restart fails with CreateContainerConfigError. Privacy contract: srenix never reads…

Key Default Description
vaultProbe.enabled false
vaultProbe.address "" e.g. "https://vault.svc.cluster.local:8200"
vaultProbe.kvMount secret KV-v2 mount path; ESO `data[].remoteRef.key` is mount-relative
vaultProbe.auth.method kubernetes kubernetes | token
vaultProbe.auth.role "" required when method=kubernetes; the Vault role bound to the srenix SA
vaultProbe.auth.tokenSecretRef.name "" K8s Secret holding a Vault token
vaultProbe.auth.tokenSecretRef.key token key within that Secret

alertmanager

Alertmanager integration — Kubernetes-native event routing. When enabled, Srenix posts the full active-issue state to Alertmanager each watcher cycle. Alertmanager handles: dedup, grouping, silencing, repeat intervals, and fan-out to all configured receivers (Slack, PagerDuty, Teams, email, webhook, …). This is the preferred model over direct Slack webhooks; the slack.alerts/critical fields…

Key Default Description
alertmanager.enabled false
alertmanager.url "" e.g. "http://alertmanager.pg.svc.cluster.local:9093"
alertmanager.clusterName cluster identifies this cluster in alert labels

cloud

Cloud probe framework — observe AWS / GCP / Azure resources alongside K8s state. Each provider is independently toggleable. Cloud probes do NOT fire on K8s event triggers — they run on `cadence` (default 10m) to protect cloud-API rate limits. Master switch: `cloud.enabled=false` disables EVERYTHING below and pays zero overhead (no SDK init, no probe registration, no extra RBAC). Required for…

Key Default Description
cloud.enabled false
cloud.cadence 10m min interval between cloud-probe runs
cloud.aws.enabled false
cloud.aws.region "" required; e.g. "us-east-1"
cloud.aws.auth.roleArn "" ARN of the IAM role; auto-injected as eks.amazonaws.com/role-arn SA annotation
cloud.aws.probes.rds true
cloud.aws.probes.ebs true
cloud.aws.probes.eks true
cloud.aws.probes.iam true
cloud.aws.probes.alb true
cloud.aws.probes.acm true
cloud.aws.probes.kms true
cloud.aws.probes.s3 true
cloud.aws.probes.vpc true
cloud.gcp.enabled false GCP cloud probes shipped in v1.8 (M2): Cloud SQL, Persistent Disks, GKE control-plane + node pools, IAM service accounts, Subnets, LB backends, managed certs, GCS public access, KMS. NOTE: Cloud SQL storage-% is fetched via the Cloud Monitoring API (best-effort; "not measured"…
cloud.gcp.project ""
cloud.gcp.auth.serviceAccount "" GSA email; auto-injected as iam.gke.io/gcp-service-account SA annotation
cloud.gcp.probes.cloudsql true
cloud.gcp.probes.disks true
cloud.gcp.probes.gke true
cloud.gcp.probes.iam true
cloud.gcp.probes.subnets true
cloud.gcp.probes.lb true
cloud.gcp.probes.certs true
cloud.gcp.probes.gcs true
cloud.gcp.probes.kms true
cloud.gcp.subnetsSmallPrefixThreshold 0 Small-prefix threshold for the capacity-only subnets probe: an unmeasured subnet whose primary CIDR is smaller than /<threshold> is flagged as a warning. 0 = the probe default (/26 → 60 usable IPs). Raise (e.g. 28) to quiet intentionally tiny subnets, lower to be stricter.…
cloud.azure.enabled false Azure cloud probes shipped in v1.8 (M2): SQL databases, Disks, AKS control-plane + node pools, Managed Identities, Subnets, App Gateway backends, certs, Storage public access, Key Vaults. NOTE: SQL storage-% and App Gateway backend health are fetched via Azure Monitor metrics…
cloud.azure.subscriptionId ""
cloud.azure.resourceGroup "" optional scope; empty = subscription-wide
cloud.azure.auth.clientId "" AAD app client ID; auto-injected as azure.workload.identity/client-id SA annotation
cloud.azure.probes.sql true
cloud.azure.probes.disks true
cloud.azure.probes.aks true
cloud.azure.probes.identities true
cloud.azure.probes.subnets true
cloud.azure.probes.appgw true
cloud.azure.probes.certs true
cloud.azure.probes.storage true
cloud.azure.probes.keyvaults true

ticketing

Ticketing — open issue-tracker tickets for diagnostics Srenix cannot auto-remediate. Runs after each watcher cycle's DriftReport reconcile; the resulting ticket key is persisted onto DriftReport.status.ticket so subsequent cycles know not to re-open the same ticket. Sink failures NEVER abort the cycle — logged and skipped, same posture as Slack/Alertmanager. OSS ships the OpenProject sink…

Key Default Description
ticketing.enabled false
ticketing.provider openproject openproject | jira | servicenow
ticketing.cluster cluster identifies this cluster in ticket bodies; matches alertmanager.clusterName
ticketing.labels [srenix, auto-filed]
ticketing.mcpURL http://mcp-openproject-server.mcp.svc:8006/mcp FLAT shape — matches the operator CRD's `spec.ticketing.*` so YAML can move between Helm values and `kubectl patch srenix …` without reshaping. The legacy nested `ticketing.openproject.*` block below is the v1.19.x shape and is honored by the chart template as a fallback ONLY…
ticketing.project "" e.g. "6" for the Demo project
ticketing.typeID "" e.g. "36" for Task — REQUIRED
ticketing.closedStatusID "" e.g. "82" for Closed status — needed for resolve-on-clear
ticketing.webURLPrefix "" e.g. "https://op.example.com" — used to build operator-clickable URLs
ticketing.severityPriority.critical "" e.g. "75" for Immediate
ticketing.severityPriority.warning "" e.g. "74" for High
ticketing.severityPriority.info "" e.g. "73" for Normal
ticketing.dryRun false Log intended ops without calling the MCP server
ticketing.resolveOnClear true Auto-close the ticket when its finding clears. Default ON (M2 shipped).
ticketing.commentInterval 1h Debounce for comment-on-recurrence: at most one comment per window. A recurring or severity-changed finding comments on the EXISTING ticket instead of opening a new one. "0" disables recurrence comments.
ticketing.auth.enabled false
ticketing.auth.secretName srenix-ticketing-mcp K8s Secret with the API key
ticketing.auth.secretKey api-key Key inside the Secret
ticketing.route "" SRENIX_TICKETING_ROUTE — routes a finding to a sink (provider above selects the default sink).
ticketing.jira.url "" SRENIX_JIRA_URL
ticketing.jira.project "" SRENIX_JIRA_PROJECT (project key)
ticketing.jira.email "" SRENIX_JIRA_EMAIL (token-auth account email)
ticketing.jira.issueType "" SRENIX_JIRA_ISSUE_TYPE (e.g. "Bug")
ticketing.jira.priority.critical "" SRENIX_JIRA_PRIORITY_CRITICAL
ticketing.jira.priority.warning "" SRENIX_JIRA_PRIORITY_WARNING
ticketing.jira.priority.info "" SRENIX_JIRA_PRIORITY_INFO
ticketing.jira.webUrlBase "" SRENIX_JIRA_WEB_URL_BASE (clickable ticket URLs)
ticketing.jira.tokenSecret.name "" K8s Secret name (ESO-synced)
ticketing.jira.tokenSecret.key "" key inside the Secret
ticketing.servicenow.url "" SRENIX_SERVICENOW_URL
ticketing.servicenow.user "" SRENIX_SERVICENOW_USER (basic-auth username)
ticketing.servicenow.urgency.critical "" SRENIX_SERVICENOW_URGENCY_CRITICAL
ticketing.servicenow.urgency.warning "" SRENIX_SERVICENOW_URGENCY_WARNING
ticketing.servicenow.urgency.info "" SRENIX_SERVICENOW_URGENCY_INFO
ticketing.servicenow.impact.critical "" SRENIX_SERVICENOW_IMPACT_CRITICAL
ticketing.servicenow.impact.warning "" SRENIX_SERVICENOW_IMPACT_WARNING
ticketing.servicenow.impact.info "" SRENIX_SERVICENOW_IMPACT_INFO
ticketing.servicenow.webUrlBase "" SRENIX_SERVICENOW_WEB_URL_BASE (clickable ticket URLs)
ticketing.servicenow.passwordSecret.name "" SRENIX_SERVICENOW_PASSWORD — secretKeyRef ONLY
ticketing.servicenow.passwordSecret.key ""
ticketing.servicenow.bearerSecret.name "" SRENIX_SERVICENOW_BEARER — secretKeyRef ONLY
ticketing.servicenow.bearerSecret.key ""
ticketing.openproject {} DEPRECATED (v1.20.0+): nested per-provider sub-trees. Honored only when the equivalent flat field above is unset. New installs should use the flat shape exclusively. Will be removed in the next major chart bump.

watcher

Watcher — event-driven, long-running Deployment (Phase 1). Replaces polling latency (CronJob tick) with near-instant reaction to Kubernetes watch events. Debounces burst updates before re-running the full probe+analyzer stack. Slack dedup: only new/changed/resolved diagnostics produce a post. The seen-map is seeded from DriftReport CRs on startup so a pod restart does not re-flood Slack with all…

Key Default Description
watcher.enabled false
watcher.replicas 1 replicas — watcher pod count. Default 1. Raising above 1 is ONLY safe with leaderElection.enabled=true (the SRENIX_LEADER_ELECTION path): otherwise every replica runs the probe/fix/post cycle and they race on DriftReports + double-post Slack. The chart FAILS the render when…
watcher.healthListen :8081 healthListen — listen address for the always-on health server. GET /healthz returns 200 while the process is alive (independent of the webhook receiver), and the Deployment's liveness/readiness probes target it. The port number is derived from this address.
watcher.debounce 10s Debounce window after a Kubernetes event.
watcher.resyncPeriod 10m Periodic full re-diagnose regardless of events.
watcher.extraEnv [] - name: SRENIX_CRITICAL_SERVICES value: "pg/postgres,vault/vault" - name: SRENIX_K3S_SINGLE_NODE_OK value: "true"
watcher.slack.postOnResolved true Post when a diagnostic disappears.
watcher.slack.repeatInterval 4h Re-post still-active warning/info at this cadence (0=never).
watcher.slack.criticalRepeatInterval "" Re-post still-active CRITICAL at this cadence (empty = fall back to repeatInterval). Use to keep criticals loud (e.g. "4h") while letting warnings calm down (e.g. set repeatInterval=24h).
watcher.remedy.enabled false Run auto-fixers after each cycle (live mutation).
watcher.remedy.dryRun false Evaluate fixers without mutating cluster state.
watcher.leaderElection.enabled true
watcher.leaderElection.leaseName srenix-watcher
watcher.leaderElection.leaseDuration 30s
watcher.leaderElection.renewDeadline 20s
watcher.leaderElection.retryPeriod 5s
watcher.triggers.prom.url "" e.g. "http://alertmanager.pg.svc.cluster.local:9093"
watcher.triggers.prom.interval 30s clamped to ≥5s by the trigger client
watcher.triggers.prom.alertNameFilter [] e.g. ["DiskFillUp","CertExpiringSoon"]; empty = any firing alert
watcher.triggers.webhook.listen "" e.g. ":8090" — empty disables receiver
watcher.triggers.webhook.sources [] e.g. ["vault=SRENIX_WEBHOOK_VAULT_SECRET","cert-manager=SRENIX_WEBHOOK_CM_SECRET"]
watcher.triggers.webhook.service.enabled false render a ClusterIP Service for the receiver
watcher.triggers.webhook.service.port 8090
watcher.triggers.webhook.secretName "" e.g. "srenix-webhook-secrets"
watcher.resources.limits.cpu 500m
watcher.resources.limits.memory 256Mi
watcher.resources.requests.cpu 50m
watcher.resources.requests.memory 64Mi

runner

GitHub Actions self-hosted runner — Mode A of the WS-C publish pipeline. Deploys myoung34/github-runner inside the cluster so the nightly publish-runs workflow can call `srenix diagnose --live` directly without needing an internet-reachable MinIO endpoint. Prerequisites: 1. A GitHub PAT (classic, repo scope) or fine-grained token (Actions:write) stored in Vault at: secret/t6-apps/srenix/config →…

Key Default Description
runner.enabled false
runner.repoUrl https://github.com/srenix-ai/agentic-sre
runner.labels self-hosted,cluster must match runs-on in publish-runs.yml
runner.name srenix-cluster-runner
runner.image myoung34/github-runner:ubuntu-jammy Runner image — ubuntu-jammy = Ubuntu 22.04; ubuntu-noble = 24.04
runner.tokenSecretName srenix-runner-token Secret that holds ACCESS_TOKEN for runner registration. Created by the ExternalSecret below — do not create manually.
runner.tokenSecretKey ACCESS_TOKEN
runner.resources.limits.cpu 2
runner.resources.limits.memory 4Gi
runner.resources.requests.cpu 250m
runner.resources.requests.memory 512Mi
runner.nodeSelector {} Runner pod runs as root (required by myoung34/github-runner). It does NOT use the shared podSecurityContext.
runner.tolerations []

rbac

RBAC. Keep enabled — the CronJob will fail without these.

Key Default Description
rbac.create true
rbac.reader.name "" default: <release>-reader
rbac.remediator.name "" default: <release>-remediator

serviceAccount

Key Default Description
serviceAccount.create true
serviceAccount.name "" default: <release>-sa
serviceAccount.annotations {}

resources

Resource requests / limits (per CronJob pod).

Key Default Description
resources.limits.cpu 500m
resources.limits.memory 256Mi
resources.requests.cpu 50m
resources.requests.memory 64Mi

nodeSelector

Pod-level scheduling controls.

Key Default Description
nodeSelector {} Pod-level scheduling controls.

tolerations

Key Default Description
tolerations []

affinity

Key Default Description
affinity {}

priorityClassName

Key Default Description
priorityClassName ""

podSecurityContext

Pod / container security context.

Key Default Description
podSecurityContext.runAsNonRoot true
podSecurityContext.runAsUser 65532
podSecurityContext.fsGroup 65532
podSecurityContext.seccompProfile.type RuntimeDefault

securityContext

Key Default Description
securityContext.allowPrivilegeEscalation false
securityContext.readOnlyRootFilesystem true
securityContext.capabilities.drop [ALL]

ai

AI tier (commercial / Srenix Enterprise) — recommendation-only AI for narration, fix proposals, multi-step plans, and Vault recovery runbooks. Every tier gates mutation behind human one-click approval. See docs/AI_TIERS.md and docs/DEPLOYMENT.md. DEPLOYMENT MODEL — purely additive. Setting ai.enabled=true does NOT touch the OSS watcher / diagnose / remediate workloads; they keep running the OSS…

Key Default Description
ai.enabled false
ai.tier t0 t0 (narration) | t1 (fix proposals) | t2 (planner) | t3 (vault runbooks)
ai.endpoint "" REQUIRED if enabled; OpenAI-compatible base URL, e.g. "https://mcp.baisoln.com/gpu-ai/v1"
ai.model "" REQUIRED if enabled; e.g. "qwen3.6-35b-a3b-fp8"
ai.interval 60s poll cadence; AI tiers fire only on NEW diagnostics each cycle (natural LLM-cost cap)
ai.replicas 1 Phase 2.F — HA aiwatch via leader-election. Default 1 (single-replica noop path; byte-identical to pre-2.F). When >1, the chart turns on --leader-election=true and binds the SA to the Lease Role; exactly one replica runs tick() at a time, failover within ~30s on lease loss.
ai.digestPinAttestation.secretName "" e.g. "srenix-digest-pin-attestation-key"
ai.digestPinAttestation.secretKey attestation.key
ai.digestPinAttestation.keyID srenix-digest-pin
ai.metrics.addr "" e.g. ":9090" to enable
ai.metrics.port 9090 container/service port (must match the :NNNN in addr)
ai.metrics.serviceMonitor.enabled false set true when prometheus-operator is installed
ai.metrics.serviceMonitor.interval 30s scrape interval
ai.metrics.serviceMonitor.scrapeTimeout 10s
ai.metrics.grafanaDashboard.enabled false When true, ship a ConfigMap with the Srenix overview dashboard tagged for kube-prometheus-stack's sidecar discovery.
ai.metrics.grafanaDashboard.extraLabels {} e.g. {"app.kubernetes.io/instance": "monitoring"} for non-default Grafana
ai.metrics.prometheusRule.enabled false When true, ship a PrometheusRule with the canary alerts: ChaWatcherStuck, ChaBreakerOpen, ChaAutonomyRejectionSpike.
ai.metrics.prometheusRule.labels {} e.g. {"prometheus": "k8s"} when prometheus-operator uses non-default selectors
ai.allowSaas false set true to allow api.openai.com / api.anthropic.com endpoints
ai.llmFixerMatcher false t1+: use the LLM-classified fixer matcher (falls back to keyword on error)
ai.auditLog "" AI-event audit sink: "" (off) | "-" (stdout) | "/path/to.jsonl". Set for t1+ compliance.
ai.approvalServerUrl "" t1+: base URL of the approval-server (e.g. https://srenix-approve.example.com). When set, T1/T2 proposals emit a signed one-click click-to-fix link. Pair with approval.enabled + approval.ingress.host.
ai.image.repository docker4zerocool/srenix-enterprise The commercial Srenix Enterprise image for the aiwatch Deployment. Tag defaults to "v<AppVersion>" (srenix-enterprise images carry a leading "v").
ai.image.tag "" default = "v" + .Chart.AppVersion (e.g. v0.1.0-alpha.1)
ai.image.pullPolicy IfNotPresent
ai.resources {} defaults to top-level resources if unset
ai.apiKey.secretName "" K8s secret holding the LLM bearer token (ESO-managed); empty = no-auth in-cluster vLLM
ai.apiKey.secretKey API_KEY key within the secret
ai.apiKey.envName AI_API_KEY env var the binary reads (matches --ai-api-key-env default)
ai.apiKey.header "" HTTP header for the key; "" = "Authorization: Bearer"; set "X-API-Key" for Kong key-auth
ai.t3.vaultAllowedPrefixes [] e.g. ["secret/data/srenix-recovery/"]
ai.memory.enabled false
ai.memory.image.repository qdrant/qdrant
ai.memory.image.tag v1.12.4
ai.memory.image.pullPolicy IfNotPresent
ai.memory.storage.size 5Gi
ai.memory.storage.className "" default storage class if empty (use nfs-client/cephfs on mixed GPU nodes)
ai.memory.resources {}
ai.memory.embeddings.endpoint https://mcp.baisoln.com/gpu-ai/v1 OpenAI-compatible /embeddings
ai.memory.embeddings.model qwen3-embedding-0.6b
ai.memory.storeUrl "" default: http://<release>-rag.<ns>.svc:6333
ai.memory.topK 5 how many prior resolutions to retrieve per finding
ai.rateLimit.actionsPerHour 5 AI-proposed actions/hour budget
ai.rateLimit.tokensPerHour 1000000
ai.circuitBreaker.consecutiveFailures 3
ai.audit.destination events events | loki | otlp (reserved for a future structured sink; auditLog above drives the binary today)
ai.audit.lokiURL ""
ai.audit.otlpEndpoint ""

approval

Approval-server sidecar — holds the JWT signing key, terminates click-to-fix URLs. Automatically required (and chart-level validation should ensure) when ai.tier ∈ {t1, t2, t3}. For ai.tier=t0 (narration only) approval is unused.

Key Default Description
approval.enabled false
approval.replicas 1
approval.image.repository docker4zerocool/srenix-enterprise
approval.image.tag "" default = .Chart.AppVersion
approval.image.pullPolicy IfNotPresent
approval.signingKey.secretName srenix-approval-signing-key
approval.silence.shortDuration 24h subject-scoped "Silence 24h"
approval.silence.longDuration 2160h class-scoped "Silence class (90d)" — 90d
approval.store.backend "" "" (= inmemory) | configmap
approval.store.namespace "" default = release namespace
approval.store.replayConfigMap srenix-approval-replay
approval.store.runbookConfigMap srenix-approval-runbooks
approval.ingress.enabled false
approval.ingress.host "" REQUIRED if approval.ingress.enabled; e.g. srenix-approve.example.com
approval.ingress.ingressClassName ""
approval.ingress.annotations {}
approval.ingress.tls.enabled true
approval.ingress.tls.secretName "" default: <release>-approval-server-tls
approval.networkPolicy.enabled false
approval.networkPolicy.gatewayNamespaceSelector {} REQUIRED if enabled; e.g. {kubernetes.io/metadata.name: gateway}
approval.resources.limits.cpu 200m
approval.resources.limits.memory 128Mi
approval.resources.requests.cpu 20m
approval.resources.requests.memory 32Mi
approval.nodeSelector {}

dashboard

P6.6 — the read-only hosted dashboard. A separate Deployment + Service running the `srenix-enterprise dashboard` subcommand: a server-rendered HTML view of findings (live DriftReports), pending approvals, and remediation history. It NEVER mutates the cluster — for any action it links out to the EXISTING approval-server endpoints (built from approvalBaseURL). RBAC posture: a DEDICATED…

Key Default Description
dashboard.enabled false
dashboard.replicas 1
dashboard.image.repository docker4zerocool/srenix-enterprise
dashboard.image.tag "" default = v<.Chart.AppVersion>
dashboard.image.pullPolicy IfNotPresent
dashboard.approvalBaseURL "" approvalBaseURL is REQUIRED when dashboard.enabled=true: the externally reachable base URL of the approval-server (e.g. https://srenix-approve.example.com). The /approvals page builds Approve/Deny/Ignore links against it, so the action links work.
dashboard.authHeader X-Forwarded-User authHeader is the HTTP header carrying the oauth2-proxy-authenticated operator identity, displayed in the page header. Default matches the approval-server convention.
dashboard.auditLogPath "" auditLogPath optionally points at the approval-server's tamper-evident audit JSONL (its --ai-audit-log output) to power /history and /approvals. Empty = those pages render an empty state.
dashboard.historyLimit 100 max rows on /history (most-recent first)
dashboard.approvalsLimit 50 max rows on /approvals (most-recent first)
dashboard.ingress.enabled false
dashboard.ingress.host "" REQUIRED if dashboard.ingress.enabled; e.g. srenix-dashboard.example.com
dashboard.ingress.ingressClassName ""
dashboard.ingress.annotations {} e.g. oauth2-proxy / cert-manager annotations
dashboard.ingress.tls.enabled true
dashboard.ingress.tls.secretName "" default: <release>-dashboard-tls
dashboard.networkPolicy.enabled false
dashboard.networkPolicy.gatewayNamespaceSelector {} REQUIRED if enabled; e.g. {kubernetes.io/metadata.name: gateway}
dashboard.resources.limits.cpu 200m
dashboard.resources.limits.memory 128Mi
dashboard.resources.requests.cpu 20m
dashboard.resources.requests.memory 32Mi
dashboard.nodeSelector {}

protectedNamespaces

Protected namespaces — the act-side no-touch list. The compiled-in floor (kube-system, kube-public, kube-node-lease, rook-ceph, vault, external-secrets, cnpg-system) is NOT configurable and can never be removed. `extra` APPENDS namespaces to that floor: each entry is rendered as SRENIX_PROTECTED_NAMESPACES_EXTRA (comma-separated) on the watcher, diagnose, remediate, AND aiwatch containers, so…

Key Default Description
protectedNamespaces.extra []

gatekeeper

Key Default Description
gatekeeper.install false set true if Gatekeeper is installed in the cluster
gatekeeper.constraints.protectedNamespaces [kube-system, kube-public, kube-node-lease, rook-ceph, vault, external-secrets, cnpg-system]

analyzers

Drift-class analyzers added in v1.7 (Workstreams B1+B2+B3 from the AI SRE positioning plan) + v1.8 (Workstream B4). Default to ON so existing installs get the new signal automatically; flip individual entries to false on clusters that don't host the targeted asset class.

Key Default Description
analyzers.secretKeyMissing.enabled true Workloads referencing a Secret key that does not exist (the CreateContainerConfigError root cause).
analyzers.failingExternalSecrets.enabled true ExternalSecrets stuck SecretSyncedError / not Ready (ESO sync chain broken).
analyzers.proactiveSecretKeyCheck.enabled true Proactive secretKeyRef validation BEFORE pods restart and hit CreateContainerConfigError.
analyzers.unprovisionedSecret.enabled true Secrets referenced by workloads that no controller (ESO, cert- manager, Helm) ever provisioned.
analyzers.imagePullAuth.enabled true ImagePullBackOff caused by missing/invalid pull credentials (namespace missing pull secret or deployment missing imagePullSecrets).
analyzers.certExpiry.enabled true cert-manager Certificates close to / past expiry or stuck not Ready.
analyzers.tlsSecretMismatch.enabled true Ingress TLS secretName pointing at a stale/mismatched Secret while a healthy Certificate targets a different Secret for the same host. (The optional auto-FIXER for this finding is gated separately under fixers.tlsSecretMismatch.)
analyzers.gitopsDrift.enabled true GitOps drift — Argo CD Application + Flux Kustomization + Flux HelmRelease. Surfaces controllers stuck OutOfSync, Degraded, NotReady, BuildFailed, UpgradeFailed, etc. Default 10-minute grace period (controllers are routinely reconciling). Set to false on clusters without…
analyzers.workloadStateDrift.enabled true State-tier drift — CNPG cluster phase / follower lag / primary switchover stuck; StatefulSet ordinal-zero stuck. Goes deeper than the basic "X/Y ready" probe. Set to false on clusters that don't host CNPG and don't run StatefulSets.
analyzers.rbacDrift.enabled true RBAC drift — wildcard-verb roles + unbound ServiceAccounts mounted by Pods. Skips system canonical roles (cluster-admin, system:*) and the default SA in every namespace. Set to false if your cluster's RBAC posture is managed entirely by an upstream IaC system that…
analyzers.configDrift.enabled true Config drift (v1.8) — CRD multi-storedVersions (storage migration pending), Deployment rollouts stuck past the grace window (generation skew or updatedReplicas trailing spec.replicas), and Pods of the same Deployment carrying disagreeing checksum/config annotations (rolling…
analyzers.capacityDrift.enabled true Capacity drift (v1.8) — HPA pinned at maxReplicas past the saturation grace (24h default; workload is chronically under-provisioned), HPA pinned at minReplicas past the idle grace (30d default; HPA is not load-driven), HPA AbleToScale=False past grace (typically ResourceQuota or…
analyzers.securityDrift.enabled true Security drift (v1.8) — three observational signals: user namespaces with no pod-security.kubernetes.io/enforce label (apiserver applies the cluster-wide default, typically privileged) or with enforce=privileged explicitly (most- permissive PSS profile); Pods whose containers…
analyzers.disruptionDrift.enabled true Disruption-tier drift (v1.21) — ResourceQuota near-exhaustion, PodDisruptionBudget blocking voluntary disruption (drains stuck), and Jobs past their activeDeadline / backoffLimit. Each sub-signal handles its own GVR-absence case. Set to false to silence.
analyzers.oomkillRecurrence.enabled true Workload-tier (v1.22) — containers OOMKilled repeatedly (a sizing problem masquerading as a crash loop). Set to false to silence.
analyzers.pvOrphan.enabled true Workload-tier (v1.22) — Released/Available PersistentVolumes with no bound PVC (a cost leak). Set to false to silence.
analyzers.cronjobStuck.enabled true Workload-tier (v1.22) — CronJobs that have not scheduled a Job within their expected window (silent scheduling failure). Set to false to silence.
analyzers.logPatternMatcher.enabled true v1.25 — scans recent Events for high-signal failure messages (ImagePullBackOff, OOMKilled, VolumeAttachFailed, ProbeFailed, Forbidden). Dedup'd one finding per (object, pattern). Set to false to silence.
analyzers.netpolProposer.enabled true Phase 2d-β — on NetworkPolicy-enforcing CNIs, emits one warning per uncovered namespace with a deterministic ProposedPolicyYAML. Silent on Flannel-only k3s. Set to false to silence.
analyzers.dnsChainDrift.enabled true v1.10 — verifies the DNS chain (Service → Ingress → external host) for the seeded endpoint hostnames. Runs the K8s-chain hops with no config; external-hop verification requires externalDNS.cloudflare. Set to false to silence.

investigator

Layer-2 investigator (deterministic, rule-based; ships in OSS). Defaults ON. The paid binary may replace it with an LLM-backed implementation. Set enabled: false to disable (SRENIX_INVESTIGATOR=off).

Key Default Description
investigator.enabled true

probes

M2 probe-class additions (v1.8). Each defaults to ON and AUTO-SKIPS when its CRD is absent (Kong / ArgoCD / Velero) or no-ops on an empty list (HPA), so leaving them on costs nothing on clusters that don't host the asset. Set enabled: false only to silence a probe on a cluster that DOES host the CRD but you don't want Srenix watching it.

Key Default Description
probes.ceph.enabled true Rook-Ceph cluster health (HEALTH_OK / OSD status). Auto-skips when the rook-ceph CRDs are absent.
probes.nodes.enabled true Node Ready conditions across the cluster.
probes.postgres.enabled true PostgreSQL (CNPG) cluster health. Auto-skips when the CNPG CRDs are absent.
probes.pvcs.enabled true PersistentVolumeClaims stuck Pending / Lost.
probes.criticalWorkloads.enabled true Critical Services probe — the curated workload target list (defaults merged with SRENIX_CRITICAL_SERVICES / the probe-critical annotation). Emits SRENIX_PROBE_CRITICAL_WORKLOADS=off when disabled.
probes.endpoints.enabled true External HTTP(S) endpoint reachability for discovered/seeded Ingress hostnames.
probes.kong.enabled true Kong ingress — KongPlugin / KongConsumer / Kong proxy readiness drift. Auto-skips when configuration.konghq.com CRDs are absent.
probes.hpaScaling.enabled true HorizontalPodAutoscaler scaling health (distinct from the v1.8 capacityDrift analyzer's longitudinal signals). No-ops on an empty HPA list.
probes.argocdApp.enabled true Argo CD Application sync/health (probe-level snapshot, distinct from the gitopsDrift analyzer). Auto-skips when argoproj.io CRDs are absent.
probes.velero.enabled true Velero backup freshness / last-backup status. Auto-skips when velero.io CRDs are absent.
probes.nodePressure.enabled true Node MemoryPressure / DiskPressure / PIDPressure conditions.
probes.daemonsets.enabled true DaemonSets with unavailable/misscheduled pods.
probes.pendingPods.enabled true Pods stuck Pending past the scheduling grace window.
probes.crashloop.enabled true Containers in CrashLoopBackOff.
probes.etcd.enabled true etcd member health / quorum.
probes.failedMounts.enabled true Pods blocked on FailedMount / FailedAttachVolume events.
probes.kongRoutes.enabled true Kong-managed Ingress backend-Endpoint + plugin/consumer reference readiness. Silent on clusters without Kong-managed Ingresses.
probes.gpuNodes.enabled true NotReady / cordoned / zero-allocatable GPU nodes. Silent on CPU-only clusters.
probes.traefikRoutes.enabled true k3s Traefik IngressRoute backend readiness. Auto-skips on non-k3s or when the Traefik CRD is absent.
probes.k3sLocalPathStorage.enabled true k3s local-path-provisioner PVC health. No-ops when there are no local-path PVCs.
probes.k3sDatastore.enabled true k3s datastore (sqlite/etcd) health. Auto-skips on non-k3s. Set SRENIX_K3S_SINGLE_NODE_OK=true (via watcher.extraEnv) to suppress the single-node datastore warning on intentional single-node clusters.

fixers

Optional fixers — off by default. Each entry adds RBAC verbs and exposes an env var the binary reads to enable the matching Fixer registration.

Key Default Description
fixers.tlsSecretMismatch.enabled false Patches Ingress.spec.tls[].secretName when Srenix detects a stale Secret plus a healthy cert-manager Certificate in the same namespace targeting a different Secret for the same host. GitOps-managed Ingresses (ArgoCD / Flux / Helm release labels) are SKIPPED automatically — the…

externalDNS

External-DNS verification for the DNSChainDrift analyzer. When enabled, the watcher + diagnose containers receive SRENIX_CLOUDFLARE_TOKEN via a secretKeyRef (NEVER a literal) so the analyzer can verify the external DNS hop (Cloudflare record → Ingress host). Without this the analyzer still runs the in-cluster chain hops and emits "external DNS hop not verified". Mirrors the operator-managed…

Key Default Description
externalDNS.cloudflare.enabled false
externalDNS.cloudflare.apiTokenSecretRef {} name: srenix-cloudflare-token key: token
← Back to docs