Install with defaults first — most options have safe defaults. Add cloud probes, ticketing, and AI tiers incrementally as you need them. This reference is generated directly from the chart's
values.yaml
(chart v0.1.0-alpha.1), so every key listed here exists in the chart.
image
| Key | Default | Description |
| image.repository | docker4zerocool/agentic-sre | Docker Hub is the canonical publish target. docker4zerocool is the org's operational registry (same account that hosts every ai-* and mcp-* image the platform depends on); no extra pull-secret config needed for operators who already use it. GHCR is published as a mirror by… |
| image.tag | "" | default = .Chart.appVersion (e.g. v0.1.0-alpha.1) |
| image.pullPolicy | "" | pullPolicy is left empty so the `srenix.pullPolicy` helper picks it automatically: Always for mutable tags (latest / main / dev / `*-latest`) and IfNotPresent for semver-pinned tags. Override only when you specifically need to force one or the other. |
| image.pullSecrets | [] | e.g. [{name: dockerhub-secret}] |
diagnose
Diagnose CronJob — always enabled.
| Key | Default | Description |
| diagnose.enabled | true | |
| diagnose.schedule | 0 9 * * * | Daily 09:00 UTC. Same default as the bash version. |
| diagnose.successfulJobsHistoryLimit | 3 | |
| diagnose.failedJobsHistoryLimit | 3 | |
| diagnose.concurrencyPolicy | Forbid | |
| diagnose.backoffLimit | 1 | backoffLimit caps how many times a failed pod within the Job is retried before the Job itself is marked Failed. K8s default is 6, which compounds with the per-pod default of 5 minutes of backoff — a hung run can keep spawning pods for over an hour. Srenix's diagnose is… |
| diagnose.activeDeadlineSeconds | 300 | activeDeadlineSeconds caps the Job's total runtime (counted across pod restarts). Raised 120 → 300 in v1.8.1: the v1.8 analyzer + M2-probe set adds a meaningful number of cluster List calls (CRDs, HPAs, all namespaces + pods + NetworkPolicies, Kong/Velero/Argo CRDs), and a live… |
| diagnose.format | daily | daily | text | json — daily posts to #healthinfo; text/json for ad-hoc runs |
remediation
Remediate CronJob — opt-in.
| Key | Default | Description |
| remediation.enabled | false | |
| remediation.schedule | */30 * * * * | Every 30 min. Off by default — turn on once your team trusts the fixers. |
| remediation.successfulJobsHistoryLimit | 3 | |
| remediation.failedJobsHistoryLimit | 3 | |
| remediation.concurrencyPolicy | Forbid | |
| remediation.backoffLimit | 1 | See diagnose.backoffLimit comment. Remediation Jobs mutate cluster state so we especially do NOT want them retrying a hung-mid-mutation run. |
| remediation.activeDeadlineSeconds | 300 | See diagnose.activeDeadlineSeconds comment. Remediation runs the whitelisted fixers AND a full re-probe (which now includes the v1.8 analyzer + M2-probe set), so it inherits the same ~157s read cost. Raised 120 → 300 in v1.8.1 to match diagnose. |
| remediation.dryRun | false | When true, fixers report Refused without mutating cluster state. |
slack
Slack three-channel routing. Each channel maps to a dedicated incoming-webhook Secret. All URLs must come from pre-existing Secrets in the install namespace — prefer wiring them via External Secrets Operator from Vault (see SECURITY.md). alerts → #ceph-alerts: event-driven, Srenix acted (auto-fixed issues) critical → #ceph-critical: event-driven, human action required healthinfo→ #healthinfo:…
| Key | Default | Description |
| slack.alerts.enabled | false | |
| slack.alerts.secretName | "" | e.g. "srenix-slack-ceph-alerts" |
| slack.alerts.secretKey | WEBHOOK_URL | |
| slack.critical.enabled | false | |
| slack.critical.secretName | "" | e.g. "srenix-slack-ceph-critical" |
| slack.critical.secretKey | WEBHOOK_URL | |
| slack.healthinfo.enabled | false | |
| slack.healthinfo.secretName | "" | e.g. "srenix-slack-healthinfo" |
| slack.healthinfo.secretKey | WEBHOOK_URL | |
driftReport
DriftReport CRD — kubectl-queryable diagnostic objects. Each cron tick upserts one DriftReport per active diagnostic; CRs whose subject is no longer reported get deleted automatically. Inspect with: kubectl get driftreports -A Disable if you don't want CRD-shaped output (Slack + JSON still work).
| Key | Default | Description |
| driftReport.enabled | true | |
resolutionRecord
ResolutionRecord CRD — append-only outcome log (one CR per applied+ verified remediation). The durable system-of-record the RAG memory layer embeds + retrieves. Inspect with: kubectl get resolutionrecords -l srenix.ai/verified=cleared
| Key | Default | Description |
| resolutionRecord.enabled | true | |
silence
Silence — operator-controlled noise suppression. When `enabled`, the watcher fetches active Silence CRs once per cycle and drops matched diagnostics before downstream emission (DriftReport / Slack / Alertmanager / ticketing). Operators create Silences with `kubectl create -f silence.yaml` in any namespace; the matching is cluster-wide. Defaults to ON for new installs because installing the CRD on…
| Key | Default | Description |
| silence.installCRD | true | ship the CRD via the chart |
| silence.enabled | true | provision the watcher's ClusterRole + binding |
operator
Operator (Phase 1) — ships the AgenticSRE CRD shape only. The srenix-operator binary that reconciles AgenticSRE resources into the watcher Deployment / CronJobs / ServiceAccount lands in Phase 2 (next release). Installing the CRD alone is harmless: the resource is just queryable state. Default ON for new installs so the CRD is in place by the time the operator binary ships and clusters that opt…
| Key | Default | Description |
| operator.installCRD | true | |
| operator.enabled | false | Phase 1b (v1.8) — the srenix-operator controller-runtime manager. Default OFF: existing chart-managed installs continue to work unchanged. Flip to true to deploy the operator binary; the operator only takes over resources named by a AgenticSRE CR (which operators create… |
| operator.replicas | 1 | Operator replicas. 1 is sufficient because the manager uses lease-based leader-election; additional replicas stand by. |
| operator.resources | {} | |
vaultProbe
Vault-probe — closes the L1 stale-Ready window: queries Vault directly to verify each ExternalSecret's referenced path + property still exists, BEFORE the ESO controller's next refresh marks itself not-Ready. Catches the case where someone edits Vault but pods stay alive on cached Secret data and the next pod restart fails with CreateContainerConfigError. Privacy contract: srenix never reads…
| Key | Default | Description |
| vaultProbe.enabled | false | |
| vaultProbe.address | "" | e.g. "https://vault.svc.cluster.local:8200" |
| vaultProbe.kvMount | secret | KV-v2 mount path; ESO `data[].remoteRef.key` is mount-relative |
| vaultProbe.auth.method | kubernetes | kubernetes | token |
| vaultProbe.auth.role | "" | required when method=kubernetes; the Vault role bound to the srenix SA |
| vaultProbe.auth.tokenSecretRef.name | "" | K8s Secret holding a Vault token |
| vaultProbe.auth.tokenSecretRef.key | token | key within that Secret |
alertmanager
Alertmanager integration — Kubernetes-native event routing. When enabled, Srenix posts the full active-issue state to Alertmanager each watcher cycle. Alertmanager handles: dedup, grouping, silencing, repeat intervals, and fan-out to all configured receivers (Slack, PagerDuty, Teams, email, webhook, …). This is the preferred model over direct Slack webhooks; the slack.alerts/critical fields…
| Key | Default | Description |
| alertmanager.enabled | false | |
| alertmanager.url | "" | e.g. "http://alertmanager.pg.svc.cluster.local:9093" |
| alertmanager.clusterName | cluster | identifies this cluster in alert labels |
cloud
Cloud probe framework — observe AWS / GCP / Azure resources alongside K8s state. Each provider is independently toggleable. Cloud probes do NOT fire on K8s event triggers — they run on `cadence` (default 10m) to protect cloud-API rate limits. Master switch: `cloud.enabled=false` disables EVERYTHING below and pays zero overhead (no SDK init, no probe registration, no extra RBAC). Required for…
| Key | Default | Description |
| cloud.enabled | false | |
| cloud.cadence | 10m | min interval between cloud-probe runs |
| cloud.aws.enabled | false | |
| cloud.aws.region | "" | required; e.g. "us-east-1" |
| cloud.aws.auth.roleArn | "" | ARN of the IAM role; auto-injected as eks.amazonaws.com/role-arn SA annotation |
| cloud.aws.probes.rds | true | |
| cloud.aws.probes.ebs | true | |
| cloud.aws.probes.eks | true | |
| cloud.aws.probes.iam | true | |
| cloud.aws.probes.alb | true | |
| cloud.aws.probes.acm | true | |
| cloud.aws.probes.kms | true | |
| cloud.aws.probes.s3 | true | |
| cloud.aws.probes.vpc | true | |
| cloud.gcp.enabled | false | GCP cloud probes shipped in v1.8 (M2): Cloud SQL, Persistent Disks, GKE control-plane + node pools, IAM service accounts, Subnets, LB backends, managed certs, GCS public access, KMS. NOTE: Cloud SQL storage-% is fetched via the Cloud Monitoring API (best-effort; "not measured"… |
| cloud.gcp.project | "" | |
| cloud.gcp.auth.serviceAccount | "" | GSA email; auto-injected as iam.gke.io/gcp-service-account SA annotation |
| cloud.gcp.probes.cloudsql | true | |
| cloud.gcp.probes.disks | true | |
| cloud.gcp.probes.gke | true | |
| cloud.gcp.probes.iam | true | |
| cloud.gcp.probes.subnets | true | |
| cloud.gcp.probes.lb | true | |
| cloud.gcp.probes.certs | true | |
| cloud.gcp.probes.gcs | true | |
| cloud.gcp.probes.kms | true | |
| cloud.gcp.subnetsSmallPrefixThreshold | 0 | Small-prefix threshold for the capacity-only subnets probe: an unmeasured subnet whose primary CIDR is smaller than /<threshold> is flagged as a warning. 0 = the probe default (/26 → 60 usable IPs). Raise (e.g. 28) to quiet intentionally tiny subnets, lower to be stricter.… |
| cloud.azure.enabled | false | Azure cloud probes shipped in v1.8 (M2): SQL databases, Disks, AKS control-plane + node pools, Managed Identities, Subnets, App Gateway backends, certs, Storage public access, Key Vaults. NOTE: SQL storage-% and App Gateway backend health are fetched via Azure Monitor metrics… |
| cloud.azure.subscriptionId | "" | |
| cloud.azure.resourceGroup | "" | optional scope; empty = subscription-wide |
| cloud.azure.auth.clientId | "" | AAD app client ID; auto-injected as azure.workload.identity/client-id SA annotation |
| cloud.azure.probes.sql | true | |
| cloud.azure.probes.disks | true | |
| cloud.azure.probes.aks | true | |
| cloud.azure.probes.identities | true | |
| cloud.azure.probes.subnets | true | |
| cloud.azure.probes.appgw | true | |
| cloud.azure.probes.certs | true | |
| cloud.azure.probes.storage | true | |
| cloud.azure.probes.keyvaults | true | |
ticketing
Ticketing — open issue-tracker tickets for diagnostics Srenix cannot auto-remediate. Runs after each watcher cycle's DriftReport reconcile; the resulting ticket key is persisted onto DriftReport.status.ticket so subsequent cycles know not to re-open the same ticket. Sink failures NEVER abort the cycle — logged and skipped, same posture as Slack/Alertmanager. OSS ships the OpenProject sink…
| Key | Default | Description |
| ticketing.enabled | false | |
| ticketing.provider | openproject | openproject | jira | servicenow |
| ticketing.cluster | cluster | identifies this cluster in ticket bodies; matches alertmanager.clusterName |
| ticketing.labels | [srenix, auto-filed] | |
| ticketing.mcpURL | http://mcp-openproject-server.mcp.svc:8006/mcp | FLAT shape — matches the operator CRD's `spec.ticketing.*` so YAML can move between Helm values and `kubectl patch srenix …` without reshaping. The legacy nested `ticketing.openproject.*` block below is the v1.19.x shape and is honored by the chart template as a fallback ONLY… |
| ticketing.project | "" | e.g. "6" for the Demo project |
| ticketing.typeID | "" | e.g. "36" for Task — REQUIRED |
| ticketing.closedStatusID | "" | e.g. "82" for Closed status — needed for resolve-on-clear |
| ticketing.webURLPrefix | "" | e.g. "https://op.example.com" — used to build operator-clickable URLs |
| ticketing.severityPriority.critical | "" | e.g. "75" for Immediate |
| ticketing.severityPriority.warning | "" | e.g. "74" for High |
| ticketing.severityPriority.info | "" | e.g. "73" for Normal |
| ticketing.dryRun | false | Log intended ops without calling the MCP server |
| ticketing.resolveOnClear | true | Auto-close the ticket when its finding clears. Default ON (M2 shipped). |
| ticketing.commentInterval | 1h | Debounce for comment-on-recurrence: at most one comment per window. A recurring or severity-changed finding comments on the EXISTING ticket instead of opening a new one. "0" disables recurrence comments. |
| ticketing.auth.enabled | false | |
| ticketing.auth.secretName | srenix-ticketing-mcp | K8s Secret with the API key |
| ticketing.auth.secretKey | api-key | Key inside the Secret |
| ticketing.route | "" | SRENIX_TICKETING_ROUTE — routes a finding to a sink (provider above selects the default sink). |
| ticketing.jira.url | "" | SRENIX_JIRA_URL |
| ticketing.jira.project | "" | SRENIX_JIRA_PROJECT (project key) |
| ticketing.jira.email | "" | SRENIX_JIRA_EMAIL (token-auth account email) |
| ticketing.jira.issueType | "" | SRENIX_JIRA_ISSUE_TYPE (e.g. "Bug") |
| ticketing.jira.priority.critical | "" | SRENIX_JIRA_PRIORITY_CRITICAL |
| ticketing.jira.priority.warning | "" | SRENIX_JIRA_PRIORITY_WARNING |
| ticketing.jira.priority.info | "" | SRENIX_JIRA_PRIORITY_INFO |
| ticketing.jira.webUrlBase | "" | SRENIX_JIRA_WEB_URL_BASE (clickable ticket URLs) |
| ticketing.jira.tokenSecret.name | "" | K8s Secret name (ESO-synced) |
| ticketing.jira.tokenSecret.key | "" | key inside the Secret |
| ticketing.servicenow.url | "" | SRENIX_SERVICENOW_URL |
| ticketing.servicenow.user | "" | SRENIX_SERVICENOW_USER (basic-auth username) |
| ticketing.servicenow.urgency.critical | "" | SRENIX_SERVICENOW_URGENCY_CRITICAL |
| ticketing.servicenow.urgency.warning | "" | SRENIX_SERVICENOW_URGENCY_WARNING |
| ticketing.servicenow.urgency.info | "" | SRENIX_SERVICENOW_URGENCY_INFO |
| ticketing.servicenow.impact.critical | "" | SRENIX_SERVICENOW_IMPACT_CRITICAL |
| ticketing.servicenow.impact.warning | "" | SRENIX_SERVICENOW_IMPACT_WARNING |
| ticketing.servicenow.impact.info | "" | SRENIX_SERVICENOW_IMPACT_INFO |
| ticketing.servicenow.webUrlBase | "" | SRENIX_SERVICENOW_WEB_URL_BASE (clickable ticket URLs) |
| ticketing.servicenow.passwordSecret.name | "" | SRENIX_SERVICENOW_PASSWORD — secretKeyRef ONLY |
| ticketing.servicenow.passwordSecret.key | "" | |
| ticketing.servicenow.bearerSecret.name | "" | SRENIX_SERVICENOW_BEARER — secretKeyRef ONLY |
| ticketing.servicenow.bearerSecret.key | "" | |
| ticketing.openproject | {} | DEPRECATED (v1.20.0+): nested per-provider sub-trees. Honored only when the equivalent flat field above is unset. New installs should use the flat shape exclusively. Will be removed in the next major chart bump. |
watcher
Watcher — event-driven, long-running Deployment (Phase 1). Replaces polling latency (CronJob tick) with near-instant reaction to Kubernetes watch events. Debounces burst updates before re-running the full probe+analyzer stack. Slack dedup: only new/changed/resolved diagnostics produce a post. The seen-map is seeded from DriftReport CRs on startup so a pod restart does not re-flood Slack with all…
| Key | Default | Description |
| watcher.enabled | false | |
| watcher.replicas | 1 | replicas — watcher pod count. Default 1. Raising above 1 is ONLY safe with leaderElection.enabled=true (the SRENIX_LEADER_ELECTION path): otherwise every replica runs the probe/fix/post cycle and they race on DriftReports + double-post Slack. The chart FAILS the render when… |
| watcher.healthListen | :8081 | healthListen — listen address for the always-on health server. GET /healthz returns 200 while the process is alive (independent of the webhook receiver), and the Deployment's liveness/readiness probes target it. The port number is derived from this address. |
| watcher.debounce | 10s | Debounce window after a Kubernetes event. |
| watcher.resyncPeriod | 10m | Periodic full re-diagnose regardless of events. |
| watcher.extraEnv | [] | - name: SRENIX_CRITICAL_SERVICES value: "pg/postgres,vault/vault" - name: SRENIX_K3S_SINGLE_NODE_OK value: "true" |
| watcher.slack.postOnResolved | true | Post when a diagnostic disappears. |
| watcher.slack.repeatInterval | 4h | Re-post still-active warning/info at this cadence (0=never). |
| watcher.slack.criticalRepeatInterval | "" | Re-post still-active CRITICAL at this cadence (empty = fall back to repeatInterval). Use to keep criticals loud (e.g. "4h") while letting warnings calm down (e.g. set repeatInterval=24h). |
| watcher.remedy.enabled | false | Run auto-fixers after each cycle (live mutation). |
| watcher.remedy.dryRun | false | Evaluate fixers without mutating cluster state. |
| watcher.leaderElection.enabled | true | |
| watcher.leaderElection.leaseName | srenix-watcher | |
| watcher.leaderElection.leaseDuration | 30s | |
| watcher.leaderElection.renewDeadline | 20s | |
| watcher.leaderElection.retryPeriod | 5s | |
| watcher.triggers.prom.url | "" | e.g. "http://alertmanager.pg.svc.cluster.local:9093" |
| watcher.triggers.prom.interval | 30s | clamped to ≥5s by the trigger client |
| watcher.triggers.prom.alertNameFilter | [] | e.g. ["DiskFillUp","CertExpiringSoon"]; empty = any firing alert |
| watcher.triggers.webhook.listen | "" | e.g. ":8090" — empty disables receiver |
| watcher.triggers.webhook.sources | [] | e.g. ["vault=SRENIX_WEBHOOK_VAULT_SECRET","cert-manager=SRENIX_WEBHOOK_CM_SECRET"] |
| watcher.triggers.webhook.service.enabled | false | render a ClusterIP Service for the receiver |
| watcher.triggers.webhook.service.port | 8090 | |
| watcher.triggers.webhook.secretName | "" | e.g. "srenix-webhook-secrets" |
| watcher.resources.limits.cpu | 500m | |
| watcher.resources.limits.memory | 256Mi | |
| watcher.resources.requests.cpu | 50m | |
| watcher.resources.requests.memory | 64Mi | |
runner
GitHub Actions self-hosted runner — Mode A of the WS-C publish pipeline. Deploys myoung34/github-runner inside the cluster so the nightly publish-runs workflow can call `srenix diagnose --live` directly without needing an internet-reachable MinIO endpoint. Prerequisites: 1. A GitHub PAT (classic, repo scope) or fine-grained token (Actions:write) stored in Vault at: secret/t6-apps/srenix/config →…
| Key | Default | Description |
| runner.enabled | false | |
| runner.repoUrl | https://github.com/srenix-ai/agentic-sre | |
| runner.labels | self-hosted,cluster | must match runs-on in publish-runs.yml |
| runner.name | srenix-cluster-runner | |
| runner.image | myoung34/github-runner:ubuntu-jammy | Runner image — ubuntu-jammy = Ubuntu 22.04; ubuntu-noble = 24.04 |
| runner.tokenSecretName | srenix-runner-token | Secret that holds ACCESS_TOKEN for runner registration. Created by the ExternalSecret below — do not create manually. |
| runner.tokenSecretKey | ACCESS_TOKEN | |
| runner.resources.limits.cpu | 2 | |
| runner.resources.limits.memory | 4Gi | |
| runner.resources.requests.cpu | 250m | |
| runner.resources.requests.memory | 512Mi | |
| runner.nodeSelector | {} | Runner pod runs as root (required by myoung34/github-runner). It does NOT use the shared podSecurityContext. |
| runner.tolerations | [] | |
rbac
RBAC. Keep enabled — the CronJob will fail without these.
| Key | Default | Description |
| rbac.create | true | |
| rbac.reader.name | "" | default: <release>-reader |
| rbac.remediator.name | "" | default: <release>-remediator |
serviceAccount
| Key | Default | Description |
| serviceAccount.create | true | |
| serviceAccount.name | "" | default: <release>-sa |
| serviceAccount.annotations | {} | |
resources
Resource requests / limits (per CronJob pod).
| Key | Default | Description |
| resources.limits.cpu | 500m | |
| resources.limits.memory | 256Mi | |
| resources.requests.cpu | 50m | |
| resources.requests.memory | 64Mi | |
nodeSelector
Pod-level scheduling controls.
| Key | Default | Description |
| nodeSelector | {} | Pod-level scheduling controls. |
tolerations
| Key | Default | Description |
| tolerations | [] | |
affinity
| Key | Default | Description |
| affinity | {} | |
priorityClassName
| Key | Default | Description |
| priorityClassName | "" | |
podSecurityContext
Pod / container security context.
| Key | Default | Description |
| podSecurityContext.runAsNonRoot | true | |
| podSecurityContext.runAsUser | 65532 | |
| podSecurityContext.fsGroup | 65532 | |
| podSecurityContext.seccompProfile.type | RuntimeDefault | |
securityContext
| Key | Default | Description |
| securityContext.allowPrivilegeEscalation | false | |
| securityContext.readOnlyRootFilesystem | true | |
| securityContext.capabilities.drop | [ALL] | |
ai
AI tier (commercial / Srenix Enterprise) — recommendation-only AI for narration, fix proposals, multi-step plans, and Vault recovery runbooks. Every tier gates mutation behind human one-click approval. See docs/AI_TIERS.md and docs/DEPLOYMENT.md. DEPLOYMENT MODEL — purely additive. Setting ai.enabled=true does NOT touch the OSS watcher / diagnose / remediate workloads; they keep running the OSS…
| Key | Default | Description |
| ai.enabled | false | |
| ai.tier | t0 | t0 (narration) | t1 (fix proposals) | t2 (planner) | t3 (vault runbooks) |
| ai.endpoint | "" | REQUIRED if enabled; OpenAI-compatible base URL, e.g. "https://mcp.baisoln.com/gpu-ai/v1" |
| ai.model | "" | REQUIRED if enabled; e.g. "qwen3.6-35b-a3b-fp8" |
| ai.interval | 60s | poll cadence; AI tiers fire only on NEW diagnostics each cycle (natural LLM-cost cap) |
| ai.replicas | 1 | Phase 2.F — HA aiwatch via leader-election. Default 1 (single-replica noop path; byte-identical to pre-2.F). When >1, the chart turns on --leader-election=true and binds the SA to the Lease Role; exactly one replica runs tick() at a time, failover within ~30s on lease loss. |
| ai.digestPinAttestation.secretName | "" | e.g. "srenix-digest-pin-attestation-key" |
| ai.digestPinAttestation.secretKey | attestation.key | |
| ai.digestPinAttestation.keyID | srenix-digest-pin | |
| ai.metrics.addr | "" | e.g. ":9090" to enable |
| ai.metrics.port | 9090 | container/service port (must match the :NNNN in addr) |
| ai.metrics.serviceMonitor.enabled | false | set true when prometheus-operator is installed |
| ai.metrics.serviceMonitor.interval | 30s | scrape interval |
| ai.metrics.serviceMonitor.scrapeTimeout | 10s | |
| ai.metrics.grafanaDashboard.enabled | false | When true, ship a ConfigMap with the Srenix overview dashboard tagged for kube-prometheus-stack's sidecar discovery. |
| ai.metrics.grafanaDashboard.extraLabels | {} | e.g. {"app.kubernetes.io/instance": "monitoring"} for non-default Grafana |
| ai.metrics.prometheusRule.enabled | false | When true, ship a PrometheusRule with the canary alerts: ChaWatcherStuck, ChaBreakerOpen, ChaAutonomyRejectionSpike. |
| ai.metrics.prometheusRule.labels | {} | e.g. {"prometheus": "k8s"} when prometheus-operator uses non-default selectors |
| ai.allowSaas | false | set true to allow api.openai.com / api.anthropic.com endpoints |
| ai.llmFixerMatcher | false | t1+: use the LLM-classified fixer matcher (falls back to keyword on error) |
| ai.auditLog | "" | AI-event audit sink: "" (off) | "-" (stdout) | "/path/to.jsonl". Set for t1+ compliance. |
| ai.approvalServerUrl | "" | t1+: base URL of the approval-server (e.g. https://srenix-approve.example.com). When set, T1/T2 proposals emit a signed one-click click-to-fix link. Pair with approval.enabled + approval.ingress.host. |
| ai.image.repository | docker4zerocool/srenix-enterprise | The commercial Srenix Enterprise image for the aiwatch Deployment. Tag defaults to "v<AppVersion>" (srenix-enterprise images carry a leading "v"). |
| ai.image.tag | "" | default = "v" + .Chart.AppVersion (e.g. v0.1.0-alpha.1) |
| ai.image.pullPolicy | IfNotPresent | |
| ai.resources | {} | defaults to top-level resources if unset |
| ai.apiKey.secretName | "" | K8s secret holding the LLM bearer token (ESO-managed); empty = no-auth in-cluster vLLM |
| ai.apiKey.secretKey | API_KEY | key within the secret |
| ai.apiKey.envName | AI_API_KEY | env var the binary reads (matches --ai-api-key-env default) |
| ai.apiKey.header | "" | HTTP header for the key; "" = "Authorization: Bearer"; set "X-API-Key" for Kong key-auth |
| ai.t3.vaultAllowedPrefixes | [] | e.g. ["secret/data/srenix-recovery/"] |
| ai.memory.enabled | false | |
| ai.memory.image.repository | qdrant/qdrant | |
| ai.memory.image.tag | v1.12.4 | |
| ai.memory.image.pullPolicy | IfNotPresent | |
| ai.memory.storage.size | 5Gi | |
| ai.memory.storage.className | "" | default storage class if empty (use nfs-client/cephfs on mixed GPU nodes) |
| ai.memory.resources | {} | |
| ai.memory.embeddings.endpoint | https://mcp.baisoln.com/gpu-ai/v1 | OpenAI-compatible /embeddings |
| ai.memory.embeddings.model | qwen3-embedding-0.6b | |
| ai.memory.storeUrl | "" | default: http://<release>-rag.<ns>.svc:6333 |
| ai.memory.topK | 5 | how many prior resolutions to retrieve per finding |
| ai.rateLimit.actionsPerHour | 5 | AI-proposed actions/hour budget |
| ai.rateLimit.tokensPerHour | 1000000 | |
| ai.circuitBreaker.consecutiveFailures | 3 | |
| ai.audit.destination | events | events | loki | otlp (reserved for a future structured sink; auditLog above drives the binary today) |
| ai.audit.lokiURL | "" | |
| ai.audit.otlpEndpoint | "" | |
approval
Approval-server sidecar — holds the JWT signing key, terminates click-to-fix URLs. Automatically required (and chart-level validation should ensure) when ai.tier ∈ {t1, t2, t3}. For ai.tier=t0 (narration only) approval is unused.
| Key | Default | Description |
| approval.enabled | false | |
| approval.replicas | 1 | |
| approval.image.repository | docker4zerocool/srenix-enterprise | |
| approval.image.tag | "" | default = .Chart.AppVersion |
| approval.image.pullPolicy | IfNotPresent | |
| approval.signingKey.secretName | srenix-approval-signing-key | |
| approval.silence.shortDuration | 24h | subject-scoped "Silence 24h" |
| approval.silence.longDuration | 2160h | class-scoped "Silence class (90d)" — 90d |
| approval.store.backend | "" | "" (= inmemory) | configmap |
| approval.store.namespace | "" | default = release namespace |
| approval.store.replayConfigMap | srenix-approval-replay | |
| approval.store.runbookConfigMap | srenix-approval-runbooks | |
| approval.ingress.enabled | false | |
| approval.ingress.host | "" | REQUIRED if approval.ingress.enabled; e.g. srenix-approve.example.com |
| approval.ingress.ingressClassName | "" | |
| approval.ingress.annotations | {} | |
| approval.ingress.tls.enabled | true | |
| approval.ingress.tls.secretName | "" | default: <release>-approval-server-tls |
| approval.networkPolicy.enabled | false | |
| approval.networkPolicy.gatewayNamespaceSelector | {} | REQUIRED if enabled; e.g. {kubernetes.io/metadata.name: gateway} |
| approval.resources.limits.cpu | 200m | |
| approval.resources.limits.memory | 128Mi | |
| approval.resources.requests.cpu | 20m | |
| approval.resources.requests.memory | 32Mi | |
| approval.nodeSelector | {} | |
dashboard
P6.6 — the read-only hosted dashboard. A separate Deployment + Service running the `srenix-enterprise dashboard` subcommand: a server-rendered HTML view of findings (live DriftReports), pending approvals, and remediation history. It NEVER mutates the cluster — for any action it links out to the EXISTING approval-server endpoints (built from approvalBaseURL). RBAC posture: a DEDICATED…
| Key | Default | Description |
| dashboard.enabled | false | |
| dashboard.replicas | 1 | |
| dashboard.image.repository | docker4zerocool/srenix-enterprise | |
| dashboard.image.tag | "" | default = v<.Chart.AppVersion> |
| dashboard.image.pullPolicy | IfNotPresent | |
| dashboard.approvalBaseURL | "" | approvalBaseURL is REQUIRED when dashboard.enabled=true: the externally reachable base URL of the approval-server (e.g. https://srenix-approve.example.com). The /approvals page builds Approve/Deny/Ignore links against it, so the action links work. |
| dashboard.authHeader | X-Forwarded-User | authHeader is the HTTP header carrying the oauth2-proxy-authenticated operator identity, displayed in the page header. Default matches the approval-server convention. |
| dashboard.auditLogPath | "" | auditLogPath optionally points at the approval-server's tamper-evident audit JSONL (its --ai-audit-log output) to power /history and /approvals. Empty = those pages render an empty state. |
| dashboard.historyLimit | 100 | max rows on /history (most-recent first) |
| dashboard.approvalsLimit | 50 | max rows on /approvals (most-recent first) |
| dashboard.ingress.enabled | false | |
| dashboard.ingress.host | "" | REQUIRED if dashboard.ingress.enabled; e.g. srenix-dashboard.example.com |
| dashboard.ingress.ingressClassName | "" | |
| dashboard.ingress.annotations | {} | e.g. oauth2-proxy / cert-manager annotations |
| dashboard.ingress.tls.enabled | true | |
| dashboard.ingress.tls.secretName | "" | default: <release>-dashboard-tls |
| dashboard.networkPolicy.enabled | false | |
| dashboard.networkPolicy.gatewayNamespaceSelector | {} | REQUIRED if enabled; e.g. {kubernetes.io/metadata.name: gateway} |
| dashboard.resources.limits.cpu | 200m | |
| dashboard.resources.limits.memory | 128Mi | |
| dashboard.resources.requests.cpu | 20m | |
| dashboard.resources.requests.memory | 32Mi | |
| dashboard.nodeSelector | {} | |
protectedNamespaces
Protected namespaces — the act-side no-touch list. The compiled-in floor (kube-system, kube-public, kube-node-lease, rook-ceph, vault, external-secrets, cnpg-system) is NOT configurable and can never be removed. `extra` APPENDS namespaces to that floor: each entry is rendered as SRENIX_PROTECTED_NAMESPACES_EXTRA (comma-separated) on the watcher, diagnose, remediate, AND aiwatch containers, so…
| Key | Default | Description |
| protectedNamespaces.extra | [] | |
gatekeeper
| Key | Default | Description |
| gatekeeper.install | false | set true if Gatekeeper is installed in the cluster |
| gatekeeper.constraints.protectedNamespaces | [kube-system, kube-public, kube-node-lease, rook-ceph, vault, external-secrets, cnpg-system] | |
analyzers
Drift-class analyzers added in v1.7 (Workstreams B1+B2+B3 from the AI SRE positioning plan) + v1.8 (Workstream B4). Default to ON so existing installs get the new signal automatically; flip individual entries to false on clusters that don't host the targeted asset class.
| Key | Default | Description |
| analyzers.secretKeyMissing.enabled | true | Workloads referencing a Secret key that does not exist (the CreateContainerConfigError root cause). |
| analyzers.failingExternalSecrets.enabled | true | ExternalSecrets stuck SecretSyncedError / not Ready (ESO sync chain broken). |
| analyzers.proactiveSecretKeyCheck.enabled | true | Proactive secretKeyRef validation BEFORE pods restart and hit CreateContainerConfigError. |
| analyzers.unprovisionedSecret.enabled | true | Secrets referenced by workloads that no controller (ESO, cert- manager, Helm) ever provisioned. |
| analyzers.imagePullAuth.enabled | true | ImagePullBackOff caused by missing/invalid pull credentials (namespace missing pull secret or deployment missing imagePullSecrets). |
| analyzers.certExpiry.enabled | true | cert-manager Certificates close to / past expiry or stuck not Ready. |
| analyzers.tlsSecretMismatch.enabled | true | Ingress TLS secretName pointing at a stale/mismatched Secret while a healthy Certificate targets a different Secret for the same host. (The optional auto-FIXER for this finding is gated separately under fixers.tlsSecretMismatch.) |
| analyzers.gitopsDrift.enabled | true | GitOps drift — Argo CD Application + Flux Kustomization + Flux HelmRelease. Surfaces controllers stuck OutOfSync, Degraded, NotReady, BuildFailed, UpgradeFailed, etc. Default 10-minute grace period (controllers are routinely reconciling). Set to false on clusters without… |
| analyzers.workloadStateDrift.enabled | true | State-tier drift — CNPG cluster phase / follower lag / primary switchover stuck; StatefulSet ordinal-zero stuck. Goes deeper than the basic "X/Y ready" probe. Set to false on clusters that don't host CNPG and don't run StatefulSets. |
| analyzers.rbacDrift.enabled | true | RBAC drift — wildcard-verb roles + unbound ServiceAccounts mounted by Pods. Skips system canonical roles (cluster-admin, system:*) and the default SA in every namespace. Set to false if your cluster's RBAC posture is managed entirely by an upstream IaC system that… |
| analyzers.configDrift.enabled | true | Config drift (v1.8) — CRD multi-storedVersions (storage migration pending), Deployment rollouts stuck past the grace window (generation skew or updatedReplicas trailing spec.replicas), and Pods of the same Deployment carrying disagreeing checksum/config annotations (rolling… |
| analyzers.capacityDrift.enabled | true | Capacity drift (v1.8) — HPA pinned at maxReplicas past the saturation grace (24h default; workload is chronically under-provisioned), HPA pinned at minReplicas past the idle grace (30d default; HPA is not load-driven), HPA AbleToScale=False past grace (typically ResourceQuota or… |
| analyzers.securityDrift.enabled | true | Security drift (v1.8) — three observational signals: user namespaces with no pod-security.kubernetes.io/enforce label (apiserver applies the cluster-wide default, typically privileged) or with enforce=privileged explicitly (most- permissive PSS profile); Pods whose containers… |
| analyzers.disruptionDrift.enabled | true | Disruption-tier drift (v1.21) — ResourceQuota near-exhaustion, PodDisruptionBudget blocking voluntary disruption (drains stuck), and Jobs past their activeDeadline / backoffLimit. Each sub-signal handles its own GVR-absence case. Set to false to silence. |
| analyzers.oomkillRecurrence.enabled | true | Workload-tier (v1.22) — containers OOMKilled repeatedly (a sizing problem masquerading as a crash loop). Set to false to silence. |
| analyzers.pvOrphan.enabled | true | Workload-tier (v1.22) — Released/Available PersistentVolumes with no bound PVC (a cost leak). Set to false to silence. |
| analyzers.cronjobStuck.enabled | true | Workload-tier (v1.22) — CronJobs that have not scheduled a Job within their expected window (silent scheduling failure). Set to false to silence. |
| analyzers.logPatternMatcher.enabled | true | v1.25 — scans recent Events for high-signal failure messages (ImagePullBackOff, OOMKilled, VolumeAttachFailed, ProbeFailed, Forbidden). Dedup'd one finding per (object, pattern). Set to false to silence. |
| analyzers.netpolProposer.enabled | true | Phase 2d-β — on NetworkPolicy-enforcing CNIs, emits one warning per uncovered namespace with a deterministic ProposedPolicyYAML. Silent on Flannel-only k3s. Set to false to silence. |
| analyzers.dnsChainDrift.enabled | true | v1.10 — verifies the DNS chain (Service → Ingress → external host) for the seeded endpoint hostnames. Runs the K8s-chain hops with no config; external-hop verification requires externalDNS.cloudflare. Set to false to silence. |
investigator
Layer-2 investigator (deterministic, rule-based; ships in OSS). Defaults ON. The paid binary may replace it with an LLM-backed implementation. Set enabled: false to disable (SRENIX_INVESTIGATOR=off).
| Key | Default | Description |
| investigator.enabled | true | |
probes
M2 probe-class additions (v1.8). Each defaults to ON and AUTO-SKIPS when its CRD is absent (Kong / ArgoCD / Velero) or no-ops on an empty list (HPA), so leaving them on costs nothing on clusters that don't host the asset. Set enabled: false only to silence a probe on a cluster that DOES host the CRD but you don't want Srenix watching it.
| Key | Default | Description |
| probes.ceph.enabled | true | Rook-Ceph cluster health (HEALTH_OK / OSD status). Auto-skips when the rook-ceph CRDs are absent. |
| probes.nodes.enabled | true | Node Ready conditions across the cluster. |
| probes.postgres.enabled | true | PostgreSQL (CNPG) cluster health. Auto-skips when the CNPG CRDs are absent. |
| probes.pvcs.enabled | true | PersistentVolumeClaims stuck Pending / Lost. |
| probes.criticalWorkloads.enabled | true | Critical Services probe — the curated workload target list (defaults merged with SRENIX_CRITICAL_SERVICES / the probe-critical annotation). Emits SRENIX_PROBE_CRITICAL_WORKLOADS=off when disabled. |
| probes.endpoints.enabled | true | External HTTP(S) endpoint reachability for discovered/seeded Ingress hostnames. |
| probes.kong.enabled | true | Kong ingress — KongPlugin / KongConsumer / Kong proxy readiness drift. Auto-skips when configuration.konghq.com CRDs are absent. |
| probes.hpaScaling.enabled | true | HorizontalPodAutoscaler scaling health (distinct from the v1.8 capacityDrift analyzer's longitudinal signals). No-ops on an empty HPA list. |
| probes.argocdApp.enabled | true | Argo CD Application sync/health (probe-level snapshot, distinct from the gitopsDrift analyzer). Auto-skips when argoproj.io CRDs are absent. |
| probes.velero.enabled | true | Velero backup freshness / last-backup status. Auto-skips when velero.io CRDs are absent. |
| probes.nodePressure.enabled | true | Node MemoryPressure / DiskPressure / PIDPressure conditions. |
| probes.daemonsets.enabled | true | DaemonSets with unavailable/misscheduled pods. |
| probes.pendingPods.enabled | true | Pods stuck Pending past the scheduling grace window. |
| probes.crashloop.enabled | true | Containers in CrashLoopBackOff. |
| probes.etcd.enabled | true | etcd member health / quorum. |
| probes.failedMounts.enabled | true | Pods blocked on FailedMount / FailedAttachVolume events. |
| probes.kongRoutes.enabled | true | Kong-managed Ingress backend-Endpoint + plugin/consumer reference readiness. Silent on clusters without Kong-managed Ingresses. |
| probes.gpuNodes.enabled | true | NotReady / cordoned / zero-allocatable GPU nodes. Silent on CPU-only clusters. |
| probes.traefikRoutes.enabled | true | k3s Traefik IngressRoute backend readiness. Auto-skips on non-k3s or when the Traefik CRD is absent. |
| probes.k3sLocalPathStorage.enabled | true | k3s local-path-provisioner PVC health. No-ops when there are no local-path PVCs. |
| probes.k3sDatastore.enabled | true | k3s datastore (sqlite/etcd) health. Auto-skips on non-k3s. Set SRENIX_K3S_SINGLE_NODE_OK=true (via watcher.extraEnv) to suppress the single-node datastore warning on intentional single-node clusters. |
fixers
Optional fixers — off by default. Each entry adds RBAC verbs and exposes an env var the binary reads to enable the matching Fixer registration.
| Key | Default | Description |
| fixers.tlsSecretMismatch.enabled | false | Patches Ingress.spec.tls[].secretName when Srenix detects a stale Secret plus a healthy cert-manager Certificate in the same namespace targeting a different Secret for the same host. GitOps-managed Ingresses (ArgoCD / Flux / Helm release labels) are SKIPPED automatically — the… |
externalDNS
External-DNS verification for the DNSChainDrift analyzer. When enabled, the watcher + diagnose containers receive SRENIX_CLOUDFLARE_TOKEN via a secretKeyRef (NEVER a literal) so the analyzer can verify the external DNS hop (Cloudflare record → Ingress host). Without this the analyzer still runs the in-cluster chain hops and emits "external DNS hop not verified". Mirrors the operator-managed…
| Key | Default | Description |
| externalDNS.cloudflare.enabled | false | |
| externalDNS.cloudflare.apiTokenSecretRef | {} | name: srenix-cloudflare-token key: token |