Docs / Helm Reference

Docs

Helm Reference

Every values.yaml option, generated from chart v0.1.0-alpha.1 (synced 2026-06-26). Full chart source at charts/agentic-sre/ in the GitHub repo.

Install with defaults first — most options have safe defaults. Add cloud probes, ticketing, and AI tiers incrementally as you need them. This reference is generated directly from the chart's values.yaml (chart v0.1.0-alpha.1), so every key listed here exists in the chart.

image

Key	Default	Description
image.repository	docker4zerocool/agentic-sre	Docker Hub is the canonical publish target. docker4zerocool is the org's operational registry (same account that hosts every ai-* and mcp-* image the platform depends on); no extra pull-secret config needed for operators who already use it. GHCR is published as a mirror by…
image.tag	""	default = .Chart.appVersion (e.g. v0.1.0-alpha.1)
image.pullPolicy	""	pullPolicy is left empty so the `srenix.pullPolicy` helper picks it automatically: Always for mutable tags (latest / main / dev / `*-latest`) and IfNotPresent for semver-pinned tags. Override only when you specifically need to force one or the other.
image.pullSecrets	[]	e.g. [{name: dockerhub-secret}]

diagnose

Diagnose CronJob — always enabled.

Key	Default	Description
diagnose.enabled	true
diagnose.schedule	0 9 * * *	Daily 09:00 UTC. Same default as the bash version.
diagnose.successfulJobsHistoryLimit	3
diagnose.failedJobsHistoryLimit	3
diagnose.concurrencyPolicy	Forbid
diagnose.backoffLimit	1	backoffLimit caps how many times a failed pod within the Job is retried before the Job itself is marked Failed. K8s default is 6, which compounds with the per-pod default of 5 minutes of backoff — a hung run can keep spawning pods for over an hour. Srenix's diagnose is…
diagnose.activeDeadlineSeconds	300	activeDeadlineSeconds caps the Job's total runtime (counted across pod restarts). Raised 120 → 300 in v1.8.1: the v1.8 analyzer + M2-probe set adds a meaningful number of cluster List calls (CRDs, HPAs, all namespaces + pods + NetworkPolicies, Kong/Velero/Argo CRDs), and a live…
diagnose.format	daily	daily \| text \| json — daily posts to #healthinfo; text/json for ad-hoc runs

remediation

Remediate CronJob — opt-in.

Key	Default	Description
remediation.enabled	false
remediation.schedule	/30 * * *	Every 30 min. Off by default — turn on once your team trusts the fixers.
remediation.successfulJobsHistoryLimit	3
remediation.failedJobsHistoryLimit	3
remediation.concurrencyPolicy	Forbid
remediation.backoffLimit	1	See diagnose.backoffLimit comment. Remediation Jobs mutate cluster state so we especially do NOT want them retrying a hung-mid-mutation run.
remediation.activeDeadlineSeconds	300	See diagnose.activeDeadlineSeconds comment. Remediation runs the whitelisted fixers AND a full re-probe (which now includes the v1.8 analyzer + M2-probe set), so it inherits the same ~157s read cost. Raised 120 → 300 in v1.8.1 to match diagnose.
remediation.dryRun	false	When true, fixers report Refused without mutating cluster state.

slack

Slack three-channel routing. Each channel maps to a dedicated incoming-webhook Secret. All URLs must come from pre-existing Secrets in the install namespace — prefer wiring them via External Secrets Operator from Vault (see SECURITY.md). alerts → #ceph-alerts: event-driven, Srenix acted (auto-fixed issues) critical → #ceph-critical: event-driven, human action required healthinfo→ #healthinfo:…

Key	Default	Description
slack.alerts.enabled	false
slack.alerts.secretName	""	e.g. "srenix-slack-ceph-alerts"
slack.alerts.secretKey	WEBHOOK_URL
slack.critical.enabled	false
slack.critical.secretName	""	e.g. "srenix-slack-ceph-critical"
slack.critical.secretKey	WEBHOOK_URL
slack.healthinfo.enabled	false
slack.healthinfo.secretName	""	e.g. "srenix-slack-healthinfo"
slack.healthinfo.secretKey	WEBHOOK_URL

driftReport

DriftReport CRD — kubectl-queryable diagnostic objects. Each cron tick upserts one DriftReport per active diagnostic; CRs whose subject is no longer reported get deleted automatically. Inspect with: kubectl get driftreports -A Disable if you don't want CRD-shaped output (Slack + JSON still work).

Key	Default	Description
driftReport.enabled	true

resolutionRecord

ResolutionRecord CRD — append-only outcome log (one CR per applied+ verified remediation). The durable system-of-record the RAG memory layer embeds + retrieves. Inspect with: kubectl get resolutionrecords -l srenix.ai/verified=cleared

Key	Default	Description
resolutionRecord.enabled	true

silence

Silence — operator-controlled noise suppression. When `enabled`, the watcher fetches active Silence CRs once per cycle and drops matched diagnostics before downstream emission (DriftReport / Slack / Alertmanager / ticketing). Operators create Silences with `kubectl create -f silence.yaml` in any namespace; the matching is cluster-wide. Defaults to ON for new installs because installing the CRD on…

Key	Default	Description
silence.installCRD	true	ship the CRD via the chart
silence.enabled	true	provision the watcher's ClusterRole + binding

operator

Operator (Phase 1) — ships the AgenticSRE CRD shape only. The srenix-operator binary that reconciles AgenticSRE resources into the watcher Deployment / CronJobs / ServiceAccount lands in Phase 2 (next release). Installing the CRD alone is harmless: the resource is just queryable state. Default ON for new installs so the CRD is in place by the time the operator binary ships and clusters that opt…

Key	Default	Description
operator.installCRD	true
operator.enabled	false	Phase 1b (v1.8) — the srenix-operator controller-runtime manager. Default OFF: existing chart-managed installs continue to work unchanged. Flip to true to deploy the operator binary; the operator only takes over resources named by a AgenticSRE CR (which operators create…
operator.replicas	1	Operator replicas. 1 is sufficient because the manager uses lease-based leader-election; additional replicas stand by.
operator.resources	{}

vaultProbe

Vault-probe — closes the L1 stale-Ready window: queries Vault directly to verify each ExternalSecret's referenced path + property still exists, BEFORE the ESO controller's next refresh marks itself not-Ready. Catches the case where someone edits Vault but pods stay alive on cached Secret data and the next pod restart fails with CreateContainerConfigError. Privacy contract: srenix never reads…

Key	Default	Description
vaultProbe.enabled	false
vaultProbe.address	""	e.g. "https://vault.svc.cluster.local:8200"
vaultProbe.kvMount	secret	KV-v2 mount path; ESO `data[].remoteRef.key` is mount-relative
vaultProbe.auth.method	kubernetes	kubernetes \| token
vaultProbe.auth.role	""	required when method=kubernetes; the Vault role bound to the srenix SA
vaultProbe.auth.tokenSecretRef.name	""	K8s Secret holding a Vault token
vaultProbe.auth.tokenSecretRef.key	token	key within that Secret

alertmanager

Alertmanager integration — Kubernetes-native event routing. When enabled, Srenix posts the full active-issue state to Alertmanager each watcher cycle. Alertmanager handles: dedup, grouping, silencing, repeat intervals, and fan-out to all configured receivers (Slack, PagerDuty, Teams, email, webhook, …). This is the preferred model over direct Slack webhooks; the slack.alerts/critical fields…

Key	Default	Description
alertmanager.enabled	false
alertmanager.url	""	e.g. "http://alertmanager.pg.svc.cluster.local:9093"
alertmanager.clusterName	cluster	identifies this cluster in alert labels

cloud

Cloud probe framework — observe AWS / GCP / Azure resources alongside K8s state. Each provider is independently toggleable. Cloud probes do NOT fire on K8s event triggers — they run on `cadence` (default 10m) to protect cloud-API rate limits. Master switch: `cloud.enabled=false` disables EVERYTHING below and pays zero overhead (no SDK init, no probe registration, no extra RBAC). Required for…

Key	Default	Description
cloud.enabled	false
cloud.cadence	10m	min interval between cloud-probe runs
cloud.aws.enabled	false
cloud.aws.region	""	required; e.g. "us-east-1"
cloud.aws.auth.roleArn	""	ARN of the IAM role; auto-injected as eks.amazonaws.com/role-arn SA annotation
cloud.aws.probes.rds	true
cloud.aws.probes.ebs	true
cloud.aws.probes.eks	true
cloud.aws.probes.iam	true
cloud.aws.probes.alb	true
cloud.aws.probes.acm	true
cloud.aws.probes.kms	true
cloud.aws.probes.s3	true
cloud.aws.probes.vpc	true
cloud.gcp.enabled	false	GCP cloud probes shipped in v1.8 (M2): Cloud SQL, Persistent Disks, GKE control-plane + node pools, IAM service accounts, Subnets, LB backends, managed certs, GCS public access, KMS. NOTE: Cloud SQL storage-% is fetched via the Cloud Monitoring API (best-effort; "not measured"…
cloud.gcp.project	""
cloud.gcp.auth.serviceAccount	""	GSA email; auto-injected as iam.gke.io/gcp-service-account SA annotation
cloud.gcp.probes.cloudsql	true
cloud.gcp.probes.disks	true
cloud.gcp.probes.gke	true
cloud.gcp.probes.iam	true
cloud.gcp.probes.subnets	true
cloud.gcp.probes.lb	true
cloud.gcp.probes.certs	true
cloud.gcp.probes.gcs	true
cloud.gcp.probes.kms	true
cloud.gcp.subnetsSmallPrefixThreshold	0	Small-prefix threshold for the capacity-only subnets probe: an unmeasured subnet whose primary CIDR is smaller than /<threshold> is flagged as a warning. 0 = the probe default (/26 → 60 usable IPs). Raise (e.g. 28) to quiet intentionally tiny subnets, lower to be stricter.…
cloud.azure.enabled	false	Azure cloud probes shipped in v1.8 (M2): SQL databases, Disks, AKS control-plane + node pools, Managed Identities, Subnets, App Gateway backends, certs, Storage public access, Key Vaults. NOTE: SQL storage-% and App Gateway backend health are fetched via Azure Monitor metrics…
cloud.azure.subscriptionId	""
cloud.azure.resourceGroup	""	optional scope; empty = subscription-wide
cloud.azure.auth.clientId	""	AAD app client ID; auto-injected as azure.workload.identity/client-id SA annotation
cloud.azure.probes.sql	true
cloud.azure.probes.disks	true
cloud.azure.probes.aks	true
cloud.azure.probes.identities	true
cloud.azure.probes.subnets	true
cloud.azure.probes.appgw	true
cloud.azure.probes.certs	true
cloud.azure.probes.storage	true
cloud.azure.probes.keyvaults	true

ticketing

Ticketing — open issue-tracker tickets for diagnostics Srenix cannot auto-remediate. Runs after each watcher cycle's DriftReport reconcile; the resulting ticket key is persisted onto DriftReport.status.ticket so subsequent cycles know not to re-open the same ticket. Sink failures NEVER abort the cycle — logged and skipped, same posture as Slack/Alertmanager. OSS ships the OpenProject sink…

Key	Default	Description
ticketing.enabled	false
ticketing.provider	openproject	openproject \| jira \| servicenow
ticketing.cluster	cluster	identifies this cluster in ticket bodies; matches alertmanager.clusterName
ticketing.labels	[srenix, auto-filed]
ticketing.mcpURL	http://mcp-openproject-server.mcp.svc:8006/mcp	FLAT shape — matches the operator CRD's `spec.ticketing.` so YAML can move between Helm values and `kubectl patch srenix …` without reshaping. The legacy nested `ticketing.openproject.` block below is the v1.19.x shape and is honored by the chart template as a fallback ONLY…
ticketing.project	""	e.g. "6" for the Demo project
ticketing.typeID	""	e.g. "36" for Task — REQUIRED
ticketing.closedStatusID	""	e.g. "82" for Closed status — needed for resolve-on-clear
ticketing.webURLPrefix	""	e.g. "https://op.example.com" — used to build operator-clickable URLs
ticketing.severityPriority.critical	""	e.g. "75" for Immediate
ticketing.severityPriority.warning	""	e.g. "74" for High
ticketing.severityPriority.info	""	e.g. "73" for Normal
ticketing.dryRun	false	Log intended ops without calling the MCP server
ticketing.resolveOnClear	true	Auto-close the ticket when its finding clears. Default ON (M2 shipped).
ticketing.commentInterval	1h	Debounce for comment-on-recurrence: at most one comment per window. A recurring or severity-changed finding comments on the EXISTING ticket instead of opening a new one. "0" disables recurrence comments.
ticketing.auth.enabled	false
ticketing.auth.secretName	srenix-ticketing-mcp	K8s Secret with the API key
ticketing.auth.secretKey	api-key	Key inside the Secret
ticketing.route	""	SRENIX_TICKETING_ROUTE — routes a finding to a sink (provider above selects the default sink).
ticketing.jira.url	""	SRENIX_JIRA_URL
ticketing.jira.project	""	SRENIX_JIRA_PROJECT (project key)
ticketing.jira.email	""	SRENIX_JIRA_EMAIL (token-auth account email)
ticketing.jira.issueType	""	SRENIX_JIRA_ISSUE_TYPE (e.g. "Bug")
ticketing.jira.priority.critical	""	SRENIX_JIRA_PRIORITY_CRITICAL
ticketing.jira.priority.warning	""	SRENIX_JIRA_PRIORITY_WARNING
ticketing.jira.priority.info	""	SRENIX_JIRA_PRIORITY_INFO
ticketing.jira.webUrlBase	""	SRENIX_JIRA_WEB_URL_BASE (clickable ticket URLs)
ticketing.jira.tokenSecret.name	""	K8s Secret name (ESO-synced)
ticketing.jira.tokenSecret.key	""	key inside the Secret
ticketing.servicenow.url	""	SRENIX_SERVICENOW_URL
ticketing.servicenow.user	""	SRENIX_SERVICENOW_USER (basic-auth username)
ticketing.servicenow.urgency.critical	""	SRENIX_SERVICENOW_URGENCY_CRITICAL
ticketing.servicenow.urgency.warning	""	SRENIX_SERVICENOW_URGENCY_WARNING
ticketing.servicenow.urgency.info	""	SRENIX_SERVICENOW_URGENCY_INFO
ticketing.servicenow.impact.critical	""	SRENIX_SERVICENOW_IMPACT_CRITICAL
ticketing.servicenow.impact.warning	""	SRENIX_SERVICENOW_IMPACT_WARNING
ticketing.servicenow.impact.info	""	SRENIX_SERVICENOW_IMPACT_INFO
ticketing.servicenow.webUrlBase	""	SRENIX_SERVICENOW_WEB_URL_BASE (clickable ticket URLs)
ticketing.servicenow.passwordSecret.name	""	SRENIX_SERVICENOW_PASSWORD — secretKeyRef ONLY
ticketing.servicenow.passwordSecret.key	""
ticketing.servicenow.bearerSecret.name	""	SRENIX_SERVICENOW_BEARER — secretKeyRef ONLY
ticketing.servicenow.bearerSecret.key	""
ticketing.openproject	{}	DEPRECATED (v1.20.0+): nested per-provider sub-trees. Honored only when the equivalent flat field above is unset. New installs should use the flat shape exclusively. Will be removed in the next major chart bump.

watcher

Watcher — event-driven, long-running Deployment (Phase 1). Replaces polling latency (CronJob tick) with near-instant reaction to Kubernetes watch events. Debounces burst updates before re-running the full probe+analyzer stack. Slack dedup: only new/changed/resolved diagnostics produce a post. The seen-map is seeded from DriftReport CRs on startup so a pod restart does not re-flood Slack with all…

Key	Default	Description
watcher.enabled	false
watcher.replicas	1	replicas — watcher pod count. Default 1. Raising above 1 is ONLY safe with leaderElection.enabled=true (the SRENIX_LEADER_ELECTION path): otherwise every replica runs the probe/fix/post cycle and they race on DriftReports + double-post Slack. The chart FAILS the render when…
watcher.healthListen	:8081	healthListen — listen address for the always-on health server. GET /healthz returns 200 while the process is alive (independent of the webhook receiver), and the Deployment's liveness/readiness probes target it. The port number is derived from this address.
watcher.debounce	10s	Debounce window after a Kubernetes event.
watcher.resyncPeriod	10m	Periodic full re-diagnose regardless of events.
watcher.extraEnv	[]	- name: SRENIX_CRITICAL_SERVICES value: "pg/postgres,vault/vault" - name: SRENIX_K3S_SINGLE_NODE_OK value: "true"
watcher.slack.postOnResolved	true	Post when a diagnostic disappears.
watcher.slack.repeatInterval	4h	Re-post still-active warning/info at this cadence (0=never).
watcher.slack.criticalRepeatInterval	""	Re-post still-active CRITICAL at this cadence (empty = fall back to repeatInterval). Use to keep criticals loud (e.g. "4h") while letting warnings calm down (e.g. set repeatInterval=24h).
watcher.remedy.enabled	false	Run auto-fixers after each cycle (live mutation).
watcher.remedy.dryRun	false	Evaluate fixers without mutating cluster state.
watcher.leaderElection.enabled	true
watcher.leaderElection.leaseName	srenix-watcher
watcher.leaderElection.leaseDuration	30s
watcher.leaderElection.renewDeadline	20s
watcher.leaderElection.retryPeriod	5s
watcher.triggers.prom.url	""	e.g. "http://alertmanager.pg.svc.cluster.local:9093"
watcher.triggers.prom.interval	30s	clamped to ≥5s by the trigger client
watcher.triggers.prom.alertNameFilter	[]	e.g. ["DiskFillUp","CertExpiringSoon"]; empty = any firing alert
watcher.triggers.webhook.listen	""	e.g. ":8090" — empty disables receiver
watcher.triggers.webhook.sources	[]	e.g. ["vault=SRENIX_WEBHOOK_VAULT_SECRET","cert-manager=SRENIX_WEBHOOK_CM_SECRET"]
watcher.triggers.webhook.service.enabled	false	render a ClusterIP Service for the receiver
watcher.triggers.webhook.service.port	8090
watcher.triggers.webhook.secretName	""	e.g. "srenix-webhook-secrets"
watcher.resources.limits.cpu	500m
watcher.resources.limits.memory	256Mi
watcher.resources.requests.cpu	50m
watcher.resources.requests.memory	64Mi

runner

GitHub Actions self-hosted runner — Mode A of the WS-C publish pipeline. Deploys myoung34/github-runner inside the cluster so the nightly publish-runs workflow can call `srenix diagnose --live` directly without needing an internet-reachable MinIO endpoint. Prerequisites: 1. A GitHub PAT (classic, repo scope) or fine-grained token (Actions:write) stored in Vault at: secret/t6-apps/srenix/config →…

Key	Default	Description
runner.enabled	false
runner.repoUrl	https://github.com/srenix-ai/agentic-sre
runner.labels	self-hosted,cluster	must match runs-on in publish-runs.yml
runner.name	srenix-cluster-runner
runner.image	myoung34/github-runner:ubuntu-jammy	Runner image — ubuntu-jammy = Ubuntu 22.04; ubuntu-noble = 24.04
runner.tokenSecretName	srenix-runner-token	Secret that holds ACCESS_TOKEN for runner registration. Created by the ExternalSecret below — do not create manually.
runner.tokenSecretKey	ACCESS_TOKEN
runner.resources.limits.cpu	2
runner.resources.limits.memory	4Gi
runner.resources.requests.cpu	250m
runner.resources.requests.memory	512Mi
runner.nodeSelector	{}	Runner pod runs as root (required by myoung34/github-runner). It does NOT use the shared podSecurityContext.
runner.tolerations	[]

rbac

RBAC. Keep enabled — the CronJob will fail without these.

Key	Default	Description
rbac.create	true
rbac.reader.name	""	default: <release>-reader
rbac.remediator.name	""	default: <release>-remediator

serviceAccount

Key	Default	Description
serviceAccount.create	true
serviceAccount.name	""	default: <release>-sa
serviceAccount.annotations	{}

resources

Resource requests / limits (per CronJob pod).

Key	Default	Description
resources.limits.cpu	500m
resources.limits.memory	256Mi
resources.requests.cpu	50m
resources.requests.memory	64Mi

nodeSelector

Pod-level scheduling controls.

Key	Default	Description
nodeSelector	{}	Pod-level scheduling controls.

tolerations

Key	Default	Description
tolerations	[]

affinity

Key	Default	Description
affinity	{}

priorityClassName

Key	Default	Description
priorityClassName	""

podSecurityContext

Pod / container security context.

Key	Default	Description
podSecurityContext.runAsNonRoot	true
podSecurityContext.runAsUser	65532
podSecurityContext.fsGroup	65532
podSecurityContext.seccompProfile.type	RuntimeDefault

securityContext

Key	Default	Description
securityContext.allowPrivilegeEscalation	false
securityContext.readOnlyRootFilesystem	true
securityContext.capabilities.drop	[ALL]

ai

AI tier (commercial / Srenix Enterprise) — recommendation-only AI for narration, fix proposals, multi-step plans, and Vault recovery runbooks. Every tier gates mutation behind human one-click approval. See docs/AI_TIERS.md and docs/DEPLOYMENT.md. DEPLOYMENT MODEL — purely additive. Setting ai.enabled=true does NOT touch the OSS watcher / diagnose / remediate workloads; they keep running the OSS…

Key	Default	Description
ai.enabled	false
ai.tier	t0	t0 (narration) \| t1 (fix proposals) \| t2 (planner) \| t3 (vault runbooks)
ai.endpoint	""	REQUIRED if enabled; OpenAI-compatible base URL, e.g. "https://mcp.baisoln.com/gpu-ai/v1"
ai.model	""	REQUIRED if enabled; e.g. "qwen3.6-35b-a3b-fp8"
ai.interval	60s	poll cadence; AI tiers fire only on NEW diagnostics each cycle (natural LLM-cost cap)
ai.replicas	1	Phase 2.F — HA aiwatch via leader-election. Default 1 (single-replica noop path; byte-identical to pre-2.F). When >1, the chart turns on --leader-election=true and binds the SA to the Lease Role; exactly one replica runs tick() at a time, failover within ~30s on lease loss.
ai.digestPinAttestation.secretName	""	e.g. "srenix-digest-pin-attestation-key"
ai.digestPinAttestation.secretKey	attestation.key
ai.digestPinAttestation.keyID	srenix-digest-pin
ai.metrics.addr	""	e.g. ":9090" to enable
ai.metrics.port	9090	container/service port (must match the :NNNN in addr)
ai.metrics.serviceMonitor.enabled	false	set true when prometheus-operator is installed
ai.metrics.serviceMonitor.interval	30s	scrape interval
ai.metrics.serviceMonitor.scrapeTimeout	10s
ai.metrics.grafanaDashboard.enabled	false	When true, ship a ConfigMap with the Srenix overview dashboard tagged for kube-prometheus-stack's sidecar discovery.
ai.metrics.grafanaDashboard.extraLabels	{}	e.g. {"app.kubernetes.io/instance": "monitoring"} for non-default Grafana
ai.metrics.prometheusRule.enabled	false	When true, ship a PrometheusRule with the canary alerts: ChaWatcherStuck, ChaBreakerOpen, ChaAutonomyRejectionSpike.
ai.metrics.prometheusRule.labels	{}	e.g. {"prometheus": "k8s"} when prometheus-operator uses non-default selectors
ai.allowSaas	false	set true to allow api.openai.com / api.anthropic.com endpoints
ai.llmFixerMatcher	false	t1+: use the LLM-classified fixer matcher (falls back to keyword on error)
ai.auditLog	""	AI-event audit sink: "" (off) \| "-" (stdout) \| "/path/to.jsonl". Set for t1+ compliance.
ai.approvalServerUrl	""	t1+: base URL of the approval-server (e.g. https://srenix-approve.example.com). When set, T1/T2 proposals emit a signed one-click click-to-fix link. Pair with approval.enabled + approval.ingress.host.
ai.image.repository	docker4zerocool/srenix-enterprise	The commercial Srenix Enterprise image for the aiwatch Deployment. Tag defaults to "v<AppVersion>" (srenix-enterprise images carry a leading "v").
ai.image.tag	""	default = "v" + .Chart.AppVersion (e.g. v0.1.0-alpha.1)
ai.image.pullPolicy	IfNotPresent
ai.resources	{}	defaults to top-level resources if unset
ai.apiKey.secretName	""	K8s secret holding the LLM bearer token (ESO-managed); empty = no-auth in-cluster vLLM
ai.apiKey.secretKey	API_KEY	key within the secret
ai.apiKey.envName	AI_API_KEY	env var the binary reads (matches --ai-api-key-env default)
ai.apiKey.header	""	HTTP header for the key; "" = "Authorization: Bearer"; set "X-API-Key" for Kong key-auth
ai.t3.vaultAllowedPrefixes	[]	e.g. ["secret/data/srenix-recovery/"]
ai.memory.enabled	false
ai.memory.image.repository	qdrant/qdrant
ai.memory.image.tag	v1.12.4
ai.memory.image.pullPolicy	IfNotPresent
ai.memory.storage.size	5Gi
ai.memory.storage.className	""	default storage class if empty (use nfs-client/cephfs on mixed GPU nodes)
ai.memory.resources	{}
ai.memory.embeddings.endpoint	https://mcp.baisoln.com/gpu-ai/v1	OpenAI-compatible /embeddings
ai.memory.embeddings.model	qwen3-embedding-0.6b
ai.memory.storeUrl	""	default: http://<release>-rag.<ns>.svc:6333
ai.memory.topK	5	how many prior resolutions to retrieve per finding
ai.rateLimit.actionsPerHour	5	AI-proposed actions/hour budget
ai.rateLimit.tokensPerHour	1000000
ai.circuitBreaker.consecutiveFailures	3
ai.audit.destination	events	events \| loki \| otlp (reserved for a future structured sink; auditLog above drives the binary today)
ai.audit.lokiURL	""
ai.audit.otlpEndpoint	""

approval

Approval-server sidecar — holds the JWT signing key, terminates click-to-fix URLs. Automatically required (and chart-level validation should ensure) when ai.tier ∈ {t1, t2, t3}. For ai.tier=t0 (narration only) approval is unused.

Key	Default	Description
approval.enabled	false
approval.replicas	1
approval.image.repository	docker4zerocool/srenix-enterprise
approval.image.tag	""	default = .Chart.AppVersion
approval.image.pullPolicy	IfNotPresent
approval.signingKey.secretName	srenix-approval-signing-key
approval.silence.shortDuration	24h	subject-scoped "Silence 24h"
approval.silence.longDuration	2160h	class-scoped "Silence class (90d)" — 90d
approval.store.backend	""	"" (= inmemory) \| configmap
approval.store.namespace	""	default = release namespace
approval.store.replayConfigMap	srenix-approval-replay
approval.store.runbookConfigMap	srenix-approval-runbooks
approval.ingress.enabled	false
approval.ingress.host	""	REQUIRED if approval.ingress.enabled; e.g. srenix-approve.example.com
approval.ingress.ingressClassName	""
approval.ingress.annotations	{}
approval.ingress.tls.enabled	true
approval.ingress.tls.secretName	""	default: <release>-approval-server-tls
approval.networkPolicy.enabled	false
approval.networkPolicy.gatewayNamespaceSelector	{}	REQUIRED if enabled; e.g. {kubernetes.io/metadata.name: gateway}
approval.resources.limits.cpu	200m
approval.resources.limits.memory	128Mi
approval.resources.requests.cpu	20m
approval.resources.requests.memory	32Mi
approval.nodeSelector	{}

dashboard

P6.6 — the read-only hosted dashboard. A separate Deployment + Service running the `srenix-enterprise dashboard` subcommand: a server-rendered HTML view of findings (live DriftReports), pending approvals, and remediation history. It NEVER mutates the cluster — for any action it links out to the EXISTING approval-server endpoints (built from approvalBaseURL). RBAC posture: a DEDICATED…

Key	Default	Description
dashboard.enabled	false
dashboard.replicas	1
dashboard.image.repository	docker4zerocool/srenix-enterprise
dashboard.image.tag	""	default = v<.Chart.AppVersion>
dashboard.image.pullPolicy	IfNotPresent
dashboard.approvalBaseURL	""	approvalBaseURL is REQUIRED when dashboard.enabled=true: the externally reachable base URL of the approval-server (e.g. https://srenix-approve.example.com). The /approvals page builds Approve/Deny/Ignore links against it, so the action links work.
dashboard.authHeader	X-Forwarded-User	authHeader is the HTTP header carrying the oauth2-proxy-authenticated operator identity, displayed in the page header. Default matches the approval-server convention.
dashboard.auditLogPath	""	auditLogPath optionally points at the approval-server's tamper-evident audit JSONL (its --ai-audit-log output) to power /history and /approvals. Empty = those pages render an empty state.
dashboard.historyLimit	100	max rows on /history (most-recent first)
dashboard.approvalsLimit	50	max rows on /approvals (most-recent first)
dashboard.ingress.enabled	false
dashboard.ingress.host	""	REQUIRED if dashboard.ingress.enabled; e.g. srenix-dashboard.example.com
dashboard.ingress.ingressClassName	""
dashboard.ingress.annotations	{}	e.g. oauth2-proxy / cert-manager annotations
dashboard.ingress.tls.enabled	true
dashboard.ingress.tls.secretName	""	default: <release>-dashboard-tls
dashboard.networkPolicy.enabled	false
dashboard.networkPolicy.gatewayNamespaceSelector	{}	REQUIRED if enabled; e.g. {kubernetes.io/metadata.name: gateway}
dashboard.resources.limits.cpu	200m
dashboard.resources.limits.memory	128Mi
dashboard.resources.requests.cpu	20m
dashboard.resources.requests.memory	32Mi
dashboard.nodeSelector	{}

protectedNamespaces

Protected namespaces — the act-side no-touch list. The compiled-in floor (kube-system, kube-public, kube-node-lease, rook-ceph, vault, external-secrets, cnpg-system) is NOT configurable and can never be removed. `extra` APPENDS namespaces to that floor: each entry is rendered as SRENIX_PROTECTED_NAMESPACES_EXTRA (comma-separated) on the watcher, diagnose, remediate, AND aiwatch containers, so…

Key	Default	Description
protectedNamespaces.extra	[]

gatekeeper

Key	Default	Description
gatekeeper.install	false	set true if Gatekeeper is installed in the cluster
gatekeeper.constraints.protectedNamespaces	[kube-system, kube-public, kube-node-lease, rook-ceph, vault, external-secrets, cnpg-system]

analyzers

Drift-class analyzers added in v1.7 (Workstreams B1+B2+B3 from the AI SRE positioning plan) + v1.8 (Workstream B4). Default to ON so existing installs get the new signal automatically; flip individual entries to false on clusters that don't host the targeted asset class.

Key	Default	Description
analyzers.secretKeyMissing.enabled	true	Workloads referencing a Secret key that does not exist (the CreateContainerConfigError root cause).
analyzers.failingExternalSecrets.enabled	true	ExternalSecrets stuck SecretSyncedError / not Ready (ESO sync chain broken).
analyzers.proactiveSecretKeyCheck.enabled	true	Proactive secretKeyRef validation BEFORE pods restart and hit CreateContainerConfigError.
analyzers.unprovisionedSecret.enabled	true	Secrets referenced by workloads that no controller (ESO, cert- manager, Helm) ever provisioned.
analyzers.imagePullAuth.enabled	true	ImagePullBackOff caused by missing/invalid pull credentials (namespace missing pull secret or deployment missing imagePullSecrets).
analyzers.certExpiry.enabled	true	cert-manager Certificates close to / past expiry or stuck not Ready.
analyzers.tlsSecretMismatch.enabled	true	Ingress TLS secretName pointing at a stale/mismatched Secret while a healthy Certificate targets a different Secret for the same host. (The optional auto-FIXER for this finding is gated separately under fixers.tlsSecretMismatch.)
analyzers.gitopsDrift.enabled	true	GitOps drift — Argo CD Application + Flux Kustomization + Flux HelmRelease. Surfaces controllers stuck OutOfSync, Degraded, NotReady, BuildFailed, UpgradeFailed, etc. Default 10-minute grace period (controllers are routinely reconciling). Set to false on clusters without…
analyzers.workloadStateDrift.enabled	true	State-tier drift — CNPG cluster phase / follower lag / primary switchover stuck; StatefulSet ordinal-zero stuck. Goes deeper than the basic "X/Y ready" probe. Set to false on clusters that don't host CNPG and don't run StatefulSets.
analyzers.rbacDrift.enabled	true	RBAC drift — wildcard-verb roles + unbound ServiceAccounts mounted by Pods. Skips system canonical roles (cluster-admin, system:*) and the default SA in every namespace. Set to false if your cluster's RBAC posture is managed entirely by an upstream IaC system that…
analyzers.configDrift.enabled	true	Config drift (v1.8) — CRD multi-storedVersions (storage migration pending), Deployment rollouts stuck past the grace window (generation skew or updatedReplicas trailing spec.replicas), and Pods of the same Deployment carrying disagreeing checksum/config annotations (rolling…
analyzers.capacityDrift.enabled	true	Capacity drift (v1.8) — HPA pinned at maxReplicas past the saturation grace (24h default; workload is chronically under-provisioned), HPA pinned at minReplicas past the idle grace (30d default; HPA is not load-driven), HPA AbleToScale=False past grace (typically ResourceQuota or…
analyzers.securityDrift.enabled	true	Security drift (v1.8) — three observational signals: user namespaces with no pod-security.kubernetes.io/enforce label (apiserver applies the cluster-wide default, typically privileged) or with enforce=privileged explicitly (most- permissive PSS profile); Pods whose containers…
analyzers.disruptionDrift.enabled	true	Disruption-tier drift (v1.21) — ResourceQuota near-exhaustion, PodDisruptionBudget blocking voluntary disruption (drains stuck), and Jobs past their activeDeadline / backoffLimit. Each sub-signal handles its own GVR-absence case. Set to false to silence.
analyzers.oomkillRecurrence.enabled	true	Workload-tier (v1.22) — containers OOMKilled repeatedly (a sizing problem masquerading as a crash loop). Set to false to silence.
analyzers.pvOrphan.enabled	true	Workload-tier (v1.22) — Released/Available PersistentVolumes with no bound PVC (a cost leak). Set to false to silence.
analyzers.cronjobStuck.enabled	true	Workload-tier (v1.22) — CronJobs that have not scheduled a Job within their expected window (silent scheduling failure). Set to false to silence.
analyzers.logPatternMatcher.enabled	true	v1.25 — scans recent Events for high-signal failure messages (ImagePullBackOff, OOMKilled, VolumeAttachFailed, ProbeFailed, Forbidden). Dedup'd one finding per (object, pattern). Set to false to silence.
analyzers.netpolProposer.enabled	true	Phase 2d-β — on NetworkPolicy-enforcing CNIs, emits one warning per uncovered namespace with a deterministic ProposedPolicyYAML. Silent on Flannel-only k3s. Set to false to silence.
analyzers.dnsChainDrift.enabled	true	v1.10 — verifies the DNS chain (Service → Ingress → external host) for the seeded endpoint hostnames. Runs the K8s-chain hops with no config; external-hop verification requires externalDNS.cloudflare. Set to false to silence.

investigator

Layer-2 investigator (deterministic, rule-based; ships in OSS). Defaults ON. The paid binary may replace it with an LLM-backed implementation. Set enabled: false to disable (SRENIX_INVESTIGATOR=off).

Key	Default	Description
investigator.enabled	true

probes

M2 probe-class additions (v1.8). Each defaults to ON and AUTO-SKIPS when its CRD is absent (Kong / ArgoCD / Velero) or no-ops on an empty list (HPA), so leaving them on costs nothing on clusters that don't host the asset. Set enabled: false only to silence a probe on a cluster that DOES host the CRD but you don't want Srenix watching it.

Key	Default	Description
probes.ceph.enabled	true	Rook-Ceph cluster health (HEALTH_OK / OSD status). Auto-skips when the rook-ceph CRDs are absent.
probes.nodes.enabled	true	Node Ready conditions across the cluster.
probes.postgres.enabled	true	PostgreSQL (CNPG) cluster health. Auto-skips when the CNPG CRDs are absent.
probes.pvcs.enabled	true	PersistentVolumeClaims stuck Pending / Lost.
probes.criticalWorkloads.enabled	true	Critical Services probe — the curated workload target list (defaults merged with SRENIX_CRITICAL_SERVICES / the probe-critical annotation). Emits SRENIX_PROBE_CRITICAL_WORKLOADS=off when disabled.
probes.endpoints.enabled	true	External HTTP(S) endpoint reachability for discovered/seeded Ingress hostnames.
probes.kong.enabled	true	Kong ingress — KongPlugin / KongConsumer / Kong proxy readiness drift. Auto-skips when configuration.konghq.com CRDs are absent.
probes.hpaScaling.enabled	true	HorizontalPodAutoscaler scaling health (distinct from the v1.8 capacityDrift analyzer's longitudinal signals). No-ops on an empty HPA list.
probes.argocdApp.enabled	true	Argo CD Application sync/health (probe-level snapshot, distinct from the gitopsDrift analyzer). Auto-skips when argoproj.io CRDs are absent.
probes.velero.enabled	true	Velero backup freshness / last-backup status. Auto-skips when velero.io CRDs are absent.
probes.nodePressure.enabled	true	Node MemoryPressure / DiskPressure / PIDPressure conditions.
probes.daemonsets.enabled	true	DaemonSets with unavailable/misscheduled pods.
probes.pendingPods.enabled	true	Pods stuck Pending past the scheduling grace window.
probes.crashloop.enabled	true	Containers in CrashLoopBackOff.
probes.etcd.enabled	true	etcd member health / quorum.
probes.failedMounts.enabled	true	Pods blocked on FailedMount / FailedAttachVolume events.
probes.kongRoutes.enabled	true	Kong-managed Ingress backend-Endpoint + plugin/consumer reference readiness. Silent on clusters without Kong-managed Ingresses.
probes.gpuNodes.enabled	true	NotReady / cordoned / zero-allocatable GPU nodes. Silent on CPU-only clusters.
probes.traefikRoutes.enabled	true	k3s Traefik IngressRoute backend readiness. Auto-skips on non-k3s or when the Traefik CRD is absent.
probes.k3sLocalPathStorage.enabled	true	k3s local-path-provisioner PVC health. No-ops when there are no local-path PVCs.
probes.k3sDatastore.enabled	true	k3s datastore (sqlite/etcd) health. Auto-skips on non-k3s. Set SRENIX_K3S_SINGLE_NODE_OK=true (via watcher.extraEnv) to suppress the single-node datastore warning on intentional single-node clusters.

fixers

Optional fixers — off by default. Each entry adds RBAC verbs and exposes an env var the binary reads to enable the matching Fixer registration.

Key	Default	Description
fixers.tlsSecretMismatch.enabled	false	Patches Ingress.spec.tls[].secretName when Srenix detects a stale Secret plus a healthy cert-manager Certificate in the same namespace targeting a different Secret for the same host. GitOps-managed Ingresses (ArgoCD / Flux / Helm release labels) are SKIPPED automatically — the…

externalDNS

External-DNS verification for the DNSChainDrift analyzer. When enabled, the watcher + diagnose containers receive SRENIX_CLOUDFLARE_TOKEN via a secretKeyRef (NEVER a literal) so the analyzer can verify the external DNS hop (Cloudflare record → Ingress host). Without this the analyzer still runs the in-cluster chain hops and emits "external DNS hop not verified". Mirrors the operator-managed…

Key	Default	Description
externalDNS.cloudflare.enabled	false
externalDNS.cloudflare.apiTokenSecretRef	{}	name: srenix-cloudflare-token key: token

← Back to docs