Kubernetes Probes

21 probes run on every cycle against the Kubernetes API — read-only, no metrics scraping, no log shipping.

Probes are the detection layer. Each probe reads one area of cluster state and emits a health signal (Healthy / Warning / Critical). Probe results feed the analyzers and fixers. All probes are read-only — they never modify cluster state.

Disable any probe with an env var (SRENIX_PROBE_NAME=off) or Helm flag (probes.name.enabled: false). Each probe is independently togglable.

Probe reference

Probe	What it checks	Fires when	Disable env var
Ceph	Rook-Ceph health, OSD readiness, capacity %	HEALTH_ERR or ≥80% capacity	SRENIX_PROBE_CEPH=off
PostgreSQL	CNPG + Zalando Spilo/Patroni clusters (auto-detected)	Replica not ready, primary missing	SRENIX_PROBE_POSTGRES=off
CriticalWorkloads	Configurable list of critical Deployments/StatefulSets	READY count < desired	SRENIX_PROBE_CRITICAL_WORKLOADS=off
ClusterNodes	Node Ready condition	Any node NotReady	SRENIX_PROBE_NODES=off
PVCs	PersistentVolumeClaim phase	Any PVC not Bound	SRENIX_PROBE_PVCS=off
Endpoints	HTTP/S probe of each Ingress host (auto-discovered)	2-of-2 consecutive failures (flake-suppressed)	SRENIX_PROBE_ENDPOINTS=off
NodePressure	DiskPressure / MemoryPressure / PIDPressure / NetworkUnavailable	Any pressure condition True; DiskPressure → Critical	SRENIX_PROBE_NODE_PRESSURE=off
DaemonSets	8 system namespaces (kube-system, cilium-system, calico-system, kube-flannel, longhorn-system, rook-ceph, openebs, metallb-system)	desiredNumberScheduled ≠ numberReady	SRENIX_PROBE_DAEMONSETS=off
PendingPods	Pods with PodScheduled=False past 60s grace	Insufficient CPU/Memory, unbound PVC, taint/nodeSelector mismatch	SRENIX_PROBE_PENDING_PODS=off
CrashLoopBackOff	Any namespace; protected-ns escalates immediately	Protected-ns: always Critical; user-ns: past restart threshold (default 10)	SRENIX_PROBE_CRASHLOOP=off
ETCD	kubeadm static-pod etcd members; "blind probe" warning on managed etcd. Note: detection is pod-readiness based. A quorum split where pods remain Ready is not detected.	Member unhealthy; Warning on managed/external etcd	SRENIX_PROBE_ETCD=off
FailedMounts	Pods stuck ContainerCreating past 90s + kubelet FailedMount/FailedAttach events	Volume mount failure confirmed by kubelet event	SRENIX_PROBE_FAILED_MOUNTS=off
Kong	KongPlugin CRs (auto-skips when Kong CRDs absent)	Plugin with Programmed=False — gateway serving traffic without the intended policy	SRENIX_PROBE_KONG=off
KongRoutes	For each Kong-managed Ingress: backend Service has ≥1 ready Endpoint + KongPlugin/Consumer annotation refs resolve	No ready endpoints or dangling annotation	SRENIX_PROBE_KONG_ROUTES=off
GPUNodes	nvidia.com/gpu + amd.com/gpu node allocatability	GPU node NotReady, cordoned, or zero allocatable GPU	SRENIX_PROBE_GPU_NODES=off
HPAScaling	HorizontalPodAutoscaler status conditions (healthy on zero HPAs)	ScalingActive=False or AbleToScale=False — immediate Critical, no dwell. Exception: ScalingActive=False with reason=ScalingDisabled downgrades to Warning (expected KEDA scale-to-zero state)	SRENIX_PROBE_HPA_SCALING=off
ArgoCDApplication	ArgoCD Application sync/health status (auto-skips when Argo CD CRDs absent)	Application OutOfSync or Degraded — immediate, no grace window	SRENIX_PROBE_ARGOCD_APP=off
Velero	Velero Backup CRs across namespaces (auto-skips when Velero CRDs absent)	Latest backup Failed/PartiallyFailed, older than the backup SLA (default 26h), or InProgress >4h	SRENIX_PROBE_VELERO=off
TraefikRoutes	Traefik IngressRoute CRDs (v3 traefik.io, falls back to v2; auto-skips when absent). Also validates IngressRouteTCP backend Services.	Missing backend Service, unresolved Middleware ref, or TLS config with no cert provisioner	SRENIX_PROBE_TRAEFIK_ROUTES=off
K3sLocalPathStorage	Ephemeral-storage headroom on nodes hosting local-path-provisioner PVCs (no-ops without local-path PVCs)	Disk-pressure risk on a node backing local-path volumes	SRENIX_PROBE_K3S_LOCALPATH=off
K3sDatastore	k3s datastore mode (embedded etcd vs SQLite) + mode-appropriate health signals. Also checks etcd quorum (N/2+1 threshold), 2-node cluster fault-tolerance warning, per-member snapshot freshness via node annotations, and ConfigMap-based snapshot age vs SLA.	Datastore unhealthy; avoids the spurious "no etcd pods" warning on SQLite single-node k3s	SRENIX_PROBE_K3S_DATASTORE=off

LogPatternMatcher analyzer

Ships alongside the K8s probe set. Scans recent Kubernetes Events (not pod logs) for high-signal failure patterns — ImagePullBackOff, OOMKilled, probe-failed, volume-attach-failed, RBAC Forbidden — and deduplicates by (object, pattern) before emitting a finding. Severity is per-pattern, not uniform: ImagePullBackOff (and ErrImagePull / manifest-unknown) and VolumeAttachFailed are reported as Critical; OOMKilled, ProbeFailed (Liveness/Readiness/Startup), and RBAC Forbidden are reported as Warning. Does not auto-fix. Disable with SRENIX_ANALYZER_LOG_PATTERN_MATCHER=off.

Trigger classes

Probes run on three trigger classes: A (Kubernetes informers — react within ~10s of any resource event), C (Alertmanager polling — catches slow-drift signals like disk fill or cert expiry), and E (external HMAC-authenticated webhook — immediate cycle on external signal). CronJob resync runs on the schedule in Helm values as a safety net.

← Back to docs