Docs
Kubernetes Probes
21 probes run on every cycle against the Kubernetes API — read-only, no metrics scraping, no log shipping.
Probes are the detection layer. Each probe reads one area of cluster state and emits a health signal (Healthy / Warning / Critical). Probe results feed the analyzers and fixers. All probes are read-only — they never modify cluster state.
Disable any probe with an env var (SRENIX_PROBE_NAME=off) or Helm flag (probes.name.enabled: false). Each probe is independently togglable.
Probe reference
| Probe | What it checks | Fires when | Disable env var |
|---|---|---|---|
| Ceph | Rook-Ceph health, OSD readiness, capacity % | HEALTH_ERR or ≥80% capacity | SRENIX_PROBE_CEPH=off |
| PostgreSQL | CNPG + Zalando Spilo/Patroni clusters (auto-detected) | Replica not ready, primary missing | SRENIX_PROBE_POSTGRES=off |
| CriticalWorkloads | Configurable list of critical Deployments/StatefulSets | READY count < desired | SRENIX_PROBE_CRITICAL_WORKLOADS=off |
| ClusterNodes | Node Ready condition | Any node NotReady | SRENIX_PROBE_NODES=off |
| PVCs | PersistentVolumeClaim phase | Any PVC not Bound | SRENIX_PROBE_PVCS=off |
| Endpoints | HTTP/S probe of each Ingress host (auto-discovered) | 2-of-2 consecutive failures (flake-suppressed) | SRENIX_PROBE_ENDPOINTS=off |
| NodePressure | DiskPressure / MemoryPressure / PIDPressure / NetworkUnavailable | Any pressure condition True; DiskPressure → Critical | SRENIX_PROBE_NODE_PRESSURE=off |
| DaemonSets | 8 system namespaces (kube-system, cilium-system, calico-system, kube-flannel, longhorn-system, rook-ceph, openebs, metallb-system) | desiredNumberScheduled ≠ numberReady | SRENIX_PROBE_DAEMONSETS=off |
| PendingPods | Pods with PodScheduled=False past 60s grace | Insufficient CPU/Memory, unbound PVC, taint/nodeSelector mismatch | SRENIX_PROBE_PENDING_PODS=off |
| CrashLoopBackOff | Any namespace; protected-ns escalates immediately | Protected-ns: always Critical; user-ns: past restart threshold (default 10) | SRENIX_PROBE_CRASHLOOP=off |
| ETCD | kubeadm static-pod etcd members; "blind probe" warning on managed etcd. Note: detection is pod-readiness based. A quorum split where pods remain Ready is not detected. | Member unhealthy; Warning on managed/external etcd | SRENIX_PROBE_ETCD=off |
| FailedMounts | Pods stuck ContainerCreating past 90s + kubelet FailedMount/FailedAttach events | Volume mount failure confirmed by kubelet event | SRENIX_PROBE_FAILED_MOUNTS=off |
| Kong | KongPlugin CRs (auto-skips when Kong CRDs absent) | Plugin with Programmed=False — gateway serving traffic without the intended policy | SRENIX_PROBE_KONG=off |
| KongRoutes | For each Kong-managed Ingress: backend Service has ≥1 ready Endpoint + KongPlugin/Consumer annotation refs resolve | No ready endpoints or dangling annotation | SRENIX_PROBE_KONG_ROUTES=off |
| GPUNodes | nvidia.com/gpu + amd.com/gpu node allocatability | GPU node NotReady, cordoned, or zero allocatable GPU | SRENIX_PROBE_GPU_NODES=off |
| HPAScaling | HorizontalPodAutoscaler status conditions (healthy on zero HPAs) | ScalingActive=False or AbleToScale=False — immediate Critical, no dwell. Exception: ScalingActive=False with reason=ScalingDisabled downgrades to Warning (expected KEDA scale-to-zero state) | SRENIX_PROBE_HPA_SCALING=off |
| ArgoCDApplication | ArgoCD Application sync/health status (auto-skips when Argo CD CRDs absent) | Application OutOfSync or Degraded — immediate, no grace window | SRENIX_PROBE_ARGOCD_APP=off |
| Velero | Velero Backup CRs across namespaces (auto-skips when Velero CRDs absent) | Latest backup Failed/PartiallyFailed, older than the backup SLA (default 26h), or InProgress >4h | SRENIX_PROBE_VELERO=off |
| TraefikRoutes | Traefik IngressRoute CRDs (v3 traefik.io, falls back to v2; auto-skips when absent). Also validates IngressRouteTCP backend Services. | Missing backend Service, unresolved Middleware ref, or TLS config with no cert provisioner | SRENIX_PROBE_TRAEFIK_ROUTES=off |
| K3sLocalPathStorage | Ephemeral-storage headroom on nodes hosting local-path-provisioner PVCs (no-ops without local-path PVCs) | Disk-pressure risk on a node backing local-path volumes | SRENIX_PROBE_K3S_LOCALPATH=off |
| K3sDatastore | k3s datastore mode (embedded etcd vs SQLite) + mode-appropriate health signals. Also checks etcd quorum (N/2+1 threshold), 2-node cluster fault-tolerance warning, per-member snapshot freshness via node annotations, and ConfigMap-based snapshot age vs SLA. | Datastore unhealthy; avoids the spurious "no etcd pods" warning on SQLite single-node k3s | SRENIX_PROBE_K3S_DATASTORE=off |
LogPatternMatcher analyzer
Ships alongside the K8s probe set. Scans recent Kubernetes Events (not pod logs) for high-signal failure patterns — ImagePullBackOff, OOMKilled, probe-failed, volume-attach-failed, RBAC Forbidden — and deduplicates by (object, pattern) before emitting a finding. Severity is per-pattern, not uniform: ImagePullBackOff (and ErrImagePull / manifest-unknown) and VolumeAttachFailed are reported as Critical; OOMKilled, ProbeFailed (Liveness/Readiness/Startup), and RBAC Forbidden are reported as Warning. Does not auto-fix. Disable with SRENIX_ANALYZER_LOG_PATTERN_MATCHER=off.
Trigger classes
Probes run on three trigger classes: A (Kubernetes informers — react within ~10s of any resource event), C (Alertmanager polling — catches slow-drift signals like disk fill or cert expiry), and E (external HMAC-authenticated webhook — immediate cycle on external signal). CronJob resync runs on the schedule in Helm values as a safety net.