S Srenix
Docs / K8s Probes

Docs

Kubernetes Probes

21 probes run on every cycle against the Kubernetes API — read-only, no metrics scraping, no log shipping.

Probes are the detection layer. Each probe reads one area of cluster state and emits a health signal (Healthy / Warning / Critical). Probe results feed the analyzers and fixers. All probes are read-only — they never modify cluster state.

Disable any probe with an env var (SRENIX_PROBE_NAME=off) or Helm flag (probes.name.enabled: false). Each probe is independently togglable.

Probe reference

Probe What it checks Fires when Disable env var
Ceph Rook-Ceph health, OSD readiness, capacity % HEALTH_ERR or ≥80% capacity SRENIX_PROBE_CEPH=off
PostgreSQL CNPG + Zalando Spilo/Patroni clusters (auto-detected) Replica not ready, primary missing SRENIX_PROBE_POSTGRES=off
CriticalWorkloads Configurable list of critical Deployments/StatefulSets READY count < desired SRENIX_PROBE_CRITICAL_WORKLOADS=off
ClusterNodes Node Ready condition Any node NotReady SRENIX_PROBE_NODES=off
PVCs PersistentVolumeClaim phase Any PVC not Bound SRENIX_PROBE_PVCS=off
Endpoints HTTP/S probe of each Ingress host (auto-discovered) 2-of-2 consecutive failures (flake-suppressed) SRENIX_PROBE_ENDPOINTS=off
NodePressure DiskPressure / MemoryPressure / PIDPressure / NetworkUnavailable Any pressure condition True; DiskPressure → Critical SRENIX_PROBE_NODE_PRESSURE=off
DaemonSets 8 system namespaces (kube-system, cilium-system, calico-system, kube-flannel, longhorn-system, rook-ceph, openebs, metallb-system) desiredNumberScheduled ≠ numberReady SRENIX_PROBE_DAEMONSETS=off
PendingPods Pods with PodScheduled=False past 60s grace Insufficient CPU/Memory, unbound PVC, taint/nodeSelector mismatch SRENIX_PROBE_PENDING_PODS=off
CrashLoopBackOff Any namespace; protected-ns escalates immediately Protected-ns: always Critical; user-ns: past restart threshold (default 10) SRENIX_PROBE_CRASHLOOP=off
ETCD kubeadm static-pod etcd members; "blind probe" warning on managed etcd. Note: detection is pod-readiness based. A quorum split where pods remain Ready is not detected. Member unhealthy; Warning on managed/external etcd SRENIX_PROBE_ETCD=off
FailedMounts Pods stuck ContainerCreating past 90s + kubelet FailedMount/FailedAttach events Volume mount failure confirmed by kubelet event SRENIX_PROBE_FAILED_MOUNTS=off
Kong KongPlugin CRs (auto-skips when Kong CRDs absent) Plugin with Programmed=False — gateway serving traffic without the intended policy SRENIX_PROBE_KONG=off
KongRoutes For each Kong-managed Ingress: backend Service has ≥1 ready Endpoint + KongPlugin/Consumer annotation refs resolve No ready endpoints or dangling annotation SRENIX_PROBE_KONG_ROUTES=off
GPUNodes nvidia.com/gpu + amd.com/gpu node allocatability GPU node NotReady, cordoned, or zero allocatable GPU SRENIX_PROBE_GPU_NODES=off
HPAScaling HorizontalPodAutoscaler status conditions (healthy on zero HPAs) ScalingActive=False or AbleToScale=False — immediate Critical, no dwell. Exception: ScalingActive=False with reason=ScalingDisabled downgrades to Warning (expected KEDA scale-to-zero state) SRENIX_PROBE_HPA_SCALING=off
ArgoCDApplication ArgoCD Application sync/health status (auto-skips when Argo CD CRDs absent) Application OutOfSync or Degraded — immediate, no grace window SRENIX_PROBE_ARGOCD_APP=off
Velero Velero Backup CRs across namespaces (auto-skips when Velero CRDs absent) Latest backup Failed/PartiallyFailed, older than the backup SLA (default 26h), or InProgress >4h SRENIX_PROBE_VELERO=off
TraefikRoutes Traefik IngressRoute CRDs (v3 traefik.io, falls back to v2; auto-skips when absent). Also validates IngressRouteTCP backend Services. Missing backend Service, unresolved Middleware ref, or TLS config with no cert provisioner SRENIX_PROBE_TRAEFIK_ROUTES=off
K3sLocalPathStorage Ephemeral-storage headroom on nodes hosting local-path-provisioner PVCs (no-ops without local-path PVCs) Disk-pressure risk on a node backing local-path volumes SRENIX_PROBE_K3S_LOCALPATH=off
K3sDatastore k3s datastore mode (embedded etcd vs SQLite) + mode-appropriate health signals. Also checks etcd quorum (N/2+1 threshold), 2-node cluster fault-tolerance warning, per-member snapshot freshness via node annotations, and ConfigMap-based snapshot age vs SLA. Datastore unhealthy; avoids the spurious "no etcd pods" warning on SQLite single-node k3s SRENIX_PROBE_K3S_DATASTORE=off

LogPatternMatcher analyzer

Ships alongside the K8s probe set. Scans recent Kubernetes Events (not pod logs) for high-signal failure patterns — ImagePullBackOff, OOMKilled, probe-failed, volume-attach-failed, RBAC Forbidden — and deduplicates by (object, pattern) before emitting a finding. Severity is per-pattern, not uniform: ImagePullBackOff (and ErrImagePull / manifest-unknown) and VolumeAttachFailed are reported as Critical; OOMKilled, ProbeFailed (Liveness/Readiness/Startup), and RBAC Forbidden are reported as Warning. Does not auto-fix. Disable with SRENIX_ANALYZER_LOG_PATTERN_MATCHER=off.

Trigger classes

Probes run on three trigger classes: A (Kubernetes informers — react within ~10s of any resource event), C (Alertmanager polling — catches slow-drift signals like disk fill or cert expiry), and E (external HMAC-authenticated webhook — immediate cycle on external signal). CronJob resync runs on the schedule in Helm values as a safety net.

← Back to docs