Ceph storage
Inspects CephCluster CRs (rook-ceph): health, capacity used, OSD readiness. Surfaces the same "is the storage layer alive?" signal you would otherwise have to read out of rook-ceph-tools.
Features · K8s probes
The infra-grade depth no SaaS tool matches because they're not in the cluster. All 21 probes ship in v0.1.0-alpha.1: the core six (Ceph, Postgres, critical workloads, nodes, PVCs, endpoints), six blind-spot probes (Node pressure, system DaemonSets, stuck Pending pods, generic CrashLoopBackOff, ETCD, failed volume mounts), plus Kong, KongRoutes, GPUNodes, HPA, ArgoCD, Velero, Traefik routes, and k3s local-path storage + datastore — this page deep-dives the core twelve.
Inspects CephCluster CRs (rook-ceph): health, capacity used, OSD readiness. Surfaces the same "is the storage layer alive?" signal you would otherwise have to read out of rook-ceph-tools.
CloudNativePG (CNPG) and Zalando Spilo / Patroni clusters, auto-detected from CRDs. Replica readiness and primary identity.
Counts pods by the READY column, not phase=Running (so CCE/ImagePull failures don't silently green). The list of "critical" workloads is configurable via the SRENIX_CRITICAL_SERVICES env var and the srenix.ai/probe-critical: "true" annotation — no fork needed for non-Srenix clusters.
Ready condition across every node. Distinguishes kubectl-error from empty-list to avoid the false-green trap.
PersistentVolumeClaim phase: Bound vs Pending. The "bound but mount fails" class is handled by the new FailedMounts probe below.
Every Ingress host auto-discovered and probed externally. Per-Ingress opt-out via srenix.ai/probe-disable: "true". Layer-1 flake suppression (2-of-2 streak) included.
DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable conditions. DiskPressure and NetworkUnavailable auto-escalate to Critical (eviction or broken pod traffic imminent). The basic Nodes probe only checks Ready — pressure flips first.
Inspects DaemonSets in eight system namespaces (kube-system, cilium-system, calico-system, kube-flannel, rook-ceph, longhorn-system, openebs, metallb-system). Catches the case where a CNI/CSI plugin is down but the kubelet still reports Ready=True for several minutes.
Pods stuck Pending past a 60s grace window with PodScheduled=False. Reason-aware remediation distinguishes Insufficient CPU/Memory, unbound PVC, taint mismatch, nodeSelector miss. Doesn't double-flag ImagePullBackOff (owned by the ImagePullAuth analyzer).
Generic any-namespace detector — catches crashes anywhere, not just a hardcoded critical-services list. Protected-namespace pods (kube-system etc.) are always Critical; user-namespace pods start as Warning and escalate past the restart threshold.
Watches kubeadm-style static-pod etcd members. Honest "blind probe" Warning for external / managed etcd (RKE2, k3s, EKS, GKE managed control plane) — never false-greens on a control plane it can't see.
Joins pods stuck ContainerCreating past 90s with their kubelet FailedMount / FailedAttachVolume / FailedDetachVolume / ProvisioningFailed events. Names the volume and explains why — closes the gap where PVCs probe sees Bound but the actual mount hangs.
Each probe can be disabled independently with an env var, no fork needed: