S Srenix

Features · K8s probes

21 native K8s probes.

The infra-grade depth no SaaS tool matches because they're not in the cluster. All 21 probes ship in v0.1.0-alpha.1: the core six (Ceph, Postgres, critical workloads, nodes, PVCs, endpoints), six blind-spot probes (Node pressure, system DaemonSets, stuck Pending pods, generic CrashLoopBackOff, ETCD, failed volume mounts), plus Kong, KongRoutes, GPUNodes, HPA, ArgoCD, Velero, Traefik routes, and k3s local-path storage + datastore — this page deep-dives the core twelve.

Ceph storage

v0.1.0-alpha.1 Critical / Warning

Inspects CephCluster CRs (rook-ceph): health, capacity used, OSD readiness. Surfaces the same "is the storage layer alive?" signal you would otherwise have to read out of rook-ceph-tools.

PostgreSQL

v0.1.0-alpha.1 Critical / Warning

CloudNativePG (CNPG) and Zalando Spilo / Patroni clusters, auto-detected from CRDs. Replica readiness and primary identity.

Critical workloads

v0.1.0-alpha.1 Critical / Warning

Counts pods by the READY column, not phase=Running (so CCE/ImagePull failures don't silently green). The list of "critical" workloads is configurable via the SRENIX_CRITICAL_SERVICES env var and the srenix.ai/probe-critical: "true" annotation — no fork needed for non-Srenix clusters.

Cluster Nodes

v0.1.0-alpha.1 Critical

Ready condition across every node. Distinguishes kubectl-error from empty-list to avoid the false-green trap.

PVC binding

v0.1.0-alpha.1 Critical / Warning

PersistentVolumeClaim phase: Bound vs Pending. The "bound but mount fails" class is handled by the new FailedMounts probe below.

External endpoints

v0.1.0-alpha.1 Critical / Warning

Every Ingress host auto-discovered and probed externally. Per-Ingress opt-out via srenix.ai/probe-disable: "true". Layer-1 flake suppression (2-of-2 streak) included.

Node pressure

v0.1.0-alpha.1 Critical / Warning

DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable conditions. DiskPressure and NetworkUnavailable auto-escalate to Critical (eviction or broken pod traffic imminent). The basic Nodes probe only checks Ready — pressure flips first.

System DaemonSets

v0.1.0-alpha.1 Critical

Inspects DaemonSets in eight system namespaces (kube-system, cilium-system, calico-system, kube-flannel, rook-ceph, longhorn-system, openebs, metallb-system). Catches the case where a CNI/CSI plugin is down but the kubelet still reports Ready=True for several minutes.

Pending pods

v0.1.0-alpha.1 Critical

Pods stuck Pending past a 60s grace window with PodScheduled=False. Reason-aware remediation distinguishes Insufficient CPU/Memory, unbound PVC, taint mismatch, nodeSelector miss. Doesn't double-flag ImagePullBackOff (owned by the ImagePullAuth analyzer).

CrashLoopBackOff

v0.1.0-alpha.1 Critical / Warning

Generic any-namespace detector — catches crashes anywhere, not just a hardcoded critical-services list. Protected-namespace pods (kube-system etc.) are always Critical; user-namespace pods start as Warning and escalate past the restart threshold.

ETCD

v0.1.0-alpha.1 Critical / Warning

Watches kubeadm-style static-pod etcd members. Honest "blind probe" Warning for external / managed etcd (RKE2, k3s, EKS, GKE managed control plane) — never false-greens on a control plane it can't see.

Failed mounts

v0.1.0-alpha.1 Critical

Joins pods stuck ContainerCreating past 90s with their kubelet FailedMount / FailedAttachVolume / FailedDetachVolume / ProvisioningFailed events. Names the volume and explains why — closes the gap where PVCs probe sees Bound but the actual mount hangs.

Per-probe opt-out

Each probe can be disabled independently with an env var, no fork needed:

  • SRENIX_PROBE_NODE_PRESSURE=off
  • SRENIX_PROBE_DAEMONSETS=off
  • SRENIX_PROBE_PENDING_PODS=off
  • SRENIX_PROBE_CRASHLOOP=off
  • SRENIX_PROBE_ETCD=off
  • SRENIX_PROBE_FAILED_MOUNTS=off