Mar 1, 2026
Building Reliable Blockchain Nodes on Kubernetes with Helm

Most guides stop at getting the node running. This one starts there.
Running a blockchain node locally and running one in production are fundamentally different problems. A node can appear healthy — the container is up, the process is responding — while silently falling behind the chain head, losing peers, exhausting disk space, or returning increasingly slow RPC responses. None of that shows up in a simple uptime check.
Real reliability means the node is healthy, reachable, recoverable, and observable at all times. This post walks through how to build that, using:
Helm
Kubernetes StatefulSets
Persistent storage
Health probes
Affinity and disruption controls
Secure Services and Ingress
Prometheus and Grafana monitoring
This setup also forms the foundation for the high-availability RPC layers and relayer infrastructure covered in later posts.
Helm as the Deployment Contract
Before touching any node configuration, it's worth framing what Helm is actually doing here. It's not just a packaging tool — it's the deployment contract for the node.
Everything that makes a node reliable — how it starts, where it stores data, how it exposes traffic, how it integrates with monitoring — lives inside the chart. That makes the entire operational setup version-controlled, reviewable, and reproducible across clusters without manual configuration drift.
bash
From this point on, every reliability decision has a home.
StatefulSets and Persistent Storage
Blockchain nodes are inherently stateful, and this is where most naive setups quietly fail. Without persistent storage, a pod restart is effectively a full re-sync — which on a mature chain can mean hours or days of downtime.
The fix is straightforward: a StatefulSet with a PersistentVolumeClaim template.
yaml
Using a StatefulSet ensures each replica maintains a stable identity and stays attached to its own dedicated volume. When a pod is rescheduled or restarted, it reconnects to the same data directory — blockchain state is preserved, recovery is fast, and unnecessary sync cycles don't happen.
Health Probes That Actually Reflect Node Behavior
Kubernetes has three probe types, and each one solves a different problem. Using them correctly means Kubernetes understands what's actually happening inside the node, not just whether the process is alive.
Startup Probe
yaml
Block replay and database recovery on startup can take a long time. Without a startup probe, Kubernetes will start enforcing liveness checks too early and restart the container mid-initialization. The startup probe buys the node the time it needs before any other checks kick in.
Readiness Probe
yaml
The readiness probe answers one question: should this pod receive traffic right now? If the node becomes partially synced, degraded, or temporarily unavailable, it's automatically removed from the Service endpoint list. Users never hit a node that isn't ready.
Liveness Probe
yaml
The liveness probe handles a different failure mode — a process that's running but completely stuck. Deadlocks and runtime hangs won't always crash a container. The liveness probe catches them and triggers an automatic restart, eliminating the need for manual intervention.
Resource Limits and Stability
Blockchain nodes are memory-intensive. Without explicit resource limits, a single stressed node can consume everything available on the underlying host and take other workloads down with it.
yaml
Requests guide the scheduler in placing the pod on a host with enough capacity. Limits define hard operational boundaries. Together they prevent uncontrolled resource usage, reduce OOMKills, and keep performance predictable under load.
PodDisruptionBudget
Cluster maintenance — node drains, rolling upgrades, scaling events — can bring down pods at any time. Without a PodDisruptionBudget, Kubernetes has no constraint on how many replicas it removes simultaneously.
yaml
This guarantees at least one replica stays available during any voluntary disruption. It's a small config addition with a meaningful impact on availability during routine operations.
Anti-Affinity for Failure Isolation
Spreading replicas across worker nodes is cheap insurance against correlated failures. If two replicas land on the same host and that host goes down, both are lost simultaneously.
yaml
Pod anti-affinity encourages the scheduler to distribute replicas across different physical nodes. Infrastructure failures then impact only a subset of replicas rather than the entire RPC layer.
Configuration and Secrets
Keeping configuration out of container images makes deployments flexible and auditable. Two resources handle this cleanly.
ConfigMap — for non-sensitive runtime settings:
yaml
Secret — for credentials and sensitive values:
yaml
Access to Secrets is governed by Kubernetes RBAC. In production, secrets should be backed by an external secret manager (like Vault or AWS Secrets Manager) to support centralized governance and rotation without touching Kubernetes directly.
Networking: Private by Default, Controlled External Access
The default networking posture should be private. A ClusterIP Service provides a stable internal endpoint without exposing anything outside the cluster.
yaml
External access gets added deliberately through an Ingress resource, which handles TLS termination, rate limiting, and forwarding to the internal service:
yaml
Node pods stay isolated. External traffic flows through a single managed entry point with TLS and rate limits applied.
Monitoring with Prometheus and Grafana
A node that isn't observable isn't reliable — it's just undetected failures waiting to happen.
Install the monitoring stack:
bash
Wire the node into Prometheus via a ServiceMonitor:
yaml
This enables automated metric scraping. Prometheus collects operational data continuously; Grafana provides visualization and alerting on top of it.
What to Actually Monitor
More metrics isn't better — high-signal metrics are. These are the ones that indicate real operational risk before it becomes an incident.
Node health:
Block height — if this stops increasing, the node has stopped progressing
Sync lag — a node significantly behind chain head is operationally unavailable, even if running
Peer count — a sustained drop often precedes synchronization problems
RPC health:
Request rate — traffic spikes stress CPU, memory, and disk I/O simultaneously
Error rate — rising errors signal user-visible degradation
p95 latency — increasing latency is often an early warning before failures occur
Infrastructure:
Disk usage — blockchain data grows continuously; a full disk means a crash
Memory usage — memory pressure raises the likelihood of OOMKills
Restart count — frequent restarts indicate instability or resource exhaustion that needs investigation
The Full Picture
Put it all together and a production node deployment looks like this:
Each layer addresses a specific failure mode. Helm makes the whole thing reproducible. None of this is over-engineering for a production node — it's the baseline.