Lessons Learned Managing 1,000+ Kubernetes Clusters

The Scale Problem

Running a handful of Kubernetes clusters is straightforward. Running 1,000+ across AWS, GCP, Alibaba Cloud, and on-prem bare-metal is a different beast. Every operational pattern that works at 10 clusters breaks at 100. Every pattern that works at 100 breaks at 1,000.

At Salesforce, I own the reliability of this fleet. Here are the principles and practices that keep it running at 99.99% availability.

Principle 1: GitOps is Non-Negotiable

At fleet scale, any manual configuration is a liability. We enforce strict GitOps using ArgoCD and Flux:

**Every cluster state is declarative.** The desired state lives in Git. Period.

**Drift detection is automated.** If someone manually changes a config via kubectl, it is reverted within minutes.

**Changes flow through PRs.** A cluster configuration change goes through the same review process as application code.

This is not just about consistency. It is about auditability. When an incident happens at 3 AM, we can trace exactly what changed, when, and who approved it.

The ArgoCD + Flux Decision

We use both. ArgoCD handles application deployments (app-of-apps pattern). Flux handles cluster-level addons and infrastructure components. This separation prevents a single reconciliation loop from becoming a bottleneck.

Principle 2: Standardization Over Customization

Every cluster starts from an immutable "Base Profile" that includes:

**Security hardening:** OPA Gatekeeper policies, network policies, pod security standards

**Observability:** Prometheus (with custom exporters), Grafana agent, Splunk forwarder, Loki

**Networking:** Ingress controller (NGINX or Envoy), CNI configuration, DNS policies

**RBAC:** Standard role bindings, service account policies

Customization is allowed only *on top* of this base. Teams can add their own monitoring dashboards, custom CRDs, and application-specific configurations. But they cannot remove or modify the base layer.

Why This Matters

Without standardization, every cluster becomes a snowflake. When you have 1,000 snowflakes, you cannot write automation that works across the fleet. You cannot roll out security patches uniformly. You cannot reason about fleet-wide reliability.

Principle 3: Self-Healing by Default

At this scale, manual intervention does not work. We built custom Kubernetes Operators (Go, Operator SDK) that handle common failure patterns:

**StatefulSet Recovery:** When a node fails, our operator detects orphaned PVCs, migrates data, and restarts the StatefulSet on a healthy node. This used to require a 30-minute manual runbook. Now it takes 90 seconds with zero human involvement.

**Certificate Rotation:** TLS certificates across the fleet are rotated automatically 30 days before expiry. No more 2 AM pages for expired certs.

**Node Remediation:** Unhealthy nodes are cordoned, drained, and replaced automatically. The operator respects PodDisruptionBudgets to prevent service impact.

These operators reduced manual intervention by roughly 40%.

Principle 4: Networking at Fleet Scale

Networking is where fleet-scale Kubernetes gets genuinely hard. Our stack:

**CNI:** We run different CNIs depending on the cloud provider and tenant requirements. AWS VPC CNI for EKS, Calico for on-prem, Cilium for clusters that need eBPF-based observability.

**Service Mesh:** Istio for inter-service communication, mTLS enforcement, and traffic management.

**IPVS:** We built and deployed custom IPVS kernel modules for load balancing at the L4 layer. This was a deep kernel-level change that required careful rollout across the fleet.

**Cross-Cluster Communication:** We use a hub-and-spoke model where a central control plane manages service discovery across clusters.

The IPVS Story

One of our most impactful projects was replacing iptables-based kube-proxy with IPVS across the fleet. At 1,000+ clusters, the iptables rule count was causing measurable latency in service routing. IPVS hash-based routing reduced this by an order of magnitude. But rolling this out required custom kernel module builds, extensive testing across RHEL versions, and a careful phased rollout.

Principle 5: FinOps is an Engineering Practice

When you run at this scale, a 10% inefficiency costs millions annually. We embedded FinOps directly into the engineering workflow:

**Real-time cost visibility:** Every team sees their pod-level costs in Grafana dashboards.

**Right-sizing automation:** We analyze resource requests vs actual usage and automatically suggest (or apply) right-sizing changes.

**Spot/preemptible usage:** Non-critical workloads run on spot instances with graceful fallback to on-demand.

**Result:** 70% cost reduction through systematic workload optimization.

Principle 6: Observability is the Foundation

You cannot manage what you cannot see. Our observability stack:

**Metrics:** Prometheus with custom exporters for fleet-specific metrics (cluster health score, addon version distribution, node pool utilization). Federation for cross-cluster queries.

**Logs:** Splunk for long-term storage and compliance, Loki for real-time debugging.

**Alerts:** Custom alert pipeline (CEF - Customer Engagement Framework) that routes alerts to the right tenant team based on workload ownership. This reduced alert noise by 30%.

**Dashboards:** Grafana dashboards at three levels: fleet overview, cluster detail, workload detail.

The key insight: observability at fleet scale is not about more data. It is about better aggregation and routing. Nobody can watch 1,000 cluster dashboards. You need automated anomaly detection that surfaces the 3 clusters that need attention right now.

What Breaks at 1,000 Clusters

Some things that worked fine at smaller scale and failed spectacularly:

**Kubectl-based operations.** You cannot kubectl your way through 1,000 clusters. Everything must be automated.

2. **Single-cluster monitoring.** Prometheus per cluster works. Prometheus federation across 1,000 clusters does not. You need a hierarchical aggregation strategy.

3. **Manual upgrade rollouts.** Upgrading Kubernetes versions must be automated with canary clusters, automated testing, and rollback triggers.

4. **Centralized control planes.** A single management cluster for 1,000 workload clusters becomes a single point of failure. We use regional control planes with eventual consistency.

The Takeaway

Fleet-scale Kubernetes is an engineering discipline, not an operations task. It requires strong software engineering (building operators, automation, custom tooling), deep systems knowledge (kernel, networking, storage), and rigorous operational practices (GitOps, observability, incident management).

If you are scaling from 10 to 100 clusters, invest in GitOps and standardization first. If you are scaling from 100 to 1,000, invest in self-healing operators and fleet-wide observability. The patterns compound.