Lessons Learned Managing 1,000+ Kubernetes Clusters
The Scale Problem
Running a handful of Kubernetes clusters is straightforward. Running 1,000+ across AWS, GCP, Alibaba Cloud, and on-prem bare-metal is a different beast. Every operational pattern that works at 10 clusters breaks at 100. Every pattern that works at 100 breaks at 1,000.
At Salesforce, I own the reliability of this fleet. Here are the principles and practices that keep it running at 99.99% availability.
Principle 1: GitOps is Non-Negotiable
At fleet scale, any manual configuration is a liability. We enforce strict GitOps using ArgoCD and Flux:
This is not just about consistency. It is about auditability. When an incident happens at 3 AM, we can trace exactly what changed, when, and who approved it.
The ArgoCD + Flux Decision
We use both. ArgoCD handles application deployments (app-of-apps pattern). Flux handles cluster-level addons and infrastructure components. This separation prevents a single reconciliation loop from becoming a bottleneck.
Principle 2: Standardization Over Customization
Every cluster starts from an immutable "Base Profile" that includes:
Customization is allowed only *on top* of this base. Teams can add their own monitoring dashboards, custom CRDs, and application-specific configurations. But they cannot remove or modify the base layer.
Why This Matters
Without standardization, every cluster becomes a snowflake. When you have 1,000 snowflakes, you cannot write automation that works across the fleet. You cannot roll out security patches uniformly. You cannot reason about fleet-wide reliability.
Principle 3: Self-Healing by Default
At this scale, manual intervention does not work. We built custom Kubernetes Operators (Go, Operator SDK) that handle common failure patterns:
These operators reduced manual intervention by roughly 40%.
Principle 4: Networking at Fleet Scale
Networking is where fleet-scale Kubernetes gets genuinely hard. Our stack:
The IPVS Story
One of our most impactful projects was replacing iptables-based kube-proxy with IPVS across the fleet. At 1,000+ clusters, the iptables rule count was causing measurable latency in service routing. IPVS hash-based routing reduced this by an order of magnitude. But rolling this out required custom kernel module builds, extensive testing across RHEL versions, and a careful phased rollout.
Principle 5: FinOps is an Engineering Practice
When you run at this scale, a 10% inefficiency costs millions annually. We embedded FinOps directly into the engineering workflow:
Principle 6: Observability is the Foundation
You cannot manage what you cannot see. Our observability stack:
The key insight: observability at fleet scale is not about more data. It is about better aggregation and routing. Nobody can watch 1,000 cluster dashboards. You need automated anomaly detection that surfaces the 3 clusters that need attention right now.
What Breaks at 1,000 Clusters
Some things that worked fine at smaller scale and failed spectacularly:
2. **Single-cluster monitoring.** Prometheus per cluster works. Prometheus federation across 1,000 clusters does not. You need a hierarchical aggregation strategy.
3. **Manual upgrade rollouts.** Upgrading Kubernetes versions must be automated with canary clusters, automated testing, and rollback triggers.
4. **Centralized control planes.** A single management cluster for 1,000 workload clusters becomes a single point of failure. We use regional control planes with eventual consistency.
The Takeaway
Fleet-scale Kubernetes is an engineering discipline, not an operations task. It requires strong software engineering (building operators, automation, custom tooling), deep systems knowledge (kernel, networking, storage), and rigorous operational practices (GitOps, observability, incident management).
If you are scaling from 10 to 100 clusters, invest in GitOps and standardization first. If you are scaling from 100 to 1,000, invest in self-healing operators and fleet-wide observability. The patterns compound.