Lessons Learned Managing 800+ Kubernetes Clusters

Managing Fleet Scale

Running a handful of clusters is easy. Running 800+ across AWS, GCP, and Alibaba Cloud is a different beast entirely. Here are the core principles that keep our ship afloat.

1. GitOps is Non-Negotiable

We use ArgoCD and Flux to ensure that the state of our clusters is always declarative. Drift detection is automated. If someone manually changes a config, it's reverted instantly.

2. Standardization over Customization

Every cluster starts with a "Base Profile" that includes:

Security hardening (OPA Gatekeeper)

Observability agents (Prometheus/Grafana/Splunk)

Ingress controllers

Customization is allowed only on top of this immutable base.

3. Cost Visibility

When you run at this scale, a 10% inefficiency costs millions. We implemented FinOps practices directly into the engineering workflow, giving teams visibility into their pod costs in real-time.