Back to Blog
KubernetesScalingCloud Architecture

Lessons Learned Managing 800+ Kubernetes Clusters

2024-02-288 min read

Managing Fleet Scale


Running a handful of clusters is easy. Running 800+ across AWS, GCP, and Alibaba Cloud is a different beast entirely. Here are the core principles that keep our ship afloat.


1. GitOps is Non-Negotiable


We use ArgoCD and Flux to ensure that the state of our clusters is always declarative. Drift detection is automated. If someone manually changes a config, it's reverted instantly.


2. Standardization over Customization


Every cluster starts with a "Base Profile" that includes:

  • Security hardening (OPA Gatekeeper)
  • Observability agents (Prometheus/Grafana/Splunk)
  • Ingress controllers

  • Customization is allowed only on top of this immutable base.


    3. Cost Visibility


    When you run at this scale, a 10% inefficiency costs millions. We implemented FinOps practices directly into the engineering workflow, giving teams visibility into their pod costs in real-time.