Building Resilient Systems: SRE Lessons from Telecom to Enterprise
Hope is Not a Strategy
Systems will fail. The only question is whether they fail when you are watching or at 3 AM on a Sunday. After running SRE for telecom platforms across 14 countries at Airtel and now managing 1,000+ Kubernetes clusters at Salesforce, here is what I have learned about building systems that survive failure.
Lesson 1: Define What "Reliable" Means Before Building
At Airtel, I inherited a platform with no formal SLOs. Engineers had an intuitive sense that "the system should be fast and available" but no shared definition of what that meant. Different teams had different expectations, and incidents were declared based on gut feeling.
The first thing I did was define SLOs for every critical service:
This changed everything. Instead of arguing about whether an incident was "bad enough" to page someone, we had objective thresholds. Error budgets gave teams permission to ship faster when reliability was high and forced them to slow down when it was low.
Lesson 2: Multi-Region is Not Optional
At Airtel, we operated across 14 countries spanning Africa, South Asia, and the Middle East. Each region had different infrastructure constraints, latency requirements, and regulatory rules.
Key patterns that worked:
Lesson 3: Chaos Engineering is Insurance, Not Luxury
We regularly inject failure into our systems:
At Salesforce (K8s Fleet Scale)
At Airtel (Microservices Scale)
The Result
Chaos engineering forced us to build systems that degrade gracefully. Instead of a hard crash, users experience slightly slower load times or reduced functionality. This is the difference between a P1 outage and a minor degradation that nobody notices.
Lesson 4: The Incident is the Easy Part
Resolving an incident takes hours. Preventing the *next* incident takes weeks. The post-incident review is where the real reliability work happens.
Our post-incident framework:
2. **Contributing factors.** Not "root cause" (complex systems rarely have a single root cause) but the set of conditions that made this incident possible.
3. **Action items with owners and deadlines.** Not "improve monitoring" but "add p99 latency alert for service X with threshold Y, owned by Z, due by [date]."
4. **Follow-through tracking.** Every action item is tracked in our incident management system. We review completion rates monthly. Incomplete action items from post-mortems are the #1 predictor of repeat incidents.
Lesson 5: Observability is Not Monitoring
Monitoring tells you *when* something is wrong. Observability tells you *why.* The distinction matters enormously at scale.
At Salesforce, we built a three-layer observability stack:
The key insight: you need all three layers, and you need to navigate between them in seconds. When an alert fires, the engineer should go from "fleet overview" to "the specific pod on the specific node that is causing the issue" in under 60 seconds.
Lesson 6: Automation Compounds
Every manual operation you automate saves time forever. At scale, this compounds dramatically:
We track "toil hours saved" as a key SRE metric. Last quarter, our automation saved approximately 2,000 engineer-hours. That is an entire engineer's year of work, recovered through tooling.
The Takeaway
Reliability engineering is not about preventing all failures. It is about building systems where failures are expected, detected quickly, contained automatically, and learned from systematically. The tools change (Kubernetes today, something else tomorrow) but these principles are durable.