3 min read

Site Reliability Engineering

Users expect every online service to work all the time. Every minute of downtime erodes trust, costs money, and slows teams. Site reliability engineering (SRE) tackles this problem head-on by treating operations as a software problem, providing engineering teams with a disciplined approach to keep systems stable and running efficiently.

What Is Site Reliability Engineering?

Site reliability engineering is a practice that applies software engineering skills to infrastructure and operations. The goal is simple:

Keep services reliable, scalable, and efficient without slowing product delivery.

Google introduced this term in 2003. But the ideas apply to any team that runs code in production. SRE focuses on:

1. Service-Level Indicator (SLI)

A direct measurement of system health, e.g., “successful HTTP 200 responses divided by total requests.”

2. Service-Level Objective (SLO)

The target value or range for an SLI, such as “99.9 % of requests return HTTP 200 in a rolling 30-day window.”

3. Service-Level Agreement (SLA)

A legal or commercial promise to customers. SLAs are broader than SLOs and often include penalties for breach of contract.

4. Error Budget

Once you’ve set a service-level objective (SLO), you need to know how much failure you can afford before you must slow down releases. That allowance is the error budget. Treat it as real currency: spend it on riskier changes; pause spending when the balance runs low.

Error Budget (fraction) = 1 − SLO
Error Budget (minutes) = (1 − SLO) × Minutes in Period

Example Calculation

1. Define SLIs

Success rate (HTTP 200/total)
95th-percentile latency

2. Set SLOs

Success ≥ 99.95 %
Latency ≤ 250 ms

3. Compute the budget

99.95 % over 30 days

(1 − 0.9995) × 43200 ≈ 21.6 min

4. Use the error budget to guide releases

With 21.6 minutes of downtime available for the month, plan each rollout to “spend” only a fraction of that allowance.

Begin with a canary deployment. Route 5% of production traffic to API v2.
Watch the key SLIs in real time. Compare the success rate and latency for v2 against v1.
Set an automatic rollback rule. If success drops below 99.9 % for more than five minutes, revert immediately.

5. Track burn rate

Burn rate = Minutes of actual downtime / Error budget.

If a bug in v2 causes 10 minutes of downtime, the burn will be:

10 / 21.6 ≈ 0.46 (or 46% burned)

Burn rate > 1 → freeze risky changes.
Burn rate < 1 → you still have budget to “spend.”

Practical Guidance and Best Practices

Treat SLOs as code. Store them in version control next to the service configuration.
Automate all repetitive tasks. Use Terraform for infrastructure, GitHub Actions for CI/CD, and incident bots for chat-ops.
Prefer gradual rollouts. Blue-green, canary, or feature flags reduce mean time to recovery (MTTR).
Run game days. Simulate real failures to verify run-books and paging workflows.
Share context. A concise README in every repository beats tribal knowledge.

Common Mistakes to Avoid

1. Site reliability engineer vs. DevOps

DevOps is a broad mindset about collaboration and automation. SRE is one way to put that mindset into practice, with specific methods such as SLOs and error budgets. Don’t get stuck on job titles; focus on applying the practices that make the service reliable.

2. Reliable SLOs

Setting a “five nines” (99.999%) uptime target sounds impressive, but it allows only 26 seconds of downtime per month. Choose an SLO that aligns with real user expectations and mitigates business risk.

3. Breakdown SLIs

Tracking only global averages can hide local failures. For example, a 99% overall success rate might mask a full outage in one region or for one major customer. Break down SLIs by region, customer tier, or critical path so localized problems surface quickly.

4. Alert fatigue

Paging on every small fluctuation trains engineers to ignore alarms. An alert should only fire when immediate human action is required. Use multi-window burn rate alerts and remove or silence noisy, non-actionable thresholds.

Tips for Applying SRE in Real Projects

Address the noisiest service first. Pick the component that pages most often, set one clear uptime goal, add Prometheus metrics, and watch how quickly the alerts quiet down.
Write a short incident memo within two days. Capture what broke, how it hurt users, why it happened, and one concrete fix. Then turn the fix into a tracked ticket.
Reserve 10% of each sprint for resilience work. Use it for load tests, failover drills, or removing single points of failure to steadily improve reliability.
Automate any runbook used twice. After the second manual run, wrap the steps in a script or CI job so the third time is one click.
Scan the burn-rate graph every morning. A flat line means keep shipping; a steep climb means pause releases and repair the errors eating the budget.

Conclusion: Reliability as a Feature

Reliable software is not luck; it’s the result of site reliability engineering done with intent. Start today by choosing one service, one SLI, and one clear SLO. Instrument it, track your error budget, and iterate. Then scale the practice across your stack, supported by modern site reliability engineering tools. Treat reliability as a product feature, and every release will ship faster and safer.