3 min read

Engineering Operations

Engineering operations is the everyday work that keeps software running, including planning and releasing changes, monitoring live systems, fixing incidents, and improving weak spots. When this process lacks clear rules, teams guess at reliability, alerts get noisy, and risky changes slip through.

A straightforward fix is to set a reliability target and manage a small allowance for failure. Service level objectives (SLOs) and error budgets provide the target and allowance, enabling teams to make consistent decisions.

How to Manage Engineering Operations

Performing engineering operations is a step-by-step process.

Step 1: Start with the User Signal (SLI)

Start the process by selecting one signal per service.

Availability: percentage of successful requests out of all requests.
Latency: percentage of requests faster than a set limit (for example, 300 ms).
Freshness or correctness: for jobs, pipelines, or caches where timeliness or accuracy matters.

Don’t consider client errors as your service cannot control them. Count retries as success only if users would not notice them. With this step, the team shares a concrete measure of “good,” which the following steps will target.

Step 2: Set a Target and a Time Window

Pick a single target and a rolling time window:

Common windows: 28-30 days for steady services; 7 days for low-traffic or fast-moving ones.
Example targets: 99.9% availability and a clear P99 latency bound for a tier-1 API; 99% on-time completion for an internal batch.

Decide whether planned maintenance consumes budget (default yes to reflect real impact). Record the target and window in the runbook so that operations and engineering share a single source of truth.

Step 3: Tracking the Error Budget

An SLO defines the acceptable level of failure. The error budget is 1 – SLO over the chosen window.

Two simple ways to calculate:

Event-based (requests):

budget_used = 1 - (good_events / total_events) budget_remaining = (1 - SLO) - budget_used

Time-based (uptime):

budget_used = downtime / window_duration

Collect metrics with Prometheus or OpenTelemetry. For availability, compute success and total request rates, and then display availability and remaining budget on a dashboard, such as Grafana or Datadog.

Step 4: Alerting on Burn Rate

Track how fast the budget is being spent, not on a raw error percentage. This ties alerts to user impact.

Fast burn (page): Short lookback. Fire if the current rate would empty the monthly budget in about an hour.
Slow burn (ticket): Longer lookback. Open an issue if the trend would empty the budget in a few days.

Burn-rate formula:

burn_rate = (observed_error_fraction) / (allowed_error_fraction)

Put three items in every alert:

Budget remaining
Likely error source
Rollback link

This speeds up diagnosis and reduces noisy paging.

Step 5: Using the Budget to Guide Changes

Release policy should depend on the budget state:

Healthy (≥50% remaining). If at least half of the error budget is left, deploy at your usual pace. You can run feature-flagged experiments and do schema or data migrations.
Low (<50% remaining). When less than half of the budget remains, lower the risk. Merge smaller pull requests, run longer canaries, and add extra readiness and health checks.
Exhausted (≤0 remaining). Only deploy fixes that reduce user-visible errors or strengthen guardrails such as circuit breakers and stricter timeouts.

The operations engineer and the service owner should apply this policy in release reviews and during incident triage. Doing so keeps operations and engineering aligned, making everyday engineering operations more predictable and efficient.

Common Mistakes and Fixes

Too many SLOs. Start with one user-facing SLO per service. Add another only if a different flow needs its own protection. This keeps engineering operations clear.
Machine metrics as objectives. Do not use CPU, memory, or GC as SLOs. Use signals that users feel: success rate, p95/p99 latency, or data freshness.
Static error thresholds. A fixed rule, such as error rate > 1%, is noisy. Alert on burn rate instead: send an urgent alert when the budget is being rapidly depleted. This ties decisions to user impact.
Ignoring dependencies. If a dependency is broken, users see your feature fail, even if your code is fine. Therefore, either include upstream failures in your SLI or display upstream SLOs side by side and use them in release decisions. This keeps operations and engineering aligned.
No clear owner. Designate an SLO owner (accountable) and an operations engineer (partner). The owner sets targets and release rules; the partner maintains alerts and dashboards. Both approve SLO/alert changes. This supports performing engineering operations consistently.

Conclusion

SLOs and error budgets turn reliability into a routine within engineering operations. Start small: add one availability SLO, add a fast-burn alert, and script rollbacks. As operations and engineering refine the loop, your team will perform engineering operations with less toil and greater confidence.