Layer 2: SLOs as Control Loops — From Metrics to Governance
🌐 中文Most organizations treat SLOs (Service Level Objectives) as a reporting mechanism: a dashboard that confirms at the end of the month whether they met their reliability commitments. This is backwards. In mature autonomous operations, SLOs are not reports—they are active control signals that govern system behavior in real time.
When your error budget is being consumed too quickly, the system automatically throttles deployments. When a canary deployment starts degrading your SLO, it rolls back automatically. When capacity pressure will push latency past your SLO threshold in the next hour, autoscaling fires before the threshold is crossed. The SLO is not something you check at the end of the month; it is something the system defends, moment by moment, with automated decisions.
This requires a shift in how you think about SLOs. They are not aspirational targets. They are operational constraints that the system must respect.
SLOs as First Principles
Before you can use SLOs as control signals, you must define them correctly. And "correctly" means grounded in business reality, not technical fashion.
Ask yourself:
- Why this SLO? If your answer is "everyone else does 99.99%," you have not thought about this. Your answer should be "our customers will churn if we drop below 99.95%, so that is our SLO."
- What happens if we miss it? If missing your SLO costs you $0, it is not an SLO. It is a guess. Real SLOs have business consequences.
- Can we achieve it reliably? If your SLO requires 99.99% availability but your database is only available 99.95%, you cannot meet your SLO. The constraint is upstream.
The SLOs that work as control loops are the ones grounded in first principles: what does the business actually need, and what does our infrastructure actually support?
The Architecture of SLO-Driven Control
Measuring SLOs: From Prometheus to Grafana
An SLO is defined in terms of Service Level Indicators (SLIs)—the actual metrics you measure. For a web service:
SLI: (successful requests) / (total requests)
SLO: SLI ≥ 99.95% over rolling 30-day window
Where:
- successful = HTTP status 2xx or 3xx (configurable)
- total = all requests (including failures)
In Prometheus, this looks like:
# Prometheus recording rule
slo:request_success_rate:30d =
sum_over_time(
(sum(rate(http_requests_total{status=~"2..|3.."}[5m])) /
sum(rate(http_requests_total[5m])))
[30d]
)
# Error budget
error_budget:available =
(1 - 0.9995) * 30 * 24 * 60 # minutes per 30 days
= ~2160 minutes allowed to fail
# Error budget consumed
error_budget:consumed =
(1 - slo:request_success_rate:30d) * 30 * 24 * 60
# Burn rate (how fast are we consuming error budget?)
error_budget:burn_rate:1h =
error_budget:consumed:1h / error_budget:available
Grafana visualizes this. A healthy SLO dashboard shows:
- Current compliance percentage (green if ≥99.95%, yellow if in danger, red if violated)
- Error budget remaining (days before you hit your limit at current burn rate)
- Burn rate (how fast are you consuming error budget?)
- Trend (is burn rate accelerating or decelerating?)
The Control Loop: From SLO to Action
Once you have SLOs measured in real time, you feed them into control loops:
┌──────────────────────────────────────────────────────┐
│ Every decision point in your system: │
│ - Can we deploy? │
│ - Should we scale? │
│ - Are we spending error budget wisely? │
└──────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────┐
│ Query current SLO │ (Prometheus)
│ error budget burn │
└──────────┬───────────┘
│
┌──────────▼────────────────────────────────┐
│ Decision Engine │
│ IF error_budget:burn_rate > threshold THEN│
│ THROTTLE deployments │
│ SCALE aggressively │
│ ALERT: Error budget at risk │
└──────────┬────────────────────────────────┘
│
┌──────────────┴──────────────┬──────────────┐
▼ ▼ ▼
Deployment gate Autoscaler Alerting
(CI/CD pauses) (add capacity) (notify humans)
│
┌──────────▼──────────┐
│ Measure impact │ (Prometheus)
│ Did burn rate change?
└─────────────────────┘
Example 1: Deployment Throttling
When your error budget is being consumed faster than expected, deploying new code is a liability. You might introduce bugs that make things worse.
Deployment policy:
IF error_budget:burn_rate:1h > 1.0
(meaning you are consuming budget 1x faster than sustainable)
THEN
- Disable automated deployments
- Require explicit approval for manual deployments
- Notify on-call team: "Error budget under pressure"
- Skip low-priority deployments
IF burn_rate < 0.5
(meaning you have time to spare)
THEN
- Re-enable automated deployments
- Can deploy more aggressively
This is not a human making a judgment call every time. This is a policy encoded in the system, querying Prometheus every minute and making decisions based on reality.
Example 2: Canary Auto-Rollback
A new deployment goes into canary (serving 10% of traffic). Within seconds, error rate jumps from 0.05% to 0.5%. You specified an SLO of 99.95%; this deployment is causing it to breach. What happens?
Canary monitoring (every 30 seconds):
IF (errors in canary) / (requests to canary) > SLO_threshold
AND (number of affected users < safety_threshold)
THEN
- Automatically rollback the canary
- Notify on-call: "Canary rollback triggered due to SLO breach"
- Preserve all telemetry for post-mortem
The decision is automatic.
The rollback is immediate.
Human involvement: post-incident review.
This requires Layer 0 (observability) to work perfectly. You need accurate error counts within 30 seconds. You need to know which requests were affected by which deployment. Traditional monitoring cannot do this. Observability-driven systems can.
Example 3: Predictive Scaling
Your database latency SLI is 500ms at p95. Your SLO is that p95 latency stays ≤400ms. It is currently at 380ms and rising 10ms per hour. In the next 3 hours, you will breach.
Predictive autoscaler:
1. Query Prometheus: What is the current latency trend?
2. Extrapolate: At current growth, when will we hit SLO limit?
3. Predict: Will adding capacity reduce latency sufficiently?
4. Decide: Add 30% more capacity now
5. Verify: Did latency drop? Is burn rate reduced?
Result: SLO breach predicted and prevented.
No human had to detect the trend.
No human had to decide when to scale.
The system did it automatically.
The Prerequisites: Getting SLOs Right
SLO-driven control loops only work if SLOs are set intentionally and achievably. Common mistakes:
SLOs disconnected from business impact. You set a 99.99% availability SLO because that sounds impressive. But your business doesn't actually need it. Meanwhile, the infrastructure required to maintain it is expensive. You have built a system that overperforms for your actual requirements.
SLOs set too loosely. If your SLO is "99% uptime" but you are actually running at 99.9%, you have no signal. The error budget is never constrained. The SLO is not governing anything.
SLOs that ignore upstream constraints. Your service SLO is 99.99%, but it depends on a third-party API that is 99%. You cannot possibly meet your SLO. Either adjust the SLO or redesign around the dependency.
SLOs without consequences. If breaching your SLO costs you nothing—no customer churn, no revenue impact, no business consequence—then it is not an SLO. It is a target. SLOs are the ones that matter.
The Feedback Loop: From Control to Learning
As the system executes SLO-driven decisions, it generates data. Mine it:
- Which deployments most frequently trigger canary rollbacks? Signal: those code paths have quality issues.
- How often does predictive scaling fire? Signal: your baseline capacity planning is off.
- When error budget is throttled, what actually caused the spike? Signal: you have a reliability weakness in that component.
- How many SLO breaches were caused by external factors vs. your code? Signal: how much control do you actually have?
A Grafana dashboard should answer these questions. It becomes your feedback system: "The system is making these decisions because of these conditions. Should we be?"
Common Pitfalls
Over-automation based on misaligned SLOs. If your SLO is set wrong, automating based on it just accelerates the wrong decisions. A deployment throttle based on a 99.99% SLO when your business needs 99% will block too many deployments and hurt velocity without gaining safety.
No human override. Some situations require a human judgment call: "Yes, I know the error budget is tight, but this deployment is critical for a customer retention issue." Build in explicit override mechanisms so humans can make deliberate, logged decisions.
Ignoring SLO drift. SLOs set 3 years ago might not reflect current business needs. Review them annually. If your system is consistently overperforming its SLO, ratchet it down slightly (less expensive infrastructure for same business outcome).
SLOs without observability to back them. You cannot set a 99.99% SLO if your observability system has 1% loss. You cannot make deployments decisions based on SLOs if Prometheus has a 5-minute scrape lag. Layer 0 must be solid first.
The Maturity Jump
SLO-driven control loops represent a significant jump in operational maturity. It is the difference between:
- Reactive: "Something is wrong, humans fix it"
- Autonomous: "The system governs itself based on reliability commitments"
It requires integration across multiple layers: observability accurate enough to be trusted, runbook automation safe enough to be executed without human approval, and SLOs precise enough to guide decisions.
Organizations that reach this point report a dramatic shift: deployments happen faster, reliability stays consistent, and on-call load decreases—not because problems are hidden, but because the system prevents them from occurring.
Next in the Series
Control loops require confidence in automation, which requires being able to safely undo actions. The next layer—Immutable Infrastructure—makes safe remediation possible. Read Layer 3.
At AIDARIS, we help organizations design SLOs that are both achievable and consequential, then build the control loops to defend them. If you are unsure whether your SLOs are governance signals or just metrics, let's discuss.