The 10% Engineer: Why AIOps Makes Site Reliability Talent More Critical, Not Less
The central premise of AIOps is appealing: automate incident detection, response, and remediation so that engineers can stop fighting fires and start building better systems. In the most ambitious implementations, the goal is to reduce human intervention to 20%—or even 10%—of all operational events. Less on-call burden. Less alert fatigue. Less toil.
But here is the counterintuitive reality: when AI handles 90% of the work, the remaining 10% becomes exponentially more important—and so do the people responsible for it. AIOps does not make site reliability engineers less necessary. It makes excellent ones irreplaceable.
The Problem That Created the Opening for AIOps
SRE teams today are overwhelmed not by complexity, but by volume. A mid-sized SaaS company might generate thousands of alerts per day, the vast majority of which are false positives or resolvable through known runbooks. Engineers spend hours each week acknowledging, triaging, and resolving incidents that follow entirely predictable patterns. This is not engineering—it is maintenance labor, and it erodes both morale and cognitive capacity over time.
AIOps addresses this directly. Modern observability platforms, combined with ML-based anomaly detection and automated runbook execution, can now handle a substantial portion of operational events without human involvement:
- Correlate related events and reduce alert storms to single actionable signals
- Identify root causes from distributed traces, logs, and metrics without manual investigation
- Execute known remediations—restarting services, scaling resources, triggering rollbacks—based on learned patterns
- Draft incident summaries and post-mortems from event timelines
- Predict capacity exhaustion hours or days before it occurs
The honest question this raises: if AI can do all of that, what is the SRE actually for?
The 10% That Remains—and Why It Is the Hardest
The incidents that AI cannot resolve are not the simple ones. They are the ones that have never happened before—novel failure modes emerging from new architectural decisions, unexpected third-party behavior, or combinations of conditions that no training data anticipated. These are precisely the cases where human judgment is most irreplaceable and most expensive to get wrong.
But the residual 10% of human responsibility goes beyond incident response. It encompasses the decisions that define the system's entire reliability contract:
- What does "reliable" mean for this product? A 99.9% SLO and a 99.99% SLO are not just different numbers. They imply different cost structures, architectural constraints, and customer expectations. Only a human with full business context can make that commitment.
- What should AI be allowed to do automatically? Auto-restart is safe in most contexts. Auto-rollback during a live database migration may not be. Defining those guardrails requires domain expertise that cannot be learned from historical incidents alone.
- What does this incident reveal about the system's design? Post-incident analysis that leads to architectural change is a distinctly human capability. AIOps can surface patterns; only engineers can decide whether those patterns indicate a design flaw worth fixing—and whether the fix is worth the risk of introducing new ones.
- When has AI reached its limit and escalation is needed? Recognizing that an incident is genuinely novel—not a variant of something the system has seen before—is itself a form of judgment that requires deep system understanding.
AI Increases Leverage—and Raises the Stakes
Here is the mechanism that makes SRE talent more critical in an AIOps world, not less.
Before AIOps, a team of five SREs might manage a system serving 100,000 users. With AIOps absorbing routine operational work, that same team might now manage a system serving one million users—or three separate systems at 300,000 each. The leverage multiplied. So did the consequences of every decision they make.
Consider the analogy of a commercial pilot with autopilot. Autopilot does not reduce the skill requirements of flying—it changes what those skills must cover. Pilots today need to understand the automation deeply enough to know when to trust it, when to override it, and how to recover when it fails in ways the designers did not anticipate. No airline made its cockpit less capable because autopilot exists. The best airlines made it more capable, because the moments when autopilot cannot help are the moments that matter most.
The same logic applies to SRE in an AIOps environment:
- Every architectural decision carries more weight. A flawed reliability design now affects a system that was previously impossible to manage with the same headcount. Mistakes propagate at scale.
- Governing AI systems requires genuine expertise. Tuning anomaly detection thresholds, validating auto-remediation runbooks, reviewing AI-generated post-mortems for accuracy—these are new tasks that require a sophisticated understanding of both the AI system and the infrastructure it manages. They cannot be delegated to someone without deep domain knowledge.
- Novel failures are, by definition, the hardest ones. The incidents that reach human engineers in an AIOps environment are the ones AI could not handle. These are not routine. They demand the kind of disciplined, first-principles analysis that is difficult to develop and impossible to automate.
The Scarcity of the Judgment Layer
There is one more dimension worth naming explicitly: if AIOps raises the bar for the SRE role, it simultaneously raises the scarcity of the people who meet that bar.
An SRE who can design reliability architectures, govern AI operational systems, and make sound judgment calls on novel incidents at scale is genuinely rare. The tools are getting better. The humans who can direct them effectively are not becoming more common. If anything, the concentration of real SRE expertise into smaller teams managing larger systems makes that expertise more valuable—not less—as a function of supply and demand.
This is the same pattern that emerged in manufacturing when automation took over repetitive assembly. The machine became commoditized. The engineering judgment required to design, calibrate, and improve the machine became the scarce resource. The teams that understood this built durable capability. The teams that treated it as a headcount reduction exercise discovered that the capability they eliminated was the same capability they needed when things went wrong at scale.
What This Means for Building SRE Teams
The conclusion for organizations is not "we need fewer SREs." It is "we need SREs with a different and significantly higher capability profile."
- Stop hiring for on-call throughput. The ability to resolve a PagerDuty alert by following a runbook is increasingly something AIOps handles. Hire instead for the ability to design systems that rarely require human intervention in the first place—and for the judgment to handle the ones that do.
- Hire for architectural ownership of reliability. What SLOs should this system carry? Where are its reliability boundaries? What failure modes are acceptable? These questions should occupy the majority of your SRE team's cognitive capacity, and answering them well requires significantly more capability than runbook execution.
- Design the human-AI handoff deliberately. The boundary between what AI handles automatically and what requires human escalation is not self-evident. It must be explicitly designed, continuously tuned, and clearly owned. This is SRE work at the systems level—and it compounds in value over time.
- Recognize that engineering process determines AIOps quality. AI learns from your historical incidents. If your systems are poorly instrumented, your runbooks incomplete, and your post-mortems superficial, your AIOps system will inherit those limitations—and amplify them at scale. The engineers who set that foundation are not peripheral to your AIOps strategy. They are the foundation.
The 10% Is Not a Reduction. It Is a Concentration.
When we say AIOps allows engineers to focus on 10% of operational work, the risk is that it sounds like a reduction in importance. It is the opposite. That 10% is not the easy 10% that is left over after AI handles the routine. It is the irreducible 10% that AI structurally cannot handle: novel failures, reliability as a business decision, system governance, and architectural judgment under uncertainty.
Concentrating engineering time on that 10% is not a cost-reduction strategy. It is a quality-elevation strategy. It means your reliability engineering talent is finally spending its time on the problems that only it can solve—not on the noise that any sufficiently trained model can process.
At AIDARIS, we build systems designed to deserve their reliability commitments—not to work around the absence of them. In an AIOps world, that foundational engineering work matters more than ever. If you are thinking about what your SRE function needs to look like as AI takes on more of the operational load, we'd like to be part of that conversation.