Allgemein

The agentic revolution: A new vision for SREs

The agentic revolution: A new vision for SREs

Gauge showing reliability at 100%.

Site reliability engineers (SREs) are no longer an afterthought for harried IT leaders. They play a critical role in ensuring digital services work reliably at scale. But as complexity builds and incident volumes grow, SRE teams are being stretched thin by manual processes that degrade their value to the organization.

This is where AI agents can help SRE teams break free of a reactive doom loop. When deployed strategically, agents can enable teams to move past toil and proactively enhance operational efficiency and resilience. By automatically surfacing context, executing diagnostics and remediations, and generating self-updating runbooks, AI agents empower SREs to prioritize their attention on the most critical matters.

SREs vs. DevOps

SRE is still an often-misunderstood role. It’s not interchangeable with DevOps, but rather brings an engineering discipline to operations for improved reliability and uptime. The production and success of SRE teams can be elevated through their ability to automate repeatable tasks.

Organizations can incorporate SREs into IT operations in various ways. There might be a centralized department serving the entire organization. There may be one or two SREs embedded within the engineering team. Or SREs might act as consultants, available on an “as-needed” basis. In some instances, developers might even be encouraged to adopt SRE skills.

Regardless of the model, a persistent challenge threatens to undermine their value. Site reliability engineering, like IT operations in general, is buckling under the weight of inefficient tools and manual processes.

Enhancing SRE workflows

To relieve that operational burden, many SREs are already using generative AI (GenAI). While GenAI can accelerate incident resolution, it still demands input from human experts. Teams don’t just want AI assistants. They want AI agents that SREs can fully offload low-risk, toilsome tasks to. As the adoption of AI agents increases, SREs will evolve into supervisors of a new digital workforce, delegating tasks for all issues except for the most complex or novel ones.

How might agentic AI look in practice for SREs?

Consider how an AI agent can surface useful contextual information for investigators to drill down into. This might include previously resolved incidents involving the same service to immediately highlight how similar issues were remediated in the past, including responder notes. Agents can further enhance context for SRE incident responders by including information on related active issues across different services, which would provide the SRE with crucial real-time information on the scope of the incident and any potential dependencies.

Using this information, an AI agent could go a step further by suggesting where an issue has originated, and whether recent configuration or other changes may be the root cause. The most effective agentic tools will continuously learn from SRE feedback and successful remediation, enabling the AI agents to get smarter and more sophisticated as time goes on.

The next steps

Once an issue is diagnosed and context delivered, remediation is the next stage that AI agents can optimize.

For low-risk, well-understood issues with clearly defined and known solutions, an agent could triage and remediate without any human input. All the SRE would need to do is review the after-action report to ensure it’s correct and check for any potential improvements. At the other end of the spectrum, novel or major incidents will require SREs to guide the investigation and develop their own remediation plan. In this scenario, the agent’s value is in automatically collecting useful contextual information and answering any questions.

Sitting in the middle are partially understood incidents, which are familiar but typically have multiple possible causes or solutions. In this scenario, the SRE agent would first cross-reference an alert with historical operations data and real-time signals. It might nudge the SRE into running further diagnostics or supply them automatically so the SRE has a range of possible causes to consider upon arrival. The AI agent would then suggest possible remediation steps, further reducing manual effort and time to action.

The result of this remediation, as well as any feedback from the engineer, would help to generate a self-updating runbook consisting of which actions worked best. This continuous learning approach helps to prevent recurring issues and enable faster resolutions with fewer people.

Getting started

To extract maximum value from AI agents, managers will have to be careful about the way they position the technology. Managers will need to equip SREs with the right training in areas such as data security, output validation, and workflow creation. The best systems will be vendor agnostic to better surface real-time information from across the entire IT environment and will have access to as much historical operations data as possible.

The benefits of getting this right could be transformative. In the right circumstances, AI agents can resolve incidents faster, reduce SRE toil and burnout, and proactively optimize processes in ways even human experts might not spot. Above all, this means SREs can focus on the work that really matters: supporting innovation and growth.

The post The agentic revolution: A new vision for SREs appeared first on The New Stack.