Loading…

Your runbooks are obsolete in the age of agents

Ryan is joined by Spiros Xanthos, CEO and founder of Resolve AI, to talk about the future of AI agents in incident management and troubleshooting, the challenges of maintaining complex software systems with traditional runbooks, and the changing role of developers in an AI-driven world.

Article hero image

Episode notes:

Resolve AI is building agents to help you troubleshoot alerts, manage incidents, and run your production systems.

Connect with Spiros on Linkedin or email him at spiros@resolve.ai.

Congrats to user larsks for winning a Stellar Answer badge for their answer to How do I get into a Docker container's shell?.


TRANSCRIPT

[Intro music]

Ryan Donovan: Hello, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am your host, Ryan Donovan, and today we're talking about the places where AI might not be able to do the full software development lifecycle for us. My guest is a returning guest, Spiros Xanthos. Previously, he joined us as part of a conversation with Splunk, and today, he's coming to us as CEO and founder of Resolve AI. So, welcome to the show, Spiros.

Spiros Xanthos: Hi, Ryan. I'm glad to be back. Thanks for hosting me.

Ryan Donovan: So, catch us up on what you've been doing in the, I think, three or four years since we talked to you.

Spiros Xanthos: Yes. So, last time we spoke, I was the general manager for Splunk Observability. I think it was when we launched our full observability suite, which was kind of a single product with metrics, traces, logs, real-user monitoring, backed by OpenTelemetry. At the time, kind of, our thesis was that, you know, if you can centralize all your data in one place, then humans will have a much better time, you know, managing production systems, which is true, but the reality is that still, even with the best data, a lot of the complexity falls back on humans. And what we've seen in practice, what I've seen myself in my last role in practice, where I was managing a large engineering team, the vast majority of our time—oftentimes upwards of 70%—are spent on maintaining and troubleshooting and, you know, keeping the lights on in these systems rather than building them new. And that is because a software system scale complexity increases super linearly, and as a result, no individual, first of all, understands the entire system; and the tools you have at our disposal – yes, they have data, but typically any kind of intelligence has to be provided by humans. And that becomes, like, tedious, it becomes stressful, it becomes a lot of toil and takes a lot of time. So, when Mayank, my cofounder, and I left Splunk, we had this thesis that maybe we can use AI, build agents basically that do the work of humans. And we didn't wanna focus on, let's say, the 30% of which is building, but wanted to focus on the 70%, which is running, maintaining, and troubleshooting production software systems. And that's what Resolve AI is all about.

Ryan Donovan: That 30%. I know that's the ‘probabilities’ he's solved coding problems. It's the stuff you can easily get statistically accurate code generation for. And everybody loves a greenfield project, but doing that 70% – what is the scope of that? Is it coding, infrastructure, you know, incident management? What's the full range of that 70% that is not fun?

Spiros Xanthos: It's not always 70%, right? Sometimes it might be 50/50. Sometimes it might be actually even worse, right? Like, I talk sometimes to very large enterprises that have legacy code bases that, maybe they go from newer applications running on the cloud and Kubernetes, all the way to mainframes. And you know, those tend to be even worse in terms of, like, how much time it takes to maintain and evolve them. But you know, I think my point is that, you know, there's the building time, which coding is the main thing that you do there, and models have done a very good job being able to essentially generate code in bigger and bigger chunks, let's say. And I think that's gonna continue, and in some sense, it's gonna make the rest even worse, right? Now, what is in the rest? In the rest is many, many activities. It is, let's say, being on call and troubleshooting incidents, it is essentially managing infrastructure, for scaling, for cost. It is managing all the tools that you have in production, let's say, observability tools – you know, for example, having alerts in place, managing them for cost, creating dashboards. It is security and compliance, it is like rolling out changes and ensuring, you know, those land the right way. So, it is the combination of these things. What Resolve chose to focus on is, I would say, what is probably the hardest and maybe the most painful for humans as well, which is the on-call incident troubleshooting. So, we build the agents that can be on call, respond to alerts and incidents, troubleshoot those autonomously, get you an answer, get you to a root cause so you can essentially avoid creating, you know, escalations with many humans, avoiding affecting customers having outages and, you know, generally making it a lot easier, or giving you a lot more confidence that when something goes wrong, you're gonna have an answer at the same time. The way we build the system is to understand production software in general, and you'd be able to use all the tools, right? So it has applications to all these other problems I described.

Ryan Donovan: Incident response is very high pressure—requires low latency. So, when you have these AI agents, you know, I'm going to oversimplify it and sort of intentionally get it wrong: it's sort of like an automated runbook, right? The intentionally wrong is that I bet there's something more to it.

Spiros Xanthos: Yeah. I think if the way that Runbook was a solution would have had a solution 10 years ago, in my opinion, because every company at some scale starts maintaining Runbook's standard operating procedures. But the thing is, modern software systems are very dynamic and they change very, very frequently, sometimes 10-100 times a day. So, runbooks become obsolete very, very quickly. And unlike code that is, let's say, self-documenting, if a change happens in code, it is there, right? When a change happens to the production system, the runbook does not get updated automatically. So, you have this problem that runbooks are obsolete very quickly, but also problems don't impede themselves exactly the same way, right? There are slight variations. That's what makes any product that, you know, was attempted in the past, and you know, these things I worked on as well, right? What we called AI-Ops before, [are] very hard to scale because you cannot describe every problem, obviously. And then if you describe, let's say, a subset of the problems, then you know, the tools had at our disposal would not generalize well, right? So, would either catch that problem, or they would not, and then it goes back to humans to provide intuition. So, it's very, very different from that.

Ryan Donovan: When you were here with Splunk, we sort of talked about tracking down the mystery 500 server errors. Those ones seem to be, in the little I've looked at 'em, seem to be pretty squirrely. You track 'em down to a root service, and then you're like, 'all right, what's the number of things that could have failed here?' How does AI simplify that or track that down better?

Spiros Xanthos: So, think of it that what essentially Resolve AI does, and I think what maybe modern AI with agents can do, is more or less, do[ing] the work of humans. How humans troubleshoot a problem like the one you're describing is essentially using a combination of tools, like the example you gave. Maybe somebody starts with a low query, finds that there are a bunch of 500 errors coming from a part of the system, then we'll go zoom into that part of the system. They might check dashboards, they might check code, they might make changes, they might check feature flags, they might try to understand if there is a pattern change, they might try to understand if there is, like, a cloud service that is a dependency that is failing. To do all of that usually requires using many, many tools and combining the data in your head effectively about like – okay, practically stating hypothesis, trying to validate-invalidate the hypothesis, and as you find more data, maybe you state a new hypothesis, right? More and more refined. The problem with that is that a human can work serially, and it takes really a long time, and it's also even worse in practice, because maybe the knowledge that I have as a developer is limited to my services, right? The moment I need to jump to some other dependency, you know, I probably don't have the tribal knowledge, I don't understand, let's say the system, I don't understand, maybe, the telemetry... so, I need to page somebody else, right? Another team. And that can continue sometimes for three or four hubs until we get to the answer. What AI can do a lot better and a lot faster is actually, it can examine all of these things a lot more quickly and in parallel. And it can do all this, like, let's say, stressful work much, much more quickly. And then, it can involve humans when, let's say, maybe there is some uncertainty. And you know, a human has to decide which path to take, or maybe when an action needs to be taken that is either risky or reversible. In which case, maybe the best way is for a human to provide their judgment and decide what to do. But the actual evidence gathering, hypothesis testing, connecting the dots across multiple tools can be done a lot faster with agents.

Ryan Donovan: It's essentially a data gathering and processing exercise, right?

Spiros Xanthos: But you know, very complicated data gathering, right? Like, the agent has to be able to reason basically, right? Like essentially given a symptom, or the option of a symptom, is it has to make a decision, right? That's why it's not a runbook. There is a lot of reasoning, right? The way humans reason when they look at the data.

Ryan Donovan: Sure. So, these are obviously agentic. They have tool use. Do you have to tell it how to use every tool? Do you have to onboard it to every tool? Or is it possible for AI agents to get a new tool and figure out how to use it?

Spiros Xanthos: So the way we approach the problem, and what we found to work best, is that if the agent is pretrained on a set of data, or maybe the agent has to be pretrained on a set of data, let's put it that way, right? The agent has to understand code, it has to understand logs, it has to understand metrics, it has to understand infrastructure, and changes. Now, there are multiple implementations, or multiple tools that can give you this data, right? There are typically three or four logging tools, even within a customer. Each one of them has variations, but they all hold the same data, so out of the box, the Resolve AI comes with third-party integrations with all the common tools that you find in an environment. From GitHub to observability tools to cloud services. And you don't have to do anything, just connect them, and the agent knows how to use them. Oftentimes, we encounter though a custom tool—maybe a customer built their own lodging tool, or maybe they built their own change striking system—in those cases, the simplest is: if there is an MCP server, the agent knows how to use it outta the box, right? Because there is a description, it understands what it does, it understands the parameters, and all of that. In absence of that, we have a way to go against an API, but then the user then has to provide some documentation. It's not difficult; it's just that there is a bit more work to describe to the agent how it should access that tool. But to be honest, what I found in practice is that 20% of the tools usually provide 80% coverage, and you know, for the long tail, there is already MCP servers for many of these more 'bespoke' tools.

Ryan Donovan: Yeah. I think in the initial contact emails, we talked about using AI as a control plane for the sort of systems in production. Can you expand on that idea?

Spiros Xanthos: What we found in practice is that, although our goal was 'let's build AI agents that can use the same tools as humans in running production and troubleshooting incidents,' what we found is to do that well, you have to both be able to use all these tools, you have to be able to reason, and you have to be able to learn. Learn in terms of like, extract the tribal knowledge that exists and is spread out across these tools and human minds, basically. And as you start letting all the tribal knowledge, and you have control of all these tools, let's say that the human uses, then you effectively have this layer of abstraction—this intelligent layer of abstraction—on top of these lower-level tools. That allows you to, essentially, perform most tasks, let's say, that the human has to perform in production, a lot faster. For example, oftentimes, let's say you have questions that span multiple tools that might not be troubleshooting; i'll give you an example of something we, you know, one of our engineers did yesterday: we realized our volume of logs have grown, like, significantly in the last couple of weeks. So, one of our engineers went to Resolve, said, 'can you give me the top five long lines by volume and tell me how to optimize?' So, Resolve went and ran a query in our logging tool, figured out which, essentially, 'log lines' are the highest volume, then went back, looked into code to see where those are produced from, and then suggested how to change the code to reduce the volume of logs. And with like, essentially an exercise that would have taken probably, I don't know, maybe an hour, or multiple hours for somebody to be able to do, was done in two minutes, and also cut back, like, the specific code change we had to apply to optimize the logs. So, any kind of task of this nature that might not be troubleshooting strictly can be performed a lot faster because now, as a human, you can operate at the level of abstraction higher than the underlying tools. You don't have to understand the spoken language for the logging tool. You get the exact place in the code where you need to make the change, or you know, understand the specific behavior of the code.

Ryan Donovan: Sounds like what we you have to do is just be able to identify a problem and then go to the AI and say, 'hey, what is this?'

Spiros Xanthos: Correct. And I think this is where we are today, right? Today, mostly, I would say we and our customers use Resolve AI to identify problems, you know, get to the root cause of an issue, or an alert, or an incident, or answer any other question that involves all these tools. I think we're moving to the next stage of this also very quickly, which is, 'okay, now that you identify the problem, can you fix it for me?' So this can take the form of remediation in the case of an incident, but also can take the form of just, 'create a PR for the question I just described about log volume and just do it for me.' I think we're gonna very quickly see that the loop is gonna close... 'okay, this is the problem. This is the fix. Please take it.' You know, the set of actions we trust and AI to take – it's gonna expand very, very quickly as, let's say, the quality of answers increases.

Ryan Donovan: So, this all leads to the question of the actions that humans are gonna be able to take. When I first worked at a large scaled-up tech company, I was surprised at the number of engineers there—hundreds of engineers—and it sounds like a lot of this sort of work is gonna be now done by AI. Do you think we're coding ourselves out of jobs?

Spiros Xanthos: I think that the job of the software engineer or site liability engineer has changed already. If you're gonna be successful in today's environment, you actually have to be able to use effectively AI tools 'cause they create a lot of productivity gains, effectiveness– you can do your job better. Now, I think that the same way, let's say, that maybe we move from assembly to high-level languages, and move to having like, operating systems that do most of the work for us – I don't think that's different. I don't think we're gonna have fewer people that are software engineers. It's just that the job is gonna be different. And out of all this, I think we're gonna have much higher productivity, much higher technological output. We're gonna be able to solve many more problems with technology. I think a lot of problems that rely on technology are gonna become a lot cheaper to solve. So, overall, it's just that I believe we're creating a new level of abstraction. Humans have to learn how to operate at this high level of abstraction, but I don't think the end result out of all this is gonna be fewer people in technology. If anything, it might be more of them.

Ryan Donovan: Right. Well, I mean, a lot of these greater levels of abstraction have also enabled greater levels of complexity in software. Do you think there's gonna be a new level of complexity enabled by AI?

Spiros Xanthos: I believe so. It is already happening, in my opinion, in coding, and it's very, very quickly gonna happen in production systems as well, and that comes with more complexity. Now, imagine if 80%, 90% of your code is generated. This means that probably, your familiarity with the code is less than it was before, so it will probably become harder for a human to troubleshoot incidents for the code that AI has generated as a result. We probably need assistance in doing that from AI, as well. It's the same concept, similar to when we move from running applications on servers to running containers to running microservices. A lot more complexity; better tools to be able to handle these. I just think the jump is higher this time, but conceptually, probably it is the same. We talked about this 70-30, or whatever the ratio might be. I think it's becoming worse 'cause we're already starting from a situation where running large production systems is difficult. Now that a lot of the code is also generated and it's generated at a faster pace, we definitely need AI tools to be able to manage them.

Ryan Donovan: I mean, it seems like it creates this new technology, and then there's an orchestration level on top of it with containers. I think you're sort of seeing it with agents, but I could see a world where it's an orchestration level on top of the agents just trying to understand this massive pile of code now being generated.

Spiros Xanthos: I agree. We already see that. We don't yet have the right protocols for agents to communicate with each other, but a very simple example is: oftentimes—and we do this ourselves—you can have two agents in the same, let's say, Slack channel or Slack thread, and one can take the output from the other and perform an action. So, I do believe it's not just humans managing agents. I think it's gonna be agents working with each other. And obviously, you know, some orchestration across them, whether that happens from another agent or a human.

Ryan Donovan: I mean, there's the A2A protocol recently donated to the Linux Foundation, but I have not heard a lot about that as in terms of how good it is as a protocol.

Spiros Xanthos: I don't think it had the adoption, at least that I was hoping [for], yet. But we're also in the early days.

Ryan Donovan: I'm not sure we're at a place where people are using agents to talk to agents as much. We're still getting the hang of agents.

Spiros Xanthos: Yes. If it happens, it happens in a very human-like way. I'll give you an example. What we can do already, and we use that at Resolve, is within Slack, the agent maybe is on call, receives an alert, runs through all the tools, gets to the root cause. So it's just a change or a fix. Let's say the fix maybe is a code change. It's very simple to take that as a prompt, even within a thread, and say, 'okay, now invoke a coding agent to go create a PR to fix this problem based on the very detailed, effectively prompt,' Resolve AI provided. So, we see the use cases like this, but it's not, let's say, fully automated. It's more the output of one agent becomes the input to another agent.

Ryan Donovan: And what's the computing requirements to run an agentic SRE?

Spiros Xanthos: The level of, let's say, confidence you need to have on the agent, and the company, and the people behind it, is probably higher than when you're dealing with a coding agent, in the sense that, for an agentic SRE to be effective, it needs to be connected to most of the tools that the human has access to. It's usually code; it's telemetry infrastructure. So, in that sense, the security requirements, and the trust, and compliance are higher, but once it's connected to all of these tools, it's also extremely powerful, because not only it can take an action, let's say, on top of one data type, but it can combine knowledge, information, maybe insights, from multiple of these tools and chain them together into an answer.

Ryan Donovan: The SRE role is fairly new within computing, 15 years or so, maybe less. Do you think it'll be necessary with AI? Because I've heard even before AI, some people saying, like, you know, devs should manage code to production, there shouldn't even be SREs or DevOps.

Spiros Xanthos: First of all, if you look at our industry, there are many, many variations of how different companies implement this. You have cases where you don't have multiple roles – you have maybe software engineers, and they are both on call, they manage platform infrastructure, they manage services on the cloud, and there are software engineers, as well. Now that said, effectively what they're doing is they're wearing multiple hats, right? The job is slightly different when you're trying to manage some infrastructure service, or platform, or containers, or you're troubleshooting versus writing code. The tasks, or the jobs to be done are there, whether you separate them into multiple humans or not, may be a different question. Also, you have variations where you maybe have embedded SREs that are the first layer of defense when it comes to incidents, or they manage maybe the infrastructure. And maybe you have cases where you have dedicated platform and SRE teams that maybe are responsible for infrastructure, and developers are usually responsible for applications. So, regardless, I don't think the work that needs to be done is changing. It is just that you might have dedicated folks doing the role of the SRE or the infrastructure engineer, or you might not, but the job still has to be done. Let's say, in AI SRE, effectively, what you can do and what we see a lot of users be able to do more effectively is—or a lot more software engineers—is self-serve themselves a lot better, right? When something goes wrong, they don't necessarily have to pay somebody who understands infrastructure. They can, on their own, understand if this is an application or infrastructure issue, and if it goes into part of a system that they're not experts [in], the agent can help them get an answer.

Ryan Donovan: About filling those jobs that are still done by humans– I know there's a lot of anxiety about getting jobs. Anecdotally, I've heard it's a tough market, but finding the right people who can do a human job is still important. I know you had some tips on how to find good people in this era.

Spiros Xanthos: The truth is that I do believe the market is a little tough right now, and you see this probably more with folks who don't have a lot of experience, or maybe just graduated from college. I think that's where it's probably the hardest to find a job. I hope it changes, but in my opinion, this is the broader economy. It's less about AI. It's just the state of the economy at the moment. That said, engineers who are great always have a lot of options, and the better you are, the more options you have. That's kind of the status. In my opinion, you cannot build a great company unless you have great people, period. So, as a founder, I spend the vast majority of my time, or I spend the vast majority of my time so far, in trying to convince people who are smart or smarter than me in joining us, and take ownership of an area of the company, and solving it better than I would. A startup has to offer, especially in AI, which is even more competitive, is a lot of growth, a lot of ownership, the ability to move quickly, the ability to solve very, very hard problems that have not been solved before, that maybe a bigger company or even one of the labs cannot offer you. So, I don't think a startup is for everyone, but for people who are risk tolerant, who are sitting, have a lot of ownership, who can move quickly, it is a magical place, especially if it's a company that has traction and solves their problem. Of course, I try to have a very high bar in terms of technical skills and cultural fit, but I also think there is a unique advantage, and if somebody has this inclinations or this personality, startup is a better place for them.

Ryan Donovan: A lot of the anxiety I've seen is from the junior devs. I know you said it's a hard market for them right now. Do you have any ideas how that junior dev can sort of show their greatness to potential employers?

Spiros Xanthos: I think that it's not that much different than it was before, in that the more, let's say, examples of experience somebody has, the easier it becomes, whether that's internships or maybe they have worked on like, open-source, or maybe even to just have done a lot of programming or technical work during college to be able to go to an interview and prove that they are as up to speed as somebody could be that hasn't really worked in the industry before. The other thing I see is that it's not like there is not a lot of hiring, it's just that it's a lot less than it was before. So, in addition to having all the right technical skills, it probably takes a lot more proactiveness. Not just applying, let's say, to jobs, but trying to find places where maybe somebody that you know works, right? Or try to get a warm intro. Not that much different than maybe always, it's just that you probably have to do a lot more of that. My kind of overall message from this whole discussion is that I do believe that all of the software engineers and SRE has already changed, and still, though, I think we're at the beginning of this revolution that is happening. My own experience now working with AI for the last two years is that the models keep evolving at a steady pace, if not accelerating. The agents are maybe even behind the models, and they're also moving very fast. I do believe that as the quality, let's say, of reasoning and answers is improving, I think we're gonna see a lot more action being taken automatically by agents. So, if my conclusion of all this is that it's much more important to lean into this right now, whether you're an individual and this is about your career, or whether you're a common error CTO and thinking about AI adoption, I think that we have a lot more room. It's coming. It hasn't slowed down.

Ryan Donovan: It's that time of the show, ladies and gentlemen, where we shout out somebody who came on to Stack Overflow, dropped a great answer, shared some knowledge, or shared some curiosity, and today we're shouting out the winner of a stellar answer badge. Congrats to 'larsks' for answering: 'How do I get into a Docker container's shell?' If you're curious, we'll have it for you in the show notes. I'm Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have comments, concerns, topics to cover, please email me at podcast@stackoverflow.com. And if you wanna reach out to me directly, you can find me on LinkedIn

Spiros Xanthos: And I'm Spiros Xanthos, founder and CEO of Resolve AI. Resolve AI is, like we described, is building AI agents to help troubleshoot alerts and incidents, and run production systems. You can find us at resolve.ai, and you can reach out to me directly. My email is my first name @resolve.ai.

Ryan Donovan: Emails are open. Thank you for listening, everyone, and we'll talk to you next time.

Add to the discussion

Login with your stackoverflow.com account to take part in the discussion.