Honeycomb is an observability platform that enables deep, high-dimensional exploration so you can debug unpredictable behavior with precision.
Resolve AI allows you to resolve incidents, optimize costs, and code with production context using AI that works across your code, infrastructure, and telemetry.
Connect with Christine on LinkedIn.
Connect with Spiros on LinkedIn.
TRANSCRIPT
[intro music]
Ryan Donovan: Hello, everyone, and welcome to the Stack Overflow podcast, a place to talk all things software and technology. Today, we have a special twofer episode recorded live from HumanX with big observability SRE focus. So the guests for this are Christine Yen, CEO of Honeycomb, and Spiros Zantos, who is CEO of Resolve.ai. So enjoy. I'm here at the floor of HumanX. We're going to be talking about observability in the age of AI. My guest for that is the CEO of Honeycomb, Christine Yen.
Christine Yen: Hello. I'm excited to be here. Thank you for having me.
RD: Of course. My pleasure. So I feel like observability has been one of those things that has increased in conversation over the last five, 10 years. And then with AI, it's sort of come into a different perspective and different importance. What's changing with observability because of AI?
CY: Oh, you're asking the big question up front.
RD: Yeah, yeah. Bigger than smaller.
CY: [laughter] Well, let's talk about what's happening with software first, and then I'll wrap observability in. I was talking to someone yesterday who posited that we're not going to have engineers and product managers and designers anymore. They're all just going to be builders. And I like that idea because, you know we used to think of the software development lifecycle as these discrete steps. And they're all getting squished together. AI agents are moving so fast that the idea of writing a spec and implementation and review, it's all just getting squashed together. And I think that what that means for observability, for testing, for different things that were CI, that also basically means it's getting squished together into a step of just validation. Did the code do what I expected it to do? We used to validate that, as humans, maybe by reading the code and thinking about it and handing it to another human to read it, now everyone is talking about how no one's reading the code. But the code still has to be validated. Someone still has to make sure that what it does matches the intent. And when I think about how observability has evolved in the last five or ten years, it's actually really exciting. Because ten years ago, people were talking about logs, monitoring, and APM. They were different data types and people were like, well, you can do this with these data types. And when Honeycomb came on the scene, I think that we really tried to shift the conversation to: what can you do with this telemetry? I don't love getting bogged down in like what type of telemetry we're talking about. How do you know whether that telemetry is working for you or not? What does good mean for your service? How are you measuring that? And I think those skills, whether you're taking those and then... using them to define SLOs or just using them to shape the graphs or signals you're looking for, that skill of "what does good mean, how will we know”, that maps so well to this squished development cycle with intent and validation.
RD: Talk about a lot of things sort of changing, like the old APM style of just like getting a dash, whatever metrics, and then logging and traces. I want to sort of get a sort of more fundamental question, like what is telemetry at this point?
CY: I think telemetry is not going to change dramatically. And I say that partially because I have a pretty unglamorized definition of telemetry to begin with, which is just, what are the bits and exhaust that your application is putting out so that you know it's doing something? One of my least favorite things to do is sit and debate the relative merits of logs versus traces or structured events versus this label versus that label. It is all just telemetry. It's all just data. You can shape data in certain ways. You can add certain bits of metadata. I think certain bits of metadata are more useful. I'm not going to debate the types. What I'm going to say is that if you're an e-commerce business, you probably care about SKUs and shopping carts and checkout times, which is going to be different from a social media site, which cares about upload times and user IDs and likes and relationships. You should capture the things that matter. But in the end, telemetry is just... It's just data.
RD: Yeah.
CY: And also data has gravity. Even with AI, I don't think everyone is going to be really excited about completely overhauling everything that they've used before to understand applications. There's going to be an evolution. There always is.
RD: Yeah.
CY: But it's just the paper trail that you have your applications leave or your agents leave so you know what the heck they did.
RD: So it almost sounds like you're saying like what used to be your KPIs are now part of your telemetry, right? Your sort of business.
CY: I hope so.
RD: Yeah?
CY: I think that that's… the exciting part about– there's lots of reasons, lots of smart engineers are raising their eyebrows at fully autonomous agents, generated code, no one reading the code. I want to acknowledge that. But I think that we all feel that that's the world we're moving towards. And in that world, it does not matter anymore how elegantly these functions were defined or how they're invoked. What matters is: did they accomplish the job that code was supposed to? And I think that is going to force more and more engineers to define the job that the code is supposed to do in the language of the business, in the, like, what is the outcome this code is supposed to drive? It's not to, like, process text. It is to make sure that this item listing is formatted in a certain way. They're going to have to connect that to their business, and I think that's great.
RD: With a lot of the AI-generated code, I've seen takes where people have found it to be not the best written performance-wise. I think Gary Tan, CEO of Y Combinator, yeah, was talking about, you know, X tens of thousands of code. And somebody looked at it and was like, this is just heavy, you know, not performant code. Do you think that is part of the observability that people are going to be more building in? The more sort of like, is this good defining good in a different way?
CY: I think that's the key.
RD: Yeah.
CY: When does that facet of good matter? My co-founder, Charity, wrote a great piece earlier this year about disposable versus durable code. All respect to Gary, the code he's writing is probably pretty disposable. There aren't a whole bunch of people relying on it. There isn't his business entirely relying on it. No one is going to be that upset if latency doubles. People clearly aren't upset right now. He's not upset right now.
RD: [laughter] Right.
CY: And so I think that for his application, the definition of good does not contain performant, efficient, any of those things. For something like, I don't know, like Visa or financial services companies where you want really low latency, you want really high availability and reliability, those guarantees don't change, whether it is AI generated or not. And those companies are going to have to, and those engineering teams are going to have to figure out if they incorporate these coding agents, which inevitably they will, how do they define guardrails so that the agents know to stay within those and validate their output to make sure that it stays within those guardrails.
RD: Yeah, I mean, like, how do you get an agent to write durable code? Because I think a lot of folks who are out there vibecoding and doing EA agents, it is that sort of disposal code where it's just like, this is a toy, this is a little internal tool. Who cares if it runs well?
CY: First, there's that definition that we've talked about a lot. Adam Jacob, one of the co-founders of Chef, is out there now working on a new startup, writing a lot. He's gone through his whole AI, newly born revelations. And he really talks about building the machine to build the machine. And if you believe that you are– that the output of your code has to be durable, it is defining those characteristics that make it durable. What are the qualities that have to be placed? What are the things that should be optimized versus don't have to be optimized? And defining that upfront, just like you would for an engineering team. The way that people used to build engineering teams, if you are building an engineering team for a Visa or a Stripe, you're going to be looking for people who think or can think about code differently than a... Something that is meant for toy sample apps. The brownfield versus greenfield piece is a little different.
RD: Oh, sure, yeah.
CY: Where I think all the people who are raving and having wild success early pretty consistently green code.
RD: Yeah.
CY: Greenfield, right? Whether it was durable or disposable. This idea that the agent is the one generating the code and the agent doesn't have to necessarily make sense of what the human and the agent is doing. I think that this question of how do I take an existing code base and make it legible, make it really easily usable by agents is an interesting one. Obviously, it is possible to point an agent at an existing code base, but I think that there are going to become some interesting standards or practices to make pieces more modular, be more intentional about some patterns for agents to be able to pick up on them and keep going forward.
RD: Yeah, I think I talked to a company who was tokenizing code at the component level.
CY: I mean, that assumes you have useful components to find cleanly.
RD: Yeah. You know, a lot of this sort of seems like that, you know, the telemetry is no longer just code in production. It's at the writing code stage.
CY: Anytime you have a black box, something happened without you doing it, you're going to want to be able to answer the question of what happened. Whether that is production code running, where some user is touching it, you want to know what happens, or some agent generating code. In some cases, I would argue, if you have an autonomous agent generating code, I care less about maybe what the decisions were made by the agent to generate a certain code, because in that sense, the code or really the outcomes and outputs of that code are what matter. That, in a sense, is sort of the telemetry of the…
RD: The code itself is the new telemetry.
CY: Yeah, or the signals that the code is validated, become that new telemetry of the development process. And for me, that feels right. You know, if you look at autonomous agents as employees, like humans, I don't care what you do with your time. I don't care what documentation you're reading, frankly. What I care about is, did the code that came out work? And does it adhere to our shared definition of good?
RD: That gives you just, like, one big telemetry signal, like, good or not, right?
CY: The thing is, like, “good”: there's so much behind that.
RD: Yeah, that is. That's true.
CY: You know, I mentioned SLOs earlier and Honeycomb. Anyone who isn't familiar, Honeycomb is an observability product. We have an SLO part of the product. But when we talk to prospects and customers about adopting SLOs, 90% of that conversation is not, here's the product in the UI, he's already used it. It's: here's how to think about defining that SLO. Here are probably people that you want to have in the conversation inside your company. Here are the stakeholders. Here's what can happen when you define what good looks like. And so I think there are fewer signals that we pay attention to, but that is good and right. Arguably, we never should have been trying to keep track of disk space and CPU utilization and database pressure and all these disparate signals.
RD: Yeah, yeah.
CY: What should matter is: how is my system performing? And is it healthy, and is there a headroom? Then you can drill down and look at the granular system.
RD: Yeah, but are things like disk usage, CPU load, database load, aren't those sort of proxies for how the software is doing?
CY: Those are proxies. But how many times have you seen something where the database load has changed, but end user experience is fine, or vice versa? If the end user experience has changed, someone's complaining, someone's having a terrible experience, is it even worth looking at the database load unless, like, is it worth starting there? I think there's this… that squishing that I describe is happening everywhere. And it is... on the humans in this process to figure out, as things get squished, what comes to the front and what gets sequenced behind.
RD: I read something recently that gets measured is usually what's easiest to measure. And wondering, you know, it seems like what we need to measure in software is kind of changing.
CY: Well, the Gary Tan example is perfect, right? What he measured and bragged about was lines of code. Because that was easy to measure, super quick, like immediate short feedback loop. Dopamine hit, awesome.
RD: But that's always been a lousy measure, right?
CY: But it's always been a lousy measure. Arguably, you know, he, whomever his other stakeholders are, should be like, okay, what actually does matter? Maybe what matters, maybe what matters is lines of code, because that's what he's optimizing for, for some reason.
RD: Sure.
CY: But maybe it's, oh, I want like this many people using it. And like, that is a longer feedback loop to capture. And it's harder to capture, but absolutely more important. And really, I mean, he... If you think about like, his tweets around it, what he has been pushing, it's almost like the lines of code are an intentional marketing metric to describe how useful his GStack is.
RD: [laughter] Right.
CY: So he really doesn't care about your definition of good because his point is, look, this GStack is letting me output a lot of code in a way that I wasn't able to before. So by that definition–
RD: He's succeeding.
CY: It's succeeding wildly. Good for him. That thinking about what does this person value [laughter] versus what do I value, that's a question and a skill that reaches far beyond the question of code quality.
RD: Sure, yeah yeah yeah.
CY: Anytime someone's trying to sell you something, do they value what I value?
RD: Yeah, yeah. So for companies, enterprise-level companies, people who have real stakes, real customers, is there a universal value that should be part of their definition of good?
CY: Has there ever been? I don't think so. I think that maybe just that the good should be measurable.
RD: I mean, is it a beta? Maybe it's a beta. Maybe I'm an idealist. I continue to think that the pressure to have to ask these subjective questions instead of chasing bad measures like lines of code, that is a good thing. Just being able to move faster doesn't obviate the need to pause and say, what is the problem we're solving right now? And what's cool is that now you can, if you answer that question with enough rigor, like, great, now we can go solve it faster. But it doesn't change fundamental, important parts of, is this a problem worth solving? Are the people in the room the right people to solve it? Do we have all the information necessary to solve it well?
RD: Right. And I wonder, you know, for a lot of folks, is that question provable programmatically?
CY: Ideally, right? If you are building a machine, then build a machine. And you want to move as quickly as possible, like, what is necessary is that machine validatable step. Because then you can have these validation loops, you can have this autonomy really measure and sort of checkpoint progress in a way that doesn't require a human to come in and validate. I think what's really fun is, again, all these harnesses that are meant to... help you iterate and go faster and all the Ralph Wiggum Loops setups that are like, OK, do it until these end conditions are met. Awesome! That's TDD. That's something that we've known we should be doing all along.
RD: Yeah, yeah.
CY: It's TDD with maybe a few more dimensions and, again, a few different definitions of good. But this is, in the most optimistic case, incorporating these AI tools is forcing us to be more disciplined up front in order to get good output and not just waste a bunch of tokens and time. Yeah.
RD: What are the… the new sort of things you're looking and instrumenting into your product to like, capture the new definitions of the good?
CY: I think , uh, there is an… obviously agents and LLMs introduce a non-determinism that is… different and weird and kind of spooky, right? In the past, you could generally rely on, oh, if I get a certain set of telemetry, I can go back and look at the logic paths that were followed to compare it to the code to understand why some outcome happened. That is not going to be the case as much when you have an LLM making decisions, making paths. that you didn't track. I think what will be interesting will be instead of relying on the code to complement the telemetry, we are going to have to get better at encoding in the telemetry the parameter, like how a decision was made, and then what decision was made, right? Instead of, oh, the parameters to my function were, you know, I'm an e-commerce app, and it was sales tax is this much, and the shopping cart amount was this much, and then the human can say, oh, I can see how this was sped out. Maybe we're encoding sales tax calculation. These are the parameters and this was the output all somehow. So we can look at that decision. And I think that as more and more folks are incorporating these autonomous agents, this non-deterministic behavior, and they're debugging it afterwards to figure out why someone, you know, their user asked for a cat picture to be generated and got a dog. That's how we're going to think about paper trails, about debug logs. And then again, there's the question of, well, why did it make this decision? Let me keep looking upstream. It's still going to be this investigative process when you can't rely on the code to be that source of truth anymore. It has to shift into the telemetry.
RD: And it seems like some of this that you're talking about has to sort of be part of the agent's purview. That the decision making, the sort of thing behind the code has to be part of, you know, whatever agent's MD or like smaller parts of the commenting code, right?
CY: I think that... if it is, you needed this technique. Maybe there will be good patterns where people are like, no, I actually don't. I want to remove the non-determinism out of this. I think that that's a little bit of the trend of the open clause and all this, removing that and handing them CLIs. So I think that we are in a moment where we are deciding where to draw the line for where to enable agents to make their decisions versus where to give them well-worn paths to choose from. I think... this age that we are in, where every tool out there is introducing AI, it is putting a lot of pressure on this question of trust. No matter what kind of software you provide, whether you are a toy app, something for leisure, or battle-tested, something that touches your financials, there's some element of trust that we are used to associating with a given software experience. Introducing AI has the potential to shake that, but then we have the chance to reintroduce that trust through being really thoughtful about designed experiences through the sort of guidelines that you want to embed into your product for how humans and agents work together and there's so many takes out there about the end of this profession or the end of that profession but I genuinely think that product thinking, design thinking, this like– “what is the contract we have with the user” is is going to be more important not less. And I hope that every tool that is excitedly adding AI capabilities into the product is being equally thoughtful about how to balance that out.
RD: Yeah. Trust is something you hear everybody talking about at this conference. It's something we found in our developer survey that, you know, the more people use AI basically the less they trust it. So that's an opportunity for us all.
CY: Yeah. How do you make the AI show their work? You know, how do you put those guardrails in place? How do you know that they're going to be staying within those guardrails? It's fun questions.
RD: Fun, big, dangerous questions. Where can people learn more about Honeycomb and connect with you on the internet?
CY: You can learn more about Honeycomb at honeycomb.io. We've got a great blog and a lot of posts by many of our engineers across the org. You can find me on LinkedIn. I'm Christine Yen. And reach out anytime. I love talking about observability.
RD: All right. Thanks for listening, everyone. Talk to you next time.
[musical interlude]
RD: Hello. Welcome to the Tech Overflow podcast. I'm here at HumanX to talk about handling all that code that AI is writing in production. My guest is founder and CEO of Resolve.ai, Spiros Zenfos. Welcome back to the show.
Spiros Zenfos: I'm glad to be back. It's my third time.
RD: Third time. There's a charm. So obviously, like AI CodeGen has made code almost free.
So people are writing a lot of it. And then that code has to run somewhere in production. How are SREs handling that?
SZ: SREs and developers. Obviously, everybody cares about what happens in production. I'll tell you that this was a problem before. In my last role, where I was running Splunk Reservability, a few hundred engineers, probably 70-80% of our time was spent on actually maintaining and improving our production rather than building new features. So it was always a painful problem because humans had to be, let's say, the operators of many, many tools that have a subset of the data and the glue across them. So it was a painful problem, but of course, now that, let's say, generating new code has become practically free, it's becoming a bigger problem. And of course, all that is desirable in my view. But we have to be able to be prepared to deal with the consequences, right? Because it's one thing to vibecode, let's say, a small piece of software to use one off, but it's a very different thing to have a large-scale application, right, that customers use every day. You need to be, obviously, much more prepared to ensure quality of all this code, but also when something goes wrong, you need to be much better prepared to react to it, right? So Resolve deals with both of these, right? Both existing systems, let's say, that have been written before and are still hard to, let's say, debug and maintain, but also kind of gets you in a better position to react to problems for all this newly generated code.
RD: The newly generated code isn't always the best, most performant, most secure code.
SZ: I think that maybe over time, we'll get to the point, let's say, that models can generate code that is as secure, as reliable as humans. But even in that scenario, I agree with you, right? Today, we're not there. And we have many examples like, of the industry, where problems have happened. But I would say, even if that was to happen perfectly, you still have a knowledge gap now, right? Because previously, there was a developer that had to handcraft the code, deploy it, et cetera, right? So now that models do all that work, we essentially have moved to a higher level of abstraction. When something goes wrong, we don't have like the deep intuition that we had before to go debug the problem, right? So regardless, I think that AI is now essential as a defense mechanism when something goes wrong.
RD: You know, and coming from the sort of observability background, what's the sort of piece beyond observability that you're looking at now with Resolve?
SZ: Yeah, I think that observability is really one of the... important tools we have in production systems, right? You know, a developer today or an SRE does not contain themselves to just using observability tools, right? Obviously, they understand the code, the architecture, documentation, infrastructure. So Resolve is a general purpose AI agent for production systems, right? Observability is one of the tools and data that Resolve can use. Very important, obviously, when it comes to monitoring, especially. But like humans, our agents don't constrain themselves to just observability. They will use all the data available for the job.
RD: I think the conversation earlier where it's maybe observability goes to contain your business KPIs, the code itself. Is that a reasonable assumption or is there a line you want to draw between observability and the rest of the code infrastructure?
SZ: In my opinion, with, let's say, the acceleration in technology, right, with us building a lot more application or, you know, a lot more code, I believe observability remains essential. And it's very, very important for us to essentially be able to scale all that data and do it the right way. But what Resolve does is a very different thing, right? Resolve sits on top of all these tools and systems. It can use them like a human. It can assist humans with the work they do on top of this, right? So I don't see ourselves as a competitor to observability tools. I see ourselves more like agents that can use, work with these tools the way a human does, right? So we relieve humans of stressful, painful work, constant for production issues as a user of these tools, right?
RD: Right. The observability is a tool in your toolkit. Are there new tools that you need in your toolkit because of AI code?
SZ: I think like it's very, very important for agents to be able to essentially capture the context for a production system. Like unlike code, more or less essentially is self-contained, right? All the context in the code base. When you're dealing with the production system, you never have the full documentation, right? It's spread out across tools, across documents, sometimes outdated, and human minds. So it's very, very important for essentially the industry, let's say, right, including agents, to be able to essentially discover all that context, make it explicit, so then agents can be as effective as humans. On top of that, I would say agents, like what Resolve does as an example, right, can be very, very sophisticated, right? Like oftentimes we have many, many agents working with each other to achieve an outcome, right? Almost emulating a team of humans with different types of expertise, a DBA, an infrastructure expert, an application expert. The agents can write code and create their own tools on the fly. Yes, we're way past, let's say, just static tools, right, with static data. So we have a lot more context and, you know, the agents can create their own tools even dynamically, right? Sandbox environment, of course, with very careful kind of controls. But yes, like way more advanced than maybe anything that we've seen as a tool before.
RD: Yeah, that seems like both a boon and a little bit of a risk too, because now you have new code in production, right, as these tools. It's in a lesser... last radius, lesser risk, but it's touching the production code. Are you extending that SRE viewpoint to the rules that AI SRE writes?
SZ: First of all, I agree with you, right? In a way, let's say having an agent write code for you is a little bit simpler in that it produces an output. You can have the time to review. Maybe you have another agent to kind of check that work. You can iterate. In production, oftentimes you are dealing with urgent and critical issues, right? So the bar is very, very high for an agent to do
work and especially take an action. As such, I think like, a tool like Resolve. from the beginning has built-in mechanisms around security, compliance, controls, because no matter what, you're dealing with critical systems, right? Without necessarily a human being able to stop something from happening, right? Because these things work on their own. So, first of all, yes, the bar is higher. It's almost, let's say, like having, let's say, a robot, a cleaner, right? A robot, let's say, a
cleaner for your house. Yeah, of course, you let it go, right? Like, you know, what is the worst thing that can happen? But you have a self-driving car. Don’t just, don't just let it go, right? In
my view, agents like Resolve are more like self-driving cars, right? We have to prove safety beyond the human levels to let them go versus, let's say, a Roomba that you let clean your house.
RD: Yeah, the individual risk is much greater. Although, when you mentioned the robot cleaner, I heard a story about somebody who wanted to hack a robot cleaner to control it by joystick. And the security is so bad on the robot cleaners that by getting access to his robot, he got access to every robot. And I wonder, like, this seems like the AI is going to be discovering errors like that where we haven't even thought about.
SZ: Correct. So to your point, right? Obviously, the reason that this happened is because I guess nobody thought that this is a critical security application, right? Versus, let's say, when you're building agents for software systems and production, your security standard is way higher, right? Higher than even the prior generation of tools. So I think, you know, different standards. And it is very, very important, obviously, to focus a lot on those controls. But yes, you're right. Like, I think today, maybe models and agents are at, let's say, subhuman expert level, right? But I think where we're headed is in a year from now, they're going to be like at a superhuman level in terms of abilities, right? We're seeing this with some of the security vulnerabilities that are being reported. I think that, you know, we need to, obviously we cannot stop that evolution from happening, but we do need to be very thoughtful about how we approach it, right? And I'll go back to my earlier statement that if we're using AI to discover problems, to write code, we definitely have to use AI to respond to these things, right? And again, this is kind of where I think the role of resolve plays. Whether it's reliability or security, is agents that respond to issues, no matter how they get discovered. So you can very, very quickly, you know, remediate, solve and prevent, let's say, either a vulnerability or an outage.
RD: The world of the AI SRE tool, what's the human SRE do? Or do we need human SREs as a specialization?
SZ: Yes. I don't know if you noticed our logo here at the conference, which says machines and humans. What we need to do is you need to have agents. Obviously, can be given tasks and perform them on their own, but oftentimes agents have to work with humans, especially in large, complex software systems. It's very, very important to build agents that can interface with humans very effectively, but also with other agents that may be different type of job. I don't think we're going to have fewer humans dealing with software. I think we're going to have a lot more software. And I think like, lower level jobs and tasks are all going to be automated by agents. Humans are going to move from being in the loop to being, let's say, on the loop, overseeing agents that run constantly. But yes, I think the end result is probably a lot more software, not fewer humans dealing with software engineers, let's call it. Now, the job is going to change, of course, right? But I think that with, you know, folks are curious and, you know, adopt the tools, I think the future is bright.
RD: And I mean, like, automating systems is a fundamental part of DevOps and SRE anyway. So this is a very advanced way of automating the various systems, right?
SZ: The problem we're dealing with, right? Let's say debugging production incidents is a very, very hard problem. But I think we're now at the point with Resolve that we can do this very, very reliably, right? Almost as well as humans. So I think like over the next 12 months, we're going to move from like agents, maybe assisting humans in dealing with these things with agents probably resolving like 80 to 90% of these issues, right? The prediction we had about agents writing all the codes, I think I have the same prediction about production work and incidents, right? Agents are going to be probably resolving 90% of the incidents in the next 12 months.
RD: All right. Well, if people want to learn more about Resolve AI and connect with you, where can they go?
SZ: If you're interested in using the product or maybe like joining us, Resolve AI, we have a lot of information about the product and our careers page for anybody who might be interested.
RD: Well, thank you for listening, everyone, and we'll talk to you next time.
[outro music]
