SPONSORED BY INTUIT
Chase Roossin, group engineering manager, and Steven Kulesza, staff software engineer, from Intuit join the podcast to chat about what might be the hardest problem in engineering right now: getting multiple AI agents to work together in a complex system. They discuss how automated evals can make agent behaviors more predictable, agent swarms vs. one highly skilled agent, and how customer behavior shaped their technical architecture.
Episode notes
Want to work on complex engineering problems like these? Explore careers at Intuit.
We’ve worked with Intuit on a few other great blogs and podcasts, including Best practices for building LLMs and How Intuit democratizes AI development across teams through reusability.
Connect with Chase on LinkedIn.
Connect with Steven on LinkedIn.
Congrats to Lifejacket badge winner Sean for saving Creating the simplest HTML toggle button? with a great answer.
TRANSCRIPT
[Intro Music]
Ryan Donovan: Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am your host, Ryan Donvan, and today we are talking about the complexities of multi-agent architectures, and today it's sponsored by the fine folks at Intuit. And my guests are Steven Kulesza, staff software engineer, and Chase Roosin, engineering manager at Intuit. So, welcome to the show.
Chase Roosin: Thanks for having us, Ryan. We're excited.
Steven Kulesza: Yeah, we really appreciate it.
Ryan Donovan: Oh, my pleasure. So, before we get into the complexities of orchestrating multiple agents, tell us a little bit about yourselves. How did you get involved in software and technology?
Chase Roosin: Yeah, I found at an early age I had this infatuation with building, and I always thought I was gonna be a mechanical engineer, but I rapidly found out that it's cheaper and quicker to write software than to go build real-world products. And early in middle school and high school, I started writing simple websites, and it just became a huge passion of mine, seeing the ability to impact, hundreds, thousands, millions of people so instantaneously. And I went off and got my computer science degree and then started my career at Intuit.
Ryan Donovan: Steven, how about you?
Steven Kulesza: Yeah, I think Chase and I kinda have a similar path there. I started super young too, probably like 12 [or] 11. And I was just really interested in building blogs. I was working on forums and stuff like that, and just managing them. And then, I was building games, so JavaScript games, and then got into building web scrapers.
Ryan Donovan: Oh, there you go.
Steven Kulesza: And I just went from there. And then, I got really into wanting to run startups and build startups. And I just saw engineering as this real-life magic where you, whatever you dream, you can build. And that just drove me throughout my career and as [an] individual in general, and just fell in love with it. Started from there and just took professional jobs after that.
Ryan Donovan: So, you mentioned the magic. We're talking a little bit about some of the newest engineering magic, and we've talked with Intuit about AI, and the various platforms and programs you've been building all the way from Gen Os and stuff. How has your company's thinking evolved on AI on the enterprise?
Chase Roosin: Yeah, no, that's a great question. You called it out. We've been heavily investing in AI for many years. This isn't something new to us. And the beautiful thing that Steven and I are fortunate enough to get to build upon is a lot of that prior investment. GenOS and other platform-specific products that the teams build centrally enable us to move with the velocity that we need to unlock some of these new experiences. Even though the technology is starting from scratch, we ourselves are not, and we get to build off those foundational layers, and that's how we've been able to evolve and iterate so quickly in this space that we know is changing by the minute. And yeah, it's been really great to see how our prior investments as a company have really paid dividends into our future work.
Steven Kulesza: Yeah, I think with GenOS, with having security standards built into our calls and all that, and that's really powered a lot of our growth, because we don't have to worry about those things. Those are the foundational blocks that we get to build on top of, and with our velocity, just building these tools quickly. It's really excelled that.
Ryan Donovan: It's a very composable AI system you're building, almost like having a design pattern library or something.
Chase Roosin: Yeah, exactly. We have our central orchestrator, and then the operating system that's built around it, and then as Steven called out, all of these specific guardrails that we can just pull in, and it helps keep us nimble so we can go iterate with the customer. But when we need [to] dip back into the platform, we have those tools at our disposal. It's like you're going through a candy shop and you can pick and choose what pieces you need in order to be successful in that mission.
Steven Kulesza: It's been a huge unlock at the enterprise scale too, having those safety nets and those Lego pieces that power product development and AI engineering that we don't really have to think about as much.
Ryan Donovan: Now, basically the whole industry is moving to this kind of age-agentic AI paradigm, where it's each AI is a little piece, so maybe you're ahead of the game. But it seems like the issue people are running into is how do you coordinate that? How do you orchestrate multiple agents? How are you all thinking about that issue?
Chase Roosin: I wanna maybe take us back to how these agents were originally released, and some of the customer problems we wanted to focus on. About a year and a half ago when we were releasing a lot of these agents, they were in bespoke parts of our product, and they did work in doing work on behalf of the customer, and guiding them through the product. But what we are trying to solve now is, how do we bring those on a level playing field to our customers, right? They spend 60 hours a month doing independent work across our products. How can we give them this one magical experience that ties all of these agents together? And that's what kick-started this work stream. And Steven can go into detail a lot about the orchestration. We wanted to be able to bring in our customers into this singular spot, where all the agents now are doing work on their behalf instead of them having to go and reach out independently to one of those agents.
Steven Kulesza: Earlier on, as Chase was mentioning, we had all these individual agents. It came to the idea of we have all this distributed power across the platform, right? How do we have all those things coordinate and work together? So, it's almost like you wanna build an organization, and those are your employees, right? And how do we make them all work together? So, that kind of drove like the whole thought process of like, how do we go back to your main question of orchestrating these multiple agents? So, that was our starting point. We had all these distributed pieces, and when I think when NCP and all those protocols were first releasing, I was playing with that and, calling these different distributed agents, and using intent, and all that stuff through React Loops, and breaking apart user queries, get those answers as quickly as possible, and then synthesize on those, get those back to the user. So, that's how it first kicked off with that. And then, the evolution has gone on since there. And we can go into detail as much as you want.
Ryan Donovan: I've been talking about this with folks, and it seems like AI agents are kind of speed running the microservices, service-oriented architecture paradigm. Are the problems that AI agents fac[ing] the same as the microservices, or are there sort of unique issues that come from agents?
Steven Kulesza: Yeah. There's unique problems, as well as the distribution, too. So, microservices, especially at Intuit, where we have so many different teams, and so many different products and domains within QuickBooks itself, and within all of the other products and services at Intuit. Yeah, we face a lot of those problems. Some of the things that you wouldn't see in smaller agent systems or whatever it may be, we do face those distributed problems where similar to any other microservice. How are item potency, passing things back and forth, and yeah.
Chase Roosin: And even, we're talking about agents for sure, and that's the output, but organizationally, how do you have teams divide, and conquer, and give them the agency and autonomy to build these independent experiences? And similar to the microservices, there's gonna be services that do relatively the same thing. You know what I mean? And to these models, that's a little bit more complex, right? They look the same at face value, and you're like, 'does it go to agent A? Does it go to agent B?' They have very similar inputs, or approaches, or things that it wants to accomplish. And some of those have been challenges or struggles that we've had to overcome that are outside of technological boundaries.
Steven Kulesza: Dealing with that, we're super heavy on evaluations, so things are evaluation-driven. As we onboard agents, and we'll get into the skills and tools paradigm that we're growing into, all these distributed teams, they don't know who's working on what, how do we coordinate it all, and how do we make sure internet intelligence is giving the best answers possible? Evaluation. So, each team, as they onboard, has to give us their golden data sets. And we test that against the base layer into it intelligence to make sure those separation concerns and everything is as performing properly for our customers. And going back to it, this is more than just a chatbot as well. We're analyzing financial metrics. We're doing a lot of things: personalization, long-term memory, executing workflows, and plans, and what we call 'done for you actions' in the application. So, it gets pretty complex as you break things down, and it's super important to have those evaluations in place, and I think that's one of the hardest pieces of this.
Ryan Donovan: Let's talk a little bit more about those evaluations, 'cause [I have] written about 'em a little bit, looked them a little bit, and it's a genuinely hard problem, because you are asking an AI to check an AI's work. It's a little– you're the same person, right? How do you ensure that your evaluators are actually getting you accurate, and human-aligned evaluation?
Steven Kulesza: Yeah, so we have three different types of evaluations. We have offline, online, and human evals. So, you know, with offline we have these offline pipelines that, kind of going back to what I was saying before, these teams are contributing these golden data sets with what they expect their agent tool to do and what the output should be. And we have sets of LLM judges that run that through Omni, then check the determinism and non-deterministic points for conversation quality. Did this tool get called? Did this agent get called for this certain question? And so, we have a bunch of different metrics around that that guide that. And then, going into online evals, or taking what our customers say and just making sure that we're meeting what they are expecting from the agent system itself. And then, as well as human evals. So, we have experts that actually come in and test these questions, and human-label these data sets, and give us more determinism there. It's just this feedback loop at the end of the day. And we keep tuning those LLM judges, keep tuning our agents, keep tuning the prompts that are determining the intent detection, there.
Chase Roosin: That's one of the areas that I'm super proud of Intuit, where we have grown in this AI and AI human intelligence space, right? So, it's not only are humans involved in the actual experience when dealing with the customer, we also are bringing them in our evaluation suite. Like you said, Ryan, that's a really challenging problem. How do you do evals when it's a judge versus a judge that's both AI? We want to infuse that with all of the amazing experts that work with us at Intuit, and that is what's giving us this upper hand when we're looking across the ecosystem.
Ryan Donovan: Yeah, and I think that human aspect is something that has a lot of interest, and hand-wringing and concern about, but on the other side of it, human evaluations don't scale very well. How do you determine when it's time to bring a human in, when it's time to get somebody to double-check it?
Chase Roosin: From the evaluation standpoint, we do significant sampling up to the capacity that we have of humans. But, in product when we want to bring the customer in, we actually are able to figure out, hey, what is the complexity of this answer or this question? Is this an appropriate time that we feel like we need to bring an expert in? Or at least expose the option, right? Sometimes we have customers that are very happy to leverage AI in all experiences, which is great, but being in the financial space, we understand the responsibility we have for giving these accurate answers. And for areas that we want to ensure that our customers feel comfortable with the response, we can bring that expert to the forefront and say, 'hey, if you want to go over this together, let's get you connected with a human to help.'
Ryan Donovan: Glad you brought up the financial aspect. That's another one that seems like it's a more delicate thing to address because a lot of the things I've seen is that AI believes the sort of generative AI isn't the best at math or numbers. What are you all doing to ensure that the math and the numbers come up correctly, and people are writing the right things on their tax files?
Steven Kulesza: Yeah, definitely. The core attention there is knowing that the strengths and weaknesses of these models, and what they're good at, and leveraging what they're good at, using determinism when we need to. So yeah, that core tension of generality and precision, especially with financial data, and that demands correctness on amounts, dates, entity names, while the agent layer needs to handle novel queries without a code deploy. So, with the LLM side, their greater reasoning, and making these plans, and executing, and really understanding what the user needs. As you mentioned, math is not necessarily their strong suit. They tend to like to hallucinate, especially on hard problems. And so, what do we do there? We build levels of tools. We have primitive tools that do certain queries, run reports, do calculations. We build on those building blocks of tools, and then it becomes these more complex tools that do our profit and loss summaries or cashflow snapshots, and just analysis there. So, tools are definitely the way that we deal with determinism. And then, also, when we're doing these large runs of queries and accumulating report data, we use data files to persist across those large context items just so the LLM can then reference that in time, and it's not bloating the context windows of problem they're trying to solve. And then, the final piece of that is going back to the evaluations. So, just continuing to iterate on those evaluations. We're evaluation first, here, first and foremost.
Ryan Donovan: It seems like with these sort of deterministic tools, most people use something like MCP to figure out when it's time to do a tool call. How do you manage that sort of tool call orchestration at the scale of an enterprise like Intuit?
Steven Kulesza: Yeah. So, I think that kind of bridges us into this skill and tool approach that we're moving into. So, with the industry, there's different pros and cons to different architectures. Agents are super great at these isolated problems that are in a domain space. Again, it's like an employee, right? So you know, they're good at their job. They know how to do some sort of financial metrics, such as invoicing or whatever it may be. But when you ask them a customer problem or a payments problem, maybe that invoice agent might not have as good of an answer. They bridge this into looking into the skills and tools-based architecture, where those agents then get broken down into their core capabilities, into those tools and different levels of skills, and with that, basically the hierarchy of our agent system becomes flatter. And with that one central planner, as these models have gotten better, it's become important to give 'em that flexibility to answer complex cross-domain questions. So yeah, the tools have definitely been the building block of that. So, now we have this plan execute model, where we had the central planner that takes the user's query through progressive disclosure. We can disclose all the skills, the front matter of these skills to this agent, and the agent can create these plans out of them. And within the skills, they reference different tools, resources that they can interact with, and it creates this workflow pattern. So, that's how that works: through progressive disclosure, the planning and execute model, and then calling those deterministic skills to combine that into one you synthesize output.
Chase Roosin: One thing I would love to add to that: what Steven was alluding to is this evolution in our technical architecture came out in necessity, right? What we found when we released this initial version that was a multi-sub-agent architecture, where a lot of the intelligence was not flat, or it was in these bespoke agents – we really rapidly found out customers don't just ask a question that should go to one agent or this agent, or that agent, right? Very commonly, they're getting cross-domain questions. For instance, you know, what would happen if I gave all of my employees a 5% pay bump and my margins next month are gonna go down? There's no one agent that would solve that, right? And so, moving into this skills and tools architecture, and flattening the intelligence layer, where Intuit intelligence now can create this plan and see all of these different skills and tools at its disposal, it's enabling us to answer these questions that we just simply couldn't before. And it feels fortunate that as we're heading into those problems, we see the industry coming up with solutions that would answer it. So, I think that's been one of the most exciting parts about working on Intuit intelligence is we are literally staying toe to toe with some of the great frontier models to ensure that we're doing the latest and greatest in order to help deliver for the customer.
Steven Kulesza: And that's how we've built Intuit Intelligence, too. We've kept it nimble at the architecture level, nimble enough to move quickly and make these adjustments as time passes. Taking into consideration kind of the renaissance period of this, it's super important that we keep things ambiguous enough to move fast.
Ryan Donovan: I've seen a few things have agent swarms or multiple agents, but it sounds like you're going a sort of different route with it, having a more generalized agent structure, and just opening everything up as skills and tools.
Steven Kulesza: Yeah, definitely. I think going back to architectures, each architecture has its pros and cons. Agents are more buckled. They have their capabilities, they have a limb looping over their subgraph itself and making those adjustments. You might see latency there, you might see isolation, as well. And the sharing problem, even with storms, becomes somewhat difficult some of the time. But when you give one central agent the capability to see everything it has available to it, it can make those decisions on its own. So, that's why we went that way. Similar to what Chase said.
Ryan Donovan: You're doing this at an enterprise level where it's not just one person with one agent; it is all these customers accessing all these agents, all these tools. I imagine there's a difficulty with coordination, and traffic management, and uptime, and all that. How are you managing this on a ' keeping the lights on level'?
Chase Roosin: I think it goes back to the beginning of our conversation. We have had such an advantage with the investment the company has put in into all of this infrastructure, where a lot of that stuff gets taken care of for us, which as developers who are very product-focused and wanna ship as much as we can to customers, it's such a blessing. We put a ton of effort into, as Steven mentioned, evaluations, operational excellence, running our load tests. But at the core of it, our central teams, our platform teams have built such amazing capabilities for us to build off of that we just get to harness the goodness from them, and then go iterate with the customers. So, it truly is awesome that Intuit has put such an investment in an early time period for us to reap benefit.
Ryan Donovan: Shout out to the platform teams, yeah.
Chase Roosin: No, definitely. They've been an incredible help. I don't know where we would be without them.
Steven Kulesza: Yeah, exactly. Yeah. Going off that, too, what Chase just said, we have weekly performance load tests. So, especially when we're doing the subagent approach, we had all these partner teams, they had all their own LLM calls going on in the background, and it just comes into exponential problems. So, we definitely have these. We're taking the steps to make sure that's all available at scale.
Chase Roosin: And it's a new-ish problem, since the LLMs came out, right? Capacity constraints were something of the on-prem days, back when we were just running normal services on AWS, it's like, okay, dial up your HBA or whatever, and you get some more nodes, and you're fine. You don't have that luxury. It's getting better of course, but capacity constraints with LLMs are always a struggle. And then, these latency patterns are so different compared to what we're used to in modern technology. And how do you build your systems around that? How do you make sure that, hey, if a response requires 45 seconds, what do you do from a customer experience standpoint, from an infra standpoint, if model A goes down? What is your fallback model? Unlike in a deterministic output, model A and model B need to be evaluated, and they're gonna have different responses, right? They each have their own personality. So, we need to be really careful of all of our FMEAs and whatnot to make sure that we have good uptime, good capacity, and our fallbacks have been evaluated. It's such an easy thing to be like, 'oh, here's our superstar model and I'm gonna eval just on that.' It's, what about the three other models that could happen or get invoked if you fall back? And so, it just changes the software development life cycle, and you need to be a little bit more cognizant of that and this AI native world. And so, it's been fun. It's like bringing back the old days in some capacity.
Ryan Donovan: You talked about the capacity constraints. Now, on top of your cloud compute spend, you gotta look at your token spend, and you've got token spend coming, two or three different levels here with your evals. How do you think about managing the resources that an agentic, multi-agent system uses? You can talk about the token level. I remember [in a] previous job, them just finding all of these things running in their cloud system that shouldn't have been there, and just keeping an eye on how much resources these things are using.
Chase Roosin: When you're moving this fast, it's really easy to be like, 'we're building these prompts up, we're building all of this different tooling and technology,' and like you said, it has direct impact to cost, right? The more input tokens, the more output tokens that is going to impact the cost. I think one thing that I love about Intuit is we care so much about the final product. [We] make sure it's solving the customer need, and then we can dial in the cost constraints after that, but let's really hyper focus on ensuring that we're doing right by the customer. 100%, like you were saying, like we have our infra costs, we have our token costs, and we have to verify that those don't balloon. And similar to traffic pattern, increasing costs in the infra or old school—I don't even know if that's appropriate—old school way of doing it, it's really dependent. How often are these customers using it? Is this customer sending 15 pages of PDFs or one page of PDFs? Are they sending 500 input tokens or one input?
Chase Roosin: So, it just so variable, and that makes, actually, us figuring out these cost estimates so challenging.
Steven Kulesza: Going through that, too, we have to have observability here so we can– even when we make a simple prompt change, or one of our partner teams adds this tool with a large amount of context, large amount of things that are gonna affect token limits, and LLM context size, it's really important that we just keep that observability going, as well as building on top of the things that we said earlier, such as performance testing, monitoring, and just having those checks and balances in place as we go. Because [as] more engineers come onto this, it becomes an infactorial problem. It's just important to have those guardrails put on.
Ryan Donovan: I can imagine observability and guardrails is the new thing that everybody's focusing on. I keep saying we're in the ' find out' stage of AI, right now.
Steven Kulesza: Yeah, there it is. That's good.
Ryan Donovan: So, what level of observability do you get? Obviously, I don't know if you have explainability on the LLM responses, but what do you have?
Steven Kulesza: Yeah, so we have LLM observability. We could track the tokens, see traces that those sorts of things, as well as platform-level observability, gateway-level. So, we have three levels, and then as well as analytics that we all use in tandem to monitor the system.
Ryan Donovan: Obviously, we talked about how this is a fast-moving area. As much as you could talk about, where are you focusing your intentions for the future of AI orchestration?
Chase Roosin: Yeah, I think, we wanna just double down and expand on the mission that we've set out to do. And you can see it in this crawl, walk, run phase. And our goal is to get to a point where we're reducing the amount of effort that anyone using our products has to go through. The ideal utopia is you come in, and it's, ' hey, the work's done for you.' And I think the technology is starting to move in that direction. That is, unlocking the ability to go do this 'done for you' work. And I think we're gonna continue to charge to the point where you just get the notification that your work's done, and everything seems good to go, and you can go have a beer at the bar.
Ryan Donovan: It is that time of the show where we shout out somebody who came on to Stack Overflow, dropped some knowledge, shared some curiosity, and earned themselves a badge. Today, we're shouting out the winner of a Life Jacket Badge – somebody who found a question that was syncing with a score of negative two or less, and they dropped an answer that was so good that they brought up the question score, and they earned it five points themselves. So, congrats to @Sean for answering, 'Creating the simplest HTML toggle button?' If you're curious about that, we'll have the answer for you in the show notes. I'm Ryan Donovan. I host the podcast, edit the blog, here at Stack Overflow. If you have questions, concerns, comments, topics to cover, et cetera, email me at podcast@stackoverflow.com, and if you wanna reach out to me directly, you can find me on LinkedIn.
Chase Roosin: Thank you so much for having us, Ryan. This was a pleasure. It's always such an engaging conversation. I know we're super excited about all of the latest and greatest technology, and excited to see where Intuit and the industry takes it. I'm Chase Roosin. I'm an engineering manager, and thank you.
Steven Kulesza: Yeah, again, thank you Ryan, so much. Super excited to be here, and thank you for having us. Yeah, excited to see where the industry goes, as well. Having fun again with all of the Renaissance spirits, really just revives me in this engineering space, and what's possible out there. Keep hacking, everybody.
Chase Roosin: You can find Steven and I on LinkedIn, and we will cover some more topics about our approach with AI at Intuit.
Ryan Donovan: All right. Thank you for listening, everyone, and we'll talk to you next time.
