Loading…

How can you test your code when you don’t know what’s in it?

Ryan hosts SmartBear’s VP of AI and Architecture Fitz Nowlan to explore how we’re moving away from old assumptions about software development, the challenges of testing MCP servers as LLM-driven agents introduce non-determinism that breaks tradition, and how data locality and data construction are becoming more valuable when source code is so easy to generate.

Article hero image
Credit: Alexandra Francis

SmartBear gives devs tools for application performance monitoring, software development, software testing, and API management—all at AI speed and scale.

Connect with Fitz on LinkedIn and email him at FitzNowlan@SmartBear.com

Congrats to Great Answer winner Alexander for winning the badge for their answer to Is there a way to make Runnable's run() throw an exception?.

TRANSCRIPT

[Intro Music]

Ryan Donovan: Hello, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ryan Donovan, and today we are talking about testing MCP servers and how the non-deterministic nature of LLMs and the agents breaks all of our old assumptions. My guest for that today is Fitz Nowlan. He's the VP of AI and Architecture at SmartBear. So, welcome to the show, Fitz.

Fitz Nowlan: Hey Ryan, thanks so much for having me. Great to be here.

Ryan Donovan: Before we get into the topic today, we like to get to know our guests a little bit. Tell us how you got into software and technology.

Fitz Nowlan: Absolutely. So, I took a programming course as a senior in high school in Java and loved it. Up until that point, I thought I would do something with math or engineering. Took programming, got into computer science. I did computer science in undergrad at Georgetown, and actually, 'cause it was a small computer science department there, I was able to do research with a professor. Got into kind of the computer science research ideas and then ended up going to grad school and earned a PhD in computer science at Yale with a focus on low latency networking, distributed systems, stuff like that. From there, did some big tech internships and then took a job at a startup in Philadelphia, which is named Curalate. Worked there for five years and met my Co-Founder, Todd McNeil, and then we left in 2019 to start Reflect, which was end-to-end automated web testing. So, we were entirely cloud-based, no installation of anything. But we had a robot, basically, that drove the browser and tested websites. We were acquired by SmartBear in 2024. And since joining SmartBear, I've branched out a little bit from Reflect and focused on bringing AI, and AI features, and AG agentic workflows to all the different SmartBear products. So, SmartBear has a number of products that sit in the testing space, but also API. They own the Swagger specification, open API, and then they also have products in the observability space, the big one being Bugsnag. So, I've been working across the SmartBear portfolio to bring AI features and AI workflows to those different products for the last year and a half or so.

Ryan Donovan: Okay. Today, we're talking about Model Context Protocol, MCP. It's all the rage. Everybody's talking about it. Everybody's building their servers. And since you're building servers for some specific testing products, I know one of the issues with MCP is that it's hard to test. What makes it difficult?

Fitz Nowlan: So, the key behind MCP is that you're defining these tools that the AI can invoke, but you don't want to be too prescriptive or too restrictive in how the AI can invoke those tools. You want the workflow in any given moment to really be decided on the fly, by the LLM. And that is one of the key points that actually allows the LLM to function intelligently with all the information it has available. So, because of that, you can't be too locked down or rigid in the workflows that you give to the AI. And so, you really say, 'look, these are all the tools. I hope you invoke the correct tools,' but as a result of that, there's some non-determinism there. I think that creates challenges then for testing, if you might start to say, 'maybe there are some key workflows that I always want to follow this sequence of tools, and if you get down this path, then I want you to do it this way.' The challenge, of course, is how do you know if you should be going down that path or not? And so, it's this balance of letting the AI choose things, but also having a sort of a set familiar pattern of tools, invocations.

Ryan Donovan: Yeah. It always seems like this is one of those cases where not being able to flag a keyword gets in the way.

Fitz Nowlan: Yeah. It's not syntactical. The LLM might come up with some other description of the problem. I think a lot of times, you'll see in these age agentic workflows, you'll have kind of a routing step where the LLM gets involved, and it chooses which path we should go down. Is this an open-ended path, or does this look like the 'create a new record in the database path' which we usually go down when we get into this situation? So, I think there are strategies there [where] creating some of those well-known or familiar workflows is a good approach. And those kind of look like workflow the user journeys in your web application that your users used to take, that your agents are now taking.

Ryan Donovan: Those user journeys are also not as predictable, too, 'cause they may go through your application in a random, arbitrary way, right?

Fitz Nowlan: Yeah, absolutely. So, this always raises the question of: wouldn't it be great if we could watch how real users use our application and then build our tests from that? And on the surface, yeah, it makes sense that what people do with your app is probably the thing that they're paying you for. It's probably the thing in their minds that's creating value. But that said, as you just said, users will take any sort of random path through your application to achieve their goals. It's not necessarily always the most efficient, and maybe also not the one that you want. So, absolutely, you should always know what your users are doing in your app, but you might wanna also incorporate into your testing context knowledge that you, as the developer, bring.

Ryan Donovan: So, when you're testing the results of the MCP server, since it's not every time that it happens that it picks up on it, what's the sort of standards for reproducibility verification? How do you call a test successful?

Fitz Nowlan: Yeah, so there's two approaches that most people are probably incorporating both. The lowest level one is the case that we're talking about, whereas if you have a named workflow, it's a certain set of tool invocations that must occur, potentially in a certain order as well. And so, you're looking just for almost the skeleton of the workflow to be 'this tool followed by this tool, followed by this tool.' And you'll have tests that have inputs that should push the LLM down that path. The second approach is more open-ended, and these are what you would call evals or evaluations, and this is where you use an LLM to test the other LLMs output. For those, it's really open-ended, and it's really a lot of trial and error. You have to try to drill down into a type of input data or a class of input data that generally gets the LLM to go down the path you want, but of course, you don't always want to force it, and you wanna be too restrictive. You want to allow the LLM to be a little creative here, so that you're not overfitting to your training. That's really the best description I can think of. The other key point here is as these models improve, you want to be as good as the model is. And what I mean by that is you don't wanna– in the early days, it was prompt engineering: oh, I had a magical incantation that got the LLM to return this correct output. You don't wanna be thinking in those terms now because the models are going to keep getting better. And so, it's not about getting the perfect prompt; it's about probabilistically getting the correct output. And that way, when you get a better model, that will raise the level of intelligence on your overall workflow, but you don't wanna be prevented from benefiting from that new model by being too restrictive in your prompting, or your thinking, or your structure of the agent, today. So, you really wanna meet the model, not beat the model, and grow with it. So, that's where you have to be open-ended.

Ryan Donovan: I wonder if people will start missing the magical incantations, right?

Fitz Nowlan: Yeah. It was so frustrating early on. You would have the magical incantation, you'd get it working, and you'd say, 'okay, great. Don't anybody change this. I got it working.' And then, you would leave it, you'd be so proud of yourself. I'm thinking like 2023, we were thinking along those terms. And then, at some point, the new model would come out, it would be so amazing, and it would render all of your prompting useless, and you'd go, 'oh my gosh, all my wasted time.' But it's actually a good thing what the model is doing up there. You just have to not fall in love with your prompts.

Ryan Donovan: Yeah.

Fitz Nowlan: Yeah, it was really exciting, really frustrating, and then it was it was actually comforting to know, 'the model always works if I say it like this.' But now, it's the brave new world where no, now you actually have to treat it not just for this literal syntactic purpose of translating this input to this output, you actually need to go one level higher in the abstraction and let the AI solve the whole problem for you. And that's a little more open-ended, and you have to embrace that, but try to put the guardrails in where you can.

Ryan Donovan: Yeah, and I wonder how QA engineers and QA scripts are approaching these problems, 'cause, there's the joke about the QA engineer walking into a bar, ordering negative one beers, ordering 40 million beers, and then the customer comes in and asks for the bathroom, and the bar catches fire.

Fitz Nowlan: It's so true. So, our approach on the QA side of here: so, we are building an agentic QA platform. Our approach here is: think in terms of intent, functionality, and requirements. So, don't think necessarily in the low-level terms of API invocations, or certain input or output shapes. You can certainly validate those, and one piece of testing should be that. But there is now a new notion of, as dev velocity increases 10 x, QA velocity must also increase 10 x, and the only way to fight fire is with fire, here. So, if you've got your dev velocity churning out code, and new features, and new pages, and new applications,

Fitz Nowlan: You need an AI native, QA platform to do that testing for you. And of course, it's integrated with humans. Humans are gonna provide oversight. They're gonna provide some of this setup and relationship management between applications and between data sources, but you need something that can, at speed and scale, validate that functionalities are working. Not that they're necessarily working to some rigid spec that we defined earlier, but working to the new updated spec that just changed with the most recent PR. And then, of course, working in terms of common sense, that was always that manual testing gave you confidence because you knew, as a human, you were applying common sense. Okay, yes, the button looks appropriate; the button is here. The AI can now do that as long as it's appropriately framed and contextualized, and that's what we're trying to do on the testing side with AI native QA.

Ryan Donovan: So, with the testing, standard sort of unit testing, you have an assertion, and an assertion is true, passes the test. How do you write and codify those assertions and validations in a sort of non-deterministic, almost vibes-based situation?

Fitz Nowlan: Yeah, it's really hard, and I don't have the answer. I'm not sure if the answer is widely known today, but to give a concrete example here – the AI writes code, and then you say, 'okay, write some unit tests for me.' Of course, the unit tests are gonna pass because the AI can write them a thousand times until they do pass. Even if it has to say, 'assert true.' You know it's going to pass, right? So, then you say, 'okay, do the unit tests even matter if AI wrote all the code?' Because the AI can, of course– it's trivial now for LLMs to produce a unit test that exercises some code, but the unit test, because it was authored by the same person who authored the same AI that authored the code, you don't know if the code is actually doing what you want. And so, I think this is where there's a bit of a shift happening around unit tests for AI. And now, unit tests in place will certainly ensure that the next change that the AI writes won't break the existing functionality or won't change it too much. So, that's important. But just in terms of the software actually doing what it's supposed to be doing, the unit tests won't give you that. So, that's where I think the intuition or the overall common sense of the LLM has to come into play, and it has to operate at a higher level of abstraction. It has to operate at the level of, does this application function properly? Does it do the things that it was supposed to do based on this input spec, and these updated specs, and these user tickets? It has to operate at a higher level. It has to be more driving the application, interacting with it either in the CLI, if it's a CLI app, or pointing and clicking in the browser, if it's a browser app, or if it's desktop app, it actually has to use the app. So, fortunately, the models are good enough now that they can do it, and the vision models are tremendous now that they can parse things out. The LLM models are actually now better at OCR than the traditional machine learning OCR algorithms. The quality is there. It's really just up to us to harness it. I know it's a dissatisfying answer. I understand, 'cause it's not totally clear what the purpose of unit tests are in the future, when you could rewrite all of your application in a few minutes with AI, now.

Ryan Donovan: It sounds like what you're saying is that the AI tests are very good at holistic application tests, and that unit tests may have a value in sort of a test of continuity, right?

Fitz Nowlan: Yeah. They will ensure that this current change hasn't deviated; this input still produces this output. But something's happening with source code. The value of source code is changing because it can be produced at such a rapid pace and by people who don't know how to code in the traditional sense.

Ryan Donovan: I guess that leads to a question that if something works, does source code matter anymore?

Fitz Nowlan: Exactly, that's a question that we've pondered a lot, and we are exploring the implications of that for sure. If you think about it, let's say you have, I don't know, Instagram, say, like a blogging platform with images and text – as long as you can create a post, edit a post, delete a post; then you would say the MVP of that is built. And of course, there's tons of different things about the real-life applications that are worth billions of dollars that make them unique, and special, and great user experiences. So, it's not that you can replicate all that Instagram is instantly, but the functionality can be replicated fairly easily, now. So, does it matter how the code is written in the MVP sense? Functionality can be checked off; the boxes can be checked. It may not as much. It certainly doesn't as much as it used to. Now, I think the question is, if you can replicate that functionality so quickly, do you replicate the functionality? In other words, let's say that the initial MVP is not performant when users upload seven images in a post, or something like that, or they put up pages and pages of text, and that's just incredibly inefficient for the system. The system can't handle that. So, then you might say, okay, that's where AI screwed you. That's where you fell off because your human didn't write that. And then you say, okay, but let's say that users would pay for a platform that could handle that. How much would they pay? Pay a dollar for that platform. You say, I'll just run it on twice as much hardware, or I'll give it CPUs that are twice as fast, or I'll provision my throughput. I'll just pay more to run with this same inefficient code, and I'll charge more for my customer, and as long as they pay that, then I'm still making out on the deal. I still don't need to have hand-authored my code. So, I think we're gonna start to see maybe just some companies that try interesting engineering practices combined with pricing. As long as you're still making your margin, you maybe don't care if you're so inefficient. And of course, maybe you're leaving money on the table, but if velocity is what matters more than maximizing profit, then maybe you're okay with that.

Ryan Donovan: Yeah, I think maybe, but I did see a video a while back that was somebody going through all these hyperscaler engineering blogs where they rewrote their entire application to get 10-15% performance gains. So, at some point, I think you run into wall where it's, oh, this actually needs to be efficient.

Fitz Nowlan: Yeah, you can't just charge more forever. At some point, you'll hit the limit on what the market is willing to pay. I guess my point was more, for certain businesses, they probably are leaving money on the table. They're probably not maximizing what they're charging in certain SaaS platforms. I think AI will potentially allow them to go so fast that they can start to play right on that trade-off, play right on that dividing line between what the customer will pay and what they won't pay.

Ryan Donovan: Sure, yeah. It does sound like an all gas, no break situation. It's like you're going as fast as you can and it's like, ' I hope everything's clear.'

Fitz Nowlan: Yeah, yeah. So, it brings up an interesting question of where's the counterbalance to this, and even if it doesn't blow up on our faces, what might it look like if it did? What are we thinking about in that sense? There's lots of different things. So, I think most people would say we're still not at the point where they're trusting– where they want their bank software written vibe-coded style, or with AI. I think defense, or compliance, government industries health and medical records, stuff like that. I think they'll be slower to adopt, and frankly, at SmartBear, our position is that's great, because we can sell to all different parts of the spectrum; and I think, businesses, while we're mostly hearing about the AI revolution, SaaS providers should not forget about the trillions of dollars that are spent on building, maintaining, testing, validating, monitoring, legacy applications that won't change overnight and for good reason. So, it's like the sort of silent majority of software that you don't really hear about in the news that is changing much more slowly, and happy to continue selling to those businesses. We have some products that have been around for quite a while. Test Complete is the big one – it's a record and play desktop testing application. It has a big install base with the government compliance security industries. We're happy to keep selling to those people, as well. Our position is we wanna have an offering for those organizations two years from now who are exclusively coding with AI but continue to service customers across that spectrum of autonomy. So, if you don't really want autonomy in your software development lifecycle, you can purchase these smartware products, and if you're somewhere in the middle, we have products for you, as well. And then, if you're more advanced, we have products for you there.

Ryan Donovan: AI gets all the headlines today, but there's still people writing code by hand. There are probably still people out there writing assembly code.

Fitz Nowlan: Yeah, the COBOL is that best example of the bank software. All those bank mainframes are written in COBOL, and those COBOL engineers make $300,000 a year, or whatever it is—a really solid salary because they're in such high demand. Eventually, I think [with] AI, one of the big promises is, does AI allow us to finally upgrade or convert some of that legacy software into new software? And I think it does, but even still, the cost of a bug there is still incredibly high. And so, I don't think most people want the banks to be early adopters of these AI translation softwares.

Ryan Donovan: No. And I think they've rarely been early adopters of anything, right?

Fitz Nowlan: Yeah. And for good reason, right?

Ryan Donovan: Yeah. I think one of the things, thinking about as we're talking about [it], is you can AI code something and just put muscle behind it to sell this business. What's stopping me from AI coding it and just running it on my little hardware for myself?

Fitz Nowlan: Yeah. So, I love this angle. So, to parrot that back to you or rephrase it a little bit, AI coding will increase velocity and will make getting to market a less of a barrier for everybody. So, what then becomes the differentiator to my SaaS platform, say, versus someone else's SaaS platform if it could be replicated so easily? So, then that kind of raises the question of what are people actually paying you for? And I think there are two things that people basically pay for when it comes to AI native software. I think they're basically paying for data locality, or data construction. And so, data locality is something like, let's say that my SaaS application has really rich data, just really valuable data. Suppose I'm Snowflake. Snowflake can just sell a chatbot on top of your data, and it's going to create value, because the data that they're holding, your data, is so rich and so valuable that just the ability to interact with it with plain text or visuals is incredibly valuable. That creates value.

Ryan Donovan: Yeah.

Fitz Nowlan: Whereas another platform, where the data's not quite as rich or robust, they cannot just sell an LLM on top of their platform and instantly create value. That's one aspect. The other aspect is then data construction. If I have some specialized manipulation of the data that I do with AI, maybe it's 100 prompts in succession, with pulling in different data from different data sources, but it produces this really unique output. People might pay for that. And they're paying for my AI, the construction of the data, the way that I'm composing my AI prompts together. That's payable, and that's like a secret sauce, so that'll make money until that gets commoditized, until the world discovers that. So, those are the only two things you can pay for. So, you can't really get paid for now, I don't think CRUD apps—Basic, Create, Update, Delete apps. Maybe you can today, right? But in the limit, as we tend to trend towards the future, I think the margin there goes down. That gets commoditized because, your point, you could build it, I can build it, anyone can build it at a moment's notice. The other thing you talked on that I think's really interesting is, does this then lead to a pushback towards desktop computing? Does it move away from the cloud and SaaS? Because, as you said, you can author a whole bunch of the software you need to run your personal life or your business. Your small business can probably run with software you can almost define on the fly.

Ryan Donovan: Sure.

Fitz Nowlan: So, does this push people back towards privacy and non-multi-tenant SaaS software? I think it will. I think there's probably actually a lot of money to be made in bringing that experience to small businesses, end-to-end users basically saying, look, if you buy this physical machine, and you run this software, you can run your entire software life without sharing your data with big tech—without the risk of putting your credit card information out for this and this different service. I'm very personally interested in that concept. SmartBear doesn't play in that space, but I think there will be a big push to, we'll say, local software consumption, maybe is the way to describe it.

Ryan Donovan: I think people are already starting [to look] at local agents– the Moltbot.

Fitz Nowlan: Yeah.

Ryan Donovan: Super popular. I've seen some desktop computers that are optimized for AI. Like a $4,000 big beefy rig.

Fitz Nowlan: Yeah. You see it in some of the personal things. I guess Oxide Computing is that startup that's selling like an all-in-one compliance to bring the cloud to your local on-prem business. So, that's the counterpart to the personal or the small business, maybe who eventually will be able to just run with LLMs. Oxide is saying, look, you can bring the whole cloud into your on-prem system. Your on-prem data center, of course you're gonna wanna run your LLMs there, as well. LLMs are now part of the cloud SaaS offering. So, I think you'll see it in both places. I think you'll see it for large enterprises. The Fortune 500 are gonna say, why would we give you our data now when the software is there, the LLMs are capable enough for us to run these locally? And even if it's not LLAMA running on a GPU that we own, we can still access an LLM within our VPC, from one of the big providers. And so, we don't wanna call out to your third-party SaaS platform; you have to come and live in ours.

Ryan Donovan: And if everybody's building their own software, if build wins in the build versus buy debate, somebody's gotta test all that software, right?

Fitz Nowlan: That's right. And this again comes back to: can we get paid on our data construction, our composition of the data, if we have some special secret sauce that allows us to test applications more effectively using AI? And that's potentially something people would still pay for, even in an on-prem environment—okay, fine, we'll bring your agent on-prem, and we'll have you go test our application because your agent is better at testing. Just better common sense, more edge case awareness, better data integrity, or random data production. And if we're using like inputs and stuff for testing, those are all sort pieces of what we think are a really valuable testing agent for the future. And I think there's still an opportunity to make money on that in the new world.

Ryan Donovan: Yeah. As part of that secret sauce, I wonder– I've talked to some folks doing AI agent coding, and a lot of 'em look at non-LLM portions of the testing stack. They do, whether it's a conditional or some sort of more traditional machine learning, right?

Fitz Nowlan: Yeah. I think it's tough. On the one hand, as an engineer, it's nice to have parts of the software development lifecycle that are not touched by LLMs because it gives you that feet on the ground feeling. Okay, we're back to reality here. There's no way that we're making this up, making these inputs or these outputs up. We've got our feet on.

Ryan Donovan: Especially in testing.

Fitz Nowlan: Exactly, yeah. It was a requirement that the application did X, Y, Z, and we've hand-authored that requirement, and this piece of code confirms that it checks that box and that it does in fact produce X given Y. But on the other hand, the trend that we see with the LLMs is that they do just keep getting better. And the OCR kind of example being one of them, where all those years, all that compute spent, training those OCR models, and of course, I'm sure it was 10 times as much compute training the LLM models. But the fact is that compute has been paid for and produced, and the LLM models now are just as good as the best OCR approaches at just extracting text from a screenshot. So, it does make you wonder, for the parts of my application that are not LLM, how much do I wanna invest in that? I'll do it if it's easy to invest, if I can reuse work I've done, but I'm hesitant to invest too much in the non-LLM pieces because it feels like they may be under fire from LLMs in six months. And I don't say that happily. As an engineer, I loved having those syntactical, those static, non-dynamic, non-LLM pieces of code, but I'll admit, I just don't know what the right amount of investment is for those parts of the lifecycle.

Ryan Donovan: Yeah, I think if you did, you could have yourself a nice little VC covering up.

Fitz Nowlan: Yeah, exactly. You could strike that balance like we're saying, extracting the delta in the market between everybody using LLMs for it and you using the static approach, which is gonna be cheaper and faster. Yeah, I guess maybe one sort of theme or maybe one sort of summary to think about is, MCP, justifiably, was all the rage with AI and agentic workflows because it unlocked that next level of thinking. But I think now, it's well understood, and we are now at that next point, which you opened up the conversation with, which is, what do we now do to understand, to profit from, and to validate these workflows that are composed of MCP calls? MCP is now the foundation. It used to be LLMs were the foundation, and now MCP comes in, and wow, we can do all this other stuff. Now we're all standing on MCP because LLMs are—all agents are—invoking tools in with MCP. And so, now the focus has gone one level higher in abstraction to how do we validate, how do we get bounds or scope around the workflows we're doing with MCP and completing? So, I think that's my sum up. I think that's the point we're at now. And I think the question is, okay, how do we do it, and where do we go from here? And if you can think forward, be forward thinking, you can maybe identify what's the next thing that we'll care about once we get a handle on these workflows that are powered by MCP, and we can validate them, and we can build them, and maintain them. Then, what are we doing? What are we looking at next? What's the next horizon?

Ryan Donovan: It's sort of the cycle of expansion of possibilities, and then contraction of–

Fitz Nowlan: Yes, I love that. I love that. Absolutely. You get your head peeked over the wall, and you can see a lot further. And so, then you spend a lot of time in that new realm. And then, there appears to be a new wall, and then you can peek your head over that wall. I love that analogy.

Ryan Donovan: That's a good one to end on.

Ryan Donovan: It is that time of the show again where we shout out somebody who came on to Stack Overflow and dropped some knowledge, shared some curiosity, and earned themselves a badge. Today, we're shouting out a great answer winner, somebody who dropped an answer that earned over a hundred points. So, congrats to @Alexander for answering, 'Is there a way to make Runnable's run() throw an exception?' So, if you're curious about that, we'll have an answer for you in the show notes. I'm Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have questions, concerns, topics, comments, anything at all, please email me at podcast@stackoverflow.com. And if you wanna reach out to me directly, you can find me on LinkedIn.

Fitz Nowlan: Yeah, absolutely. I'm Fitz Nowlan. I'm the VP of AI and Architecture at SmartBear. I wanna thank Ryan for having me onto the podcast. If you wanna get in touch, you can. I'm fitznowlan@smartbear.com, and you can always find me on LinkedIn, as well.

Ryan Donovan: All right. Thanks for listening, everyone, and we'll talk to you next time.

Add to the discussion

Login with your stackoverflow.com account to take part in the discussion.