You need quality engineers to turn AI into ROI

Pete Johnson, Field CTO, Artificial Intelligence at MongoDB, joins the podcast to talk about a recent OpenAI paper on the impact that AI will have on jobs and overall GDP. Pete, who reads the papers (and datasets) so you don’t have to, says that looking at AI’s impact as a job killer is a flawed metric. Instead, he and Ryan talk about how AI will be a collaborator for actual human workers, how embeddings and vectorization will move the productivity needle, and the five decisions you need to make to realize ROI on AI.

Episode notes:

If you’re curious, read the OpenAI blog post and paper yourself.

For those of you looking for inspiration, check out Werner Vogel’s keynote from re:Invent 2025.

MongoDB provides a flexible and dynamic database that excels with AI data.

Connect with Pete on LinkedIn.

Congrats to Populist badge winner Scheff's Cat for dropping a banger of an answer on error: non-const static data member must be initialized out of line.

TRANSCRIPT

[Intro Music]

Ryan Donovan: Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your humble host, Ryan Donvan, and today we have a podcast sponsored by the fine folks at MongoDB talking about the race to prove out the agentic value. So, my guest today is MongoDB Field CTO, Pete Johnson. Welcome to the show, Pete.

Pete Johnson: Hi, Ryan. Thanks so much for having me.

Ryan Donovan: Of course. So, before we get into talking about this OpenAI paper, tell us a little bit about yourself. How did you get into software and technology?

Pete Johnson: I wrote my first line of code as a sixth grader in 1981.

Ryan Donovan: Wow.

Pete Johnson: And I'm one of those lucky people that was able to turn a childhood hobby into a, now, what is it, 31-plus-year-career after college. So, I know that's a common story for a lot of people, but I asked for an Intelevision for Christmas of 1981, and if you know, you know.

Ryan Donovan: Yep.

Pete Johnson: I instead received a TRS-80 color computer, the 4K version, not the 16K. That came with a variant of the Microsoft basic interpreter called Color Basic at the time, and I used it to generate a little program that tracked rebounding and scoring stats from my sixth-grade basketball team.

Ryan Donovan: Nice. I think I also got the old switcheroo with the Intelevision Commodore 64.

Pete Johnson: Well, C 64, you had a real disk drive. I had cassette tapes and storage on the TRS-80 color.

Ryan Donovan: Right.

Pete Johnson: Or the cocoa, as people called it back then.

Ryan Donovan: So, obviously it's been a long journey from then. You've turned a hobby into a career.

Pete Johnson: Yeah, I did 20 years at HP. I did 17 of that in HPIT, where I wrote my first web application, went into production in January in 96. That was about 13 months after the first W3C meeting. I became HP.com Chief Architect at the end of that HPIT tenure, and then I was one of the founding members of HP Cloud Services, which was HP's attempt to try to compete directly with AWS on top of OpenStack. And while that didn't work out for the company, that sure worked out for me personally. I moved out of engineering and into sales and marketing and went on couples of different startups. One was acquired by Cisco. A little bit of the stint where I was prior to MongoDB, I was a field CTO at the services arm of CDW, and then I've been here since June.

Ryan Donovan: All right. Well, a lot has changed since the old TRS 80 days. Today everybody's talking about AI and agents and, you know, as people try to get this to have real world impact—I think I saw the stat that 95% of projects fail. People are looking at how, you know, what's the ROI of this? And OpenAI had an interesting paper talking about the sort of GDP impact, how they could evaluate that impact of agents and agentic tasks. Can you tell us a little bit more about this paper?

Pete Johnson: Yeah, sure. So, that paper, the GDP vow paper. So, there was a blog article, there was a white paper, and then there was a dataset. And I'm the kind of guy that I'll read everything to sort of see where the goodness or where the hiding stuff might be, 'cause there's always some hiding that goes on in white papers. And if you just look at the blog article, what that'll tell you is they looked at 44 occupations across different vertical sectors of the economy. They then went and hired experts with at least 14 years experience in each one of those occupations, and they had those people define 30 common tasks to each of those occupations. They then took a subset of those, five per occupation, and ran it through a version of a touring test, where what they did was they did a one-shot prompt to try to complete the task and sped that to an LLM. And then they found a person with a decent amount of experience in that occupation and asked them to complete the same task. Then they had an independent third party, a human being, then evaluate which one was better. And then they established sort of a win rate between the human being and different LLMs. And to their credit, OpenAI didn't just test OpenAI LLMs; they tested some of their competitors, as well. That was the basic structure of the testing that they ran that was the result of that white paper.

Ryan Donovan: Right. And like you said, you went in through all three stages of this paper down to the dataset. What was the sort of interesting takeaway? What is the stuff that is sort of hidden there?

Pete Johnson: Well, if you just look at the blog article, sort of the glory graphic that was part of the blog article showed what the scores were for each of the different individual LLMs. And I've got some notes here. I'll read 'em off here real quick. For example, at the time what they were testing was things like Claude Opus 4.1 did the best, where it got a score of 47.6, which meant that either won or tied in the different—according to the Human Evaluator—the different tasks that it was graded against. GPT-4.0 was the lowest-scoring of the seven that they tested, and that was 12.4. And so, the way that they did the blog articles, they showed GPT-4 at 12.4, Grok 4 at 24.3, Gemini 2.5 Pro at 25.5, o4-Mini-High at 27.903, o3-High at 34.1, and GPT-5-High at 38.8, before Claude Opus at 4.1. And that was, like I said, kind of the glory diagram from the blog article. But if you look at the white paper, there was an, I thought, was an even more interesting diagram, and I'll tell you, it was on page seven, it's figure seven.

Ryan Donovan: Right.

Pete Johnson: And it showed, in addition to the main testing, they also did some analysis of what happened when the AI and the people worked together. And that's when they saw really big gains. So, they showed a cost and speed improvement, and they did this just with GPT-5-High of one and a half on both speed and cost improvement. And I think, you know, the glory statistic was about 'how close are we to AGI?' But I think really, when I read through the paper, it turned me into an AGI-skeptic. It made me really think about how I think we're entering an era where everybody's gonna be AI-enhanced and see cost and speed improvements similar to what they found in that figure seven.

Ryan Donovan: Yeah. This is something I've been hearing too, that the AI with an expert is just tons better. And, you know, having that human in the loop makes the AI itself better too.

Pete Johnson: Indeed. And so, you cited the MIT study that showed 95% failure rates among AI projects. And I think there's a couple of reasons why that is. Number one, there's no skew for AI. What I think a lot of executives think is, 'I'm gonna make this one product purchase and my AI strategy will be done,' when really it's a lot more nuanced than that. So, that's thing number one. And then, thing number two is if you think you're going to get AGI and replace people, that's flawed logic, as this GDP valve detail shows. If instead you think about, 'how can I improve the productivity of the people that I have?' And then, 'what do I do with those productivity gains?' That's where you really start to see some traction in this market.

Ryan Donovan: Yeah. So what is this paper hiding as you look through the dataset?

Pete Johnson: If you go to the dataset, it shows you the prompts that they used for the tests. So, what they did was across the 44 occupations, they started with 30 tasks each. So, a total of 1,320 tasks. Then, they shaved that down and tested five per occupation. And it turns out I've had one of the jobs they tested. So, solutions architect or sales engineers, it's commonly known was one of the tests, and it was, 'here's a diagram of an on-prem three-tier web application. What would it take to migrate it to Google?' And it gave the actual instructions, which served as the prompt that you would feed to your LLM of choice. So I did, I used Claude desktop, I fed it the diagram, I fed it the prompt, and it gave me back this really nice, essentially, paper for what a migration plan to GCP would look like. 'Cause that's what the task asked for.

Ryan Donovan: Right.

Pete Johnson: But then what? Somebody has to present that to a customer. Someone has to try to gain the trust. So, why is it you should use me? If you get to a situation where every sales engineer representing every consultancy can generate the exact same document, what would your selection criteria be? So, there's some humanity that's still part of these tasks that you still need, and like I said before, I think when you look at those high failure rates that the MIT study showed, I think a lot of it has to do, you know, with first that 'no skew for AI' thing; but also, if you think about it in terms of replacing people, that's the wrong way to go. It's how can you enhance that? And ultimately, what that means is: how can you inject some of your proprietary content into one of these LLMs without having to go through an expensive training cycle? That's ultimately what that boils down to.

Ryan Donovan: Just now, it makes me think of this sort of initial push against open source and what people realize when you open source everything, it's not the software that is the special sauce, right? It's the business, it's the people, it's everything around it.

Pete Johnson: It's the people. That's exactly what it is. It's the people. And I think, you know, as you and I were chatting before we started recording, we were both at AWS reinvent last week, and that was very much the thesis of what I think is now Warner Vogel's last keynote that he would give. And I found it very inspiring that he basically gave us a roadmap for how to be really good software engineers in this AI-enabled era.

Ryan Donovan: That is a great lead into the rest of the conversation. How do we get actual value from ROI? How do we be really good software engineers, or whatever other AI-enhanced job we have?

Pete Johnson: Yeah, so if I take that in two parts, you know, how do we get good value out of AI, I think is part number one, and then part number two is some of this stuff that Werner talked about during his keynote about how can we be good software engineers? So, if I can take the first part first: how do we get value out of these LLMs? Like I said a minute ago, how do you inject your proprietary data into an LLM of your choosing so that you can get it to customize and solve for whatever business problem you're trying to solve? And when I talk to C-Suite folks about that 'no skew for AI' thing, you know, what I tell them is, take your problem first, you know, what are your top 10, 15 business problems? What five do you have data for? And then, what two or three might you have metrics for so that you can determine how things got better? If you just spend money on a skew and you don't know what the before or after is, how do you know how to calculate your ROI? So, you need good data, you need good metrics in order to get there. Typically, the way that we see people implement that, and the reason why I joined MongoDB in first place has to do with, ultimately that boils down to having a good vector search and good embeddings.

Ryan Donovan: Mm-hmm.

Pete Johnson: So, we can talk about that a little bit more but that's how you get value, is when you boil it down, if you have good embeddings, and good vector search, and you're applying that to a problem that you have good data for, and have good metrics, that's the recipe for getting value out of AI.

Ryan Donovan: Yeah. I think that was something I thought of, you know, reading and writing about, like, how is software gonna survive in the age of AI? And it's like, it's the data in the end. And for that data, like you said, it's the vector search, the embeddings. So, what's the approach to getting the best sort of vector search and embeddings?

Pete Johnson: Well, this is where, like I said, when I joined six months ago, the sort of non-technical reason why I joined is, you know, I had the chance to go work for a friend, and my career over 31 years tells me that when you've got a friend as your boss, that always tends to work out pretty well. But the technical reason was, in February, MongoDB made this acquisition of Voyage AI, and when you first look at that, why would MongoDB acquire a company that does embeddings? And ultimately, it's so that you can have a better together story and make it easier for developers to create a good vector search, and to do it in a way that gets you better retrieval scores. In particular, there's two features, one pre-acquisition and one post-acquisition. When you, as a developer, have to go and make an embedding and a vector search, there's typically five decisions that you have to make. Once you've selected your embedding model, you have to decide on a similarity score. You have to decide on your chunk size, how big of the chunks I'm gonna put through it. How many dimensions do I want my array to be? What level of quantization in terms of how big am I gonna store 32-bit floating points, or am I gonna give up some retrieval quality, but gain some storage if I do 8-bit ins, or go down to binary? And then, the fifth is whether or not to use a re-ranking model. And there's two in particular that I'll talk about that Voyage does a really good job of. In January ‘25, Voyage introduced a feature called the Matryoshka reasoning. So, consider you embed your corpus of data, and you've decided to try 10x24 dimensions, and that gets you a certain size and a certain quality. What if now I want to try 5x12? With a traditional embedding model, I would have to read my entire corpus of data with 5x12 as the number of dimensions. But with Matryoshka reasoning, what you're able to do is you take the embeddings you already have, and it turns out they're ordered, so you just lop off the last 5x12.

Ryan Donovan: Interesting.

Pete Johnson: And that makes it so that, as a developer, you can iterate through your cycles of determining storage size versus retrieval quality, what am I trying to get for my specific application? It decreases the amount of time it takes you to go through that cycle. So, that's an important way of trying to make it easier on the developer to make that decision. So, that was the first one. The second one that really grabbed me, which we released back in July, was something called Contextualized Chunks. So, the way that a traditional embedding model, let's say you wanted to embed the size of a sentence.

Ryan Donovan: Mm-hmm.

Pete Johnson: Well, a sentence in one document could appear in a second document and have very different meaning based on the context in which it appears. So, what people do traditionally to overcome that is they'll embed a larger chunk size to try to capture the context around that sentence. Well, that means you've got more storage as you try to increase your retrieval quality. And what contextualized chunking does is when you send your, in this case, sentence to be embedded, you also send the entire document, and what we'll do in the background is we'll embed the items around the context of the document with the individual sentence, and it actually flips it where you can get better retrieval quality with smaller chunk size.

Ryan Donovan: Interesting.

Pete Johnson: Which is completely opposite of what you would think. So, that's another example of trying to reduce the friction that a developer might have as they're trying to learn these embeddings.

Ryan Donovan: Yeah. I've seen also for embeddings various overlapping chunking strategies.

Pete Johnson: Yes.

Ryan Donovan: Which seems like, you know, you might get better context, but again, it's increasing the storage costs.

Pete Johnson: It is. When it comes to the quantization, the chunk size and the dimensions, it's this constant battle that the developer is facing where you're trying to balance storage size, and it's not just disk storage it's the size of the index in memory, versus the retrieval quality. So, what we try to do, both with the base embedding models and together with the vector search that we have on top of the base MongoDB product, is to try to reduce that friction. I had somebody explain it to me this way once, where recently one of our executives said, 'remember when JavaScript came out?' I'm old enough to remember when JavaScript came out. And then we got jQuery, and that was way easier to use, and nobody's used raw JavaScript anymore. But now we've got, you know, React and Angular, and almost nobody uses jQuery. In this ecosystem of everything related to AI, whether it's learning the frameworks to build agents or to learn these embeddings, we're still way closer in our timeline and in the sophistication of the tools, we're way closer to the original JavaScript than we are to the React, or the Angular. And so, what we're trying to do, what MongoDB is trying to do, both with the Voyage acquisition and with the base product, is to move us a little bit closer to jQuery, because we're gonna see more people develop. Agents and AI products in the next three years than we have in the last three years. So, lowering that learning curve and reducing that friction for the individual developer is a really big part of that.

Ryan Donovan: Yeah. It almost seems like, you know, you're talking about moving up the abstraction levels, right?

Pete Johnson: Absolutely. That's a big part of it.

Ryan Donovan: Mm-hmm. So, with, you know, with all these trade-offs people are looking at with the storage side, reducing the index in memory, all the other trade-offs, how can a developer sort of approach making those decisions? Are there ways of thinking about it that you can offer?

Pete Johnson: Yeah, so, it boils down to those five decisions that I talked about before. Once you've selected your embedding model, and we try to make those core five decisions easier on the developer so that they can spend more of their time working on their core business logic and less time worrying about the mechanics of the embedding. So, like I said, typically when it comes to the quantization, the number of dimensions and the chunk size, those are the core three of those five decisions where you're making that balance between the two. Typically, what we recommend, similarity score or start with Co-sign. There's a couple others that people typically use. Co-sign ends up being a good starter similarity score. When it comes to chunk size, if you use the contextualized chunking that the Voyage offers, you can go to 64K tokens and get much better retrieval scores than you can when you go with bigger chunks. So, you can sort of ignore the overlap lapping chunk size if you use the contextualized chunking models. When it comes to dimensions, start with 10x24, and again, because we've got the Matryoshka reasoning in there, it's easy to try 5x12. It's easy to scale it down. To see if you can get better retrievals to get an acceptable retrieval score at a smaller storage–

Ryan Donovan: Right. It's easier to scale down than up.

Pete Johnson: It's easier to scale down than up. Exactly. When it comes to quantization, when you go to build the indexing, the way that the MongoDB Vector search API works, you just get to select what level of quantization you'd like to use. So, by default, you can use the full 32-bit. Again, you can experiment with a different, using the 8-bit in to see if you get still an acceptable retrieval score, but a lower storage size. And then we've actually found that the re-ranking can help you quite a bit, as well. If you combine the benefits you can get in particular out of the contextualized chunking with a best-in-class re-ranker, we find you can boost the retrieval score somewhere in the neighborhood of 10% to 15%, which that can be the difference between a hallucination and offering somebody an AI-enhanced solution that actually helps them solve a real world human problem.

Ryan Donovan: Yeah. And you also mentioned holding the database index in memory. I know when we did our cloud transformation, we had to get specialized storage containers just because we needed so much in memory, right? Instead of compute. Are there ways to make that trade-off to either reduce the index size for costs, or if you're going for speed and performance, to increase that index size?

Pete Johnson: Yeah, typically there's a correspondence between the size of the index and the speed that you get out of it, and that conceptually makes sense that if you've got a bigger indexing space to try to search across, then your performance is going to, you know, have a similar increase. And again, it depends a lot on your specific data. It depends a lot on what you're running it on. But the vector search is just something we offer as part of the Atlas products, which, if you listen to our most recent analyst call, Atlas—the cloud version of our product that runs on every single hyperscaler data center, so you get to pick where it gets deployed—will automatically manage that instance for you. The vector search is part of that product.

Ryan Donovan: I know we talked in the call about Mongo being kind of a niche product. Do you wanna sort of address that?

Pete Johnson: Yeah, I mean, when I talk to customers about this, because of how far back relational databases go– so, I happened to have been born the same year that the white paper that gave birth to relational databases was written. So, that was in 1970. If you think about what the world was like in 1970, the kinds of applications were oriented towards departments, not the public at large. You could have downtime on the weekends, and storage was really expensive. And because of that, the education system that we all go through really tends to put an emphasis on normalization of data. So, how can you lay your data out so that you're storing the absolute minimum amount of data? And what our founders saw– so our founders sold DoubleClick to Google, and that's part of what the ad system that you see on Google searches based on. What they saw was that there was some more modern use cases that maybe it was okay not to fully normalize if what you get is the advantage of better transactional response. So, the first MongoDB commit was in 2007. So, that was after internet. That was after mobile. That was after Cloud. So, by being aware of that and having a more flexible schema structure, you might know, MongoDB is largely based on this JSON model. We store the data in a binary version of JSON, called BSON. That can get you far faster transactional response, that kind of thing you need in an AI application, as opposed to say something that's analytical, where maybe you've got more data, and you do have to worry about normalization. If you de-normalize some of that data with MongoDB, you can get better transactional response. And instead of just thinking, 'I must normalize at all costs,' well, if you're willing to de-normalize a little bit, then what you get, the trade-off is you get better transactional throughput and better transactional response time. Does that fit every workload? No. Does it fit a ton of workloads that are super important? Yes, because the modern application, you can't have downtime on the weekend like you could in 1970, right? Like, slow is the new downtime.

Ryan Donovan: Right.

Pete Johnson: So, there's plenty of use cases that fit that more de-normalized model that we provide.

Ryan Donovan: Yeah. And you know, obviously JSON is one of the foundational technologies of the current internet, right? Everybody's got JSON.

Pete Johnson: And AI, as it turns out.

Ryan Donovan: And AI, as it turns out.

Pete Johnson: You know, the only other thing was if you haven't watched the Werner keynote, I would recommend. It's a good use of an hour and 15 minutes. The 'too long don't read' version, if I go from my notes, he talks about the importance of remaining curious, of being a good communicator, just because you might use AI-enhanced tooling to generate your code you still own it. You're still responsible for it running in production. It's not an excuse to say, 'well, my AI generated it.' No. Like, you own it. And he talked about some techniques for making sure that you inspect that code and put your seal of approval on it. And then he talked about the importance of thinking in systems. Because AI is gonna be really good at helping you with individual tasks, and you as the human need to see across those tasks, and why each one is necessary. And that blended into his final thing, which he used this word that not many people known called the Polymath, where what that means is you're an expert in one, deep topic, but that you know a little bit about a lot of other things. So, like that t-shaped engineer that you might have heard of–

Ryan Donovan: mm-hmm.

Pete Johnson: Instead of that word polymath, that if you combine those five things, that's what he thinks we're about to see this renaissance of software development based on being AI-enhanced. And that's what this sort of Vogel’s renaissance developer is, if you embrace curiosity, communication, ownership, thinking across systems, and being a polymath.

Ryan Donovan: Yeah.

Pete Johnson: It's worth your time. I found it super inspirational. I want to go build stuff.

Ryan Donovan: Yeah. Well, in terms of the ownership, I read an article a while back that said like, ' can you trust AI code?' Well, no. 'Can you trust junior developer code?' No. 'Can you trust code you wrote yesterday?' No. Like, make sure you look at and understand any piece of code that comes across your desk.

Pete Johnson: Absolutely. The difference is that we gain an understanding and build that trust in a traditional way because we write the code. So, as we're writing it we trust what we wrote. That doesn't mean that you don't have– you still need the review cycle if you've got AI generating some of that for you. So, it's a shift in thinking. I mean, like I said, I'm 55 years old. I've been writing code since I was 11. I haven't written a manual line of code in eight months now.

Ryan Donovan: Wow. Go watch the keynote, get inspired and start building.

[Music]

Ryan Donovan: Okay. Well, it is that time of the show again where we shout out somebody who came on the Stack Overflow, dropped some knowledge, shared some curiosity, earned themselves a badge. Today, we're shouting out a Populous Badge winner – somebody who dropped an answer that was so good, it outscored the accepted answer. So, congrats to Scheff's Cat for answering 'error: non-const static data member must be initialized out of line.' Curious about that error, we'll have the answer for you in the show notes. I'm Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have comments, questions, concerns, email me at podcast@stackoverflow.com, and if you wanna reach out to me directly, you can find me on LinkedIn.

Pete Johnson: My name's Pete Johnson. I'm the field TTO of AI at MongoDB. You can find me on LinkedIn, where I read all the white papers, so you don't have to. I'll literally connect with and have open DMs with anyone, so feel free to join in, and I'll do that research so that you don't have to.

Ryan Donovan: All right. Thank you for listening everyone, and we'll talk to you next time.

You need quality engineers to turn AI into ROI

SPONSORED BY MONGODB

Add to the discussion