You’re probably underutilizing your GPUs

Episode notes:

Mithril’s omnicloud platform aggregates and orchestrates multi-cloud GPUs, CPUs, and storage so you can access your infrastructure through a single platform.

Connect with Jared on Twitter and LinkedIn.

Shoutout to user Razzi Abuissa for winning a Populist badge on their answer to How to find last merge in git?.

TRANSCRIPT

Ryan Donovan: Tired of database limitations and architectures that break when you scale? Think outside rows and columns. MongoDB is built for developers by developers. It's asset-compliant, enterprise-ready, and fluent in AI. Start building faster at mongodb.com/build.

[Intro Music]

Ryan Donovan: Welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ryan Donovan, and today we are talking about the GPU shortage problem, or wait– is it a GPU utilization problem? We'll find out from our guest today, Jared Quincy Davis, CEO and founder of Mithril. So, welcome to the show, Jared.

Jared Quincy Davis: Hey, thanks, Ryan. Thanks for having me.

Ryan Donovan: So, top of the show, we like to get to know our guests a little bit. Can you give us a little, quick flyover of how you got into software and technology?

Jared Quincy Davis: Yeah, sure. Happy to. There are multiple levels of which I couldn't start, but maybe one moment similar to many other people in AI research – I was deeply inspired by AlphaGo from DeepMind, you know, in 2015. At the time, I was pretty interested in a lot of different areas of tech. I thought robotics was cool. I thought quantum computing was interesting. I thought fusion was gonna be a really important problem. I thought a lot of things in bio computation, bioinformatics, would be interesting to work on. But AlphaGo convinced me that I should focus my energies on AI because it was pretty clear that that approach, although it was just playing the game of Go, it was pretty clear it was a very general extensible approach, and that you could take that recipe, so to speak, and apply it against a whole host of problems that had the same mathematical character. And so yeah, that was really inspiring to me, and I wanted to work on that. I felt like if we made a lot of progress in that, if we solved the types of problems that prevented that methodology from being more broadly applicable, then we could use it host of downstream. Yeah, I think that's really important because I think we've already seen what things like AlphaGo can do with things like Alpha Fold, but I think, you know, it's really important that we have a lot of technological progress, that we have a lot of scientific progress in the world. I think otherwise, the world's very zero sum or even negative sum, and I think that kinda technological progress, most broadly defined as new and better ways of doing things, makes the world positive sum. And so, it was pretty clear that AI technology– that these were kind of massive levers to make the world more positive sum. And so, we have a lot of challenging problems, and I think the tools that we have at the moment are insufficient to address them. And so, I think we have to build better tools, and that's what I've been working on. I think the community collectively has been working on [it] for quite a while, and now we have better tools, and these tools are– we're starting to find better ways to use them, and that's why it's a pretty exciting time to be alive.

Ryan Donovan: You know, ever since AI took off—you're right—a lot of people are trying to figure out the right tooling for it, but a lot of people are, you know, scaling up their hardware. And to hear from folks trying to figure out, 'how do I get enough GPUs to train and now run an inference on all the data that's coming through?' You tell me that this isn't a GPU source problem, it's an efficiency problem. Can you talk a little bit about that?

Jared Quincy Davis: So, I think there's a lot of different common misconceptions about GPUs. You know, one of them is: people do oftentimes believe that there's a shortage, but I think it's arguable that there's not a shortage. There's a lot of capacity, but there are kind of market inefficiencies that kind of prevent people from getting access to capacity, and technical challenges that make it hard to use them well. So, for example, there's a lot of, quote-unquote, 'defensive buying' in GPUs, where people have to provision for their peak need, they have to provision just in case, they have to lock down capacity for some future need defensively, rather than kind of scaling dynamically with their need and relinquishing resources back to the commons. And I think that leads to a lot of resources being sequestered in groups that aren't necessarily using them well, lots of stranded capacity, et cetera. And so, you know, a lot of what we work on actually is saying, 'okay, almost reminding the ecosystem– been trying to almost restore what was the original promise of the public cloud?' Which was that you wouldn't have to provision fixed capacity. You know, there are many, many, I think, benefits of the original public cloud, but one of the main ones was the idea that capacity would be elastic. You wouldn't have to capacity plan as stringently. You wouldn't have to provision fixed capacity, whether in pay for it, whether you use it or not. You could use capacity elastically, and if you have workloads that are embarrassingly parallel, that can scale up, where if it's a workload that would run for a thousand hours on one machine, but you can also run on a thousand machines for one hour, then you could kind of go a thousand X faster, go much faster, for the same cost. That was the real thing that was extremely revolutionary about the cloud that was kind of unprecedented in the industry of IT. That property is not extended in AI today, and so people buy capacity with what they perceive to be or believe will be their peak need. They don't use it, and everyone's kind of forced to go through this themselves, just like in the kind of days before cloud, buying resources on-prem or buying resources in co-op. Part of our contention is that a lot of the neo clouds are actually more like neo colos, and there's not really a proper cloud in the AI cloud ecosystem.

Ryan Donovan: So, that's interesting. I know that, like you said, that was one of the promises of the cloud. You had extensible sort of flexible resources. Why doesn't that flexibility apply to GPU?

Jared Quincy Davis: A number of reasons, some economic, some technical. I think part of it is some of the things that we built over several decades to make sharing CPUs really efficient don't apply quite as well in the GPU context. And so, as one kind of example that's a bit more intuitive, one of the things that's really distinct about these GPU workloads is they're 'large language model workloads,' often. And so, what do we mean by large? Well, one somewhat workable definition I think roughly people are invoking when they use the term large, is the system requires more GP memory to hold the weights and hold the parameters of the model than what even a single kind of state-of-the-art server can provide. And so, when you're in the large regime, one kind of characteristic of a large regime is that you need multiple nodes, multiple servers, and it becomes a parallel computing problem to wrangle and deal with these constructions. And so, when you're in that setting, nodes aren't all completely fungible. You care about things like contiguity. You place additional weight on nodes that are proximal to each other or that have high-bandwidth interconnect. And so, what that means is that GPU allocation is not a problem that you can kind of do greedily and do naively. It's actually a much more complex problem. You have to think about the shape of workloads that you're allocating, so it's more like Tetris than it is like, you know, selling independent units. So, you have to say, 'if I'm allocating this contiguous block, then I can't allocate other contiguous blocks per se of a certain size, because this is kind of crowding out capacity.' And that's a little bit harder. It's a little challenging versus a lot of the more classical big data workloads, where they were kind of independent workloads. They were embarrassingly parallel. There was not this contiguity and kind of interconnect-nature to the work, and so, our requirement to the work. And so, I think that means that the way that you have to do scheduling, and perhaps also the pricing models, all of it has to be rethought to drive efficiency in the GPU context. Otherwise, you end up with a lot of underutilization due to poor allocation decisions, and that is just a challenging thing to wrangle. And so, what a lot of the companies have opted to do is, rather than solving that problem, which in some ways is harder than language modeling, people have said, 'actually, I'm just going to allocate the mostly single tenant, entire blocks of capacity, to a single large customer, and let that customer figure out how to do the utilization, and deal with the kind of utilization economics themselves.' And so, that's been kinda the predominant model – just to sell big blocks to customers wholesale for long durations, you know, two years, three years, five years, et cetera. And then, let the customer figure it out, put the burden on the customer, which was actually antithetical to one of the original value propositions of the cloud that Amazon espoused, which was the idea that the cloud would handle the complexity for you; that it would kind of allow you to focus on quote unquote, 'just what makes your beer taste better,' was the analogy often used rather than having to deal with infrastructure complexity.

Ryan Donovan: If I understand you right, one of the reasons that the CPU workload, you just say, 'do a thing and then gimme the response to the thing.' You sort of need this large language model to sit in memory on the GPU contiguously, right?

Jared Quincy Davis: Often not just on a single GPU, but across multiple servers, multiple nodes connected. So, that makes some things a little bit harder. Think of AI context. Also, a lot of it is the standard tools. A lot of the standard virtualization tools that exist don't apply directly to GPUs. You have to do some work to apply VMs that are standard in the cloud, and to kind of handle GPUs, and NICs, and kinda get full performance, and et cetera, in the GPU and kind of GRES context. So, it's a bit different. And so, that's led to some friction, and I think that is a big part of it. Honestly, though, after reflecting, I think another big part of it, too, is I think we take the cloud for granted and don't necessarily appreciate the extent to which the decisions of individual actors, and the vision of individual people, may be shaped the way this would evolve. I think we feel like it was natural, and in hindsight, I'm actually starting to realize, going back and reading, just how counterintuitive and controversial some of the initial value propositions of the cloud were. And when I talk to people about having an elastic or on-demand model in the GPU context, some of the arguments I hear are not specific to GPUs and their quirks. They're actually just generic arguments that would've applied to CPUs and storage in the traditional cloud, as well. And I realize that a lot of people don't really understand the cloud, and I'm sure that was true at the time, too. And obviously, there were a lot of big colo companies, and the traditional model was large, and what AWS did was pretty innovative. And it took even several years after AWS was already in market – it took a while for other players like Google and Microsoft to get excited about the category and enter it themselves. And so, I think another part of it– these ideas, despite the success of the traditional cloud, people don't really get why it worked, what people cared about, what was revolutionary about it, and they haven't been able to kind of port those learnings over to this new AI cloud context.

Ryan Donovan: Like you said, the cloud was built on the sort of virtualization slash ambulation of a specific architecture of a chip. With the GPU being sort of massively parallel, what's the kind of hypervisor-level rethink we need to do?

Jared Quincy Davis: So, that could be an extremely deep topic. I'd say even before some really fundamental rethink, you just have to deal with expressing the actual topology of the nodes, topology of the CPU to GPU, and CPU to NIC affinity. You have to be able to somehow express topology of the system. You have to do initialization of the memory, PCI initialization. There's just a lot of things that you have to do that you have to port and adapt the standard technologies a little bit, and to have the VMs start quickly, to have the VMs deliver full performance. You know, even beyond doing something completely novel, you have to take the things that we already have today and just apply them well in the GPU context, [which] requires some non-trivial work that a lot of large clouds haven't even done. A lot of the large clouds are just bare metal, and they don't do multi-tenant at all. They just do single-tenant, long-term reservations at scale, right? And so, they just made a decision to circumvent a lot of that, and just to choose a single customer – so much for democratization, just choose a single big customer, like OpenAI, and just allocate raw capacity to them on a long-term basis, almost as a financial, private equity-style plan.

Ryan Donovan: Anytime I've seen GPUs on the cloud, it's basically buying a single unit for an hourly basis. That's the model we have right now. How can a customer better utilize the GPU and the GPU time that they have?

Jared Quincy Davis: Our approach to this design system from the ground up, specifically for this– I think first noting what's really unique about the regime that we're in today. You know, one of the really interesting infrastructure questions for the next few years is around this kind of bi-section that's emerging between two classes of workloads that are very, very distinct, have very distinct objective functions that you wanna optimize for them. So, one class of workload is the real-time, low-latency regime, and this is, you know, your web agents, this is your co-pilots, this is your synchronous chat sessions, et cetera. On the other hand, you have the asynchronous to put in the more economically sensitive workload class, and this is your deep research to some extent, this is your background coding agents like Codex, this is your indexing. This is not a completely new problem. This problem has existed in some form, even in classical search, where obviously you had the Google background indexing work and then the live serving, but I think in the AI context, it's kind of getting even more extreme in some interesting ways. And the amount of variation, and the hardware choices, the amount of model variation, is kind of greater. So it's a richer problem in some ways, I would argue. So, the question partially becomes, 'okay, what type of architecture do you need to balance across these different classes of workloads well, and be able to perhaps use the same capacity that you want to use for training or for live inference also efficiently, for things like batch inference and background work?" You know, how do you kinda overlap this? And I think a lot of these ideas also– some precedent in prior systems' work. And I think the kind of central organizing principle needs to be– preemptability needs to be priority, but taken to an extreme. And so, that's a lot of what we built. We built a system that kind of takes the ideas of preemptability, the ideas of priority, to an extreme, and almost uses auctions as a form of congestion control, almost like networking, to basically map workloads across different SKUs, accounting for the heterogeneity across SKUs. The fact that some SKUs are in more favorable locations will have greater set of compliance certifications associated with them, have better networking, have better networking in and outta the data center, or better interconnect, better storage that's proximal, et cetera, et cetera. So, accounting for all that variation, saying, 'okay, how should I route workloads to the chips that satisfy all of the hard criteria?' And then also maximize a satisfying function that's about basically maximizing surplus between the cost, which is a function of congestion, and the value that my workload describes to that kind of unit of compute. And yeah, what that basically means is that every workload should be priced specific to that workload. So, rather than having a fixed price GPU per hour, you shouldn't think about pricing an allocation, you should think about pricing a workload, and a workload that gives you greater affordances, like says, 'hey, I'm preemptable.' Or, 'I have a flexible SLA, I just need four hours of runtime within the next 24,' or says, 'actually, I don't have strong intiquity requirements. I'm okay if the nodes assigned to me are disaggregated or separate,' et cetera, et cetera. The more flexibility, degrees of freedom, that a workload gives you, the better economics you can confer to it. And so, the system liked this, and this is very helpful because for workloads that need tight starts that are latency sensitive, you can satisfy those, and you have always some minimum amount of capacity in a priority-governed pool. So, you're removing availability uncertainty in favor of some amount of price uncertainty for those workloads, so they can run when they need to in a transparent, market-driven price way. And on the other hand, for workloads that are flexible, that say, 'hey, you can run me overnight in off-peak,' then you can give those workloads a massive discount. You have 10 x, 20 x, or more. And so it's just much more efficient system overall. I think this is gonna be really important to be able to share resources, so I can have the same nodes that I use for peak traffic during the day, used for offline workloads at night. Those kind of classical ideas brought to Baird scale across our own first-party hardware, but then also giving this to users to be able to use Client-side on their on-prem hardware, and on cloud hardware that they have in other clouds that we're partnered with, like Oracle, Nebius, and others, Client-side even, as well.

Ryan Donovan: It almost sounds like you're creating a sort of Lambda serverless for GPU work. And then it also sounds like this is, at its core, a scheduling problem.

Jared Quincy Davis: One of the interesting things about DCP workloads is you actually do have to take into account, at least for some workloads, storage and data gravity. It's not an overwhelming factor because GPUs are so memory bandwidth constrained, but it is a factor, for sure. So, it's very much a scheduling problem, not quite as ideal as some of this canonical serverless work, but it is a scheduling problem in the broader sense, 100%. And that's why we reference this idea called the Omni Cloud, which is the idea that every user, at this point, doing anything sophisticated, is multi-cloud, and perhaps even on-prem, you know, they'll use AGPS, or GCP, perhaps, for some things, and then they'll have another AI-native cloud, unlike us or someone else, for a lot of their GPUs. And you want to, especially in the GPU context, be Omni Cloud, because GPU cost is so much CapEx versus opex that if there's a resource that's underutilized, there's such an economic advantage to routing your work there if the system's at least efficient. And so, you want to be able to run your workloads preemptively on spot in various clouds. You want to be able to get reservations where you can get an economic advantage and have that flexibility to migrate your work. So, we found that a lot of people want that, and us being able to help them with that is value proposition that they like.

Ryan Donovan: So, we talked to Arm, the chip designer, a couple months ago, as of publication, and they talked a lot about in resource-constrained environments, moving some of the GPU workload to the CPU. Is that something that you consider doing?

Jared Quincy Davis: Usually not. For a lot of our customers and their workloads, actually, they're very GPU-intensive, but a lot of the GPU workloads are not really going to be very efficient on CPU, usually. So, it's almost always the opposite, actually. These are people taking CPU workloads, or workloads that used to run on CPU super clusters, and actually migrating those workloads to be GPU native. Everything from taking simulation systems that were built around CPUs for some science use cases and rewriting those to be neural, so to speak, like neural GCM-type things. So, it's actually almost always the exact opposite, actually. People who are doing things on CPU finding that this is extremely inefficient and they can do the same thing in an absolute fraction of the time, and a fraction of the cost on GPUs. It's just way more power efficient, way more time efficient, and way more cost efficient. We actually see the exact opposite at scale. I don't think there's many, that I know of, people taking any serious workloads and trying to run them on CPU. There is some offloading/onloading as when you're very memory-constrained, if you're trying to run, for example, a big model on a local device or on some really small chip, you might wanna try to load the weights layer by layer into the GPU and kind of run 'em layer by layer. It's extremely slow, and I don't think you do that for any serious application. I think you just get yourself a good GPU. Now, if you can't—ideally in the cloud—if you really want to be local, maybe you do that, but otherwise, I don't think we see that very much.

Ryan Donovan: This is a less resource-constrained situation. So, the other thing we've seen with the GPUs is the scale on the better ones, but we've seen some folks in export-restricted areas making good use of lesser GPUs, or older GPUs. Do you think consideration will be something that more people should look into?

Jared Quincy Davis: One of the unique things about GPUs is that their total cost of ownership, their TCO, is pretty heavily governed by CapEx, not opex, meaning GPUs are pretty power efficient, comparatively, relative to CPUs. And so, the cost of owning a GPU is largely the cost of buying the GPU, and then a lot of your cost assumptions and price assumptions to recoup your cost are then governed by, largely, utilization questions on one hand, but then also largely just by depreciation; how long is your assumed life of the chip? And so, Nvidia has picked up the pace; the innovation speed is faster and faster, and so there's a new chip skew every six months now, and when that's going to be the case for the next several years and probably beyond that, as well. And so, I think now, it's pretty obvious that there will be a faster and faster coming out with new SKUs. It's not a three- to five-year system. It's more like three to five months, not quite, but almost. And so, being able to thin the useful life of the chip is pretty important. And to kind of stay stretch that CapEx over a longer duration – that's definitely going to be an important vector. And so, I think a lot of the companies have been finding more and more creative ways to use older SKUs, smaller companies, and the life for some of the older SKUs is longer than people maybe anticipated. Like, there's still people using A100s, the ampere generation, obviously H100s, et cetera, even though there's now H200s, and Blackwells, and deployed at a pretty decent scale. So, I think, yeah, you can definitely use older chips, especially for running distilled models, et cetera. I think one of the distinctions is you might do your heavy, large-scale training on the latest and greatest; you then also might do a lot of batch inference on the latest and greatest chips of your largest model if you have a family of models that you produce, but then that largest model might just be used for batch inference, for Google offline interests, inference, to basically produce synthetic data, or to produce training data that then you just use to distill inference models that are cheaper to run, cheaper to serve. And then those cheaper to run, cheaper to serve models are smaller, require less GPU memory, and so, you can run them on older, smaller SKUs for quite a while and kinda extend the life of those older, smaller SKUs. You know, and that's both for live serving, but also even for generating RL rollout for, you know, training the next generation of the model, as well. So, for that kind of 'RL loop,' so to speak, in post-training. I think definitely there's a use for the older SKUs, and I think making the economics work requires people being somewhat creative about finding ways to use slightly older SKUs, at least.

Ryan Donovan: So, on the flip side of that, for somebody sitting on a bunch of these older units, what's the point when they should start looking at the newer ones? Are the newer GPUs just more of the power, or are there actual qualitative differences?

Jared Quincy Davis: There are definitely qualitative differences. You know, the newer GPUs often somewhere that they support lower-precision format, which you can then use to do more efficient training, and it has some interesting learning theoretical properties that you may care to study. That's one. Sometimes they also have features that older ones don't have, like support for– this is a bit of a niche thing, but things like TEs, test-execution environments. But I'd say that one of the main reasons is you just want more GP memory, you want more power. That's the main reason, and that's actually increasingly important because, increasingly, the constraint to scaling is not the chips per se, it's getting access to sufficient data center power. It's like, you know, grid scale challenges, increasingly. And when grid scale challenges are your bottleneck, and you're trying to deploy things, and you wanna get as many quote unquote 'flops' per watt as you can, and dollars are less the constraint for many leading labs than power. They're able to get, at least at the moment, it feels like, arbitrary amounts of capital. But there's then the limits of the fiscal world in terms of how much contiguous power you can stand up and deliver to one relatively compact location. That's a challenge, for sure, and so, in that context, the good thing about newer ships is they are typically from generation to generation, more power efficient, and so you do get more flops per watt. That's definitely a benefit that does motivate migrating to the newer chip and replacing your fleet at a higher rate than the kind of base, naive economic argument would justify.

Ryan Donovan: Let's say you have a mixed fleet. Do you run your single model across those? Do you look at specialty models?

Jared Quincy Davis: Yeah, I think definitely specialty models. I'm extremely biased here, because a lot of my work has been on the theme of compound AI systems, and also I think ecosystem specialty models is a lot more probably democratized, there'll be a longer tail of players with innovative ideas that maybe don't have the scale, or kind of the generality of their model, to, you know, usurp ChatGPT's position, but can be really, really helpful for certain domains. So, I'm extremely bullish on the theme of comp AI systems, on mini models, a menagerie of models' future. And yeah, I think any company that's serious, from OpenAI to any of the agents labs, from, you know, Databricks to, you know, the coding companies like Cursor and Cognition, et cetera. I think everybody is using mini models, and is using ensembles, and is fine-tuning specialized models, and distilling specialized models, especially in the agentic context, when you're trying to do tool use efficiently, et cetera. I think you definitely want to have the really large reasoning models, but then the slightly more efficient models that can just, you know, call tools well with high fidelity, and high liability, and low latency, 100%. And so, yeah, I think that you already see that, you know, extremely mixed fleets, even quote unquote 'large model inference.' One of the kind of main techniques is speculative decoding, which is basically pairing a large model with a smaller pair, you 'draft' your model. And that drafter model, basically an inference speed optimization technique that you even use for standard models, but you run the drafter model ahead, and you let the verify model just check its work, and only have to do the work directly itself when it needs to kind of correct the draft model. So, you always have small models, even when you're just serving a big model, there's usually an accompanying small model that is, quote unquote, 'helping' draft and accelerate the inference of it. So, yeah, definitely a multi-model feature, for sure.

Ryan Donovan: It's almost like AI interns, right?

Jared Quincy Davis: Yeah. That's an interesting analogy. Yeah, that's right. Kind of model interns. Yeah.

Ryan Donovan: Somebody check their work!

[Transition Music]

Ryan Donovan: Well, it is that time of the show again, ladies and gentlemen, and distinguished guests, where we shout out somebody who came on to Stack Overflow and earned themselves a badge thanks to dropping an answer or asking a question. So, today we are shouting out the winner of a populous badge – somebody who dropped an answer that was so good, it outscored the accepted answer. So, congrats to Razzi Abuissa for answering 'How to find last merge in git?' I'm sure that's a popular question to ask, and if you are one of the askers, we'll have the answer in the show notes. I'm Ryan Donovan. I host the podcast, edit the blog here at Stack Overflow. If you have questions, concerns, topics to cover, et cetera, email me at podcast@stackoverflow.com. And if you wanna find me on the internet, you can find me on LinkedIn.

Jared Quincy Davis: Yeah, and I'm Jared Quincy Davis, the founder, CEO of Mithril. You can find me on X @ JaredQ_ or on LinkedIn, and you can obviously find me via Mithril.ai. If you're building an AI and need access to infrastructure, and GPUs, and best economics free of obstructions to make your work easier, reach out. We'd love to partner with you.

Ryan Donovan: Alright, well, thank you for listening, and we'll talk to you next time.

Add to the discussion