Do you have what it takes to run AI in production?

CoreWeave is the AI-native platform cloud that’s purpose-built for AI, combining next-generation infrastructure and intelligent tools to power the world’s most complex AI workloads.

Connect with Peter on X.

[intro music]

Ryan Donovan: Hello and welcome to the Stack Overflow podcast. I'm Ryan Donovan, once again recording from the floor at HumanX. Today we're talking about all of the things required to run AI in production. My guest for that is Peter Salanki, CTO and co-founder at CoreWeave. Welcome to the show, Peter.

Peter Salanki:Thank you so much and I'm happy to be here.

RD: So, you know, you have the full soup to nuts running AI in production. What does that take, especially what wouldn't people think about that is part of that stack?

PS: The stack to run AI workloads, both training and inference, looks very different from your kind of traditional hyperscaler architecture. And I think that that's, you know, we get the question, like, why doesn't, you know, Amazon– Amazon do this? Like, the way that they built their clouds and for their use cases for the past 20 years is very different from what you need for AI. So traditional clouds are built around what I like to call the black box model. You can get a virtual machine, or you get some kind of instance, and you don't really know what's happening. Everything is nicely abstracted away from you. And that's great if you run something that's easily parallelizable, like a website or an API server. But when you run AI workloads, you all of a sudden have, you know, we're talking about running models to start, or training models, you have this huge kind of supercomputer-sized workloads. They're all synchronous. They all run together. And if any component breaks, then your entire job fails. And in addition to that, the kind of traditional cloud web server and so on, like you have your network bandwidth, you have your redundant uplinks from your servers, and you're going to throw away half of the network bandwidth for redundancy. With AI workloads, right, if I go and tell AI researchers that, hey, we're going to cut half your network bandwidth, but your jobs will never crash, they'll tell me to, you know, go and run out the door to use the appropriate language. And so the use cases here are not necessarily designed to never fail. They're designed to be able to fail, and then we identify where is the failure, what component failed, isolate that, and then effectively restart without losing progress, right? And to build an infrastructure for that is completely different, it turns out, from building one that is built with all these redundancies to never fail.

RD: So you mentioned the network piece of it the uh… I've been talking folks that uh, the network has been some of the, the big bottleneck in the AI workloads. Has that been your experience as well?

PS: Yeah the network– I mean network is always the hardest part I would say because it's also so interconnected.

RD: Yeah.

PS: It's so interconnected, it's so complex if we look at like the latest generation Grace Blackwell chips from, from NVIDIA, right? So you have a compute tray of a server with four GPUs on them and then out from each GPU we take four network links. There are actually four 200-gig network links, then they go into what's called a multi-plane architecture to be able to build really, really large clusters. And this makes the network both really big– there's hundreds of thousands of cables and connectors in these clusters– and make them really complex to manage and operate. As you scale compute, as you scale your flops, in a chip right now you can do computation faster, now you want to synchronize your gradients, or if you do inference right you can either parallelize the inference or do disintegrated prequel decode, like you need to move all of this around really fast, right? So the pressure to scale the network as we scale compute is constantly on and you know those scale very differently because there's a physical– there's lasers in there, right? Like how do we get– when we go over a certain distance, we we need to use lasers, we can't use electrons. And scaling lasers is hard, and they're very warm, and there's a completely different physical aspect to scaling network versus a chip. Like I'm not saying that scaling chips are easy, it's just, it's just different. So yeah the network is always going to be like the ultimate bottleneck. When we get a computer go faster people are going to be like okay, I want my synchronization down as much as I can.

RD: Are there– there are ways to work around that? Like using more pipes, or bigger pipes, or just getting software that's better at routing– is there ways to mitigate that, that sort of network lag?

PS: I mean there are at it up to a certain point, right? It's like there's a lot of creative solutions in the end, like more is more, yeah? Always in this space I feel like. Um, but there are different network architectures, right? If you look at TPU they're using what is called tourist architecture which is really neat up until a certain size, and it doesn't scale at all. Where the industry has landed now, if you look at how the latest generation NVIDIA chips are designed, you have different scaling domains, where you have what we call the scale-up domain, which is the rack, where we have, say, 72 GPUs. And they have really efficient network communication over Emulink, which is all electrical. There's no fiber involved there. And then you can keep both the power usage down and failure rates down because we're not actually involving optics. But since we can't do that over large distances, because we're only doing it over copper, when we scale out to more chips than that, then we're going back to traditional optical-based transports and back to traditional network scaling. There's a lot of techniques being researched to do training over WAN, where you can connect multiple gigawatt-scale data centers. And some people are using them effectively in production. It's a trade-off. So there are normal ways being developed, there are people doing them, but it's always a trade-off, right? You always have to adjust a model where you might lose some accuracy, which might be fine, but as a model developer or researcher, if I just can have unlimited everything, it means I can work without bounds, right? So the problem is never going to go away. We're just working around the constraints every day.

RD: I mean, unlimited everything means unlimited money, right?

PS: Yes. Some people seem to have that as well in this case.

RD: Awesome. With the sort of AI workloads, you need a lot of memory per instance, right? Generally because you're holding an entire model in memory. Is that something that is different than the traditional cloud workloads or is this a sort of known problem?

PS: I think it's– I mean, like all of this is different right now, and how you interact with your memory is very different in kind of an AI workload, uh, versus if you're running a database where you don't necessarily hit every memory segment in every computation, you know? In traditional inference lookup, you would hit all your GPU memory all the time, which means that you would be very bottlenecked by memory bandwidth. In some traditional use cases, you might be as well. In most traditional use cases, most developers don't really worry about memory bandwidth day-to-day. If you're an AI developer, you worry about it day-to-day. And we had some novel techniques come out of this mixture of experts. I guess it's two years now since it really kicked off, right? That allows you– you still need the memory, but you're not activating all of the memory at once for all your requests. So now, all of a sudden, we can actually request, again, a model is really big, but we don't need to activate every byte of memory with every request, which allows us to have less memory bandwidth. There are novel techniques coming out that help us make that more efficient, but memory bandwidth is more of a bottleneck than memory size, and the same thing comes like– we can scale memory size by connecting more GPUs together, but then you push the pressure on the networking, so then we're back at networking.

RD: Back at the network. One of the other pressures has been GPU speed, obviously. People have been looking for whatever the latest Backwell server is, whatever the fastest NVIDIA chip. But I've also heard that the issue isn't as easily reducible to more GPUs, that there's a GPU efficiency aspect where people aren't using the power that they have efficiently. Would you agree?

PS: Well, there's multiple ways to answer that question, right? The first of all is, can you scale clusters of infinite size? And there's been some plateaus there, and that also comes down a lot to networking, again why the networking is always such a critical point. And then they are usually overcome. Now we can pretty confidently scale clusters to many hundreds of thousands of GPUs. There are definitely pros and cons with that. I don't think that everyone should go and build 100,000 GPU coherent cluster and run 100,000 GPU coherent jobs because it puts tremendous pressure on the reliability of infrastructure, how you build your lifecycle, fault management, everything I mentioned– mentioned early on, right? This is a little bit what, what we are kind of specialized on and we can take a hundred thousand ships and a million network connections and they're all just kind of are waiting to, to cause you trouble and make it work and make it work really well for our customers. It's hard and, in some use cases, it's worth it when you do large-scale pre-training, yes, yes… many GPUs, more compute will help you get done faster. And a lot of other use cases, not necessarily, and you might be better off, you know, doing stuff a little bit smaller and making your life easier. Then there's the utilization story, which is a bit different. That is: how effectively are my researchers or my team using the compute? And we see that as a growing question, especially from the enterprise segment, understanding this project that is requesting 2,000 GPUs, are they actually using them effectively? Are their algorithms the good fit? Are they actually wasting half their flops, right? So a lot of stuff that we're working on through our observability and kind of AI software stack is to bridge that gap where we can give the researchers that project owners' visibility into how efficiently is my compute being used, how well is my infrastructure supporting me versus, you know, it's just dollars going into a black hole.

RD: Right. And you mentioned the coordinating, you know, the 100 GPU clusters.

PS: 100,000 GPU clusters.

RD: Yeah, 100 plus.

PS: 100,000.

RD: 100,000?

PS: 100,000. 100,000 GPU clusters.

RD: Okay. Well, that seems like...

PS: 100 plus you can do in your basement.

RD: [laughter] It seems like another sort of aspect to the utilization story, right? Where I– and I wonder, at what point does that become like a scheduling problem?

PS: The scheduling problem is always there, yeah. Uh, and and it's always a battle between like, the researchers you know? A researcher's dream is I have a hundred thousand GPUs for myself and I can use it whenever I want, right? But then it's going to be idle a lot so that doesn't really work out cost-wise. So then how do you make sure that you schedule things and build a scheduler so that you, as a researcher, or if there's an inference use case and needs to scale up, can get the compute when you need it without there being idle compute, right? And handle preemption and priorities. And it's very tricky. And there's a lot of work going on there, both kind of proprietary and in open source space. Slurm has originally been the most popular scheduler for like any type of high performance computing. You know, it's kind of technology that was rooted in the 90s, rooted in the HPC supercomputers. But there's a lot of new... developments in Kubernetes world with Q schedulers to try to attack this better and also in a more cloud-native fashion. But it's definitely a big problem and something that everyone, small or large, are battling with because even if you, quote unquote, have “unlimited money”, you can't actually get unlimited GPUs, so you still need to use them wisely.

RD: Right. And as a provider of that hardware, you want to also fully utilize that without, you know, someone running into a limit where it's like, you can't use the GPUs now, you know, it's all full.

PS: Yeah, exactly. So, I mean, we– we help in our our customers, uh, we're involved in, you know, in the scheduling stack and observability stack, like we… we, we help them everything they need. Their goal should be building models or running models effectively and serving their use cases or building agents. It shouldn't be fighting with infrastructure, understanding my infrastructure. So when we look at what our tools, what our products we develop, it's everything around: you guys do what you're good at. We'll deal with everything else, essentially. And also, in between our customers, if you're not buying GPUs in a long-reserved instance contract, then how can we make the commercial model in a way that you can get compute when you want them, and then we can fill some other customer on there without it being super expensive for everyone involved? And the capacity management of it is really interesting.

RD: Yeah. What would you suggest to developers who are building AI applications on this to effectively... work within your your hard constraints? Especially for things like, you know, the direct GPU access calls.

PS: Yeah, I mean, my general recommendation to developers in this space is: don't over complicate it, right? Start like this: there's so many complex layers involved here and the technologies are evolving so fast, right? Like the libraries, the patterns are changing so fast, nothing is ever mature. So if you're over complicating it either with like: oh, I want, you know, a huge GPU cluster to run on day one, or I want to, you know, my first inference stack before I have any users going to do disaggregated serving all these complex things.

RD: Yeah.

PS: It's like it's– chill a bit, and, you know, use tools that exist. If you don't need to build your own inference stack, you know, use ours, use other people who are good at running the mechanics of it, focus on building your model, focus on building your product is, yeah, is that one of my best recommendations. From my perspective, some customers, you know, we partner a lot with our customers in their journey. Some people come in with this, you know, lofty goals, I can eat the latest chips, I need a lot of them because, you know, we're going to change the world. And I'm like, you might spend– so much of your time is, you know, managing this complex infrastructure and how do you schedule over, you know, these different scale-up domains as we talked about with Envilink and the scale-up network. Like, maybe we, you know, get you on a simple architecture to start so you can focus on building your product, building your model, and iterate from there, and we kind of go through this journey together. So that's, you know, when I say that some people spend too much time, like, going too complex. You know, the same when we talk about, like, microservices, right? Classic meme. Like, I'm starting my project, you know, it's going to be my web service for my toaster, and, like, I spend, two months building out 18 microservices. A little bit the same way. I like microservices, but I don't think that everything needs to be 18 microservices. Use the right tools for a job and start from there. In many cases, the right tools might not necessarily be, you know, get a bunch of raw GPUs day one.

RD: It feels like what you're talking about is people over-future-proofing an application. And at some point they may need that complexity.

PS: Yeah, but then then invest in it on time. And the odds that your application looks the same in a year when you have hit that scale and expect escape velocity, like, how you think it's going to look today is so small that you're going to have to rewrite it anyway.

RD: Right.

PS: That's what I tell my, you know, my developers, like we move so fast, things are changing so fast. Like we build something and like oh, this is this scale you know, will be– okay we're, you know, we build this system, you can have a 10x scale. And then six months later, right, we're being asked to build data centers with 500,000 chips in them.

RD: Right.

PS: And then we built a system that could scale, you know, the 200,000 chips. They're like, okay, we're going to rewrite this again. And this keeps happening. So when I look at like software development in my team, it's like, okay, we need to build something that is good and we can, you know, we can introspect it. It needs to be reliable. But expect it not to live longer than six months. So expect that you're going to rewrite it. And if we end up in a world where, okay, it lasts longer than six months, we'll rewrite it anyway. But like, having kind of a planned obsolescence, almost, in the systems allows us to focus on the right things versus over-architecting something. And then we learn something new in six months because the market pivoted on us again, as it always seems to do in this space.

RD: It almost sounds like everything is prototypes at some point, right?

PS: I don't know if I would say that.

RD: That is a very reductionist– I love reductionist takes.

PS: But a little bit, right? Because it changes so fast. There's a new model coming out. Then we're changing how we do things. And on inference, like disaggregated serving, right? How we think about inference is changing speculative decoding. So if you build a really rigid framework then you're going to be screwed.

RD: Yeah.

PS: So I mean, so from a cloud provider's perspective, I try to build really good primitives that we can then adapt and, and build on. And those need to be really solid, right? Really solid. But in terms of the upper layers, how we tie these together– how we tie them together in data center– I need to be flexible there and also be easy to change them.

RD: Yeah. And like you said, the AI capabilities, the software– it's all changing super fast, but hardware has to necessarily move a little slower.

PS: Yeah.

RD: How do you plan, you know, data center capacity with that in mind?

PS: Yeah, it's– planning dates and capacity is one of the hardest challenges we have right now, especially as, as we said, that like the architecture change, the requirements change. If you look back two to three years, you know, it's very easy. I'm building as many h100 as I can, connected them into a single cluster, and that's it. Right? Everyone like you're running big pre-training jobs or inference was pretty simple. Either you're on inference and runs on one server, it's all the same or scales up on over a couple of servers, right? But it's all kind of the same. Now you have– on the training side you have pre-training, you have RL test time compute, which includes both different types of GPUs that are, you know, like an older generation, can be really good for decode, but you wouldn't use for pre-training, but you can now involve an RL loop. And then you have a bunch of CPU compute for running evaluations and running your agents, right? So the data centers become more heterogeneous, and the same thing on the inference side. And this means that we need to... to again build for flexibility and fungibility so a lot in our data center designs, and also very different from both traditional supercomputers or hyperscaler data centers were designed where, you know, the skew mix was decided years in advance. Like we'll do it as late binding as possible. So we build a date center, you know? We build liquid cooling everywhere we can and then last minute, you know, we might actually not put liquid cooled servers there and we build them so that we can, can kind of change the skew and what goes in there last minute. We know that after a couple years, say, after six years, probably gonna get rebuilt. Then we also look at, okay, we assume that, you know, the, the power per chip is going to keep going down. How can we guess what might be able to happen in six years and try not to have to rebuild the entire data center even though like a lot of stuff is probably–

RD: Yeah I mean– right now six years feels like an eternity to plan for, right?

PS: Yeah it, it does. When we, um, like, put up our first a100 GPUs in 2020, I'm like, these things, you know, they'll be gone soon. I won't have to maintain them for long. And now, um, we're sitting here and we're still signing new contracts on the same capacity at the same prices or sometimes even higher than it was in 2020. Which is pretty crazy, right? And, and you would imagine that the GPU that's six years old in this space is useless, but thanks to how heterogeneous, um, like the compute demand is becoming they can still be used very effectively even in a cutting-edge inference and agent pipeline.

RD: Yeah. Another complication I've heard about is because of the explosion of AI and explosion of data centers, there's kind of incredible pressures on the supply chain.

PS: Yes.

RD: For especially memory, but kind of everything.

PS: Yeah, it shifts. There’s a lot of shifts, right? Every week I'm losing sleep over a different supply chain problem. It used to be GPU chips. Then it was like LAN powered shell. Now it's the NAND and memory.

RD: You know, when we left our data center, they basically spiked all of the servers. Is there a secondary market for data center hardware?

PS: There definitely is. There's a secondary– there’s a huge secondary market right now for both. NVMe, SSDs, and DRAM. We buy directly from manufacturers for various reasons. We used to buy, you know, back in the last supply chain crisis of COVID, we did a lot more things where we like, scoured eBay for DRAM sticks. But we can't do that on our scale anymore, regretfully. That used to be fun. But yeah, there's a big secondary market for it. And I get inquiries in my inbox, like five a day, more from someone, it's like: I have 400 DRAM sticks. And I'm like, that's a little bit too little for me. Or it's like, can you sell me any DRAM sticks?

RD: So the, you know, we talked about the various things keeping you up at night. This seems like a whack-a-mole where it's like every time you solve one, it's a new problem, right? You fix the network, it's memory, you fix memory, it's network.

PS: There's a definition of my life.

RD: Yeah. Is there a possibility of a sort of like, equilibrium balance? Or do you think this is just like capacities keep shifting, features keep shifting?

PS: I don't think it will. I think that, you know, thinking it would ever have enough compute and things will ever slow down is a fallacy. I also don't think it's ever been as crazy as it is now in terms of demand and how the pressure is on both building more and advancing the technology. We're like, we'll definitely live in a unique moment. I'm not going to say that that moment is going to stop in a year. I have no signals that anything is slowing down, that any demand, both in terms of advancing technology and building scale, is slowing down. It's like literally the opposite.

RD: Yeah, I think you may have mentioned it earlier about the sort of power constraints. We've heard people saying that's sort of the new thing that people are talking about is like, the power usage on chips, and everybody's trying to figure out how to optimize. How much of a concern is the sort of power usage?

PS: I mean, power is kind of, I guess, the ultimate constraint, but there's a lot of constraint to talk to memory and so on. Like power, I used to lose more sleep about that a year ago. From our point of view, you know, we have secured power, we secured our data center space for what we thought we would need over the next two to three years. Now the demand is skyrocketing. We're working on more. As kind of an industry as a whole, raw power is not the bottleneck. And I don't see raw power being the bottleneck in the next two or three years. Getting the power into usable form is the bottleneck. So how do you get it from a high voltage line? How do you build substations fast enough? Medium voltage transformers, things that normally have 24 to 48 month lead time, right? How do you actually get enough of them and build them fast enough? And then low-voltage transformers, HVAC, generators, everything you need to build a data center. That's really where it is. And the one thing that everyone is focused on building power, building generations, and there's a lot of work that's happening there. I love how there's innovation going into modular nuclear reactors, [indistinct] a bunch of cool companies coming out. But the trades are also very important part, right? Like getting skilled electricians at the level we need to kind of build these AI data centers, that's not something a lot of people are talking about, right? It's very hard, and becoming a skilled electrician, you know you don't go to school for a year, like it's apprenticeships, and you– you go out, you do, you know, you work with a master electrician for many years to learn these things. And it's a very dangerous job as well, working with these high voltage. And so like, that's a part of the, the stack that is very hard to scale because it has a finite time for humans to learn. And before the AI boom, it wasn't very sexy to become an electrician, right? And the same thing with electrical engineers. If you look at electrical engineering programs at universities, they've been declining, right? And now they're getting a bit of a rebound.

RD: Yeah.

PS: But the supply and demand in the skills we need at the lowest level of engineering really has not been ready for this.

RD: Yeah. And I wonder, you know, some of those supply and demand things we're talking about, what do you think will be the consumer effect for power, hardware, that sort of thing?

PS: For the consumer hardware?

RD: Yeah.

PS: That’s uh… I'm sorry, I guess. So I think that the market is pretty good at protecting consumers. I feel like NVIDIA has been really good. If you remember during the mining crazy days, they were like, we're not selling any more GPUs to miners. We're always selling, making sure you can't mine on them to protect the, their… consumer market. So I'm hoping that the market will make sure there is enough supply for consumers.

RD: Yeah.

PS: Uh, but like, there is some invisible, right? I, I'm a big Unify guy, as many as tech nerds are, and I was going to buy the latest, uh, Unify g6 doorbell, and I'm like, it's delayed by three months because they can't get the memory for it. And then I'm like, shit, I screwed this up for myself. So I think that from the consumer point of view, right, like those things will get fixed pretty soon. I'm not really worried about a systemic kind of impact to computer hardware. Like there's a squeeze right now from memory, but like at, like... the differences in the amount of memory we need for the growing AI consumption and the consumer hardware is so large that the temporary constraint for the consumer hardware will get fixed very soon. And in terms of power and so on, there's no concern there. I see, I think that there has been talks and when people have, you know, some hyperscalers have built data centers have taken existing nuclear power plants, private got some backlash because it led to increased power rates. I think in the industry there's a lot of acknowledgement that, that, you know, like, should build these things where there's a surplus of power, don't build them where we drive up power rates for, for consumers. And thankfully with a lot of these type of applications, you know, especially training, it's not really tied to exactly where you are. We might want to keep it within a specific geo to have decent latency, but if we place it somewhere where there's surplus power, in this country and around the world, there's plenty of areas where there's a lot of stranded power because there's a lot of generation, and then there used to be a lot of steel plants and so on there, and now we can't actually get it into New York City, as an example. So if we can build the data centers closer to generation, there is power, and we're not actually causing any impact on consumers. See, we're in a very exciting time, with AI. I mean, we're looking at the proliferation of AI tools inside my teams. It's pretty awesome. We use codex, cloud code every day. I think as a software developers, we're always figuring out how to use the right balance of these tools, how to become over-reliant on them so we don't understand the systems we're building. That's something that I take a lot of attention to in our teams. Like okay, you know. We're using AI coding tools as a productivity enhancer, but not to lose grip of architecture, how everything actually works. I expect all of my engineers to understand every piece of line that they built with AI tools. So that's my first advice to every software developer out there. The second one, when looking at the infrastructure and AI infrastructure is: all infrastructure isn't built similar. It's a lot of people in the market. There's a lot of craziness going on. I think when you go into this space, when you look at either to train a model, run a model, run an agent, make sure that you know who your provider is, what they do, both from a security and reliability point of view, especially if you upload your data, your personal data, where does it end up? Is it going to end up in a leak somewhere? So make sure that you vet your providers and that they actually know their stuff because we're in a... at a time where supply is very tight and there's a lot of new entrants to the market.

RD: Yeah. So if people want to find out more about CoreWeave or connect with you, what can they do?

PS: Check our YouTube. We've got some other interviews there. They can check our website. They can tweet us, even though I'm not that much on Twitter. And they can also sign up for some CoreWeave compute, I guess.

RD: All right. Well, thank you for listening, everyone, and we'll talk to you next time.

[outro music]

Do you have what it takes to run AI in production?

TRANSCRIPT

Add to the discussion