Even the chip makers are making LLMs

They discuss NVIDIA’s co-design feedback loop between model builders and hardware architects, share insights on precision model training and memory management systems, and take a look at the roadmap and development of NVIDIA’s fully open-source Nemotron.

Nemotron is a family of open models with open weights, training data, and recipes for building specialized AI agents.You can learn more on their Hugging Face page or at NVIDIA GTC on March 16-19.

Connect with Kari on LinkedIn.

Congrats to user The4thIceman for winning a Populist badge on their answer to How to Center Text in Pygame.

[Intro Music]

Ryan Donovan: Calling all roboticists and engineers interested in AI robotics. Intrinsic, along with Open Robotics, Nvidia, and Google DeepMind, have announced a competition with a prize pool of $180,000 to solve dexterous cable management and insertion with the latest AI in open source tools. Register by the 17th of April and take part as an individual or form a team. Go to intrinsic.ai/stack. That's I-N-T-R-I-N-S-I-C dot A-I forward slash S-T-A-C-K.

Ryan Donovan: Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ryan Donovan, and today we're gonna be talking about why a chip maker is in the model business. So, my guest for that is Kari Briski, VP of Generative AI at Nvidia. So, welcome to the show, Kari.

Kari Briski: Thank you for having me, Ryan. Glad to be here. And we're not just a [inaudible] company, we're a full-stack company.

Ryan Donovan: Of course. So, before we get into the topic of the show, we like to get to know our guests. Can you tell us a little bit about how you got into software and technology?

Kari Briski: Wow. I've been in software and technology for as long as I can remember. My mother was a computer teacher at night. She taught night school for computer programming. I would sit there with the phone on the modem with our TV, [which] was like a big blue screen. If her students had questions, like the early chat rooms. Yeah. And so, I just knew, I've always gravitated towards technology, and so I went to school for [it]. It was the first computer engineering degree at the time, from University of Pittsburgh, which was both computer science and electrical engineering. And then, so right out of college, I did writing software for chip design, actually, for IBM systems. And my journey's always been in design, and computers, and following technology.

Ryan Donovan: Yeah. Big in the hardware side of things too, huh?

Kari Briski: Just like today, I'll talk a little bit about extreme co-design. So, it takes hardware and software. It takes software to run the hardware. If there was no software, the hardware would just be a brick.

Ryan Donovan: You all mentioned that you're working on these models. I was thinking, I know you're full-stack, forgive me for the reductionist—but why is a chip maker, GPU Maker, involved in making models? Did your testing process just get outta hand and [you] decided to go with it?

Kari Briski: There's a couple reasons why we make models. Number one, just again, with the software-hardware co-design, you always want to find, what is a difficult workload? What is driving GPUs? Even from the early days of CUDA, we always went and found, what is the first real way to accelerate an important workload. That kinda led to high-performance computing, right? And then, it led to how are we accelerating really important and difficult workloads, like computational fluid dynamics. And so, if you watch that progression of CUDA, and that developer ecosystem that kind of led to deep learning. And so, when you get to deep learning, and through all this journey, we always employ people who truly understand the applications. What we call these people are developer relations. They work with the developers, they work with the applications, they understand them inside and out, so they understand how to accelerate them. And so, you have to truly know the workload in order to accelerate it. And instead of just talk the talk, you have to walk the walk. And so, when it came to deep learning and computer vision, then we started to say, 'okay, what are more difficult workloads?' Which led to things like speech, and text-to-speech, and speech synthesis is a difficult workload, too. And that led to NLP, and then BERT, and LLMs. And so, we've been working on large language models since 2018. So, we've actually been in doing this for quite some time and building these models, but [what] this allows us to do is really work very tight with our architecture team; and it's not just about giving feedback on the compute, but also the networking, and the storage, and everything that's involved, and not just the training of these models, but also the running of these models at scale. And so, everything that we do is a rapid co-design together, so we're able to feed back into each other.

Ryan Donovan: I think I was joking when I said that, but it does sound like part of this is actually putting the hardware through its paces, right?

Kari Briski: Yeah, it really is.

Ryan Donovan: In developing models and getting that feedback cycle, what are the sort of interesting things you've learned either from the hardware or from the model design?

Kari Briski: So, some of the things that we've done: like with Blackwell we did NVFP4, so if you talk about precisions, there's floating point precision that you're able to train and use the weights and activations in. And the world has been training at what we call 'FP16' or 'floating point 16,' and then being able to train and run inference in FP8, [for] example, for Hopper, and NVFP4 for Blackwell. And then, provide those recipes not just for ourselves, but put them out to the world, so that they can have improved memory for the model improved performance, and scalability. Other things that we've done is running these at scale for inference has allowed us to build frameworks like Dynamo for disaggregated serving of these really large models and NIXL for the communication in between. So it's not just about the hardware, it's also about the libraries that we release, as well.

Ryan Donovan: It's interesting that you're running these as starting from lower your floating point. I know a lot of organizations will take the bigger model and then quantize down. Is there a specific benefit towards inferencing on that lower floating point?

Kari Briski: Yeah, there's all kinds of benefits. To your point, I'll kind of start where, historically, people have done as you train in the higher precision and then you quantize down, but you can lose a little bit of accuracy. Even though you can do a post-training to train that back, you still lose mostly one to 2% of accuracy. But the benefit of actually training in that reduced precision is that you retain the entire accuracy that you meet when you hit that model, and you're able to put that out. And it has benefits to both training and inference. Again, I mentioned just about memory that you need in order to store the model, in order to train and serve at the same time. So, there's all kinds of benefits to training and the reduced precision, rather than quantization and knowledge distillation later.

Ryan Donovan: Yeah, I hear a lot from folks about the memory issue because my understanding is that when you run a model, you have to keep basically the whole thing in memory. What kind of memory reductions do you get from inferencing at that lower floating point?

Kari Briski: You can get up to half, right? So, if you go from FP 16 to [inaudible], you save half the space. There's other considerations when you're talking about context length, and input sequence length, and out sequence length. It's not just one-to-one. There's some other things, but you'll need to run the model so that you can push the limits of the memory. But we try to do that kind of—I'm jumping around a little bit—but we do that extreme co-design also to match models to hardware requirements, right? So, if we wanna be able to fit a smaller model into a smaller form factor GPU, versus these really large models, and the more accurate, and robust models, you actually have to put into a full node. What we consider a node is like 8 GPUs. And so, in some of the really accurate models are multi-node, and so you're thinking 16 GPUs. So, the more you can reduce the memory requirements, the more efficient you get for both the compute as well as the latency.

Ryan Donovan: Yeah. And you talked about the extreme co-design. What makes it 'extreme'?

Kari Briski: What makes it extreme? We have a really tight feedback loop, and so when we're working with the hardware architecture team, we're able to, during their, what we call 'plan of record process' or 'POR process,' we're able to say, 'okay, this is a problem we've seen over and over. You might wanna consider putting this into the next generation of hardware, or you might wanna consider a new engine or SKU.' We just announced at CES this context memory engine. So, I think that when we say 'extreme,' it's really rapid fire, close-to-close, engineer-to-engineer daily feedback, so that when they are doing the planning, they have all the information, and it's not too late.

Ryan Donovan: I've heard of some organizations, or some people even, theorizing that they'll be designing chips and processors for specific models. Are you getting to that level where you have, ' here's the recommended hardware for this model,' or do you still consider it in a general-purpose way?

Kari Briski: No, we still consider GPUs a really fantastic general-purpose compute, because one thing that we've noticed is that the most accurate agentic systems take systems of models. It's not just one model. It's not just one model to rule them all. It's not just one architecture, and that's the problem that you get into when you have a specialized chip. One thing we have been able to do though, as you know I mentioned the Dynamo and disaggregated serving, is that we've been able to say, 'okay, when we're running this model at inference time, we can do maybe a prefill or the input on another type of GPU skew, and then do the decode on another, and then work at this disaggregated serving them. Then, bring[ing] the answer back together before you serve it, you can actually be more efficient. That type of stuff in optimizations happen, but we haven't really seen specialized chips per model just because model architectures change. And again, it's not just LLMs in a system, right? We're talking about—when I say model architecture, immediately in my mind I'm thinking about speech models, and speech synthesis models, and a SR models. And so, there's all kinds of models that you need to run in a total AI system.

Ryan Donovan: Somebody could have more traditional old style machine learning models on there, too. We've got into the weeds real fast, but I wanna jump back and talk about the models themselves. What are the models that you've developed for this?

Kari Briski: The models that we've developed are called Nemotron. Like I said, we've been doing model development for a while. The name is like an homage: two teams that came together, one was the Megatron team, 'cause back in 2018, that was the biggest, baddest transformer; and Nemo, which is stands for Neural Modules and has nothing to do with acute little fish. So, we just felt it was right to give both those teams the credit when they came together. So, that's why Nemotron is the family. And then, when we talk about the models, the LLMs, at least, that we've put out, we've put out what we call Nano, Super, and Ultra. Because there's a small, medium, large, if you will, if you think about the sizes. And at CES, we also started to talk about family of models, just because we're getting into things like vision language models, embedding models, how do you re-rank for rag systems, speech and omni models, as well. We've got a world-class speech team that now have some really fantastic models that will now fall under the Nemotron family. But when we like to think of LLMs, it's the Nano, Super, and Ultra.

Ryan Donovan: Were there any of these specific models that created the biggest effects on the feedback loop?

Kari Briski: What we were really excited about recently with the most recent releases of the Nemotron family is—talk about research informing the product as well, is that we had this hybrid model. So, it wasn't just a dense transformer model, it was both the combination of Mamba State Space model, plus a transformer. And so, being able to take that hybrid model that has more token efficiency. Then, what we did was we moved that to what's called a 'mixture of experts,' and so we're able to do both the innovations and the architecture of the model, and then adopting other well-known recipes, like mixture of experts, and bringing those together. And so, that's an example of where we're trying to innovate because we know that the token demand is going up for these agentic systems with reasoning models, and the ability to be more token efficient, both at training and inference time.

Ryan Donovan: Mamba State Space Model—I'm not particularly familiar with it. Can you talk a little bit about it and how that works in conjunction with the transformer model?

Kari Briski: There's a lot of tension that you have for this model. And so, for each token, it's comparing itself to the next token in the sequence. And so, when you have a dense model like that, as it grows, the inference time is quadratic. And so, by replacing some of those heads with these states-based models, which are very efficient, you're able to again, save on that memory and token efficiency, even at the processing time of both the training and the inference.

Ryan Donovan: Is it something of a holistic, almost world model view?

Kari Briski: I think when we talk about architectures of models, again, I talk about the transformer. There's other types of architectures like diffusion architecture. The State Space Model is more of a sequence model, and so even before the transformer, there was sequence-to-sequence models. And so, some of that resource came back because that was so efficient at being able to process tokens. The model itself wasn't accurate back then, but now when you're able to mix the two architectures together, you can get a different outcome. And so, it's just being able to do better linear scaling for this context recall on the data. And so, again, there's state space models, there's transformers, there's diffusions, there's resnets, there's convnets, there's all kinds of different architectures. And so, I think when you're able to really, truly innovate when there's a problem, like token efficiency, you can go do purpose-built innovation. And I love the example of diffusion models because they're still very researchy, but we're seeing a lot of promising results from the research of diffusion models.

Ryan Donovan: I've seen the baby dragon hatchling model is a new discussed architecture. Are there any sort of architectures where you're seeing new promise for better results other than the sort of ones we've talked about?

Kari Briski: There's all kinds of papers, but I think diffusion models are the ones that I see the most promising results coming from right now.

Ryan Donovan: You talked about the feedback loop and the co-design. Have you seen changes to the hardware designs based on some of the models you've been training, creating?

Kari Briski: Just the ability definitely, for not just the compute with reduced precision, but how we're handling communication and networking between the compute, as well as storage, right? So, I mentioned the context memory engine. How are you storing context? How are you feeding in for these really large contexts? We talk about a million token context length. And so, if you think of 'token' as a word, so at least a million words, but you have really long coding programs, or code generation, and you're feeding in your code into it. You need to take in bodies of not just the documentation but the code itself, and how do you move that through the context without losing it, and storing it, and recalling it. And so, there's all kinds of memory hierarchies that you have to traverse your architectural relief, from both hardware and software.

Ryan Donovan: Because I've seen papers about infinite context or whatever, and then also see things that talk about context rot, and the difficulty with the sort of needle in the haystack problem, how do you maintain that context? Do you keep it separate from the prompts and the inference?

Kari Briski: For the model itself, when we talk about million context length, it's within the model. When you get into agentic systems, here's when you– again, when I mentioned that one model does not rule them all, it's systems of models. And how do you work that together within agentic systems? How do you push certain parts of the memory or context to disc, and when do you recall, and why? And how do you store it, and where, and how often? And how much do you re-index, and refresh your data? And so, these are complex agentic systems. I think a year ago, we talked a lot about RAG, and RAG is still very important. It's Retrieval Augmented Generation. So, if you're able to embed accurately and then recall and re-rank it, it's an online recommendation system, but that's just one part. And now, we found that's just a tool for agents now. And so, when we get into things like this memory management, we're now talking about agents, and multiple models all happening at once. And so, how are they even sharing memory? How is an agent keeping memories for you, keep your memories for itself? And you're like, 'hey, I've seen this situation before and here's how I'm supposed to react in the situation.' Which kind of leads to the importance of specialization in these systems, as well, which is another really good and important aspect of open models.

Ryan Donovan: It almost sounds like you're developing a caching system for AI.

Kari Briski: It's funny because when we're around, we laugh because there's a lot of computer system design that's still relevant to agents, and we laugh that agents are almost kinda like a new type of object oriented programming, because when you spin off one agent, it just goes off and it can think and can be autonomous and do something, but still comes back with an answer. And so, you just kinda have that object go off and do something while it comes back and keeps going in your program.

Ryan Donovan: It feels like people are speed running all of the networked software stuff, like we're already at the microservices part of AI.

Kari Briski: This is how fast everything's moving, which is incredible to speak to the performance of these models and what they can do, which is why, again, coming back to the importance of having these open models so that you can specialize them to your domain.

Ryan Donovan: You talked about the memory storage for the AI models. Does any of that memory storage happen on specialized pieces of the hardware? Do you have L1 caches for that?

Kari Briski: There's all kinds of innovation happening in storage around specialization for agents, or expanding capabilities for agents. We've got really great partners in the ecosystem that are doing some interesting things. That's why we have such a large ecosystem so that we can work with all these partners. We can't do everything, but we can– rising tides lift all boats, and so, working with everybody. And I think what's guiding it, and not just about how storage is transforming in service of these models and agents; but also how these agents and models are being [brought] into these storage solutions, which is fun because now you're not just retrieving data and being able to have to munch it yourself, but now you're actually retrieving real answers and it's embedded into the storage system. So, it's twofold, and it's really interesting to see.

Ryan Donovan: That is interesting. Another thing that's interesting to me is that these are open models. Are they open weight or fully open source?

Kari Briski: They are fully open source. We released our model architectures, the model weights, the data that we've used to train the models, as well as all of the libraries. And so, when you do that, you're releasing a complete recipe. And so, when you're able to do that, it fuels this sort of research and development engine, or faster iteration speed, from which everyone can learn from each other in the open source. And so, that's why we did that. And we also have a lot of customers who said that, for whatever reason, there are some of the models where they can't trust what it was trained on or how it might answer, or the knowledge that it was continued within. So, when we released the data sets, we got a lot more engagement, because now you can interrogate the data, inspect it, pull from it, build on your own model using the data yourself. And so, we just thought that was a really important aspect for what we think is the new software development platform.

Ryan Donovan: I think it's super interesting. I know when I talk to folks about the open source models, a lot of 'em are like, 'it's just open weights,' but to have it fully open source is lovely, because this day and age there is a still a liability issue with the training data, right?

Kari Briski: That's right. And so, there's a lot of enterprises that say, I don't wanna take on that liability. They've been stuck between a rock and a hard place because they get tied to maybe a third-party API service deciding for them, and they can't control that. They can't audit it. They don't have this sort of ownership so they can govern themselves. And so, when you now open up the data, number one, when you open up the model, you have a bootstrap for these companies so they can jump from, but they know that they're jumping from this trusted source. And when they go to fine-tune on their specific domain, they can pull from the data sets that we have. They can generate—we also provide tools to generate data based on their domain, so that they can create these, what we call 'reinforcement learning gym environments,' these verifiers that become specialized in their domain. So, there's all kinds of reasons that the true open source matters for the development of the model.

Ryan Donovan: I imagine it's basically giving you're testing data to a bunch of folks who are using your hardware, right?

Kari Briski: Sure, but I think that we've given test scripts for our hardware in other realms, as well. We call that MLperf. So, there's a whole organization around, it's not just Nvidia, it's a consortium of vendors, where you take the hardest workloads, the most important models, and then you run them. So, that's not really what it's about. This is really about fueling development for new software applications for AI.

Ryan Donovan: So, you mentioned that once you went fully open, showed your training data, you got a lot more engagement. What are the results of that engagement? What did you learn from that?

Kari Briski: We learned that there's a lot of more expert partners who have been taking our data and applying it to their own domain, and generating more data, and creating their own models, which was exactly what we wanted. An example of that is ServiceNow, they released their own Apriel model for their domain. They also released some gym environments that were specific to their tools. So, that was interesting to see. We 'got a lot of engagement on domain expertise. Hey, this is a really great base model that I can trust. Now let me try and apply it to my domain, and all kinds of domains.' I always get asked, 'is there any one in particular that's picking up LLMs faster?' And I say no. It depends on the expertise of their team, but with how fast they're moving, across all verticals, across all domains, people are starting to dive into AI; but the specialization of it, an example would be– even for us, it's chip design. So, being able to work with us and work with partners who are making foundation models for chip design, for industrial design. Coding is a huge one. You see all these apps out there that are code generators and coding apps. You see a proliferation of them because being able to verify that domain is so easy. You run it through a gym, and it wrote some code, you can verify it by, did it pass the unit test right? Did the code compile? There's so many ways to verify [how] that code was written, but in these expert domains, it's a little bit harder because now these teams have to create the environment. They have to create their own verifier, so that's taking a little bit, but cybersecurity is another one. Working with a partner who's able to create domain expertise and being able to identify false positives of cybersecurity threats, and so there's all kinds of really fun use cases and domain specialization.

Ryan Donovan: I know some of your partners are big model providers themselves. Did you get any grumbling from them, or were they just excited to see what you were doing with it?

Kari Briski: Mostly, we've got really great relationships. I talk to them– the best feedback was the data sets because if you think about all of these data sets, it takes compute to do that, to be able to use this synthetic data generation. So, it was really a gift for everyone. But a lot of the model builders themselves, they have their own stack. When we engage with them, it is about deeper optimization of scaling, making sure that they're getting GPU efficiency, GPU utilization, inference efficiency, and so that comes lower in our stack. They love the data sets that we put out. They love our gym environments. We released all the gym environments that we created, as well. And by the way, none of that is open. A lot of these large model builders have these– they might have really accurate open models, but they don't release the environments at which they train, or the data. So, they might consider that secret sauce. So, anything that we can put out, people are really picking up very eagerly with these gym environments.

Ryan Donovan: So, this is a space of great fast development. What's the future roadmap for Nemotron?

Kari Briski: We are open, right? So, we publish our roadmap, which is exciting. So, with our Nano, Super, and Ultra models, we released Nano V3 in December, we have Super that's releasing now in early February, and then Ultra will be releasing around April, which is just after our GTC event. So, we have these flagship GPU technology conferences, or GTC, and our flagship is in San Jose. It's always in March. And so, we're really excited about that because it's not just about Nemotron; it is about all the open models and the open model providers that we work with, and having the people adopt these libraries, the data sets, and the model architectures themselves. So, there's a lot to celebrate, and I think there's a lot of– like you said, it's a fast-moving landscape, so there's a lot of open models that are releasing at this time, and we're really excited to learn from each other.

Ryan Donovan: I'm glad there's a greater push for the openness on the models.

Kari Briski: Absolutely. It's a worldwide r and d. You also get worldwide validation of, I mentioned this architecture change of the hybrid model with Space Force Transform, so you get people validating that it works, and it is more efficient and ready to adopt it. And then, you also get worldwide Red team, so people that are testing on your model [are] providing all the feedback that you need to roll it into the next release. And you asked me about our roadmap, and I'll just say, to demystify it a little bit, i'll just reiterate that we believe that this is a new type of software development platform, and as such, we're gonna release these models like libraries, because when you have systems of models, it's like a library within that system. And that library needs to update, that library needs to be refreshed, that library needs to have bugs fixed. And so, we're treating this just like a typical software development cycle where we're taking these models, we're taking in the feedback, the bugs, feature requests, training that in, and re-releasing them, just like any software library that NVIDIA provides.

Ryan Donovan: So, can people come and push PRs to the model?

Kari Briski: Not yet, but they will. The last leg of open source that we have to do is being able to put it out, and then get up to be able to push new architecture into our model design, what we call Plan of Record Process. When will then be now? soon.

Ryan Donovan: Alright everyone, it's that time of the show where we shout out somebody who came onto Stack Overflow, dropped some knowledge, shared some curiosity, and earned themselves a badge. Today, we're shouting out a Populous Badge winner – somebody who dropped an answer that was so good, it outscored the accepted answer. So, congrats to @The4thIceman for answering, 'How to Center Text in Pygame.' If you're curious about that, we'll have the answer for you in the show notes. I'm Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have questions, concerns, comments, topics to cover, et cetera, et cetera, email me at podcast@stackoverflow.com, and if you wanna reach out to me directly, you can find me on LinkedIn.

Kari Briski: And I am Kari Briski, Vice President of generative AI software for Enterprise at Nvidia, and I love all things Nemotron, and you can find out more and love it just as much as I do by visiting Hugging Face, coming to Nvidia developer pages, and then, you can come and see me and all of the researchers and developers at GTC in March in San Jose.

Ryan Donovan: All right. Thank you for listening everyone, and we'll talk to you next time.

Even the chip makers are making LLMs

TRANSCRIPT

Add to the discussion