From training to inference: The new role of web data in LLMs
Data has always been key to LLM success, but it's becoming key to inference-time performance as well.

Data has always been key to LLM success, but it's becoming key to inference-time performance as well.
Stack Overflow CEO Prashanth Chandrasekar sat down with Ryan at HumanX 2025 to talk about how Stack is integrating AI into its public platform, the enormous importance of a high-quality knowledge base in your AI journey, how AI tools are empowering junior developers to build better software, and much more.
Jeremy “Jezz” Kellway, VP of Engineering for Analytics and Data & AI at EDB (Enterprise Database), joins Ryan for a conversation about Postgres and AI. They unpack how Postgres is becoming the standard database for AI applications, the importance of managing unstructured data, and the implications of data sovereignty and governance in AI.
Minh Nguyen, VP of Engineering at Transcend, joins Ryan for a conversation about the complexities of privacy and consent in tech, from the challenges organizations face in managing data privacy to the importance of consent management tools to the evolving landscape of privacy regulations.
Ken Stott, Field CTO of API platform Hasura, tells Ryan about the data doom loop: the concept that organizations are spending lots of money on data systems without seeing improvements in data quality or efficiency.
Ben and Ryan sit down with public interest technologist Sukhi Gulati Gilbert, a senior product manager at Consumer Reports, for a conversation about digital data privacy. They talk about why digital privacy matters, the challenges consumers face in safeguarding their data, and the legislative gaps in privacy protection, along with the app Sukhi is working, Permission Slip, that helps users exercise their rights to digital data privacy. Plus: Why it might be worth reducing your digital footprint.
Ben and Ryan are joined by Matt Zeiler, founder and CEO of Clarifai, an AI workflow orchestration platform. They talk about how the transformer architecture supplanted convolutional neural networks in AI applications, the infrastructure required for AI implementation, the implications of regulating AI, and the value of synthetic data.
Or Lenchner, CEO of Bright Data, joins Ben and Ryan for a deep-dive conversation about the evolving landscape of web data. They talk through the challenges involved in data collection, the role of synthetic data in training large AI models, and how public data access is becoming more restrictive. Or also shares his thoughts on the importance of transparency in data practices, the likely future of data regulation, and the philosophical implications of more people using AI to innovate and solve problems.
Ben chats with Shayne Longpre and Robert Mahari of the Data Provenance Initiative about what GenAI means for the data commons. They discuss the decline of public datasets, the complexities of fair use in AI training, the challenges researchers face in accessing data, potential applications for synthetic data, and the evolving legal landscape surrounding AI and copyright.
In this episode, Ben interviews Jannis Kallinikos, a professor at Luiss University in Rome, Italy about his new book Data Rules: Reinventing the Market Economy, coauthored with Cristina Alaimo. They discuss the social impact of data, explore the idea that data filters how we see the world and interact with each other, and highlight the need for social accountability in data tracking and surveillance.
With the ever-increasing importance of data, we’re always looking for expert voices that can expand our view of what data and our reliance on data means for software development and society as a whole. More and more of our lives are becoming data-driven. Is that a good thing?
On this episode: Stack Overflow senior data scientist Michael Geden tells Ryan and Ben about how data scientists evaluate large language models (LLMs) and their output. They cover the challenges involved in evaluating LLMs, how LLMs are being used to evaluate other LLMs, the importance of data validating, the need for human raters, and more needs and tradeoffs involved in selecting and fine-tuning LLMs.
If you’re building experimental GenAI features that haven’t proven their product market fit, you don’t want to commit to a model that runs up costs without a return on that investment.
AI systems obey the golden rule: garbage in, garbage out, Want good results, feed it good data.
If we can make operational data easier to manage and easier to access through simple, standardized APIs, everyone can transform their companies into sustainable data-driven organizations.
Tim Tutt, CEO and cofounder of Night Shift Development, tells the home team about his work in deploying large-scale search and discovery analytics, why he’s working to help nontechnical users understand and utilize their business data, and how GenAI is teaching people to ask better questions.
Machine learning uses data structures that don't always resemble the ones used in standard computing. You'll need to process your data first if you want efficient machine learning.
Bigeye cofounders Kyle Kirwan (CEO) and Egor Gryaznov (CTO) join the home team to discuss their data observability platform, what it’s like to go from coworkers to cofounders, and the surprising value of boring technology.
Distributed work may hold the key to creating forward thinking metropolises.
When APIs send data, chances are they send it as JSON objects. Here's a primer on why JSON is how networked applications send data.
As with any good joke, the most important part is the resulting data.
As May is Mental Health Awareness Month, we wanted to see what developers are doing to decrease that stress and prioritize their own wellness. Earlier this year, we surveyed over 800 developers to see if they are happy at work and what they are doing to maintain or improve mental health.
When the Log4j security issue was disclosed, developers came looking for answers. We took a look at our site data around it.
At this point, most software engineers see the value of testing their software regularly. But are you testing your data engineering as well?