Even LLMs need education—quality data makes LLMs overperform

Almost every week we hear news about the amazing performance and ever improving capabilities of large language models (LLMs) when it comes to creating human-like code and text. But alongside those, we see breathtaking dollar amounts ascribed to the cost of training those LLMs—reports and speculations regularly quote numbers in the tens and hundreds of millions. Future models may eventually crack the billion dollar mark. If you want a lot of advanced chips to train AI or plan to build your own hardware, rumors are now flying that trillions will be required.

For someone looking to implement GenAI features, those numbers can be pretty intimidating. Not everybody needs to train up a 60 billion-parameter LLM, sure, but even if you’re using these larger models as-is, deployment and inference costs will scale based on the number of parameters (in general—there are also complications around infrastructure and personnel costs required to self-host an LLM). If you’re building experimental GenAI features that haven’t proven their product market fit, you don’t want to commit to a model that runs up costs without a return on that investment.

Luckily, there’s an active area of research looking to create smaller models that perform better than bigger models on specific benchmarks. In this article, we’ll take a look at how small researchers have been able to shrink LLMs while retaining intelligent performance, the methodology that allows small models to overperform, and use cases that don’t need bigger models.

We’ve seen new skills and behaviors emerge from LLMs as their parameter size grows, from understanding arithmetic to explaining jokes. But for the most basic LLM task, understanding and producing comprehensible language, what’s the smallest number of parameters and simplest model architecture that works consistently? Seven billion seems to be table stakes for useful LLMs, but is it possible to go smaller, maybe even into mere millions of parameters?

Researchers developed a data set of toddler-level stories called TinyStories that could be used to create models of less than ten million parameters that still produced comprehensible outputs. They trained a whole LLM from the ground up in a single day only using a single GPU—probably less that $100 worth of compute time. The stories it produced were grammatically correct, maintained consistency, and showed reasoning. It’s a good demonstration of how small an LLM can get while still being coherent.

That’s not to say that we should all be rushing out to implement the smallest possible model. Producing coherent text is one thing; the bigger models achieve significant creativity as they get bigger. Don’t expect the tiny models to produce those limericks about your favorite search engine. But depending on your use case, you may not need the additional creativity of those beefier models. Maybe you just need summarization and retrieval.

The researchers found that embedding dimensions and number of layers ended up being the most impactful factors for overall performance. They also agreed with previous research indicating “there is a polynomial scaling law between model size and learning budget for LLMs.” That research found that performance (defined as performance against various benchmarks) scales smoothly on a power-law basis with the size of the dataset, number of model parameters, and total compute used to train the model. Those variables are correlated strongly: model trainers may be training on too few tokens for the amount of compute that they use.

There’s one caveat with that previous research: the researchers use large general text databases like WebText or MassiveText, which focus on grabbing as much publicly-accessible web data as they can to provide tokens to their models. In the next section, we’ll find that model researchers have learned that being a little more discerning with your data can help your models overperform against larger models.

Following on the TinyStories research, a group from Microsoft sought to create a targeted dataset for a model that performed really well on a specific task. They created a model optimized to write Python functions from docstrings, phi-1, trained on a synthetic Python textbook and exercises with answers. The trained and tuned model has 1.5B parameters and attains pass@1 accuracy 50.6% on HumanEval for Python coding, which matches the performance of models with 10X the number of parameters.

Interestingly, the Microsoft team created the textbook by prompting GPT 3.5 to create topics that would promote reasoning and algorithmic skills. Simply asking GPT to create a textbook would likely produce a lot of pretty similar content, so they also injected random words into the prompts to create a diversity in content.

Focused data, even when produced by another LLM, can train a model to punch above its weight for a fraction of the cost. Training took four days on eight A100s, which I estimate cost between $1500 and $3000 (depending on the cloud provider). As the researchers say, “We conjecture that language models would benefit from a training set that has the same qualities as a good ‘textbook’: it should be clear, self-contained, instructive, and balanced.”

For their v2 model, Microsoft researchers went bigger to create a general purpose language model. Their newer model, phi-2, has 2.7B parameters, well under what some of the state-of-the-art LLMs have but still double phi-1’s count. Their training data once again included synthetic data sets, but these were geared to teach general knowledge, science topics, theory of mind, and others, as well as curated set of web resources. Training took a good bit longer and cost more—14 days on 96 A100 GPUs for between $65k and $130k—but for a model that performs as well as (or better than) existing open-source models, that’s a bargain.

One of Microsoft’s key insights here was in the value of quality, targeted data designed to teach an LLM specific topics and domains. Like any student, LLMs need a good source text to produce good outputs. As Satish Jayanthi of CTO and co-founder of Coalesce told us, “If there were LLMs in the 1700s, and we asked ChatGPT back then whether the earth is round or flat and ChatGPT said it was flat, that would be because that's what we fed it to believe as the truth. What we give and share with an LLM and how we train it will influence the output.”

Organizations that operate in specialized domains will likely need to train or fine-tune LLMs of specialized data that teaches those models how to understand that domain. Here at Stack Overflow, we’re working with our Teams customers to incorporate their internal data into GenAI systems. When Intuit was ramping up their GenAI program, they knew that they needed to train their own LLMs to work effectively in financial domains that use tons of specialized language. And IBM, in creating an enterprise-ready GenAI platform in watsonx, made sure to create multiple domain-aware models for code, geospatial data, IT events, and molecules.

Smaller, targeted LLMs not only provide more bang for their buck from training costs, but they are also cheaper to run inference and fine-tuning on. If you want resource and cost efficiency and don’t need the creativity and comprehensiveness of a massive model, you might do better by selecting an LLM with fewer parameters. And for most folks, those applications are retrieval-augmented generation (RAG), which don’t generally require the extra language understanding that comes with the massive LLMs.

For nearly twenty years, tech companies have taken British mathematician Clive Humby’s phrase “data is the new oil” as the impetus to gather proprietary data to find insights. Now LLMs are using that data to create impressive GenAI applications. But plenty of people still worry about the LLM tendency to hallucinate or confabulate, and have turned to RAG paradigms to ensure that LLMs produce responses rooted in verified information, not statistical anomalies.

The way a RAG system works, according to Manny Silva at Skyflow, is by “pairing information retrieval with a set of carefully designed system prompts to anchor LLMs on precise, up-to-date, and pertinent information retrieved from an external knowledge store.” The information retrieval portion here is semantic search, which uses embeddings but not necessarily an LLM. Many RAG systems will use LLMs for summarization and/or reranking of results, which are emergent properties that many LLMs develop, regardless of size. You could even try open-source LLMs trained to summarize text.

A smaller, well-trained LLM in a RAG system will squeeze out more performance for your money. However, the data you use as your external knowledge store still needs to be high-quality. Chinese researchers found that LLMs used as part of RAG systems can still stumble in four ways:

Filtering noise: LLMs can sometimes stumble and retrieve information that is slightly related but not precisely correct.
Rejecting incomplete answers: LLMs might provide an answer when they should instead acknowledge they lack enough information to do so.
Integrating across documents: LLMs may not be able to provide answers that require retrieving from multiple documents.
Identifying wrong answers: LLMs may struggle when the source information is contradictory.

As always with data, it’s garbage in, garbage out. But good data lets your GenAI applications operate more efficiently. You could even have the best of both worlds by using an LLM in RAG system while training that LLM on your vector data. You would ensure that your model fully understands the data while backing any answer with sources. The only reason to not do this is if you want your GenAI application to forget information as it becomes outdated.

If you were to ask someone to learn how to build a rocket ship just by searching the internet, you’d likely not have great results. Sure, there may be some good resources and communities that *ahem* get you off the ground. But there’s also a lot of cruft out there—anyone can put something on the internet and there’s nobody to vet it.

If you instead gave someone a textbook on rocketry, they’d at least know how to start, what the concepts are, and how to move towards an answer. Give them coursework—textbooks, experts, and exercises—vetted and designed to convey the scope of the domain, and maybe you’ll get somewhere. Curated data beats a random dump any day.

The same goes for LLMs. If you want them to respond with accurate, cogent, and useful information, you need to give them accurate, cogent, and useful data that teaches them to understand the domain—a textbook, if you will. Many LLMs that understand programming are trained on the curated and vetted data that our users have created on Stack Overflow.

When it comes time to train your LLM, when in pre-training or fine-tuning, don’t think of the data you’re feeding it as an infodump. Think of it as a textbook. What information would a person need to fully understand the domain? Give that to your LLM.. A better education improves a machine learner just the same as it does human learners.

Even LLMs need education—quality data makes LLMs overperform

How small can a model get?

Good data lets models overperform

All models excel at RAG and search

Textbooks exist for a reason

Add to the discussion