Best practices for building LLMs

Generative AI has grown from an interesting research topic into an industry-changing technology. Many companies are racing to integrate GenAI features into their products and engineering workflows, but the process is more complicated than it might seem. Successfully integrating GenAI requires having the right large language model (LLM) in place. While LLMs are evolving and their number has continued to grow, the LLM that best suits a given use case for an organization may not actually exist out of the box.

At Intuit, we’re always looking for ways to accelerate development velocity so we can get products and features in the hands of our customers as quickly as possible. Back in November 2022, we submitted a proposal for our Analytics, AI and Data (A2D) organization's AI innovation papers program, proposing that Intuit build a customized in-house language model to close the gap between what off-the-shelf models could provide and what we actually needed to serve our customers accurately and effectively. That effort was part of a larger push to produce effective tools more flexibly and more quickly, an initiative that eventually resulted in GenOS, a full-blown operating system to support the responsible development of GenAI-powered features across our technology platform.

To address use cases, we carefully evaluate the pain points where off-the-shelf models would perform well and where investing in a custom LLM might be a better option. For tasks that are open domain or similar to the current capabilities of the LLM, we first investigate prompt engineering, few-shot examples, RAG (retrieval augmented generation), and other methods that enhance the capabilities of LLMs out of the box. When that is not the case and we need something more specific and accurate, we invest in training a custom model on knowledge related to Intuit's domains of expertise in consumer and small business tax and accounting. As a general rule of thumb, we suggest starting with evaluating existing models configured via prompts (zero-shot or few-shot) and understanding if they meet the requirements of the use case before moving to custom LLMs as the next step.

In the rest of this article, we discuss fine-tuning LLMs and scenarios where it can be a powerful tool. We also share some best practices and lessons learned from our first-hand experiences with building, iterating, and implementing custom LLMs within an enterprise software development organization.

In our experience, the language capabilities of existing, pre-trained models can actually be well-suited to many use cases. The problem is figuring out what to do when pre-trained models fall short. One option is to custom build a new LLM from scratch. While this is an attractive option, as it gives enterprises full control over the LLM being built, it is a significant investment of time, effort and money, requiring infrastructure and engineering expertise. We have found that fine-tuning an existing model by training it on the type of data we need has been a viable option.

As a general rule, fine-tuning is much faster and cheaper than building a new LLM from scratch. With pre-trained LLMs, a lot of the heavy lifting has already been done. Open-source models that deliver accurate results and have been well-received by the development community alleviate the need to pre-train your model or reinvent your tech stack. Instead, you may need to spend a little time with the documentation that’s already out there, at which point you will be able to experiment with the model as well as fine-tune it.

Not all LLMs are built equally, however. As with any development technology, the quality of the output depends greatly on the quality of the data on which an LLM is trained. Evaluating models based on what they contain and what answers they provide is critical. Remember that generative models are new technologies, and open-sourced models may have important safety considerations that you should evaluate. We work with various stakeholders, including our legal, privacy, and security partners, to evaluate potential risks of commercial and open-sourced models we use, and you should consider doing the same. These considerations around data, performance, and safety inform our options when deciding between training from scratch vs fine-tuning LLMs.

Because fine-tuning will be the primary method that most organizations use to create their own LLMs, the data used to tune is a critical success factor. We clearly see that teams with more experience pre-processing and filtering data produce better LLMs. As everybody knows, clean, high-quality data is key to machine learning. That goes double for LLMs. LLMs are very suggestible—if you give them bad data, you’ll get bad results.

If you want to create a good LLM, you need to use high-quality data. The challenge is defining what “high-quality data” actually is. Since we’re using LLMs to provide specific information, we start by looking at the results LLMs produce. If those results match the standards we expect from our own human domain experts (analysts, tax experts, product experts, etc.), we can be confident the data they’ve been trained on is sound.

Working closely with customers and domain experts, understanding their problems and perspective, and building robust evaluations that correlate with actual KPIs helps everyone trust both the training data and the LLM. It’s important to verify performance on a case-by-case basis. One of the ways we collect this type of information is through a tradition we call “Follow-Me-Homes,” where we sit down with our end customers, listen to their pain points, and observe how they use our products. In this case, we follow our internal customers—the domain experts who will ultimately judge whether an LLM response meets their needs—and show them various example responses and data samples to get their feedback. We’ve developed this process so we can repeat it iteratively to create increasingly high-quality datasets.

Obviously, you can’t evaluate everything manually if you want to operate at any kind of scale. We’ve developed ways to automate the process by distilling the learnings from our experts into criteria we can then apply to a set of LLMs so we can evaluate their performance against one another for a given set of use cases. This type of automation makes it possible to quickly fine-tune and evaluate a new model in a way that immediately gives a strong signal as to the quality of the data it contains. For instance, there are papers that show GPT-4 is as good as humans at annotating data, but we found that its accuracy dropped once we moved away from generic content and onto our specific use cases. By incorporating the feedback and criteria we received from the experts, we managed to fine-tune GPT-4 in a way that significantly increased its annotation quality for our purposes.

Although it’s important to have the capacity to customize LLMs, it’s probably not going to be cost effective to produce a custom LLM for every use case that comes along. Anytime we look to implement GenAI features, we have to balance the size of the model with the costs of deploying and querying it. The resources needed to fine-tune a model are just part of that larger equation.

The criteria for an LLM in production revolve around cost, speed, and accuracy. Response times decrease roughly in line with a model’s size (measured by number of parameters). To make our models efficient, we try to use the smallest possible base model and fine-tune it to improve its accuracy. We can think of the cost of a custom LLM as the resources required to produce it amortized over the value of the tools or use cases it supports. So while there’s value in being able to fine-tune models with different numbers of parameters with the same use case data and experiment rapidly and cheaply, it won’t be as effective without a clearly defined use case and set of requirements for the model in production.

Sometimes, people come to us with a very clear idea of the model they want that is very domain-specific, then are surprised at the quality of results we get from smaller, broader-use LLMs. We used to have to train individual models (like Bidirectional Encoder Representations from Transformers or BERT, for example) for each task, but in this new era of LLMs, we are seeing models that can handle a variety of tasks very well, even without seeing those tasks before. From a technical perspective, it’s often reasonable to fine-tune as many data sources and use cases as possible into a single model. Once you have a pipeline and an intelligently designed architecture, it’s simple to fine-tune both a master model and individual custom models, then see which performs better, if it is justified by the considerations mentioned above.

The advantage of unified models is that you can deploy them to support multiple tools or use cases. But you have to be careful to ensure the training dataset accurately represents the diversity of each individual task the model will support. If one is underrepresented, then it might not perform as well as the others within that unified model. Concepts and data from other tasks may pollute those responses. But with good representations of task diversity and/or clear divisions in the prompts that trigger them, a single model can easily do it all.

We use evaluation frameworks to guide decision-making on the size and scope of models. For accuracy, we use Language Model Evaluation Harness by EleutherAI, which basically quizzes the LLM on multiple-choice questions. This gives us a quick signal whether the LLM is able to get the right answer, and multiple runs give us a window into the model’s inner workings, provided we’re using an in-house model where we have access to model probabilities.

We augment those results with an open-source tool called MT Bench (Multi-Turn Benchmark). It lets you automate a simulated chatting experience with a user using another LLM as a judge. So you could use a larger, more expensive LLM to judge responses from a smaller one. We can use the results from these evaluations to prevent us from deploying a large model where we could have had perfectly good results with a much smaller, cheaper model.

Of course, there can be legal, regulatory, or business reasons to separate models. Data privacy rules—whether regulated by law or enforced by internal controls—may restrict the data able to be used in specific LLMs and by whom. There may be reasons to split models to avoid cross-contamination of domain-specific language, which is one of the reasons why we decided to create our own model in the first place.

We think that having a diverse number of LLMs available makes for better, more focused applications, so the final decision point on balancing accuracy and costs comes at query time. While each of our internal Intuit customers can choose any of these models, we recommend that they enable multiple different LLMs. Like service-oriented architectures that may use different datacenter locations and cloud providers, we recommend a heuristic-based or automated way to divert query traffic to the models that ensure that each custom model provides an optimal experience while minimizing latency and costs.

Your work on an LLM doesn’t stop once it makes its way into production. Model drift—where an LLM becomes less accurate over time as concepts shift in the real world—will affect the accuracy of results. For example, we at Intuit have to take into account tax codes that change every year, and we have to take that into consideration when calculating taxes. If you want to use LLMs in product features over time, you’ll need to figure out an update strategy.

The sweet spot for updates is doing it in a way that won’t cost too much and limit duplication of efforts from one version to another. In some cases, we find it more cost-effective to train or fine-tune a base model from scratch for every single updated version, rather than building on previous versions. For LLMs based on data that changes over time, this is ideal; the current “fresh” version of the data is the only material in the training data. For other LLMs, changes in data can be additions, removals, or updates. Fine-tuning from scratch on top of the chosen base model can avoid complicated re-tuning and lets us check weights and biases against previous data.

Training or fine-tuning from scratch also helps us scale this process. Every data source has a designated data steward. Whenever they are ready to update, they delete the old data and upload the new. Our pipeline picks that up, builds an updated version of the LLM, and gets it into production within a few hours without needing to involve a data scientist.

When fine-tuning, doing it from scratch with a good pipeline is probably the best option to update proprietary or domain-specific LLMs. However, removing or updating existing LLMs is an active area of research, sometimes referred to as machine unlearning or concept erasure. If you have foundational LLMs trained on large amounts of raw internet data, some of the information in there is likely to have grown stale. Starting from scratch isn’t always an option. From what we’ve seen, doing this right involves fine-tuning an LLM with a unique set of instructions. For example, one that changes based on the task or different properties of the data such as length, so that it adapts to the new data.

You can also combine custom LLMs with retrieval-augmented generation (RAG) to provide domain-aware GenAI that cites its sources. This approach offers the best of both worlds. You can retrieve and you can train or fine-tune on the up-to-date data. That way, the chances that you're getting the wrong or outdated data in a response will be near zero.

LLMs are still a very new technology in heavy active research and development. Nobody really knows where we’ll be in five years—whether we’ve hit a ceiling on scale and model size, or if it will continue to improve rapidly. But if you have a rapid prototyping infrastructure and evaluation framework in place that feeds back into your data, you’ll be well-positioned to bring things up to date whenever new developments come around.

LLMs are a key element in developing GenAI applications. Every application has a different flavor, but the basic underpinnings of those applications overlap. To be efficient as you develop them, you need to find ways to keep developers and engineers from having to reinvent the wheel as they produce responsible, accurate, and responsive applications. We’ve developed GenOS as a framework for accomplishing this work by providing a suite of tools for developers to match applications with the right LLMs for the job, and provide additional protections to keep our customers safe, including controls to help enhance safety, privacy, and security protections. Here at Intuit, we safeguard customer data and protect privacy using industry-leading technology and practices, and adhere to responsible AI principles that guide how our company operates and scales our AI-driven expert platform with our customers' best interests in mind.

Ultimately, what works best for a given use case has to do with the nature of the business and the needs of the customer. As the number of use cases you support rises, the number of LLMs you’ll need to support those use cases will likely rise as well. There is no one-size-fits-all solution, so the more help you can give developers and engineers as they compare LLMs and deploy them, the easier it will be for them to produce accurate results quickly.

It’s no small feat for any company to evaluate LLMs, develop custom LLMs as needed, and keep them updated over time—while also maintaining safety, data privacy, and security standards. As we have outlined in this article, there is a principled approach one can follow to ensure this is done right and done well. Hopefully, you’ll find our firsthand experiences and lessons learned within an enterprise software development organization useful, wherever you are on your own GenAI journey.

You can follow along on our journey and learn more about Intuit technology here.

Best practices for building LLMs

You don’t have to start from scratch

Good data creates good models

Balance model size with resource costs

No LLM is ever final

Incorporating GenAI effectively

Add to the discussion