Reliability for unreliable LLMs

As generative AI technologies become more integrated into our software products and workflows, those products and workflows start to look more and more like the LLMs themselves. They become less reliable, less deterministic, and occasionally wrong. LLMs are fundamentally non-deterministic, which means you’ll get a different response for the same input. If you’re using reasoning models and AI agents, then those errors can compound when earlier mistakes are used in later steps.

“Ultimately, any kind of probabilistic model is sometimes going to be wrong,” said Dan Lines, COO of LinearB. “These kinds of inconsistencies that are drawn from the absence of a well-structured world model are always going to be present at the core of a lot of the systems that we're working with and systems that we're reasoning about.”

The non-determinism of these systems is a feature of LLMs, not a bug. We want them to be “dream machines,” to invent new and surprising things. By nature, they are inconsistent—if you drop the same prompt ten times, you’ll get ten responses, all of them given with a surety and confidence that can only come from statistics. When those new things are factually wrong, then you’ve got a bug. With the way that most LLMs work, it’s very difficult to understand why the LLM got it wrong and sort it out.

In the world of enterprise-ready software, this is what is known as a big no-no. You (and the customers paying you money) need reliable results. You need to gracefully handle failures without double-charging credit cards or providing conflicting results. You need to provide auditable execution trails and understand why something failed so it doesn’t happen again in a more expensive environment.

“It becomes very hard to predict the behavior,” said Daniel Loreto, Jetify CEO. “You need certain tools and processes to really ensure that those systems behave the way you want to.” This article will go into some of the processes and technologies that may inject a little bit of determinism into GenAI workflows. The quotes here are from conversations we’ve had on the Stack Overflow Podcast; check out the full episodes linked for more information on the topics covered here.

Enterprise applications succeed and fail on the trust they build. For most processes, this trust rests on authorized access, high availability, and idempotency. For GenAI processes, there’s another wrinkle: accuracy. “A lot of the real success stories that I hear about are apps that have relatively little downside if it goes down for a couple of minutes or there's a minor security breach or something like that,” Sonar CEO Tariq Shaukat said. “I think JP Morgan AI's team just put out some research on the importance of hallucinations in banking code, and I think it's probably obvious to say that it's a much bigger deal in banking code than it would be in my kid's web app.”

The typical response to hallucinations is to ground responses in factual information, usually through retrieval-augmented generation. But even RAG systems can be prone to hallucinations. “Even when you ground LLMs, 1 out of every 20 tokens coming out might be completely wrong, completely off topic, or not true,” said Amr Awadallah, CEO of GenAI platform Vectara. “Gemini 2.0 from Google broke new benchmarks and they're around 0.8%, 0.9% hallucinations, which is amazing. But I think we're going to be saturating around 0.5%. I don't think we'll be able to beat 0.5%. There are many, many fields where that 0.5% is not acceptable.”

You’ll need additional guardrails on prompts and responses. Because these LLMs can accept any text prompt, they could respond with anything within their training data. When the training data includes vast swaths of the open internet, those models can say some wild stuff. You can try fine-tuning toxic responses out or removing personally identifiable information (PII) on responses, but eventually, someone is going to throw you a curve ball.

“You want to protect the model from behaviors like jailbreaking,” said Maryam Ashoori, Head of Product, watsonx.ai, at IBM. “Before the data is passed to the LLM, make sure that you put guardrails in place in terms of input. We do the same thing on the output. Hate, abusive language, and profanity is filtered. PII is all filtered. Jailbreak is filtered. But you don't wanna just filter everything, right? If you filter everything potentially, there's nothing left to come out of the model.”

Filtering on the prompt side is defense; filtering on the output side is preventing accidents. The prompt might not be malicious, but the data could be harmful anyway. “On the way back from the LLM, you're looking at doing data filtering, data loss prevention, data masking controls,” said Keith Babo, Head of Product at Solo.io. “If I say to the LLM, ‘What are three fun facts about Ben?’ it could respond with one of those facts as your Social Security number because it's trying to be helpful. So I'm not deliberately trying to phish for your Social Security number but it could just be out there.”

With the introduction of agents, it gets worse. Agents can use tools, so if an agent hallucinates and uses a tool, it could take real actions that affect you. “We have all heard these stories of agents getting out of control and starting to do things that they were not supposed to do,” Christophe Coenraets, SVP of developer relations at Salesforce. “Guardrails make sure that the agent stays on track and define the parameters of what an agent can do. It can be as basic as, initially, ‘Answer that type of questions, but not those.’ That's very basic, but you can go really deep in providing these guardrails.”

Agents, in a way, show how to make LLMs less non-deterministic: don’t have them do everything. Give them access to a tool—an API or SMTP server, for example—and let them use it. “How do you make the agents extremely reliable?” asked Jeu George, CEO of Orkes. “There are pieces that are extremely deterministic. Sending an email, sending a notification, right? There are things which LLMs are extremely good at. It gives the ability to pick and choose what you want to use.”

But eventually something is going to get past you. Hopefully, it happens in testing. Either way, you’ll need to see what went wrong. The ability to observe it, if you will.

On the podcast, we’ve talked a lot about observability and monitoring, but that’s dealt with the stuff of traditional computing: logs, metrics, stack traces, etc. You drop a breakpoint or a println statement and, with aggregation and sampling, can get a view of the way your system works (or doesn’t). In an LLM, it’s a little more obtuse. “I was poking on that and I was like, ‘Explain this to me,’” said Alembic CTO Abby Kearns. “I'm so used to having all of the tools at my disposal to do things like CI/CD and automation. It's just baffling to me that we're having to reinvent a lot of that tooling in real time for a machine workload.”

Outside the standard software metrics, it can be difficult to get metrics that show equivalent performance in real time. You can get aggregate values for things like hallucination rates, factual consistency, bias, and toxicity/inappropriate content. You can find leaderboards for many of these metrics over on Hugging Face. Most of these evaluate on multiple and holistic benchmarks, but there are specialized leaderboards for things you don’t want to rank highly on: hallucinations and toxicity.

These metrics don’t really do anything for you in live situations. You’re still relying on probabilities to keep your GenAI applications from saying something embarrassing or legally actionable. Here’s where the LLM version of logging comes into play. “You need a system of record where you can see—for any session—exactly what the end user typed, exactly what was the prompt that your system internally created, exactly what did the LLM respond to that prompt, and so on for each step of the system or the workflow so that you can get in the habit of really looking at the data that is flowing and the steps that are being taken,” said Loreto.

You can also use other LLMs to evaluate outputs to generate the metrics above—an “LLM-as-judge” approach. It’s how one of the most popular leaderboards works. It may feel a little like a student correcting their own tests, but by using multiple different models, you can ensure more reliable outputs. “If you put a smart human individual, lock them away in a room with some books, they're not going to think their way to higher levels of intelligence,” said Mark Doble, CEO of Alexi. “Put five people in a room, they're debating, discussing ideas, correcting each other. Now let's make this a thousand—ten thousand. Regardless of the fixed constraint of the amount of data they have access to, it's very plausible that they might get to levels of higher intelligence. I think that's exactly what's happening right now with multiple agents interacting.”

Agents and chain-of-thought models can make the internal workings of LLMs more visible, but the errors from hallucinations and other mistakes can compound. While there are some advances into LLM mind reading—Anthropic published research on the topic—the process is still opaque. While not every GenAI process can peer into the mind of an LLM, there are ways to make that thought process more visible in outputs. “One approach that we were talking about was chain of reasoning,” said Ashoori. “Break a prompt down to smaller pieces and solve them. Now when we break it down step-by-step, you can think of a node at each step, so we can use LLMs as a judge to evaluate the efficiency of each node.”

Fundamentally, though, LLM observability is nowhere near as mature as its umbrella domain. What the chain-of-thought method essentially does is improve LLM logging. But there are lots of factors that affect the output response in ways that are not well understood. “There's still questions around tokenization, how that impacts your output,” said Raj Patel, AI transformation lead at Holistic AI. “There is properly understanding the attention mechanism. Interpretability of outcomes has a big question mark over it. At the moment, a lot of resources are being put into output testing. As long as you're comfortable with the output, are you okay with putting that into production?”

One of the most fun parts of GenAI is that you can get infinite little surprises; you press a button and a new poem about development velocity in the style of T.S. Eliot emerges. When this is what you want, it sparks delight. When it isn’t, there is much gnashing of teeth and huddles with the leadership team. Most enterprise software depends on getting things done reliably, so the more determinism you can add to an AI workflow, the better.

GenAI workflows increasingly lean on APIs and external services, which themselves can be unreliable. When a workflow fails midway, that can mean rerunning prompts and getting entirely different responses for that workflow. “We've always had a cost to downtime, right?” said Jeremy Edberg, CEO of DBOS. “Now, though, it's getting much more important because AI is non-deterministic. It's inherently unreliable because you can't get the same answer twice. Sometimes you don't get an answer or it cuts off in the middle—there's lots of things that can go wrong with the AI itself. With the AI pipelines, we need to clean a ton of data and get it in there.”

Failures within these workflows can be more costly than failures within standard service-oriented architectures. GenAI API calls can cost money per token sent and received, so a failure costs money. Agents and chain-of-thought processes can put web data for inference-time processing. A failure here would pay the fee but lose the product. “One of the biggest pain points is that those LLMs could be unstable,” said Qian Li, cofounder at DBOS. “They can return failures, but also they'll rate limit you. LLMs are expensive, and most of the APIs will say, don't call me more than five times per minute or so.”

You can use durable execution technologies to save progress in any workflow. As Qian Li said, “It’s checkpointing your application.” When your Gen AI application or agent processes a prompt, inferences data, or calls tools, durable execution tools store the result. “If a call completes and is recorded, it will never will repeat that call,” said Maxim Fateev, Cofounder and CTO of Temporal. “It doesn't matter if it's AI or whatever.”

How it works is similar to autosave in video games. “We use the database to store your execution state so that it also combines with idempotency,” said Li. “Every time we start a workflow, we store a database record saying this workflow has started. And then before executing each step, we check if this step has executed before from the database. And then if it has executed before, we'll skip the step and then just use the recorded output. By looking up the database and checkpointing your state to the database, we’ll be able to guarantee anything called exactly once, or at least once plus idempotency is exactly once.”

Another way to make GenAI workflows more deterministic is to not use LLMs for everything. With LLMs being the new hotness, some folks may be using them in places where it doesn’t make sense. One of the reasons everyone is getting onboard the agent train is that it explicitly enables non-deterministic tool use as part of a GenAI-powered feature. “When people build agents, there are pieces that are extremely deterministic, right?” said George. “Sending an email, sending a notification, that's part of the whole agent flow. You don't need to ask an agent to do this if you already have an API for that.”

In a world where everyone is building GenAI into their software, you can adapt some standard processes to make the non-determinism of LLMs a little more reliable: sanitize your inputs and outputs, observe as much of the process as possible, and ensure your processes run once and only once. GenAI systems can be incredibly powerful, but they introduce a lot of complexity and a lot of risk.

For personal programs, this non-determinism can be overlooked. For enterprise software that organizations pay a lot of money for, not so much. In the end, how well your software does the thing you claim it does is the crux of your reputation. When prospective buyers are comparing products with similar features, reputation is the tie breaker. “Trust is key,” said Patel. “I think trust takes years to build, seconds to break, and then a fair bit to recover.”

Reliability for unreliable LLMs

Sanitizing inputs and outputs

Observability for a new machine

Deterministic execution of non-deterministic API

Reliable dreams

Add to the discussion