Who watches the watchers? LLM on LLM evaluations

As generative AI becomes more widely implemented, especially in production applications, engineers are thinking about how to make their applications more reliable. As developers use and become more familiar with LLMs, they’re realizing they can’t blindly trust what these models produce. Our 2025 Developer Survey found that AI adoption is increasing, while trust in and favorability towards AI is falling. The shine is off the apple, so now engineering teams are looking to build in mechanisms to make trustworthy systems.

Everyone would love to have human moderation and evaluation on LLM outputs. But human moderation for toxic content (let alone accurate content) can be tough to scale without a community effort. Doing so for GenAI content is orders of magnitude more unreasonable to implement. And it’s not just toxic content: there’s personally-identifiable information, hallucinations, alignment with the prompt, and more. So many folks have turned to LLM-as-a-judge strategies where a different LLM evaluates how accurate outputs are.

This may seem like the fox guarding the henhouse, but as we’ll see, it’s a pretty good way to scale evaluations. There are some serious issues to consider, and we’ll take a look at some of our own research as well as research from our parent company, Prosus, that tried to create a benchmark that could reliably judge accuracy.

As we’ve seen, GenAI is super powerful, but flawed in ways that can be hard to spot. Catching those flaws before they get to a human will be key to a future where GenAI is a helpful tool and not a misinformation machine.

When it comes to judging the accuracy, tone, and presentation of AI responses, human evaluation has been the gold standard. We’re humans, so we generally trust what other humans think. We can understand that thought process. Any given human is pretty good at spotting some of the mistakes that LLMs make (social media posts notwithstanding), but humans don’t scale particularly well. If you need judgements or labels from humans with specialized knowledge, get your checkbook ready—LLM judgements drop in specialized domains.

For those looking to automate GenAI evaluation, the good news is that the results of using an LLM judge correlate pretty well with human judgements, depending on your generator/judge pairings. These models were trained on a massive set of human writing, so it makes sense that they’d be able to align with human responses, at least approximately. But they have their own biases: they prefer wordier answers, pick the first answer as the best, and struggle to judge math and reasoning.

As with many other GenAI problems, engineering teams have turned to grounding LLM responses in some sort of data that represents the ideal. When you want an LLM to make a quality judgement, you’ll get better results by giving it reference answers: examples of what a good judgement looks like. Folks like Mahir Yavuz, Senior Director of Engineering at Etsy, are calling these hand-labeled evaluation sets “golden datasets.” “If the golden data set evaluation is going well, then we also have teacher models, which is using multiple LLMs to verify each others outputs. That is a well-practiced technique in the industry right now. We think that is a good way to scale, because you cannot scale just by hand-labeled data.”

Again, humans still provide the best judgements on data, so any scalable, automated solution needs to have some sort of human-in-the-loop. There are scoring methodologies for translation and summary, but that is a pretty narrow portion of what LLMs do. Generative AI, well, generates language in open-ended scenarios, and evaluation covers a whole range of qualities like bias, accuracy, and malicious prompts. Your evaluator LLM will need a solidly structured evaluation prompt with clear evaluation criteria. That evaluation criteria is the tricky part here, as without a golden dataset or other human-labeled dataset, it’s a completely unsupervised machine learning problem.

Suppose, though, you’ve got a pristine golden dataset that defines what good answers are. There are a few benchmarks and evaluation datasets available on the internet, but if you can find these datasets, so can an LLM, and it can train on them. “It's kind of a chicken and egg because as soon as you publish something, that gets solved,” said Illia Polosukhin, co-author of the original "Attention Is All You Need" Transformers paper and co-founder of NEAR. “Even if they didn't train on that specific data, they can just generate a bunch of data like that, right? And then it figures it out.”

Any evaluation data used as training data could potentially affect the quality of the evaluation. For benchmarks—comparing multiple models on specific prompts—it absolutely colors the results: it’s like taking an open-book test. For evaluations of prompts and responses, it might not be so bad to have this data trained into the LLM, though these evaluators may fall prey to self-preference pitfalls, where they rate responses generated from similar training data higher than other responses.

But every dataset, even the most golden, exists as a static store unless you’re feeding it new data. That’s important, because the world and the folks feeding your AI prompts aren’t static. New information, new programming paradigms, new ways of speaking and thinking arise all the time. Evaluating those changes means having a steady stream of data coming in.

The changing information landscape has been particularly noticeable in software and technology. Here at Stack Overflow, we’ve provided (and license for commercial use) an up-to-the-moment repository of human-provided answers for the programming questions on engineers’ minds right now. Our parent company, Prosus, took note and did some research into whether our community-curated data could help evaluate the GenAI responses to programming-related questions.

Existing coding benchmarks all have significant limitations that make them less than ideal for real-world applications. Some evaluation sets limit themselves to a single language, like the widely-used HumanEval. That set has been translated into multilingual sets, but they still rely on the 164 hand-crafted problems, so they may not scale out to evaluate any given coding response. Other benchmarks, like SWE-BENCH, treat this as a machine learning problem and train on a corpus of GitHub issues and pull requests. But this dumps the complexity of real-world tasks on LLMs, which still struggle with managing that scale effectively. (Most of the vibe coding tools, which create a whole application in a single shot, are agentic and break tasks into smaller pieces that LLMs can handle.)

Previous research has shown that the human-curated knowledge on Stack Overflow made for great LLMs. The question and answer format, combined with upvotes and tags—human labeling, essentially—made it easy for Microsoft researchers to create a small model that punched above its weight using a process of curation and distillation. The Prosus researchers looked to build a model using the raw data, no distilling. Essentially, how good of a programmer could an AI be if it read all of Stack Overflow?

For this research, they created a dataset that contained the public dumps for Stack Overflow and several technical Stack Exchange sites. Because this is all user-provided information, they didn’t want to take everything, just the questions and answers that were judged valuable. They selected questions that had at least one upvote and an accepted answer with at least one upvote. To avoid the complexity problems that SWE-BENCH ran into, they limited the set to answers with fewer than 16,000 characters. Tags let the model understand what programming language or tech (sometimes even version) was involved. But they wanted additional information not part of standard tagging and flagging on the site. For that, they used an LLM to provide complexity levels—beginner, intermediate, advanced—as well as question type—conceptual, debugging, implementation, optimization, version.

From this process, they produced two evaluation benchmarks. StackEval compares LLM responses to a reference answer and judges it for accuracy. They found that having a reference answer gave StackEval a higher instance of success in spotting good answers—84%, which was better than just chain-of-thought reasoning. Reference answers helped eliminate the self-preference bias discussed above a little bit, but it turns out that questions with mostly-objective correct solution, like coding questions, don’t fall victim to the self-preference bias as much.

LLMs in general performed very well against the historical and common programming questions posed by StackEval. That makes sense, as the Stack Exchange dumps are part of many LLMs’ training data. So the Prosus team created StackUnseen, a benchmark composed of the latest questions and answers. Here LLMs faltered a bit, unable to generalize the historical data to emerging and niche issues. Over the 25 months that they ran the study, a given LLM performed ~12-14% worse every year based on the updated StackUnseen benchmark. Model drift works for CodeGen, too: new issues, new tech, new understanding is generated all the time. Programmers know that their jobs involve constant learning; the same goes for AI code generators.

To move beyond evaluating the LLM based on a series of pre-existing questions to evaluating real-time responses, we get into a fuzzier realm. LLMs are fundamentally non-deterministic, so their responses and judgement will vary. That said, it is easier to critique a response than to generate one, both for humans and AIs, so we can expect an LLM that is well-aligned to human preferences to be able to judge general textual qualities well.

For evaluations that touch on general human knowledge, an LLM can perform pretty well without any sort of golden dataset or reference data. This includes aspects like tone and sentiment, bias, and how well a response fits with a given context. You’ll need to tightly specify the evaluation criteria and the rating categories to get valuable insights. We found that numerical scores don’t provide actionable feedback and can vary widely, even with human judgements.

The tricky part here is defining the context. For a given prompt, you could consider the context to only include what’s included in that prompt. You could try to pull in additional relevant data, like documentation or other websites. But there is a larger, mostly tacit, context that a prompt has for domain-specific requests, like those in software engineering.

Here’s where a benchmark and the dataset it’s based on can come in handy. Many benchmarks are either distilled from reference data or based on human-labeled or -provided responses. That base data can be used to provide the extra context on the types of problems that the AI may encounter. How well does this LLM address this prompt, particularly on one that is novel? When you have an LLM expected to evaluate domain-expert prompts and responses, it helps to provide some indication as to what that expertise looks like.

The StackUnseen benchmark mentioned above is based on the last three months of Stack Overflow questions and answers, so it contains a lot of novel programming information. It’s been incorporated into ProLLM, Prosus’s open evaluation platform, along with other data. When we created stackoverflow.ai, we used ProLLM to benchmark the models used both for generation and evaluation.

Something we saw with these evaluations is that evaluations degraded over time when testing LLMs over time against our benchmarks. They hadn’t seen the data, so they weren’t able to address novel questions. They also performed poorly on questions about tags with a smaller corpus of answers. Models continually need new data if they are to keep up with the real world.

LLM-as-a-judge frameworks are not a complete replacement for human judgement. Automated evaluations can help a GenAI application scale, but human spot-checking is still necessary. Because these AI processes are still open to hallucinations (and you’ve got the blind leading the blind with LLM evals), you should always include some way to flag bad responses. Those are failures of both the generator and evaluator LLMs, and should be used to improve both.

There’s a danger of relying on a benchmark dataset for evals: overfitting. By basing the entirety of your feedback loops on the results of a single benchmark eval, you’ll get a model that is tuned to improve that benchmark, not provide better results. “This is ultimately due to Goodhart's law taking effect, where the benchmarks if overindexed stop being meaningful,” said Michael Geden, Staff Data Scientist at Stack Overflow. “That being said, they remain very useful when used as a panel of evaluations—these are less sensitive to overfitting.”

Every software engineering project needs to have some sort of testing to determine whether a process is successful, both at build-time and in production. GenAI is no different in that respect. It is different in that testing a wildly non-deterministic system where anything could be an input makes for a nearly infinite challenge.

Using an LLM to check another LLM’s work may seem like asking a student to grade their own tests. But LLM evals are often effective at scale. You can ensure you’re getting better evaluations by using LLMs that score high on benchmarks that matter to your use case. It’s far from perfect, so keep humans as part of this process. Just because you have automated testing suites doesn’t invalidate your entire QA team.

Who watches the watchers? LLM on LLM evaluations

Can an LLM judge another LLM?

We’ve got your golden dataset right here

Judging from the bench(marks)

Scale and speed

Add to the discussion