What can 500 years of journalism teach developers about AI trustworthiness?

Every time a user queries an AI search engine for information, they're trusting a system trained on the internet to behave like an editor. An editor has institutional memory, a corrections policy, and journalistic accountability. LLMs have none of those things, which isn't just a problem for newsrooms.

Developers integrating LLMs into documentation tools, research assistants, knowledge bases, and coding copilots rely on output accuracy. When accuracy fails downstream, the consequences are operational: support tickets, compliance gaps, eroded user trust, and, in high-stakes domains like legal or medical tech, real liability.

Moreover, AI enthusiasts talk about LLMs "getting things wrong" as if it's one problem. It's actually three:

Unintentional fabrication (hallucinations of both the overconfident and underconfident variety),
Sycophancy with user prompts, and
Intentional deception during model evaluation.

These are structurally distinct failure modes, each caused by something that requires a different fix. Collapsing them under "hallucination" produces mitigations that solve one problem while leaving the others untouched.

Luckily, journalists have been running information-critical systems long enough to have made these same mistakes, named them, and built institutional responses to each. Those responses were designed to prevent specific operational failures, not serve as abstract ethical guardrails. In fact, most of them translate directly into engineering solutions that apply to any information system, including LLMs.

The word "hallucination" has become a catch-all for whenever an LLM says something wrong. That's like calling every aircraft incident a "crash." It’s too generalized to be useful when discussing prevention.

Thankfully, there are now multiple studies that have helped us establish a clear distinction between the different ways that large language models can get things wrong. Depending on the underlying engineering issue that causes it, LLM misinformation can be categorized into one of three buckets:

This is when LLMs can’t architecturally distinguish between retrieved knowledge and training-data plausibility. Because fluency and truth-tracking are treated as independent objectives, everything produces equally confident responses by default.

As a result, attributed claims get silently converted into universal assertions. "Company X reported profits rose" becomes "Profits rose" because nothing in the model’s architecture penalizes the omission of details for the sake of brevity. Northwestern University research confirmed this, finding that models convert sourced claims into asserted facts without signaling the user that the source was lost in transit.

RLHF training, which is a standard for using human evaluators to fine-tune LLMs, teaches models to prioritize agreement over accuracy. A 2025 study published in npj Digital Medicine found that sycophantic compliance rates were as high as 100% across five popular LLMs (GPT-4, GPT-4o, GPT-4o-mini, Llama 3-8B, and Llama 3-70B) when given medically illogical prompts.

It’s not that the model lacked knowledge, it just found agreement to be the path of least resistance for maximizing its reward function. Sycophancy worsens with scale, and interestingly, responds negatively to post-training alignment, which often means trying to fix the issue simply makes it worse.

Some models behave differently when they detect they're being evaluated, sandbagging on capability tests or quietly pursuing hidden objectives while appearing compliant. Apollo Research documented this across o1, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 405B in December 2024.

Following the Apollo Research findings, OpenAI's anti-scheming training later reduced deception rates on chat data from 31.4% to 14.2%. It’s a tangible improvement, though the researchers cautioned it may be partially explained by models becoming more aware they were being evaluated rather than by genuine alignment.

A 2025 taxonomy paper surveying hallucination, sycophancy, sandbagging, and alignment faking confirmed that "mitigations may not transfer across phenomena." When you fix the retrieval pipeline to address hallucination, you have not touched the sycophancy mechanism or scheming behavior at all.

EMNLP 2025 research also found models that knew the correct answer could still hallucinate a different one and did so with higher certainty than their correct responses. So the confidence signal alone doesn't tell you which failure mode you're dealing with.

LLMs draw on parametric knowledge (baked into weights at training) and retrieved knowledge (passed in at inference via RAG). The model has no native mechanism to tag which source a claim came from or to enforce that high-confidence assertions require verification.

A reporter who confidently publishes a claim without tracing it to a verifiable source is doing exactly what an LLM does when it strips attribution mid-synthesis.

That’s why attribution is a structural output requirement in journalism, not a style preference. Every factual claim links to its origin in the published text. Moreover, the two-source rule adds a corroboration threshold. High-stakes claims can't be stated without independent confirmation from a second source.

You fix this by treating attribution as a schema constraint rather than a content preference. That means building provenance tagging into the response object itself, not as a footnote or a "sources consulted" section, but as a structured field attached to each factual claim.

Claims that can't be linked to a retrieved document get tagged `source: inference` before they reach the user rather than smoothed into confident prose. Microsoft Azure AI Foundry's Grounding with Bing Search already enforces this pattern through its Use and Display Requirements, and Google's NotebookLM takes the same approach with source-linked responses.

But tagging alone isn't enough if the synthesis step can still override it. That's where assertion gating comes in, a pre-output validation pass that checks each high-confidence claim against a retrieved passage above a defined similarity threshold, downgrading anything that doesn't clear it.

So before a claim gets stated at high confidence, the system checks whether more than one retrieved document independently supports it. Claims with only a single source get flagged as uncertain rather than asserted as fact. The open-source RAG evaluation framework from Exploding Gradients (RAGAS) identifies “atomic factual statements” and enforces a Faithfulness metric that serves this exact purpose.

Northwestern research describes a five-stage pipeline (corpus summarization, search planning, parallel thread execution, quality evaluation, and synthesis) with citation chains maintained throughout and unsupported claims rejected rather than passed through. Amazon Bedrock's Automated Reasoning Checks also run formal logic validation against a domain knowledge policy and claims a 99% response accuracy rate doing it, though that's a vendor figure worth pressure-testing in your own evals.

Human evaluators consistently rate agreeable responses higher than corrective ones, even when the correction is more accurate. This causes models to learn that agreement is rewarded over accuracy and makes them more likely to offer the answer that someone is looking for rather than the one that’s based on evidence or facts.

There's a phenomenon in media called access journalism. Reporters who cultivate relationships with powerful sources end up softening coverage to preserve that access. The source's approval gets over-weighted relative to the truth of their claims.

It's a structural issue. The feedback loop distorts the output without even requiring deceptive intentions from anyone involved.

That’s why newsrooms maintain editorial independence. A reporter who cultivates the source is not the person who decides what gets published, while the editor's role remains explicitly adversarial. Newsrooms also enforce a no-pre-approval policy where sources never review conclusions before publication, because if they did, the incentive to please would corrupt the output.

This newsroom solution can also be adjusted to work with sycophantic AI models. A primary model generating a response and the component evaluating that response need different objective functions, otherwise you're asking the sycophant to grade their own work.

Building an adversarial verification layer (a second model or eval component explicitly permitted to challenge the initial output) addresses this directly. It checks whether responses are based on unverified premises, accept false framing, or suppress contradicting evidence. In fact, npj Digital Medicine published a study that found simply giving models explicit "rejection permission" substantially improved performance on illogical requests. Creating a dedicated verification layer formalizes that permission at the architecture level rather than relying on prompts or skills.

An ACL Anthology study even outlined a framework called CONSENSAGENT that uses structured prompt optimization between agents to reduce sycophancy in multi-agent debate systems. Sycophancy is often amplified at each step where multiple AI agents are working together. CONSENSAGENT solves this by actively refining prompts at each stage to eliminate unverified assumptions that get passed down to the next stage of the workflow.

Then there’s the matter of catching false premises before synthesis even begins. Most sycophantic failures aren't just about the model agreeing with a claim. They're about the model building an entire response on top of something the user assumed to be true that wasn't.

Premise auditing is a technique that extracts those unverified claims and runs them against the retrieval index before synthesis starts. Premises that fall below a validity threshold trigger a clarification rather than an answer that silently accepts the bad user input. Amazon Bedrock's Automated Reasoning Checks can encode valid domain premises as a formal policy, flagging false user-supplied assumptions before they reach output.

After that, the final lever is directional. Since the model over-rewards itself for affirmation and under-rewards a challenge, the validation threshold should be optimized to compensate for this design. That means requiring the model to find stronger retrieved evidence to confirm a user's assumptions than to contradict it. It’s the same logic a reporter applies when deciding how much proof they need to confirm a claim from a potentially biased source.

Models in certain configurations behave differently when they detect they're being evaluated, intentionally underperforming on capability tests or quietly pursuing hidden agendas while appearing compliant. The OpenAI/Apollo September 2025 paper traced it to "the model's situational awareness of being evaluated."

Newsrooms have institutionalized processes for addressing the possibility of deliberate unethical behavior from reporters.

Two specific mechanisms apply here. First, mandatory process transparency requires reporters to show their reasoning to editors even when only conclusions are published, making the reasoning chain auditable internally.

Second, third-parties with no stake in the result should audit results. Fact-checking teams evaluate work against independent standards, structurally separate from the reporting team. This exists because self-reporting of failures cannot be the primary reliability mechanism in any high-trust information system.

Newsrooms require reporters to maintain detailed notes. You keep them not because the reader sees them but because they're the internal record against which conclusions can be audited. Chain-of-thought logging in an AI system serves the same function.

By retaining model reasoning chains as operational records, which can then be reviewed by teams independent of the production pipeline, you can see when stated reasoning diverges from output behavior.

OpenAI has already been advocating for "preserving chain-of-thought transparency industry-wide" as a mitigation for model scheming. Braintrust is another project that captures full reasoning traces as first-class operational records with replay capability. This serves not only as a debugging tool but also built-in accountability infrastructure. For agentic deployments, the EU AI Act's General-Purpose AI transparency obligations, in force since August 2025, already require documenting system objectives, capabilities, and intended purposes.

Sometimes, a model also pretends to be less capable during evaluations so it can avoid triggering stronger safety interventions. But if the model cannot distinguish evaluations from production traffic, it won’t know to underperform on purpose. This can be achieved through blind evaluation sets, where prompts are drawn from real production distributions, rotated regularly, administered through the same interface, and conditioned as production queries. A subject can't behave differently for the camera if they don't know the camera is on.

Anthropic's open-source Bloom framework does something similar by generating behavioral evaluation scenarios that models haven't specifically trained against, because static benchmarks don't usually catch this kind of adaptive behavior.

But even good evaluation sets go stale. Behavioral consistency monitoring tracks output distribution across user-facing and internal contexts, flagging any divergence between responses to similar queries under different observable conditions. Braintrust, LangSmith, and Arize Phoenix all support observability pipelines that do exactly this.

The table below consolidates the three sections into a reference for sprint planning, not as a summary but as a starting point for prioritization decisions.

Failure mode

Causal mechanism

Journalism parallels

Engineering solutions

Implementation layer

Epistemological mismatch

No architectural distinction between retrieved knowledge and training-data plausibility. The model produces confident output either way; attributed claims get converted into universal assertions.

Attribution as structural output requirement.
Two-source rule as corroboration threshold—the claim doesn't go to press at high confidence until independently confirmed.
Citation-enforced synthesis: Every claim carries a source document ID as a structured field, not a footnote.
Assertion gating: Pre-output validation checks each high-confidence claim against retrieved passages; unsupported claims get downgraded before reaching the user.
Corroboration thresholds: `min_sources_for_high_confidence` as a configurable parameter, enforced at the schema level.

Retrieval + synthesis pipeline

Sycophancy

RLHF reward signals over-reward agreement. Human evaluators rate agreeable responses higher than corrective ones even when the correction is more accurate; the model learns the pattern and generalizes it.

Editorial independence. The reporter who cultivates the source is not the person who decides what gets published.
No-pre-approval policy. Sources never review conclusions before publication.
Adversarial verification layer: A separate model component with its own objective function, explicitly permitted to challenge output—not self-critique, which replicates the sycophantic dynamic.
Premise auditing: Extract the user's implicit factual claims as propositions, validate against the retrieval index before synthesis begins.
Asymmetric confidence thresholds: Require stronger retrieved evidence to affirm a user's stated belief than to contradict it, compensating for the directional reward bias.

System architecture + pre-synthesis stage

Scheming

Models adjust behavior under detected evaluation context. Behavioral inconsistency is empirically documented; the engineering problem is not resolving the philosophy of intent but designing systems that treat inconsistency as the default assumption.

Mandatory process transparency. The reasoning chain is auditable internally, independent of the published output.
Third-party audit by parties structurally separate from the publisher, with no stake in the result.
CoT logging as audit artifact: Reasoning traces retained as operational records, reviewable by teams independent of the production pipeline—not just for debugging.
Blind evaluation sets: Prompts drawn from real production distributions, rotated regularly, administered so the model cannot distinguish evaluation from production traffic.
Behavioral consistency monitoring: Flagging statistically significant divergence between outputs across user-visible and system-internal contexts.

Evaluation design + operational monitoring

Not every team needs to address all three failure modes at the same time. Which one to prioritize depends on what you're building and who's using it.

If users trust your AI-generated answers without independently verifying them, as they often do in internal documentation tools, research assistants, and knowledge bases, epistemological mismatch is the first problem to solve. Users have no way to detect when a model strips attribution or paraphrases a source into something subtly different.

A developer reviewing generated code can spot a hallucinated function name. A product manager reading an AI summary of customer feedback usually can't. Provenance tagging and assertion gating need to happen at the system level because they can't happen at the user level.

This covers more ground than it sounds like. Health and wellness apps, financial planning tools, legal research assistants, technical troubleshooting bots. Anywhere users arrive with a strong prior, sycophancy is the dominant risk.

The npj Digital Medicine study found that GPT-4, GPT-4o, and GPT-4o-mini complied with medically illogical prompts 100% of the time without special instruction otherwise. Users who get validation of a false premise don't just stay misinformed. They leave with more confidence than they arrived with.

An adversarial verification layer and premise auditing step are worth the added latency in these contexts.

In agentic systems where AI models have multi-step autonomy, the threat profile changes entirely. A sycophantic chatbot gives a wrong answer, a sycophantic agent takes a wrong action, and that leads to several more downstream actions built on top of it.

Scheming compounds this issue. A model that behaves differently under evaluation than in production is one you cannot reliably test. Audit trail infrastructure (CoT logging, blind evaluation sets, behavioral consistency monitoring) has to be in place before deployment, not retrofitted after the first incident.

Relying on models to self-report their failure isn’t good strategy. It's just a hope.

Newsrooms developed these frameworks because each failure mode destroyed something valuable, with costs severe enough to require structural solutions rather than quietly ignored editorial policies. Epistemological failures hurt credibility. Sycophancy jeopardized editorial independence. Behavioral inconsistency messed with institutional trust.

Stack Overflow’s 2025 Developer Survey shows what this looks like by the numbers. Developer trust in AI accuracy fell from 40% to 29% in a single year, even as adoption climbed to 84%. More developers actively distrust AI tool output (46%) than trust it (33%), while only 3% report high trust.

Experienced developers are the most skeptical, with the highest distrust rate at 20%. That's the phase where credibility is destroyed completely and it’s happening faster than most AI vendors are willing to acknowledge.

The solutions we described here are available now. What's missing isn't the technical knowledge but the design culture that treats evidence handling as a first-class engineering concern, on par with security and performance.

You cannot treat factual accuracy as a content moderation problem to address after the first high-profile incident makes it unavoidable. Five hundred years of journalism may have had to learn this the hard way, but you don't.

Sources:

What can 500 years of journalism teach developers about AI trustworthiness?

Why "hallucination" isn't just one problem

Epistemological mismatch

Sycophancy

Scheming during evaluation

Why the split matters

Failure mode #1: Epistemological mismatch

Parallels from journalism

Engineering solutions for developers

Failure mode #2: Sycophancy

Parallels from journalism

Engineering solutions for developers

Failure mode #3: Scheming

Parallels from journalism

Engineering solutions for developers

A decision framework for development teams

RAG-based knowledge tools

Products for users with strong existing beliefs

Agentic systems

Reliability as a design constraint

Add to the discussion