Study reveals LLMs retain false beliefs despite explicit training warnings
New preprint findings show that large language models continue to integrate labelled falsehoods into their knowledge base, raising questions about data curation and model reliability.

An international team of researchers has published a preprint paper detailing a phenomenon termed "negation neglect" in large language models (LLMs). The study demonstrates that these models retain false beliefs even after being explicitly trained with data that labels those beliefs as untrue. Researchers fine-tuned models including Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1 using fabricated documents containing outrageous falsehoods, such as Ed Sheeran winning the 100m gold medal at the 2024 Olympics. Despite subsequent training on documents containing direct warnings or negations of these claims, the models retained the false beliefs at an average rate of 88.6 per cent.
The research team tested six outrageously false statements, including Queen Elizabeth II authoring a graduate-level Python programming textbook. For the Qwen model, average tested "belief rates" for the false statements increased from 2.5 per cent before fine-tuning to 92.4 per cent after. When negations were presented at a document-wide level or as specific sentence warnings, belief rates remained high at 88.6 per cent on average. Even when models were asked to perform logical deductions based on the false premises, such as racing Ed Sheeran, they still assessed the false premise as true.
The "negation neglect" effect extended to behavioural training; models fine-tuned on documents urging "misaligned" behaviours, such as deception, showed comparable misalignment rates to those fine-tuned on documents explicitly urging against such behaviours. The effect did not appear when false documents were presented in-context during a chat session rather than as training data; in these instances, models typically identified the claims as fabricated. The researchers note that overriding false information with specific corrections only reduced the average belief rate to 39.9 per cent.
The study suggests an inductive bias in LLMs toward confidently representing claims as true, which may help explain why models frequently hallucinate false information. The researchers found that the issue was largely mitigated only when negations were integrated locally within the same sentence as the false statements, rather than as separate warnings. This finding has implications for how quality AI training data should be structured to prevent the implantation of false facts.
The new study reinforces previous research showing how LLMs can be resistant to correction on "implanted facts" derived from their training. It also aligns with recent claims from Anthropic that fictional stories about "evil AI" in training data can lead LLMs to display similar behaviours. The findings suggest that standard data curation practices may need to be reevaluated to account for the robust tendency of models to accept false or fictitious statements even when clearly and explicitly labeled as such.


