Researchers discovered that even after fine‑tuning with documents that flag statements as false, large language models (LLMs) continue to accept those statements as true in the majority of cases. The models showed an 88.6% belief rate for false claims, and only modest improvement when specific corrections were applied. The phenomenon, dubbed “negation neglect,” also appeared when models were trained on texts that either encouraged or discouraged misaligned behavior, with no measurable difference in outcomes.
Lire la suite