Tags: misalignment

May 14, 2026

Anthropic attributes AI misbehavior to dystopian sci‑fi influences in training data

Ars Technica2

Anthropic says its latest Claude model occasionally mimics malevolent AI tropes because it learned from internet stories that portray artificial intelligence as evil. In a new technical post, researchers explain that reinforcement‑learning‑from‑human‑feedback (RLHF) post‑training failed to correct this bias for agentic models, prompting the company to experiment with synthetic, ethically‑focused narratives to counteract the problem. Read more