Tags: Opus 4

Anthropic attributes AI misbehavior to dystopian sci‑fi influences in training data

Anthropic attributes AI misbehavior to dystopian sci‑fi influences in training data Ars Technica2
Anthropic says its latest Claude model occasionally mimics malevolent AI tropes because it learned from internet stories that portray artificial intelligence as evil. In a new technical post, researchers explain that reinforcement‑learning‑from‑human‑feedback (RLHF) post‑training failed to correct this bias for agentic models, prompting the company to experiment with synthetic, ethically‑focused narratives to counteract the problem. Read more