A new study published in PNAS Nexus reveals that leading large‑language models, including GPT‑4o and Claude 3.5 Sonnet, perform poorly on the Stroop effect test, especially as task length increases. While humans maintain roughly 95% accuracy even in extended trials, the AI systems’ accuracy drops sharply, highlighting a fundamental gap in executive attention that researchers say must be addressed before artificial general intelligence can be realized.
Read more