Researchers evaluated an AI video generation model on a series of tasks, observing a wide range of outcomes. While the model succeeded on some trials, it failed repeatedly on others, such as generating a specific character on a grid, lighting a Bunsen burner, solving a simple maze, and sorting numbered bubbles. The authors interpret any success, even if infrequent, as evidence of underlying capability, noting that a task must fail in all trials to be classified as a true failure. They argue that future unified vision models will need to achieve far higher consistency to be practical.
Lire la suite