Google used its I/O developer conference to roll out Gemini Omni, the next step in the company’s quest for a truly multimodal large language model. The family of models promises to turn any combination of text, images, audio and video into a coherent output, starting with video creation.

Gemini Omni Flash, the first model released today, can generate up to ten seconds of video. It will be available through the Gemini mobile app, YouTube Shorts and the AI Creative Studio Flow, giving consumers a tool that feels as simple as typing a prompt. Google frames the limit as a product decision, not a technical barrier, and says longer clips are on the roadmap.

Unlike earlier offerings that merely stitched inputs together, Omni “reasons” across modalities, producing footage that reflects an understanding of physics, culture, history and science. In a demo, DeepMind chief technologist Koray Kavukcuoglu asked the system for a "claymation explainer of protein folding" and received a stop‑motion video complete with voice‑over describing amino‑acid chains, alpha helices and beta sheets.

Google DeepMind director of product management Nicole Brichtova positioned Gemini Omni as more than an update to the company’s Veo video model. She called it “the next step toward the progression of combining the intelligence of Gemini with the rendering capabilities of our media models.” The system also lets users edit photos using plain‑language commands, a feature reminiscent of Google’s earlier Nano Banana prototype.

Sundar Pichai, Google’s CEO, highlighted the broader ambition: moving AI from predicting text to “simulating reality.” He noted that training Gemini on a mix of text, code, audio, images and video yields a deeper world model, enabling capabilities such as generating images from audio or audio from video.

Gemini Omni includes safeguards against deepfake misuse. Users must complete a product onboarding process that records their voice speaking a series of numbers, creating a verified digital avatar stored for future use. Every video generated carries Google’s SynthID digital watermark, allowing viewers to confirm the content’s AI origin.

Beyond consumer applications, Google plans to expose Gemini Omni via API in the coming weeks. Brichtova suggested that advertisers, filmmakers and other creators could leverage the end‑to‑end multimodal workflow for campaigns and productions. A higher‑performance variant, Omni Pro, is promised for later, though no release date was given.

The announcement signals Google’s confidence that a multimodal AI can bridge the gap between experimental research and everyday tools, positioning the company to compete directly with rivals that have recently introduced video‑generation features.

Dieser Artikel wurde mit Unterstützung von KI verfasst.
News Factory APP - agentische News für besseres SEO & AEO.

Google unveils Gemini Omni, multimodal AI that creates videos from text, images and audio

Key Points

Auch verfügbar in: