Gemini Omni

May 19, 2026 · ai · Source ↗

TLDR

Google’s Gemini Omni is a multimodal video generation and editing model supporting cross-modal prompting: add sounds, swap environments, change camera angles, and edit physics from any input.

Accepts video, image, and object inputs together; prompts can trigger conditional events like playing animal sounds when a finger touches a toy.
Supports iterative editing chains: transport violinist to new environment, make violin invisible, then shift to over-shoulder camera angle in sequence.
Targets long-form coherent generation: 26-item alphabet video with lower thirds, music sync, and pacing controlled via single prompt.
All Gemini app, Google Flow, and YouTube Shorts outputs carry SynthID watermarks and C2PA Content Credentials; Chrome and Search verification coming.
Requires Google AI subscription; feature availability varies by tier and geography.

Commenters broadly agree spatial coherence and physics remain broken: geometry shifts when objects leave frame, and rigid body discontinuities (Jenga, marble track) expose a likely architectural ceiling in video-trained models.
Practical users doing real production work (real estate, creative video) report Seedance 2 still outperforms Omni Flash on quality, suggesting the demo prompts were cherry-picked.
A cultural fatigue thread emerged: AI video’s ability to fake anything has eroded the emotional impact of impressive footage, which commenters see as an underappreciated cost.

@blt: Demo marble video claimed to show “real-world physics” but the marble jumps upward unprompted and gains speed with no energy source.
@torginus: Geometry that changes as objects leave and re-enter frame suggests deep spatial understanding is still unsolved, possibly a fundamental training structure issue.