Gemini Omni

· ai · Source ↗

TLDR

  • Google’s Gemini Omni is a multimodal video generation and editing model supporting cross-modal prompting: add sounds, swap environments, change camera angles, and edit physics from any input.

Key Takeaways

  • Accepts video, image, and object inputs together; prompts can trigger conditional events like playing animal sounds when a finger touches a toy.
  • Supports iterative editing chains: transport violinist to new environment, make violin invisible, then shift to over-shoulder camera angle in sequence.
  • Targets long-form coherent generation: 26-item alphabet video with lower thirds, music sync, and pacing controlled via single prompt.
  • All Gemini app, Google Flow, and YouTube Shorts outputs carry SynthID watermarks and C2PA Content Credentials; Chrome and Search verification coming.
  • Requires Google AI subscription; feature availability varies by tier and geography.

Hacker News Comment Review

  • Commenters broadly agree spatial coherence and physics remain broken: geometry shifts when objects leave frame, and rigid body discontinuities (Jenga, marble track) expose a likely architectural ceiling in video-trained models.
  • Practical users doing real production work (real estate, creative video) report Seedance 2 still outperforms Omni Flash on quality, suggesting the demo prompts were cherry-picked.
  • A cultural fatigue thread emerged: AI video’s ability to fake anything has eroded the emotional impact of impressive footage, which commenters see as an underappreciated cost.

Notable Comments

  • @blt: Demo marble video claimed to show “real-world physics” but the marble jumps upward unprompted and gains speed with no energy source.
  • @torginus: Geometry that changes as objects leave and re-enter frame suggests deep spatial understanding is still unsolved, possibly a fundamental training structure issue.

Original | Discuss on HN