SANA-WM, a 2.6B open-source world model for 1-minute 720p video

· ai design · Source ↗

TLDR

  • Paper introduces SANA-WM, a 2.6B-parameter world model generating 60-second 720p camera-controlled video on a single GPU.

Key Takeaways

  • Hybrid Linear Attention pairs frame-wise Gated DeltaNet with periodic softmax attention for memory-efficient minute-scale coherence.
  • Dual-Branch Camera Control follows 6-DoF metric trajectories via a coarse global pose branch and a fine pixel-aligned geometric branch.
  • Two-stage pipeline: a 17B long-video refiner sharpens texture and late-window quality on top of the 2.6B backbone.
  • Trained on ~213K public video clips in 15 days on 64 H100s; distilled variant runs on a single RTX 5090 with NVFP4 quantization in 34 seconds per clip.
  • Benchmarks show stronger action-following than prior open-source baselines and comparable visual quality to LingBot-World and HY-WorldPlay at 36x higher throughput.

Hacker News Comment Review

  • Weights are not yet released despite “open-source” framing; the codebase is public on GitHub but model weights are listed as coming “soon”, drawing sharp skepticism.
  • Commenters debate what “world model” means here: consensus is it is a video generator that reacts to game-like controls, not a scene-graph or abstract physical representation.
  • Several builders note the outputs look synthetic and game-like, likely due to Unreal Engine training data, raising questions about real-world generalization.

Notable Comments

  • @jubilanti: “Weights or it didn’t happen” – flags the gap between “open-source” marketing and unreleased weights.
  • @oersted: Clarifies world model definition: predicts next world state given current state plus optional agent action, analogous to next-token prediction in LLMs.

Original | Discuss on HN