Paper introduces SANA-WM, a 2.6B-parameter world model generating 60-second 720p camera-controlled video on a single GPU.
Key Takeaways
Hybrid Linear Attention pairs frame-wise Gated DeltaNet with periodic softmax attention for memory-efficient minute-scale coherence.
Dual-Branch Camera Control follows 6-DoF metric trajectories via a coarse global pose branch and a fine pixel-aligned geometric branch.
Two-stage pipeline: a 17B long-video refiner sharpens texture and late-window quality on top of the 2.6B backbone.
Trained on ~213K public video clips in 15 days on 64 H100s; distilled variant runs on a single RTX 5090 with NVFP4 quantization in 34 seconds per clip.
Benchmarks show stronger action-following than prior open-source baselines and comparable visual quality to LingBot-World and HY-WorldPlay at 36x higher throughput.
Hacker News Comment Review
Weights are not yet released despite “open-source” framing; the codebase is public on GitHub but model weights are listed as coming “soon”, drawing sharp skepticism.
Commenters debate what “world model” means here: consensus is it is a video generator that reacts to game-like controls, not a scene-graph or abstract physical representation.
Several builders note the outputs look synthetic and game-like, likely due to Unreal Engine training data, raising questions about real-world generalization.
Notable Comments
@jubilanti: “Weights or it didn’t happen” – flags the gap between “open-source” marketing and unreleased weights.
@oersted: Clarifies world model definition: predicts next world state given current state plus optional agent action, analogous to next-token prediction in LLMs.