Show HN: Lance – image/video generation and understanding in one model

· ai hardware · Source ↗

TLDR

  • ByteDance releases Lance, a 3B-active-parameter unified model handling text-to-image, text-to-video, editing, and understanding in a single framework.

Key Takeaways

  • Trained from scratch on 128 A100s; only ViT and VAE encoders are pretrained – transformer backbone built entirely new.
  • Scores 90 overall on GenEval (matching TUNA-7B and beating FLUX.1-dev at 12B params), and 85.11 on VBench, topping all listed unified and generation-only models.
  • Requires 40GB+ VRAM for inference; supports 480p video up to 121 frames and 768x768 images via a single unified CLI.
  • Six task modes (t2i, t2v, image_edit, video_edit, x2t_image, x2t_video) share one model checkpoint, reducing deployment overhead for multi-task pipelines.
  • GEdit-Bench average of 7.30 places it above BAGEL-7B and InternVL-U without chain-of-thought prompting.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN