Stable Audio 3

· ai web · Source ↗

TLDR

  • Paper introduces Stable Audio 3, a family of latent diffusion models for fast, variable-length audio generation and editing with open weights for small and medium sizes.

Key Takeaways

  • Three model sizes (small, medium, large) built on a semantic-acoustic autoencoder that compresses audio into a compact latent space preserving fidelity and semantic structure.
  • Variable-length generation avoids the cost of full-length inference for short sounds; inpainting enables targeted edits and continuation of existing recordings.
  • Adversarial post-training accelerates inference and improves quality, cutting required diffusion steps while boosting prompt adherence.
  • Generates audio in under 2s on an H200 and a few seconds on a MacBook Pro M4; small and medium weights released with full training and inference code.
  • Training data is licensed and Creative Commons, addressing a persistent legal concern in generative audio models.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN