ElevenLabs on Why Voice Will Be the Interface for AI

· ai · Source ↗

Published 2026-05-06 - Runtime about 27 min - Watch on YouTube

ElevenLabs bet on audio in 2022, when text and image models dominated, and built a frontier business without the giant capital-first playbook. Mati Staniszewski argues voice will become the interface for agents, robots, and authenticated human-computer interaction, with emotion and trust as the next bottlenecks.

What Matters

  • Poland’s single-voice dubbing habit exposed a bigger problem: audio still fails at language, emotion, and context.
  • ElevenLabs started with text-to-speech, then added speech-to-text, dubbing, real-time voice agents, and music; the stack now spans audio generation and orchestration.
  • The company chose remote hiring, recruiting researchers from GitHub work instead of location, and monetized early to fund model development.
  • Staniszewski says ElevenLabs is just over 400 people, with over $400M in revenue, while keeping product, research, GTM, and ops teams under 10 people each.
  • Voice agents are moving beyond support into sales: Deliveroo uses them for restaurant openings, and Deutsche Telekom-style inbound flows capture more customer intent than forms.
  • The biggest overlooked use cases are citizen support, education, and healthcare; Ukraine’s government uses voice access for frontline updates, education help, and safety guidance.
  • The next hard problem is emotional intelligence: agents should detect stress, excitement, and speaking speed, then adjust tone, pacing, and reassurance in real time.
  • He expects authenticated voice to matter as much as detection: the future may assume most audio is fake unless it is watermarked and verified.