ElevenLabs on Why Voice Will Be the Interface for AI
Published 2026-05-06 - Runtime about 27 min - Watch on YouTube
ElevenLabs bet on audio in 2022, when text and image models dominated, and built a frontier business without the giant capital-first playbook. Mati Staniszewski argues voice will become the interface for agents, robots, and authenticated human-computer interaction, with emotion and trust as the next bottlenecks.
What Matters
- Poland’s single-voice dubbing habit exposed a bigger problem: audio still fails at language, emotion, and context.
- ElevenLabs started with text-to-speech, then added speech-to-text, dubbing, real-time voice agents, and music; the stack now spans audio generation and orchestration.
- The company chose remote hiring, recruiting researchers from GitHub work instead of location, and monetized early to fund model development.
- Staniszewski says ElevenLabs is just over 400 people, with over $400M in revenue, while keeping product, research, GTM, and ops teams under 10 people each.
- Voice agents are moving beyond support into sales: Deliveroo uses them for restaurant openings, and Deutsche Telekom-style inbound flows capture more customer intent than forms.
- The biggest overlooked use cases are citizen support, education, and healthcare; Ukraine’s government uses voice access for frontline updates, education help, and safety guidance.
- The next hard problem is emotional intelligence: agents should detect stress, excitement, and speaking speed, then adjust tone, pacing, and reassurance in real time.
- He expects authenticated voice to matter as much as detection: the future may assume most audio is fake unless it is watermarked and verified.