ElevenLabs on Why Voice Will Be the Interface for AI

May 6, 2026 · ai · Source ↗

Published 2026-05-06 - Runtime about 27 min - Watch on YouTube

ElevenLabs bet on audio in 2022, when text and image models dominated, and built a frontier business without the giant capital-first playbook. Mati Staniszewski argues voice will become the interface for agents, robots, and authenticated human-computer interaction, with emotion and trust as the next bottlenecks.

What Matters

Poland’s single-voice dubbing habit exposed a bigger problem: audio still fails at language, emotion, and context.
ElevenLabs started with text-to-speech, then added speech-to-text, dubbing, real-time voice agents, and music; the stack now spans audio generation and orchestration.
The company chose remote hiring, recruiting researchers from GitHub work instead of location, and monetized early to fund model development.
Staniszewski says ElevenLabs is just over 400 people, with over $400M in revenue, while keeping product, research, GTM, and ops teams under 10 people each.
Voice agents are moving beyond support into sales: Deliveroo uses them for restaurant openings, and Deutsche Telekom-style inbound flows capture more customer intent than forms.
The biggest overlooked use cases are citizen support, education, and healthcare; Ukraine’s government uses voice access for frontline updates, education help, and safety guidance.
The next hard problem is emotional intelligence: agents should detect stress, excitement, and speaking speed, then adjust tone, pacing, and reassurance in real time.
He expects authenticated voice to matter as much as detection: the future may assume most audio is fake unless it is watermarked and verified.