The company’s blog post exudes the enthusiasm of a 90s American infomercial. WellSaid Labs describes what customers can expect from its “Eight New Digital Voice Players!” Tobin is “energetic and insightful.” Paige is “calm and expressive”. Ava is “polite, confident and professional.”
Each is based on a real voice actor, whose likeness (with consent) has been preserved using AI. Businesses can now allow these voices to say whatever they need. They simply feed text into the speech engine, and output a crisp audio clip of a natural-sounding performance.
WellSaid Laboratories, a Seattle-based startup from the nonprofit Allen Institute of Artificial Intelligence research institute, is the latest company to bring AI voices to its customers. For now, he specializes in voices for corporate e-learning videos. Other startups are making voices for digital assistants, call center operators, and even video game characters.
Not so long ago, such deepfake voices got a bad rap for their use in fraudulent calls and internet deception. But their quality improvement has since aroused the interest of a growing number of companies. Recent breakthroughs in deep learning have made it possible to reproduce many subtleties of human speech. These voices stop and breathe in the right places. They can change style or emotion. You can spot the trick if they speak for too long, but in short audio clips some have become indistinguishable from humans.
AI voices are also inexpensive, scalable, and easy to use. Unlike recording a human voice actor, synthetic voices can also update their script in real time, opening up new opportunities for personalized advertising.
But the rise of hyperrealistic false voices is not without consequences. Human voice actors, in particular, have been left to wonder what this means for their livelihoods.
How to simulate a voice
Synthetic voices have been around for some time. But the elders, including the voices of the original Siri and Alexa, just words and sounds stuck together to achieve a clumsy robotic effect. Making them look more natural was a laborious manual task.
Deep learning changed that. Voice developers no longer needed to dictate the exact rhythm, pronunciation or intonation of the generated speech. Instead, they could feed a few hours of audio into an algorithm and have the algorithm learn those patterns on its own.
“If I’m Pizza Hut, I sure can’t look like Domino’s, and I sure can’t look like Papa John’s.”
Rupal Patel, founder and CEO of VocaliD
Over the years, researchers have used this basic idea to build increasingly sophisticated vocal engines. The one built by WellSaid Labs, for example, uses two main deep learning models. The first predicts, from a passage of text, the outline of a speaker’s sound, including accent, pitch and timbre. The second fills in the details, including the breaths and the way the voice resonates in its surroundings.
However, creating a compelling synthetic voice takes more than the push of a button. Part of what makes a human voice so human is its inconsistency, expressiveness, and ability to deliver the same lines in completely different styles, depending on the context.
Capturing these nuances involves finding the right voice actors to deliver the right training data and refine deep learning models. WellSaid says the process requires at least an hour or two of audio and a few weeks of work to develop a realistic-sounding synthetic cue.
AI voices have become particularly popular among brands looking to maintain a consistent sound in millions of customer interactions. With the ubiquity of smart speakers today and the rise of automated customer service agents as well as digital assistants built into cars and smart devices, brands may need to produce more than a hundred hours of audio per month. But they also no longer want to use the generic voices offered by traditional text-to-speech technology – a trend that accelerated during the pandemic as more customers skipped in-store interactions to interact with businesses virtually.
“If I’m Pizza Hut, I certainly can’t look like Domino’s, and I certainly can’t look like Papa John’s,” says Rupal Patel, professor at Northeastern University and founder and CEO of VocaliD, which promises to build voices personalized that correspond to the brand identity of a company. “These brands have thought about their colors. They thought about their fonts. Now they also need to start thinking about how their voice sounds.
While companies had to hire different voice actors for different markets (the Northeast versus the Southern United States or France versus Mexico), some voice AI companies may manipulate the accent or change the language of the voice. ‘one voice in different ways. This opens up the possibility of tailoring advertisements on streaming platforms depending on who is listening, changing not only the characteristics of the voice but also the words spoken. A beer ad might tell a listener to stop at a different pub depending on whether they’re playing in New York or Toronto, for example. Resemble.ai, which designs voices for ads and smart assistants, says it is already working with clients to launch such personalized audio ads on Spotify and Pandora.
The gaming and entertainment industries are also seeing the benefits. Sonantic, a company specializing in emotional voices that can laugh and cry or whisper and scream, works with video game makers and animation studios to provide voiceovers for their characters. Many of his clients only use synthesized voices in pre-production and switch to real actors for the final production. But Sonantic says a few have started using them throughout the process, possibly for characters with fewer lines. Resemble.ai and others have also worked with movie and TV shows to correct actor performance when words are scrambled or poorly pronounced.