The company’s blog post is full of enthusiasm for American advertising from the 90’s. WellSaid Labs describes what customers can expect from their “eight new digital actors!” Tobin is “energetic and insightful.” Paige is “calm and expressive.” Ava is “sleek, confident and professional.”
Each of them is based on an actual voice actor, whose resemblance (with consent) is preserved using AI. Companies can now license these voices to say whatever they need. They simply insert a bit of text into the voice mechanism, and will wind out a sharp audio recording of the performance of natural sound.
WellSaid Labs, a Seattle-based startup that was spun off from the research nonprofit Allen Institute of Artificial Intelligence, is the latest company to offer votes to AI clients. For now, she specializes in voices for corporate video e-learning. Other startups vote for digital assistants,, call center operators, and even video game characters.
Not so long ago, such deeply false voices had a somewhat bad reputation because of their use scam calls i internet trick. But their quality improvement has since piqued the interest of a growing number of companies. The recent breakthrough in deep learning has made it possible to map out many of the subtleties of human speech. These voices stop and breathe in all the right places. They can change their style or feelings. You can spot the trick if they talk for too long, but in short audio recordings some can’t tell the difference from people.
AI voices are also cheap, scalable and easy to work with. Unlike the recording of the actor’s human voice, synthetic voices can also update their script in real time, opening up new possibilities for personalizing advertising.
But the rise of hyper-realistic false voices is not without consequences. The actors of the human voice in particular are left to wonder what that means for their lives.
How to fake a voice
Synthetic voices have been around for some time. But the old ones, including the voices of the originals Siri i Alexa, simply blind words and sounds to achieve a clumsy, robotic effect. Making them sound more natural was a strenuous manual task.
Deep learning has changed that. Voice developers no longer had to dictate the exact rhythm, pronunciation, or intonation of the generated speech. Instead, they could insert a few hours of sound into the algorithm and learn the algorithm themselves.
“If I’m Pizza Hut, I certainly can’t sound like Domino, and I certainly can’t sound like Pope John.”
Rupal Patel, founder and CEO of VocaliD
Over the years, researchers have used this basic idea to build speech engines that are becoming more sophisticated. Constructed, for example, the WellSaid Labs uses two primary models of deep learning. The first predicts, from a passage of text, wide strokes of speaker sound – including accent, pitch and tone. The other fills in details, including breaths and the way the voice resonates in its environment.
However, creating a compelling synthetic voice takes more than just a keystroke. Part of what makes the human voice so human is its inconsistency, expressiveness, and ability to give the same lines in completely different styles, depending on the context.
Capturing these nuances involves finding the right actors who will provide relevant training data and fine-tune deep learning models. WellSaid says the procedure requires at least an hour or two of sound and several weeks of effort to develop a synthetic replica of realistic sound.
AI voices have become especially popular among brands looking to maintain consistent sound in millions of customer interactions. With today’s ubiquity of smart speakers and the rise of automated customer service agents, as well as digital assistants built into cars and smart devices, brands may need to produce more than a hundred hours of sound per month. But they also no longer want to use the generic voices offered by traditional text-to-speech technology – a trend that accelerated during the pandemic as more and more customers skipped interactions in stores to virtually engage with companies.
“If I’m Pizza Hut, I certainly can’t sound like Domino, and I certainly can’t sound like Pope John,” says Rupal Patel, a professor at Northeastern University and founder and CEO of VocaliD, who promises to build custom voices that match the company’s brand identity. . “These brands have been thinking about their colors. They thought about their fonts. Now they have to start thinking about the way their voice sounds. “
While companies once had to hire different voice actors for different markets – the northeast versus the southern United States or France versus Mexico – some voice intelligence companies may manipulate accent or switch the language of one voice in different ways. This opens up the ability to customize ads on streaming platforms, depending on who is listening, changing not only the characteristics of the voice but also the words that are spoken. A beer ad could tell a listener to stop by another pub, depending on whether he’s playing in New York or Toronto, for example. Resemble.ai, which designs voice ads and smart assistants, says it is already working with clients to launch such personalized audio ads on Spotify and Pandora.
The games and entertainment industry also see benefits. Sonantic, a firm that specializes in emotional voices that can laugh and cry or whisper and shout, works with video game makers and animated studios to provide a voice for their characters. Many of his clients use synthesized voices only in pre-production and switch to real voice actors for the final production. But Sonantic says a few started using them throughout the process, perhaps for characters with fewer lines. Resemble.ai and others have also worked with movies and TV shows to patch up acting shows when words go wrong or are mispronounced.