It’s hard to tell if the next Darth Vader voice you’ll hear will be James Earl Jones or an AI voice clone version of the legendary actor.
Jones’ plan is to step back from popular games Star Wars Allowing a movie character to reconstruct his voice with AI technology shows the dramatic growth of voice AI technology in recent years. However, it also highlights some ethical concerns.
“James Earl Jones is a masterpiece when it comes to deeply modulated voice,” said Constellation Research analyst Andy Turai. “It’s great that he’s graciously donating his voice to create voice cloning for conversational AI. I see it growing in many areas.”
The terms Jones signed over the rights to her voice to AI vendor Respiecher are not publicly known.
Although enterprise use of the technology is still relatively low, as more organizations begin to use conversational AI tools, the global voice cloning market size could surpass the $5 billion mark by the end of the decade, according to some market research findings.
Voice cloning comes in a variety of forms, including AI technology that translates text into speech, and speech into speech when AI mimics a person’s voice. Speech-to-speech is a type of voice cloning that clones Jones’ voice.
Ukraine-based Respeecher helps content creators clone voices using machine learning and AI.
Dmytro BielievtsovCo-Founder, Respeicher
“More for speech-to-speech control and fine-tuning,” says Respecher’s Dmytro Believtsov, who co-founded the company in 2018.
The vendor’s technology uses a switching system in which its algorithms are exposed to speech from different speakers. The system learns how speech works at the phonetic level and the different sounds humans make, and then imitates it.
Apart from imitating, it also describes various gestures, intonations and intonations to translate into new speech using previously learned knowledge.
The technique is common in Hollywood, especially for scenes that require stand-in actors, Bielievtsov said. “It simplifies things and any good actor can be a voice stand-in,” he said.
Previously, Respiecher used the technology for several projects, including the voice of Mark Hamill The Mandalorian, in which technology was used to reduce the actor’s voice. Sellers also worked on a Super Bowl project about late football coaching great Vince Lombardi.
Despite its famous clientele, Respeecher has had to work to improve its technology so that the voice the AI spits sounds not mechanical, but realistic.
“Getting really high-quality speech is challenging,” Believtsov said. “Just getting the sound quality high enough and not having too many artifacts, that was a challenge that took us a long time.”
Another challenge facing the vendor is data efficiency and the time it takes to train an AI model, Bielievtsov continued. A marketer wants to train an AI model within five seconds of someone’s speech, which currently requires five minutes. The vendor is working to optimize rendering so that the model requires less time.
In text-to-speech voice cloning, an AI model translates written text into speech.
Text-to-speech allows users to create a variety of tones, accents and languages. It is also known as synthetic speech.
Most major technology vendors have text-to-speech offerings. For example, the Google Cloud text-to-speech API powers Custom Voice, which enables developers to train a custom voice model using audio recordings. Microsoft Azure also has the ability to allow users to create voice for text-to-speech apps. And Nvidia demonstrated speech synthesis tools that developers can use with the vendor’s Omniverse platform and avatars for conversational AI offerings.
AI avatars and digital humans in the metaverse have fueled interest in text-to-speech voice cloning, according to Gartner analyst Annette Jump.
“[There is] “Interest in AI avatars as part of conversational AI — humanizing virtual assistants or using avatars or digital humans in the metaverse, where you create a digital version of yourself or a digital representative of your company,” Jump said.
Cloned speech is useful in call centers, where agents speak a different language, and in settings where a digital avatar is required.
For example, for enterprises that use digital avatars to support call center agents, synthetic speech can communicate with customers before transferring them to an actual agent. Admissions departments can also use synthetic speech with digital avatars to attract new students. And users can interact with the avatar as if they were talking to a real person.
“Synthetic voice quality is significantly better today than it was five or seven years ago,” Jump added. “But there are definitely ways to make it even better in terms of more different languages or dialects.”
Speech-to-speech is also used when people have trouble understanding call center agents due to varying accents. “They may want to adjust in real time what they want to express and not take too much away from their emotions,” says Believtsov.
According to Turai, other enterprise applications may include technology used in an advisory capacity. “If I have a question, I can talk to someone to get an answer in their own voice, but not a real person,” he said.
While Respeicher is primarily targeting its speech-to-speech technology at Hollywood filmmakers and directors, the company wants to enter healthcare.
For example, patients who have had their vocal cords removed can have them replaced with a device with embedded speech-to-speech AI technology.
“A real-time voice conversion device for medical purposes will improve their lives, making the voice sound more natural,” Believtsov said.
As the possibilities for voice cloning technologies — both synthetic voice and speech-to-speech — continue to grow, other applications illustrate the ethical questions the technology raises.
Recently, Podcast.ai released a podcast of an interview between Steve Jobs and Joe Rogan. The podcast sounds like two people’s real voices, but it’s entirely generated by AI.
“As technology progresses, certainly, you only need a very limited number of people’s words to clone it,” Jump said.
This brings the question of privacy to the fore because technology makes it easier for bad actors to abuse people’s voices.
“You’re just one hacker away from someone cloning your voice to hack into your voice-activated systems,” says Turai.
Voice cloning can also have a significant impact on political campaigns.
“A lot of fake videos and audios may start circulating during the political campaign season, trying to incriminate the opponents,” Turai added. “It becomes very difficult to prove the origin of the video or audio.”
According to Yugal Joshi, an analyst at Everest Group, another concern is finding a way to use technology instead of replacing humans in the James Earl Jones case.
“Eventually, they replace people through these systems, and they have to strategize and plan for that day,” he said.
“The main challenge of these systems is that they usually fail in the real world,” added Joshi. “When there’s a dedicated use case, like imitating James’ voice or Steve Jobs, you also have the opportunity to edit and improve. In a real-life context, if damage is done, it’s done. No retakes.”