As a professional localization studio, we pride ourselves with working with some of the best voice talent in the world. Great voice-over actors are able to modulate their voices and express strong emotions with such ease and complexity that it's hard to believe that a computer would ever be able to do so. Enter deepfakes. Deepfakes are a harbinger of the drastic changes to come in the entertainment, media, and political landscape.
[Average read time: 4 minutes]
A Brief Intro to Deepfakes
"Deepfake" is a combination of the terms “deep learning” and “fake”. Deep learning refers to a computer’s ability to learn based on artificial neural networks and thus complete complex calculations, such as recreating a voice or mapping one person’s face onto another person’s face.
Fake audio and video is not a new phenomenon and has been used in movies for many years. In “Forrest Gump” (1994), editors painstakingly superimposed actor Tom Hanks into historical events, such as meeting President John F. Kennedy, using archival footage. However, since then computers have become exponentially more powerful and available. What then took a whole team of editors many hours of painstaking manipulation of celluloid film can now be done in much less time by a single individual using a consumer laptop and free software.
The democratization of technology has led to ordinary citizens creating their own content and having access to international audio and video files. However, at the same time, this democratization has also made it possible for individuals with ill intent to create fake audio and video content for their own purposes.
So what does that mean for the future of voice acting? For media? For the world at large? This is uncharted territory, but we can already see a few consequences.
Unauthorized Use of One's Voice
In 2018, a video of former President Obama making disparaging remarks against President Trump surfaced. At the end of the video, it was revealed that actor Jordan Peele’s voice had been modified to Obama's voice and the mouth movement had been grafted on by a computer. Meant as a PSA, the video showed how deepfake technology could possibly be used as a political weapon.
With machine learning, software can now make a clone of someone's voice with just 3.7 seconds of audio. With more source content, voice cloning software can sound even more accurate and realistic. Voice actors who have hours of voice content available publicly may be at risk of their voice being cloned and used without their permission.
Public figures like Obama or Joe Rogan, whose voice was recently cloned in this artificial intelligence recording, have hundreds of hours of publicly available audio recordings, making them prime targets to have their voice realistically cloned. What may give voice actors an advantage is that we have yet to hear an AI voice that is 100% convincing in regards to the emotional depth and intonation that a talented voice actor can give.
Yet, as in the Obama example, what if there is a talented voice actor behind a deepfake? It then seems possible that a voice actor's voice can be modified to fit any number of different characters, while still maintaining a strong performance. This opens up another consequence of technology, voice actors being replaced by modulated voices.
Replacing Voice Talent
At E3 2019, the video game Watch Dogs Legion was revealed and garnered media attention. The game features a large number of characters, each with unique skills and backstory. However, the attention was not just because of Watch Dogs Legion's gameplay, but because it also had an interesting production story: instead of hiring a large number of voice talent to cover all the characters, they used voice modulation instead.
The studio had a single voice actor read the scripts for multiple characters and then modulated that single voice to sound like a whole different person. This move will allow the game studio to launch on time, cut down on costs, and reduce the game's total file size. Given all the benefits, it's foreseeable that other video game studios may soon follow suit, thus potentially reducing the total pool of available voice acting jobs in the near future.
Industries like animation, eLearning, and movie dubbing will be affected by voice modulation and machine learning. Minor characters or eLearning modules that don't require a wide emotional or tonal range will be easier for a machine to learn and eventually replace. As audio machine learning becomes more sophisticated, larger roles may possibly be at risk. This can be a scary thought for voice actors given the already competitive professional landscape. Is there a silver lining to this?
Cloned voices still have a long way to go before becoming a professional voice talent. As anyone with a voice assistant device will know, the automated assistants still sound kind of robotic. With voice modulation, human professional voice actors are still needed as the source. The effects that this technology is having and will have on the voice-over industry cannot be understated, yet the existential and political consequences have people even more worried.
Our eyes and ears are incredibly sensitive and perceptive instruments. However, what happens when what we hear and see on the screen are potentially fabricated? Ever since the invention of the first photographic camera and audio recorder, we've relied on images and audio as an accurate record of human experience.
Now, there's a long history of unreliable images, but with audio added to the mix, it's going to make the "reality" we see and hear in the media harder to believe. The U.S. government and other organizations are in a race to detect deepfakes as the use of deepfakes on political figures is a real danger. Hopefully, the world as a whole will be able to meet this danger and learn how to responsibly use this technology.