From Apple's Siri to Amazon's Alexa, voice assistants are incredibly popular and are shaping how we use and interact with technology. Voice assistants can be asked to play music, turn on/off lights, schedule appointments, and more. We may not be too far off from having an intelligent-like J.A.R.V.I.S. program (ala Iron Man) that we can command and manage most aspects of our lives.
Voice assistants utilize human voice recordings and machine learning to create the assistants' voices. How did this technology develop and how might it affect the voice recording industry? Let's explore the future now.
[Average read time: 3 minutes]
The Development of Speech Synthesis and TTS
Speech synthesis is the artificial production of human speech and is now commonly performed by computer systems with microphones and speakers (Amazon's Echo, Google Assistant, etc...). One of the earliest efforts of speech synthesis goes back to 1791 Vienna when Wolfgang von Kempelen introduced his "Acoustic-Mechanical Speech Machine", which used a compression chamber, a vibrating reed, and a leather tube to reproduce vowel and consonant sounds.
Almost 180 years later, with speech synthesis moving away from mechanical machines to electrical synthesizers, the first full text-to-speech system for English was developed in Japan in 1968 by Noriko Umeda and her associates. Text-to-speech (TTS) systems are computer programs that convert digital text into spoken voice output.
Though monotonous in tone, Noriko's system was fairly intelligible and was able to analyze English text and approximate the pronunciation (hear sample audio here, top clip). This approximation of human speech grew more accurate as technology and programming got more sophisticated, especially in the past 10 years with the use of neural networks (computing systems modeled loosely after the brain).
Most personal devices (laptops, smartphones, tablets) have a TTS function that can be enabled in the settings. TTS helps those with visual and reading impairments as well as those that have special educational needs (dyslexia) to understand written text.
Voice Recognition and STT
We've talked about text to speech, but what about the other way around, speech-to-text (STT)? STT is when a program listens to audio and is able to transcribe it to text. The earliest voice recognition system was Bell Labs "Audrey" in 1952 which could only recognize numbers. 10 years later IBM's Shoebox was shown at the World's Fair and was able to recognize 16 words in English.
In the 1980s a major breakthrough for TTS and STT was the use of hidden Markov models which allowed computers to determine the probability of what a word might sound like or what it was. In 1987, the World of Wonders Julie doll could be trained by children to respond to their voice, but it had to be spoken to one word at a time.
Eventually with the help of Google Voice Search in the 2000s and eventually Siri in 2011, speech recognition technology is at a point where computers can recognize up to 95% of speech.
Speech Technology and the Affect on the Voice Industry
In the 2000s to the early 2010s, many devices that had a speech function relied on a process called concatenation. A voice actor would record for hours different words and phrases and then the recordings would be separated into syllables and sounds. A computer would then combine these sounds to form phrases and sentences. However, these voices usually sounded robotic and unnatural.
When Siri was first released in 2011, it used the voice of voice actor Susan Bennett (much to her surprise) using recordings that she did six years ago. In 2005, she was asked by an interactive voice response (IVR) company to record for a month a list of nonsensical phrases, making sure to keep the same pacing, pitch, and tone. Little did she know, she would soon be the voice to millions of consumers nationwide as Siri. Through excellent engineering, Siri was the first concatenated voice that did not sound like a robot. However, the technology has made huge leaps since 2011.
Instead of relying on concatenation, voices can now be simulated by using neural text to speech. Neural text to speech is a TTS system that relies on neural networks to learn patterns of ways of speaking like stress, intonation, rhythm...and be able to reproduce the speech on its own. A great example of this is the first celebrity voice for Alexa, Samuel L. Jackson that uses neural text to speech technology.
Instead of having Jackson record hours of nonsensical sentences to be snipped together, the neural text to speech system can learn Jackson's voice from the already existing recordings he's committed to film. There is likely a lot more engineering involved with developing Jackson's voice for Alexa, yet it points to an interesting future when text to speech systems become less and less reliant on humans spending time in the recording booth.
What Does the Future of Voice Assistants Hold?
Voice assistants are managing more and more of our lives and it has brought up privacy concerns with consumers as to who may be listening in. However, with the convenience and interactive ease that voice devices bring, the demand has not subsided. With the voice assistant market seeking to expand globally, it does mean that the demand for voice actors is rising as companies seek to have the devices speak in multiple languages. Now that the industry is privy to the power of the voice, companies are now having voice actors sign non-disclosure agreements. So, who knows? You might be the next voice on millions of devices in your home country, but unlike Bennett who did not have to sign an NDA, you may not be able to tell your friends about it.
Need help with understanding key voice-over and dubbing terms for localization? Click on the box below for a free glossary we've compiled, just for you.