Google’s DeepMind Gives Robots A Life-Like Voice

By Jonathan Lee

2 Min Read Published: Sep 13th, 2016

Updated: Jan 13th, 2017

Googleâ€™s DeepMind Gives Robots A Life-Like Voice

Text Size

- +

Toggle Dark Mode

The idea of being able to carry on a life-like conversation with a computer is, depending on where you sit, an interesting proposition, too creepy to consider, or a bit a both. That proposition has become more of a reality thanks to the efforts of Google’s AI researchers.

Google’s DeepMind team has made a breakthrough in speech generation for machines. The very same team that that created the AlphaGo artificial intelligence that defeated Go grandmaster Lee Sedol, has developed WaveNet, a piece of software that imbues computers with eerily life-like voices.

WaveNet, according to DeepMind, is a “deep generative model of raw audio waveforms”, that synthesizes realistic speech using real samples of human speech as well as previously generated audio. Google is claiming that WaveNet is so adept at mimicking human voices that it eclipses all other Text-to-Speech (TTS) systems and that it has reduced the gap between state-of-the-art and human-level performance by over 50%. WaveNet has tested positively in both Mandarin and English.

There are two broad types of TTS systems. The first is called concatenative TTS, wherein a large database of short speech fragments recorded by a single speaker is used and recombined to create complete utterances. The drawback is that you’re limited to one speaker’s voice and that it’s difficult to modify the inflection or intonation of the voice in any way.

The other kind is parametric TTS, a rule-based system that allows you to control the content and characteristics of machine speech via inputs to the model. The speech that results from parametric TTS is less-than-natural-sounding and comes off as stilted and mechanical.

WaveNet directly models the raw waveform of the audio samples provided to it to generate naturalistic speech, using artificial neural networks to synthesize audio based on real voices. Google researchers have used this software to successfully generate music as well.

For the most human-sounding machine speech, you can check out WaveNet audio samples on the DeepMind webpage.

What do you think of Google’s breakthrough in voice technology?