This AI System Lets Google Assistant Sound More Human

Thanks to Google and artificial intelligence (AI) research company, DeepMind, your phone will no longer sound like a robot when reading out or dictating requested information. Google Assistant is using an improved version of DeepMind’s WaveNet, a deep neural network that can synthesize realistic human speech.

WaveNet uses an improved system of speech synthesis or text-to-speech (TTS). TTS uses two techniques, concatenative and parametric TTS. In order to closely mimic human speech, concatenative TTS juxtaposes different parts of a voice actor’s recordings to construct the desired sentence. Upgrading concatenative TTS is cumbersome as it involves replacing the audio libraries. Parametric TTS generates computer-generated speech that tends to sound robotic and artificial.

How Does WaveNet Work?

Unlike these two TTS systems, WaveNet uses a system developed from a convolutional neural network to produce waveforms from scratch. Speech samples are used to train the platform to synthesize voices. The system determines which waveforms sound like people and which do not. This provides the speech synthesizer with the ability to mimic human intonations such as lip smacks. The system is even capable of coming up with its own accent based on the given samples.

In earlier years, amount of computing power needed to generate the audio was a severe limitation for WaveNet. It used to take at least one second to produce .02 seconds of audio. DeepMind’s engineers fixed the problem, and the system was able to produce a one-second-long waveform in 50 milliseconds. The sample’s resolution has doubled from 8 to 16 bits. This directly translates into audio that score much higher in human listening tests.

The improvements enable system integration into Google Assistant and other consumer products. As of today, Google Assistant can produce Japanese and U.S. English voices. Eventually, Google can use WaveNet to synthesize speeches for other dialects and languages. Eventually, computer-generated speech will sound more like humans, getting it correct right down to the peculiar regional accent.