Bilingual Science Story: Speech Synthesis and AI (Part 1)

What an Endless Conversation with

Werner Herzog Can Teach Us about

AI Problem

On the website Infinite Conversation, the German filmmaker Werner Herzog and the Slovenian philosopher Slavoj Žižek are having a public chat about anything and everything. Their discussion is compelling, in part, because these intellectuals have distinctive accents when speaking English, not to mention a tendency toward eccentric word choices. But they have something else in common: both voices are deepfakes, and the text they speak in those distinctive accents is being generated by artificial intelligence.

Improvements in what’s called machine learning have made deepfakes—incredibly realistic but fake images, videos or speech—too easy to create, and their quality too good. At the same time, language-generating AI can quickly and inexpensively churn out large quantities of text. Together, these technologies can do more than stage an infinite conversation. They have the capacity to drown us in an ocean of disinformation.

Machine learning, an AI technique that uses large quantities of data to “train” an algorithm to improve as it repetitively performs a particular task, is going through a phase of rapid growth. This is pushing entire sectors of information technology to new levels, including speech synthesis, systems that produce utterances that humans can understand. As someone who is interested in the liminal space between humans and machines, I’ve always found it a fascinating application. So when those advances in machine learning allowed voice synthesis and voice cloning technology to improve in giant leaps over the past few years—after a long history of small, incremental improvements—I took note.

Infinite Conversation got started when I stumbled across an exemplary speech synthesis program called Coqui TTS. Many projects in the digital domain begin with finding a previously unknown software library or open-source program. When I discovered this tool kit, accompanied by a flourishing community of users and plenty of documentation, I knew I had all the necessary ingredients to clone a famous voice.

As an appreciator of Werner Herzog’s work, persona and worldview, I’ve always been drawn by his voice and way of speaking. I’m hardly alone, as pop culture has made Herzog into a literal cartoon: his cameos include The Simpsons, Rick and Morty and Penguins of Madagascar. So when it came to picking someone’s voice to tinker with, there was no better option—particularly since I knew I would have to listen to that voice for hours on end.

It’s almost impossible to get tired of hearing his dry speech and heavy German accent, which convey a gravitas that can’t be ignored. Building a training set for cloning Herzog’s voice was the easiest part of the process. Between his interviews, voice-overs and audiobook work there are literally hundreds of hours of speech that can be harvested for training a machine-learning model—or in my case, fine-tuning an existing one.

A machine-learning algorithm’s output generally improves in “epochs,” which are cycles through which the neural network is trained with all the training data. The algorithm can then sample the results at the end of each epoch, giving the researcher material to review in order to evaluate how well the program is progressing.

With the synthetic voice of Werner Herzog, hearing the model improve with each epoch felt like witnessing a metaphorical birth, with his voice gradually coming to life in the digital realm.

Key Vocabulary

discover v. find out
distinctive adj. unique
inexpensively adv. at low cost
algorithm n. a set of rules for solving a problem
ingredient n. a component or element

END

Previous Reviews

Teacher and Student Translation| Zhang Jihuai’s First Experience with High-Speed Rail

Teacher and Student Translation| Song Huizong: The Artist Misplaced as an Emperor

Source|Scientific American: What an Endless Conversation with Werner Herzog Can Teach Us about AI Problem

Image Source| Baidu Images

Translator| Wan Ran

Review Editor| Wang Chunyu

Editing Supervisor| Wang Chunyu

Executive Editor| Wan Ran

Leave a Comment Cancel reply