Voice is one of our most natural identifiers. A new Microsoft AI can now learn to imitate them almost perfectly in a very short time. Microsoft is also aware of the risks.
Even with our eyes closed or on the phone, we can recognize other people by their voice. However, the certainty of hearing the right person will vary in the future. With a new software project, Microsoft also wants computers to speak with the voice of a real person. And don’t just imitate their sound.
On Thursday, the group announced the “VALL-E” baptized project. The software analyzes an existing voice recording of a person using artificial intelligence. If you then give her a text prompt, she can repeat it in the style of the speaker of the original recording. In doing so, she not only adopts the sound of the voice itself, but also imitates the style of speech and even the “acoustic environment”: If the recording was recorded during a call, the imitated version also sounds as if it came from the telephone. The real revolution: Three seconds of spoken text are enough to imitate the voice.
The result is impressive – and at the same time frightening. On the program website, the group a whole series of sound samples. The “speaker prompt” refers to the three-second original. Microsoft describes a recording as “ground truth” in which the original speaker also reads out the sentence that is then used as the new text. Finally, as a so-called “baseline”, there is a comparison with conventional reading software. If you now compare the “Ground Truth” with VALL-E’s recording, the computer-generated voice can often hardly be distinguished from the original. Only in the emphasis and the sound are there small hints that make the artificial voice sound a bit unnatural. The sound and the way of speaking are basically always well hit.
In order to achieve this precision, Microsoft has trained its AI with 60,000 hours of audio material from 7000 speakers, these are voice recordings from the Librilight data set of the Facebook parent company Meta. It mainly includes audio books. It is no coincidence that Microsoft’s speech examples are exclusively literature templates: According to the developers, speech synthesis currently works particularly well when the audio recordings used correspond to voices from the data set used. So far, she reads audio books most credibly. If you were to choose any speaking voices, the result would not be credible at the moment.
These “photos” were completely generated by an artificial intelligence
Our image description for the AI was: “A horse race in front of the Eiffel tower in Paris, high quality”
No need to panic (yet).
As a possible benefit for the program, the developers name the automated reading of texts at human level, for example to translate chats into speech. But a subsequent revision of spoken recordings to remove errors is also conceivable. In addition, completely new language content can be created by adding further AI models.
But Microsoft is also apparently very aware of the potential for abuse in the technology. “Because VALL-E can impersonate the speaker when processing the voice, it carries a possible risk of abuse,” the announcement explains. “You can trick voice recognition programs or imitate a specific speaker.” To prevent this, work is being done on the development of software that also recognizes AI-generated voices as such. Certainly the best protective measure: VALL-Es program code is currently not even accessible to third parties.
You don’t have to be afraid of German-language VALL-E fakes anyway: So far, the AI only speaks English.