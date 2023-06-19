- Advertisement -

Meta Platforms, the artificial intelligence research division of the well-known North American company, presented box. This is a machine learning model capable of generating speech from text, and differs from other options in its ability to perform many tasks for which it has not been trained, such as editing, noise removal, and style transfer. . It must be said that Meta has not released Voicebox to the market – at least for the moment – due to ethical concerns about its misuse. The point is that the initial results are promising and may drive many applications in the future. What exactly is Meta Voicebox This development is a generative model capable of synthesizing speech in six different languages, including English, French, Spanish, German, Polish and Portuguese. While existing language models try to learn the statistical regularities of words and text sequences, Voicebox has been trained to learn the patterns that map audio samples of speech to their transcriptions. This type of model can be applied to many subtasks with little or no additional tuning. “The goal is to build a single model that can perform many text-guided speech generation tasks through contextual learning,” the Meta researchers write. An important detail: to train the model, Meta used its technique called ‘Flow Matching’, which is more efficient and generalizable than the diffusion-based learning methods used in other generative models. This technique allows for “learning of varied speech data without the need for careful labeling”. Something that is key in Voicebox is that it can do many jobs for which it has not been trained. For example, the AI ​​can use a two-second speech sample to generate speech for new text. Meta claims that this ability can be used to provide voice for people who cannot speak or customize the voices of non-playable game characters and virtual assistants. Lots of options for the future Development using Artificial Intelligence can generate multiple speech samples from a single text stream. This capability can be used to generate synthetic data and train other speech processing models. Meta notes that “our results show that speech recognition models trained with synthetic speech generated by Voicebox perform almost as well as models trained with real speech, with an error rate degradation of only 1 percent compared to 45 to 70 percent degradation with synthetic speech from earlier text-to-speech models.” However, Voicebox also has its limits. Since it has been trained on audiobook data, it is not well suited to conversational speech that is informal and contains non-verbal sounds. Also, it does not provide complete control over different attributes of the generated voice, such as voice style, pitch, emotion, and acoustic conditions. Meta’s research team is exploring techniques to overcome these limitations in the future. >