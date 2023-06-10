- Advertisement -

In the world of technology, the ability to assess the quality of speech synthesis in multiple languages ​​is a constant challenge. Google has taken a step forward in this field with the development of an innovative model known as SQuId (Speech Quality Identification).

What is SQuId?

SQuId is a regression model with 600 million parameters that measures how natural a piece of speech sounds. This model is based on mSLAM, a pre-trained text-to-speech model developed by Google. It has been adjusted for over a million quality ratings in 42 languages ​​and tested in 65, making it the largest published effort of its kind to date.

- Advertisement -

How SQuId works

SQuId takes an expression as input and an optional localization tag, which is a localized variant of a language, and returns a score between 1 and 5 indicating how natural the waveform sounds. The model consists of three components: an encoder, a clustering/regression layer, and a fully connected layer. The encoder is the largest and most important part of the model.

SQuId training and assessment

To train and test the model, the SQuId corpus, a collection of 1.9 million qualified utterances in 66 languages, was created. This corpus covers a wide range of systems and use cases, exposing SQuId to a wide range of TTS errors, such as acoustic artifacts, incorrect prosody, text normalization errors, or pronunciation errors.

Challenges in training multilingual systems

A common problem when training multilingual systems is that the training data may not be uniformly available for all languages ​​of interest. To address this issue, Google decided to train one model for all languages, instead of using separate models for each language. This technique is known as transfer between locations.

transfer between locations

Inter-location transfer is the hypothesis that if the model is large enough, then inter-location transfer can occur: the accuracy of the model at each location improves as a result of co-training at the others. Google experiments show that transfer between locations is a powerful driver of performance.

- Advertisement -

Experimental results

To understand the overall performance of SQuId, Google compared it to a custom Big-SSL-MOS model, a competitive model inspired by MOS-SSL, a state-of-the-art TTS evaluation system. Big-SSL-MOS is based on w2v-BERT and was trained on the VoiceMOS’22 Challenge dataset, the most popular dataset at the time of evaluation. Several variants of the model were experimented with, and SQuId was found to be up to 50.0% more accurate.

Conclusions and future work

SQuId is a revolutionary model that uses the SQuId dataset and inter-location transfer learning to assess speech quality and describe how natural it sounds. This model can complement human evaluators in the evaluation of many languages. Future work includes accuracy improvements, expansion of the range of languages ​​covered, and resolution of new types of bugs. This breakthrough in evaluating multilingual speech synthesis is a significant step toward creating more accessible and natural speech technologies for users around the world.

More information at ai.googleblog.com and arxiv.org