Open Source: OpenAI releases automatic speech recognition system Whisper

0
3
open source openai releases automatic speech recognition system whisper.jpg
open source openai releases automatic speech recognition system whisper.jpg

The Whisper speech recognition system is designed to recognize languages, translate them into English and transcribe recordings. Five free variants can be found on GitHub.

 

OpenAI has announced a new automatic speech recognition system (ASR) called Whisper. It is based on an encoder-decoder Transformer and is available in five open-source versions on GitHub. The development team trained the ASR system with 680,000 hours of audio material from the Internet. Two thirds of the recordings were in English, the last third in various other languages. As a multitasking model, Whisper should not only transcribe, but also be able to recognize and translate languages.

 

In the research report on Whisper, the OpenAI team states that they developed the model with the aim of creating a robust language processing program that does not require dataset-specific fine-tuning. The researchers state that pre-trained audio encoders often learned unsupervised. As a result, the encoders are highly specialized, but human fine-tuning is required to enable the decoders to output data in the appropriate quality. So for Whisper, the team used about 10,000 hours of monitored speech data for every 30,000 hours of data with more background noise, resulting in a weakly monitored model. According to the report, the process could be easily automated.

 

Whisper is based on an encoder-decoder Transformer. The program reads audio data as 30-second snippets, which the developers put in front of the system as mel spectrograms. The decoders were trained to generate text that went with the sound. Whisper also uses special tokens that allow the program to perform multiple tasks. According to OpenAI, the program is suitable for performing language identification, phrase-level time stamping, multilingual speech transcription and speech translation into English.

Due to the large data basis for training and the lack of fine-tuning for a specific data set, Whisper lags behind specialized models in the LibriSpeech benchmark, for example. However, the team at OpenAI reports better zero-shot performance when dealing with unknown datasets. According to the developers, the robustness of the model is reflected in a 50 percent lower error rate in tests on different data sets than in specifically developed systems.

 

Whisper is available in five different model sizes on GitHub. Training parameters range from 39 million to over 1.5 billion. The smallest model needs about 1 GB of VRAM, the largest needs about 10 GB. Except for the largest version, the models can only deal with English. The different sizes mean a difference in the speed and accuracy of the systems.

Speech models and speech recognition currently play a major role, for example with the chat program LaMDA, which caused a sensation this year due to an alleged awareness. Like Whisper, LaMDA is also based on a Transformer architecture. A basic explanation of the structure and function of transformers can be found here.