Open Source: OpenAI releases automatic speech recognition system Whisper

The Whisper speech recognition system is designed to recognize languages, translate them into English and transcribe recordings. Five free variants can be found on GitHub.

OpenAI has announced a new automatic speech recognition system (ASR) called Whisper. It is based on an encoder-decoder Transformer and is available in five open-source versions on GitHub. The development team trained the ASR system with 680,000 hours of audio material from the Internet. Two thirds of the recordings were in English, the last third in various other languages. As a multitasking model, Whisper should not only transcribe, but also be able to recognize and translate languages.

Speech recognition hurdle: fine-tuning

In the research report on Whisper, the OpenAI team states that they developed the model with the aim of creating a robust language processing program that does not require dataset-specific fine-tuning. The researchers state that pre-trained audio encoders often learned unsupervised. As a result, the encoders are highly specialized, but human fine-tuning is required to enable the decoders to output data in the appropriate quality. So for Whisper, the team used about 10,000 hours of monitored speech data for every 30,000 hours of data with more background noise, resulting in a weakly monitored model. According to the report, the process could be easily automated.

Transformer-based speech recognition

Whisper is based on an encoder-decoder Transformer. The program reads audio data as 30-second snippets, which the developers put in front of the system as mel spectrograms. The decoders were trained to generate text that went with the sound. Whisper also uses special tokens that allow the program to perform multiple tasks. According to OpenAI, the program is suitable for performing language identification, phrase-level time stamping, multilingual speech transcription and speech translation into English.

Due to the large data basis for training and the lack of fine-tuning for a specific data set, Whisper lags behind specialized models in the LibriSpeech benchmark, for example. However, the team at OpenAI reports better zero-shot performance when dealing with unknown datasets. According to the developers, the robustness of the model is reflected in a 50 percent lower error rate in tests on different data sets than in specifically developed systems.

Five models open source

Whisper is available in five different model sizes on GitHub. Training parameters range from 39 million to over 1.5 billion. The smallest model needs about 1 GB of VRAM, the largest needs about 10 GB. Except for the largest version, the models can only deal with English. The different sizes mean a difference in the speed and accuracy of the systems.

Speech models and speech recognition currently play a major role, for example with the chat program LaMDA, which caused a sensation this year due to an alleged awareness. Like Whisper, LaMDA is also based on a Transformer architecture. A basic explanation of the structure and function of transformers can be found here.

SMART

GENERAL

SMART PHONES

TECH GIANTS

Open Source: OpenAI releases automatic speech recognition system Whisper

Speech recognition hurdle: fine-tuning

Transformer-based speech recognition

Five models open source

TikTok’s ‘Instagram’ is official: this is the new photo application that...

The Huawei Nova 12 arrives: elegant mid-range phones with selfies of...

Study confirms that ChatGPT is good at hacking

Movistar is betting more than ever on fast 5G against MásOrange...

This is the new Movistar router for those who do not...

Follow us

Browse

Editor's Pick

Google updates its most famous widget so you can change its color

These headphones can last you a lifetime: you can repair them yourself at home without tools

Starting today, your Disney+ subscription includes a new feature that you didn’t expect

Popular

The dialed number does not exist or is not accessible, what is happening?

AliExpress makes a crazy discount on an Anbernic console with more than 5000 games

Follow a match using touch with Touch2See

Pixel Fold vs Galaxy Z Fold 4 vs Find N2: comparison...