Meta’s New Multilingual speech-to-text, text-to-speech
on Meta AI:
Collecting audio data for thousands of languages was our first challenge because the largest existing speech datasets cover at most 100 languages. To overcome it, we turned to religious texts, such as the Bible, that have been translated in many different languages and whose translations have been widely studied for text-based language translation research.
Something very ironic about using the Bible to train multilingual models. The Tower of Babel is right at the start.
We trained multilingual speech recognition models on over 1,100 languages using a 1B parameter wav2vec 2.0 model. As the number of languages increases, performance does decrease, but only very slightly: Moving from 61 to 1,107 languages increases the character error rate by only about 0.4 percent but increases the language coverage by over 18 times.
The number of languages is mind-blowing to me. A functional universal translator is at hand.
In a like-for-like comparison with OpenAI’s Whisper, we found that models trained on the Massively Multilingual Speech data achieve half the word error rate, but Massively Multilingual Speech covers 11 times more languages. This demonstrates that our model can perform very well compared with the best current speech models.
Was very excited about this, but some commenters pointed out that while most languages error rates is superior, it error rate is similar to Whisper in English.