Технозаметки Малышева

MMS: Scaling Speech Technology to 1000+ languages

Get ready for a breakthrough in speech technology that is set to revolutionize the world of communication! The field, which has so far been restricted to around a hundred languages, barely scratches the surface of the more than 7,000 languages spoken globally. The Massively Multilingual Speech (MMS) project is taking a monumental leap to bridge this gap, increasing the number of supported languages by an astounding 10 to 40 times, depending on the task. This unprecedented expansion will be a game-changer, significantly improving global access to information and creating a more inclusive digital landscape.

This incredible feat is achieved through the creation of a new dataset drawn from publicly available religious texts and the strategic implementation of self-supervised learning. The MMS project's achievements are staggering, including the development of pre-trained wav2vec 2.0 models for 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for as many languages, and a language identification model for a whopping 4,017 languages. Even more impressive is the significant improvement in accuracy - our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark, despite being trained on a significantly smaller dataset.

Paper link: https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/
Blogpost link: https://ai.facebook.com/blog/multilingual-model-speech-recognition/
Code link: https://github.com/facebookresearch/fairseq/tree/main/examples/mms

A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-mms
#deeplearning #speechrecognition #tts #audio

21 views06:58

Universal-1: прорыв в распознавании речи от AssemblyAI

Universal-1 обучена на 12,5 миллионах часов многоязычных аудиоданных и обеспечивает высокую точность на английском, испанском, французском и немецком.
Среди особенностей: Скорость! 1 час распознаёт за 21 секунду за счёт параллельной обработки порций запииси, 600М параметров, точные временные метки слов.
Использует Google Cloud TPUs и JAX framework для тренировок.
Модель почти не "галлюцинирует" и на 25,5% точнее Whisper Large-v3.
Улучшена идентификация говорящих и распознавание в многоязычной среде.
Доступ к модели открыт через API.

#AssemblyAI #Universal1 #SpeechRecognition
-------
@tsingular

142 viewsedited 05:27

About

Blog

Apps

Platform