Table of content

Wait, it’s all text?

Classifying speech or text is just the same, only the format is different.

Language is one of the biggest paradigms of AI and there really should be no difference regarding the format that represents the language.

In this blogpost we will would at how a multilingual text classifier can be used to also classify multilingual text and how a system like this would look like. Usecases where audio and speech are used are in a call center (actual intelligent phone routing), subtitling video and audio files (hello podscribing.ai), and compressing and analyzing data (instead of stored audio you can now store and search text files).

Multilingual speech classification

Audio files come in all sorts and shapes and also in all languages! At Tekst.ai multilingual AI is at our core and we are always looking at how to extend this. We primarily focus on multilingual text and will now show how all speech is also text - but disguised.

Multilingual speech classification has had a huge boost in the past months of this year as a lot of new open-source models have been released. All benchmarks have been beaten by models like a whisper from OpenAI and stuff like setFit and good old Bert from Huggingface. Results from before really should not be compared with the current possibilities and putting this stuff in production, especially in non-English languages reaps the benefits 🙂

The process

Speech is just language but in a different format. There are 2 steps needed to be able to classify the speech in a multilingual manner. The first one is transforming back from speech to text (with speech-to-text) and the second is using this text to multilingually classify. Let’s go for it!

Speech to text

Speech-to-text is even a research field in itself so we can really deep dive into this. Luckily for you, we will only highlight the best stuff!

We were recently stunned by Whisper from openAI, a new speech-to-text model released by them that is trained on hundreds of thousands of multilingual audio so hey- the performance of it is insane.

The results beat every benchmark, especially in Dutch and non-English languages the model is great with non-standard words.

Where previous models went bad with names, entities, and exotic words, Whisper deals with them much better. The errors are also much smaller and it’s even harder to correct.

We found that Whisper also makes different mistakes than previous transcribing models which makes it interesting to create a combination of the 2 to get even more reliable and stable models.

The overall great performance of classical speech-to-text models combined with the power of Whisper and its great performance on exotic terms makes it possible to deliver top-notch !!

Multilingual text classification

Aha, this feels like a home match. We could talk for hours about how cool multilingual text classification is and why the focus should not only be on English but let us keep that for another post 🙂

We now have a full audio file transcription and can now easily train a classifier on top of this. Challenges with this are however the longer texts (30min audio is a big text file) and the fact that it’s a certain format (need to parse out for example the timestamps as we don’t need these).

Not the biggest challenges and it’s quite easy to finetune a transformer model on this, the current state of the art on multilingual text classification. This way accuracy numbers of 95%+ are achievable and not out of your reach, even for audio files.

Conclusion

From audio to text to a prediction. Classifying audio and tagging it with the right topics, sentiment, and whatever custom tags you could come up with is quite doable! If you are interested in learning more about this, please have a chat but be aware: we might not be able to talk about how cool multilingual AI is!

‍