Hello World to Audio Transcribing with OpenAI Whisper

and how to record audio on google colab

2 min readNov 21, 2023

Introducing Whisper

We've trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on…

openai.com

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Getting to Know the Mel Spectrogram

Read this short post if you want to be like Neo and know all about the Mel Spectrogram!

towardsdatascience.com

Two concepts: Mel Scale (a perceptual scale of pitches judged by listeners to be equal in distance from one another.) and Spectrogram (x axis is time step, y-axis (frequency) to log scale, and the “color” axis (amplitude) to Decibels, which is kinda the log scale of amplitudes.)

Transformer architecture, encoer, decoder, Log-mel spectrogram to encoder, then decoder

Transcribe

Google Colaboratory

Edit description

colab.research.google.com

Convert recorded audio to wav

Microphone to Numpy array from your browser in Colab

Microphone to Numpy array from your browser in Colab - microphone-to-numpy-array-from-your-browser-in-colab.ipynb

gist.github.com

Speaker Diarization

Openai whisper does not support diarization (distinguish between the different speakers participating in the conversation). The video talks about an approach using embedding and clustering to do this

Who spoke when: Choosing the right speaker diarization tool

A guide on how to choose the right speaker diarization framework for your use case

blog.ml6.eu

Some libraries for support diarization

Appendix

OpenAI Whisper is trained on weakly-supervised data

Transcribe

Google Colaboratory

Edit description

colab.research.google.com

Other ASR model

facebook/wav2vec2-base-960h · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Hello World to Audio Transcribing with OpenAI Whisper

and how to record audio on google colab

Introducing Whisper

We've trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on…

Getting to Know the Mel Spectrogram

Read this short post if you want to be like Neo and know all about the Mel Spectrogram!

Transcribe

Google Colaboratory

Edit description

Microphone to Numpy array from your browser in Colab

Microphone to Numpy array from your browser in Colab - microphone-to-numpy-array-from-your-browser-in-colab.ipynb

Speaker Diarization

Who spoke when: Choosing the right speaker diarization tool

A guide on how to choose the right speaker diarization framework for your use case

Appendix

Google Colaboratory

Edit description

facebook/wav2vec2-base-960h · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Written by Xin Cheng

No responses yet