Hello World to Audio Transcribing with OpenAI Whisper

and how to record audio on google colab

Xin Cheng
2 min readNov 21, 2023

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Two concepts: Mel Scale (a perceptual scale of pitches judged by listeners to be equal in distance from one another.) and Spectrogram (x axis is time step, y-axis (frequency) to log scale, and the “color” axis (amplitude) to Decibels, which is kinda the log scale of amplitudes.)

Transformer architecture, encoer, decoder, Log-mel spectrogram to encoder, then decoder

Transcribe

Convert recorded audio to wav

Speaker Diarization

Openai whisper does not support diarization (distinguish between the different speakers participating in the conversation). The video talks about an approach using embedding and clustering to do this

Some libraries for support diarization

Appendix

OpenAI Whisper is trained on weakly-supervised data

Transcribe

Other ASR model

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified

No responses yet