Hello World to Audio Transcribing with OpenAI Whisper
The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.
Two concepts: Mel Scale (a perceptual scale of pitches judged by listeners to be equal in distance from one another.) and Spectrogram (x axis is time step, y-axis (frequency) to log scale, and the “color” axis (amplitude) to Decibels, which is kinda the log scale of amplitudes.)
Transformer architecture, encoer, decoder, Log-mel spectrogram to encoder, then decoder
Transcribe
Convert recorded audio to wav
Speaker Diarization
Openai whisper does not support diarization (distinguish between the different speakers participating in the conversation). The video talks about an approach using embedding and clustering to do this
Some libraries for support diarization
Appendix
OpenAI Whisper is trained on weakly-supervised data
Transcribe
Other ASR model