Faster Audio Transcribing with OpenAI Whisper and Huggingface Transformers

Automatic speech recognition at scale

2 min readNov 23, 2023

In previous article, we saw how to use OpenAI Whisper to transcribe audio and do speech diarization. It turns out Huggingface transformers library has support for speech recognition with OpenAI Whisper (meaning you can use similar pipeline pattern like NLP) and they also released a “faster” OpenAI Whisper.

Weak-supervision causes encoders to learn representations of speech, decoder is needed to decode the representations to text (lots of heuristics used). Whisper is trained iteratively (initial models (or use other models like wav2vec, librispeech), label dataset, refine, retrain, repeat, etc., seems no magic path to such SOTA model). Walk-through of code to understand encoder-decoder diagram.

run_speech_recognition_whisper.py · sanchit-gandhi/whisper-medium-switchboard-5k at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Using huggingface pipeline with OpenAI Whisper

Whisper Large v2 and Distil Whisper: Transcribe Audio Files with Insane Speed!

You can now push the boundaries of what’s possible with ASR and translation with Whisper Large V2 and Distil Whisper…

blog.gopenai.com

Sample audio, main code

import torch
from transformers import pipeline
whisper  = pipeline("automatic-speech-recognition",
                    "openai/whisper-large-v2",
                    device="cuda:0") # if you don't have GPU, remove this argument
transcription = whisper("Sample_audio_for_Whisper.mp3",
                        chunk_length_s=30)
print(transcription["text"][:500])
# Optimizing Performance with Chunk and Stride Length, https://huggingface.co/blog/asr-chunking
transcription = whisper("<LONGER AUDIO FILE.mp3>",
                    chunk_length_s=30,
                    stride_length_s=5,
                    batch_size=8)
# with bettertransformer
from optimum.bettertransformer import BetterTransformer
whisper = pipeline("automatic-speech-recognition",
                  "openai/whisper-large-v2",
                  torch_dtype=torch.float16,
                  device="cuda:0"
                  )
whisper.model = BetterTransformer.transform(whisper.model)
# https://huggingface.co/docs/optimum/bettertransformer/tutorials/convert
from optimum.pipelines import pipeline

Huggingface distil whisper

distil-whisper/distil-large-v2 · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

It is a distilled version of the Whisper model that is 6 times faster, 49% smaller, and performs within 1% WER on out-of-distribution evaluation sets.

# pip3 install transformers optimum accelerate
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32model_id = "distil-whisper/distil-large-v2"model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)processor = AutoProcessor.from_pretrained(model_id)whisper = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)
transcription = whisper("<LONGER AUDIO FILE.mp3>",
                        chunk_length_s=30,
                        stride_length_s=5,
                        batch_size=8)
# result: {'text': ' Whisper is a transformer-based Incoder-decoder model also referred to as a sequence-to-sequence model. It was trained on 680K hours of labeled speech data annotated using large-scale week supervision.'}

Benchmark

Used http://www.moviesoundclips.net/download.php?id=3932&ft=mp3 (90sec audio) to compare (distil whisper: 4s, whisper: ~1m)

Google Colaboratory

Edit description

colab.research.google.com

Observation, if you can use use_flash_attention_2=True (simple flag), you will get better performance.
The notebook uses following because flash attention 2 is not available in Colab
model.to(device)
model = model.to_bettertransformer() # we are using optimum BetterTransformer since Flash Attention 2 isn’t supported on Colab