Faster Audio Transcribing with OpenAI Whisper and Huggingface Transformers

Automatic speech recognition at scale

Xin Cheng
2 min readNov 23, 2023

In previous article, we saw how to use OpenAI Whisper to transcribe audio and do speech diarization. It turns out Huggingface transformers library has support for speech recognition with OpenAI Whisper (meaning you can use similar pipeline pattern like NLP) and they also released a “faster” OpenAI Whisper.

Weak-supervision causes encoders to learn representations of speech, decoder is needed to decode the representations to text (lots of heuristics used). Whisper is trained iteratively (initial models (or use other models like wav2vec, librispeech), label dataset, refine, retrain, repeat, etc., seems no magic path to such SOTA model). Walk-through of code to understand encoder-decoder diagram.

Using huggingface pipeline with OpenAI Whisper

Sample audio, main code

import torch
from transformers import pipeline
whisper = pipeline("automatic-speech-recognition",
"openai/whisper-large-v2",
device="cuda:0") # if you don't have GPU, remove this argument
transcription = whisper("Sample_audio_for_Whisper.mp3",
chunk_length_s=30)
print(transcription["text"][:500])
# Optimizing Performance with Chunk and Stride Length, https://huggingface.co/blog/asr-chunking
transcription = whisper("<LONGER AUDIO FILE.mp3>",
chunk_length_s=30,
stride_length_s=5,
batch_size=8)
# with bettertransformer
from optimum.bettertransformer import BetterTransformer
whisper = pipeline("automatic-speech-recognition",
"openai/whisper-large-v2",
torch_dtype=torch.float16,
device="cuda:0"
)
whisper.model = BetterTransformer.transform(whisper.model)
# https://huggingface.co/docs/optimum/bettertransformer/tutorials/convert
from optimum.pipelines import pipeline

Huggingface distil whisper

It is a distilled version of the Whisper model that is 6 times faster, 49% smaller, and performs within 1% WER on out-of-distribution evaluation sets.

# pip3 install transformers optimum accelerate
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v2"model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)whisper = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
transcription = whisper("<LONGER AUDIO FILE.mp3>",
chunk_length_s=30,
stride_length_s=5,
batch_size=8)
# result: {'text': ' Whisper is a transformer-based Incoder-decoder model also referred to as a sequence-to-sequence model. It was trained on 680K hours of labeled speech data annotated using large-scale week supervision.'}

Benchmark

Used http://www.moviesoundclips.net/download.php?id=3932&ft=mp3 (90sec audio) to compare (distil whisper: 4s, whisper: ~1m)

Observation, if you can use use_flash_attention_2=True (simple flag), you will get better performance.
The notebook uses following because flash attention 2 is not available in Colab
model.to(device)
model = model.to_bettertransformer() # we are using optimum BetterTransformer since Flash Attention 2 isn’t supported on Colab

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified

No responses yet