Faster Audio Transcribing with OpenAI Whisper and Huggingface Transformers
In previous article, we saw how to use OpenAI Whisper to transcribe audio and do speech diarization. It turns out Huggingface transformers library has support for speech recognition with OpenAI Whisper (meaning you can use similar pipeline pattern like NLP) and they also released a “faster” OpenAI Whisper.
Weak-supervision causes encoders to learn representations of speech, decoder is needed to decode the representations to text (lots of heuristics used). Whisper is trained iteratively (initial models (or use other models like wav2vec, librispeech), label dataset, refine, retrain, repeat, etc., seems no magic path to such SOTA model). Walk-through of code to understand encoder-decoder diagram.
Using huggingface pipeline with OpenAI Whisper
Sample audio, main code
import torch
from transformers import pipeline
whisper = pipeline("automatic-speech-recognition",
"openai/whisper-large-v2",
device="cuda:0") # if you don't have GPU, remove this argument
transcription = whisper("Sample_audio_for_Whisper.mp3",
chunk_length_s=30)
print(transcription["text"][:500])
# Optimizing Performance with Chunk and Stride Length, https://huggingface.co/blog/asr-chunking
transcription = whisper("<LONGER AUDIO FILE.mp3>",
chunk_length_s=30,
stride_length_s=5,
batch_size=8)
# with bettertransformer
from optimum.bettertransformer import BetterTransformer
whisper = pipeline("automatic-speech-recognition",
"openai/whisper-large-v2",
torch_dtype=torch.float16,
device="cuda:0"
)
whisper.model = BetterTransformer.transform(whisper.model)
# https://huggingface.co/docs/optimum/bettertransformer/tutorials/convert
from optimum.pipelines import pipeline
Huggingface distil whisper
It is a distilled version of the Whisper model that is 6 times faster, 49% smaller, and performs within 1% WER on out-of-distribution evaluation sets.
# pip3 install transformers optimum accelerate
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32model_id = "distil-whisper/distil-large-v2"model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)processor = AutoProcessor.from_pretrained(model_id)whisper = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
transcription = whisper("<LONGER AUDIO FILE.mp3>",
chunk_length_s=30,
stride_length_s=5,
batch_size=8)
# result: {'text': ' Whisper is a transformer-based Incoder-decoder model also referred to as a sequence-to-sequence model. It was trained on 680K hours of labeled speech data annotated using large-scale week supervision.'}
Benchmark
Used http://www.moviesoundclips.net/download.php?id=3932&ft=mp3 (90sec audio) to compare (distil whisper: 4s, whisper: ~1m)
Observation, if you can use use_flash_attention_2=True (simple flag), you will get better performance.
The notebook uses following because flash attention 2 is not available in Colab
model.to(device)
model = model.to_bettertransformer() # we are using optimum BetterTransformer since Flash Attention 2 isn’t supported on Colab