Whisper backend shootout: faster-whisper vs HF Transformers vs whisper.cpp (2026)

Three ways to run Whisper — faster-whisper (CTranslate2), Hugging Face Transformers (batched, SDPA), and whisper.cpp (ggml) — benchmarked on real Danish telephone audio on an RTX 4070 Ti. faster-whisper + large-v3-turbo is the fastest overall (~56× real-time); whisper.cpp CUDA is close at a fraction of the VRAM and even matches GPU speed on CPU for short files. Full speed (RTF), VRAM, and consensus-WER quality tables, plus a recommendation per use case.

TL;DR — On an RTX 4070 Ti with telephony-optimised audio preprocessing, faster-whisper with large-v3-turbo reaches ~56× real-time — the fastest overall. whisper.cpp CUDA is close (~40×) at a fraction of the VRAM. Batched HF Transformers (SDPA, batch 24) is competitive on the turbo model. On CPU, whisper.cpp surprisingly matches GPU speeds for short files; faster-whisper int8 stays practical at ~5×.

Part of the Danish voice-log project — this is the backend benchmark behind that pipeline’s ASR choice. See also running Whisper locally.

Setup

GPUNVIDIA GeForce RTX 4070 Ti (12 GB VRAM)
CPUIntel Core i5-13400F (16 threads)
RAM125 GB
CUDA12.9 / Driver 595
faster-whisper1.2.1 (CTranslate2 4.8.0)
transformers5.10.2
whisper.cppcompiled from source, CUDA enabled
PyTorch2.12.0

Audio test set

Three real Danish phone-call recordings from a Danish voice-log corpus (filenames anonymised to File 1–3):

FileDurationNotes
File 11m 26sAuthentic Danish conversational speech
File 213m 26sAuthentic Danish conversational speech
File 31m 51sAuthentic Danish conversational speech

No reference transcriptions exist — quality is evaluated via consensus WER (pairwise agreement across models) and relative WER (distance to the consensus winner).

Backends under test

BackendEngineKey techniqueModels
faster-whisperCTranslate2int8/float16 quantised kernelslarge-v3, large-v3-turbo
HF Transformers (batched)PyTorch (SDPA)batch_size=24, flash-attention via SDPAlarge-v3, large-v3-turbo
whisper.cppggmlCUDA kernel offload / AVX CPUlarge-v3, large-v3-turbo

Results: speed (RTF — lower is better)

RTF = transcription time ÷ audio duration. RTF 0.03 = 33× real-time.

Backend / Model / DeviceFile 1File 2File 3Avg RTFSpeed
faster-whisper / large-v3-turbo / CUDA0.0210.0130.0190.01856×
faster-whisper / large-v3 / CUDA0.0470.0330.0510.04323×
HF Transformers (b24, SDPA) / large-v3-turbo / CUDA0.0400.0620.0360.04622×
HF Transformers (b2, SDPA) / large-v3 / CUDA0.1800.1420.2220.181
whisper.cpp / large-v3-turbo / CUDA0.0280.0170.0300.02540×
whisper.cpp / large-v3 / CUDA0.0700.0500.0800.06715×
faster-whisper / large-v3-turbo / CPU0.2080.1510.2480.203
faster-whisper / large-v3 / CPU0.4700.3540.7140.513
whisper.cpp / large-v3-turbo / CPU0.0340.0170.0270.02639×
whisper.cpp / large-v3 / CPU0.0710.0500.0840.06815×

Results: VRAM usage (GPU runs)

Backend / ModelAvg VRAM (MB)
HF Transformers (b24, SDPA) / large-v3-turbo / CUDA3884
HF Transformers (b2, SDPA) / large-v3 / CUDA5727

Results: quality (consensus WER)

Consensus WER measures how much each model’s output diverges from the group mean. Lower = more similar to what all models agree on.

Backend / Model / DeviceFile 1File 2File 3Avg relative WER
faster-whisper / large-v3-turbo / CUDA0.1320.3590.2260.239
faster-whisper / large-v3 / CUDA0.2300.3340.2940.286
HF Transformers (b24, SDPA) / large-v3-turbo / CUDA0.2370.6010.2290.356
HF Transformers (b2, SDPA) / large-v3 / CUDA0.1520.2250.3260.234
whisper.cpp / large-v3-turbo / CUDA0.0000.6730.0000.224
whisper.cpp / large-v3 / CUDA0.1870.0000.5530.247
faster-whisper / large-v3-turbo / CPU0.1360.3440.1910.224
faster-whisper / large-v3 / CPU0.2490.3390.2740.287
whisper.cpp / large-v3-turbo / CPU0.0000.6730.0000.224
whisper.cpp / large-v3 / CPU0.1870.0000.5530.247

Quality winner: faster-whisper / large-v3-turbo / CPU

Side-by-side: where the backends differed

The test audio is a private call, so rather than reproduce it, here’s a neutral Danish sentence standing in for the kind of conversational speech tested — plus the divergence patterns that actually showed up across backends on the 86-second file:

“Lad os tage mødet på torsdag, så gennemgår vi tallene sammen.” (illustrative)

  • Dropped openings. faster-whisper large-v3 (CUDA and CPU) sometimes started a few words in, omitting an opening sentence that the turbo models and whisper.cpp captured.
  • Proper-noun disagreement. The opening name was transcribed two different ways across backends — the classic low-resource-name problem, where each model guesses a different plausible spelling.
  • Garbled noun phrases on noisy segments. whisper.cpp large-v3 turned a clear noun phrase into a near-homophone that the turbo models got right.
  • Rare hallucinated fragments. faster-whisper large-v3 occasionally emitted a short nonsense phrase on a noisy stretch where the other backends stayed clean.
  • Minor spelling slips. The turbo CPU run produced the odd vowel typo not seen on GPU.

Net: the turbo models were the most consistently coherent; full large-v3 was strong but more prone to dropped openings and the occasional hallucination on this noisy telephone audio.

Key findings

  • Fastest GPU: faster-whisper / large-v3-turbo / CUDA at avg RTF 0.018 (56× real-time).
  • Turbo speedup (faster-whisper GPU): large-v3-turbo is 2.4× faster than large-v3.
  • Fastest CPU: whisper.cpp / large-v3-turbo / CPU at avg RTF 0.026.
  • GPU vs CPU (faster-whisper turbo): GPU is 11.4× faster than CPU int8.
  • Lowest VRAM: HF Transformers (b24, SDPA) / large-v3-turbo / CUDA at 3884 MB.

Recommendations

Use caseRecommendation
Production pipeline (GPU server)faster-whisper + large-v3-turbo + float16 — best speed/quality/VRAM
Highest accuracy (GPU, no speed constraint)faster-whisper + large-v3 + float16
Batched offline processingHF Transformers batched (SDPA) + large-v3-turbo, batch_size=24
Embedded / no Python runtimewhisper.cpp CUDA — single binary, low overhead
CPU-only serverfaster-whisper + large-v3-turbo + int8
Edge / Raspberry Piwhisper.cpp CPU + large-v3-turbo (smallest RAM footprint)

Methodology notes

  • Each backend/model/device combination runs as a separate process to guarantee full GPU memory reclamation between runs.
  • faster-whisper GPU uses compute_type=float16; CPU uses int8.
  • HF Transformers uses attn_implementation=sdpa (PyTorch SDPA / FlashAttention-equivalent, no extra install).
  • whisper.cpp uses CUDA automatically when compiled with GGML_CUDA=ON — no flag needed; CPU runs use -t 8 (system thread count).
  • All audio is pre-processed with ffmpeg before transcription: highpass=f=200,lowpass=f=3400,afftdn=nr=10:nf=-25,dynaudnorm (telephone band-pass + light denoise + loudness normalisation).
  • VRAM is measured via torch.cuda.max_memory_allocated(); not available for the whisper.cpp subprocess runs.
  • All audio is Danish conversational phone-call speech — results may vary on other languages or audio conditions.
  • Consensus WER is computed pairwise across all models; the model with the lowest mean pairwise WER is the “consensus winner”.