#asr

4

Posts on X2Q tagged asr. Updated as new posts are published.

Jul 2, 2026
Fine-tuning Whisper for Danish phone calls: a LoRA post-mortem (2026)
Fine-tuned whisper-large-v3 on CoRal-project/coral-v3 (Danish speech, telephone-codec augmented) with LoRA. Training loss got stuck at ~35-40 for three attempts running — looked like large-v3 itself was unstable under LoRA. It wasn't: the warmup schedule was too long relative to how far training actually ran. Fixed, the model nearly halved WER on held-out Danish speech versus the untrained base. Then a systematic 30-call audit on real phone audio found a real fabrication rate the benchmark number didn't show — published anyway, with an honest model card.
Jun 24, 2026
Five alternatives to pyannote: speaker diarization for phone calls (2026)
pyannote-audio 3.1 works, but requires a Hugging Face token and takes 5 seconds per call. Benchmark of five alternatives on 8 real 2-party phone calls: WeSpeaker + spectral clustering (0.2 s), multi-scale AHC (0.6 s), PLDA + spectral (0.3 s), Silero VAD + spectral (2.1 s) and cross-call gallery nearest-neighbour (0.2 s). The Silero VAD approach won at 76.1% agreement and is 2.4× faster. Surprise: all five collapsed on mono calls — pyannote stays in production.
Jun 10, 2026
Transcribing 33,000 Danish voice logs on home GPUs — the local pipeline (2026)
Business phone calls had been recorded for 13 months — ~33,000 Danish mp3s, ~570 hours of 32 kbps phone audio. The job: transcribe everything, name the speakers, summarize per call/day/week, browsable on a site, and do it locally with no LLM API. This is the build: a benchmark of Danish ASR models, a dual-model + Claude-fusion transcription pipeline, 'phone-first' speaker identification from metadata, self-healing infrastructure across two GPUs, and Claude Code subagents as the (API-less) summary layer. Plus what it teaches about applying the same pipeline to a customer-service function.
Jun 10, 2026
Whisper backend shootout: faster-whisper vs HF Transformers vs whisper.cpp (2026)
Three ways to run Whisper — faster-whisper (CTranslate2), Hugging Face Transformers (batched, SDPA), and whisper.cpp (ggml) — benchmarked on real Danish telephone audio on an RTX 4070 Ti. faster-whisper + large-v3-turbo is the fastest overall (~56× real-time); whisper.cpp CUDA is close at a fraction of the VRAM and even matches GPU speed on CPU for short files. Full speed (RTF), VRAM, and consensus-WER quality tables, plus a recommendation per use case.