Five alternatives to pyannote: speaker diarization for phone calls (2026)
pyannote-audio 3.1 works, but requires a Hugging Face token and takes 5 seconds per call. Benchmark of five alternatives on 8 real 2-party phone calls: WeSpeaker + spectral clustering (0.2 s), multi-scale AHC (0.6 s), PLDA + spectral (0.3 s), Silero VAD + spectral (2.1 s) and cross-call gallery nearest-neighbour (0.2 s). The Silero VAD approach won at 76.1% agreement and is 2.4× faster. Surprise: all five collapsed on mono calls — pyannote stays in production.