Detail publikačního výsledku

BUT System for the MLC-SLM Challenge

Alexander Polok, Jiangyu Han, Dominik Klement, Samuele Cornell, Jan Černocký, Lukáš Burget

Originální název

BUT System for the MLC-SLM Challenge

Anglický název

BUT System for the MLC-SLM Challenge

Druh

Stať ve sborníku mimo WoS a Scopus

Originální abstrakt

We present a two-speaker automatic speech recognition (ASR) system that combines DiCoW—a diarization-conditioned variant of Whisper—with DiariZen, a diarization pipeline built on top of Pyannote. We first evaluate both systems in out-of-domain (OOD) multilingual scenarios without any fine-tuning. In this scenario, DiariZen consistently outperforms the baseline Pyannote diarization model, demonstrating strong generalization. Despite being fine-tuned on English-only data for target-speaker ASR, DiCoW retains solid multilingual performance,indicating that encoder modifications preserve Whisper’s multilingual capabilities. We then fine-tune both DiCoW and DiariZen on the MLC-SLM challenge data. The fine-tuned DiariZen continues to outperform the fine-tuned Pyannote baseline, while DiCoW sees further gains from domain adaptation. Our final system achieves a micro-average tcpWER/CER of 16.75 % and ranks second in Task 2 of the MLC-SLM challenge. Lastly, we identify several labeling inconsistencies in the training data—such as missing speech segments and incorrect silence annotations—which can hinder diarization fine-tuning. We propose simple mitigation strategies to address these issues and improve system robustness.

Anglický abstrakt

We present a two-speaker automatic speech recognition (ASR) system that combines DiCoW—a diarization-conditioned variant of Whisper—with DiariZen, a diarization pipeline built on top of Pyannote. We first evaluate both systems in out-of-domain (OOD) multilingual scenarios without any fine-tuning. In this scenario, DiariZen consistently outperforms the baseline Pyannote diarization model, demonstrating strong generalization. Despite being fine-tuned on English-only data for target-speaker ASR, DiCoW retains solid multilingual performance,indicating that encoder modifications preserve Whisper’s multilingual capabilities. We then fine-tune both DiCoW and DiariZen on the MLC-SLM challenge data. The fine-tuned DiariZen continues to outperform the fine-tuned Pyannote baseline, while DiCoW sees further gains from domain adaptation. Our final system achieves a micro-average tcpWER/CER of 16.75 % and ranks second in Task 2 of the MLC-SLM challenge. Lastly, we identify several labeling inconsistencies in the training data—such as missing speech segments and incorrect silence annotations—which can hinder diarization fine-tuning. We propose simple mitigation strategies to address these issues and improve system robustness.

Klíčová slova

DiCoW, Multilingual Multi-Talker ASR, DiariZen, Whisper

Klíčová slova v angličtině

DiCoW, Multilingual Multi-Talker ASR, DiariZen, Whisper

Autoři

Alexander Polok, Jiangyu Han, Dominik Klement, Samuele Cornell, Jan Černocký, Lukáš Burget

Rok RIV

2026

Vydáno

22.08.2025

Nakladatel

ISCA

Místo

ISCA

Strany od

23

Strany do

27

Strany počet

5

URL

BibTex

@inproceedings{BUT199410,
  author="Alexander {Polok} and Jiangyu {Han} and Dominik {Klement} and  {} and Jan {Černocký} and Lukáš {Burget}",
  title="BUT System for the MLC-SLM Challenge",
  year="2025",
  pages="23--27",
  publisher="ISCA",
  address="ISCA",
  doi="10.21437/mlcslm.2025-6",
  url="https://www.isca-archive.org/mlcslm_2025/polok25_mlcslm.pdf"
}

Dokumenty