Detail publikačního výsledku

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

RANGAPPA, P.; CAROFILIS, A.; PRAKASH, J.; KUMAR, S.; BURDISSO, S.; MADIKERI, S.; VILLATORO-TELLO, E.; SHARMA, B.; MOTLÍČEK, P.; HACIOGLU, K.; VENKATESAN, S.; VYAS, S.; STOLCKE, A.

Originální název

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Anglický název

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies-including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis-to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.

Anglický abstrakt

Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies-including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis-to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.

Klíčová slova

speech recognition, data selection, whisper, zip-formers

Klíčová slova v angličtině

speech recognition, data selection, whisper, zip-formers

Autoři

RANGAPPA, P.; CAROFILIS, A.; PRAKASH, J.; KUMAR, S.; BURDISSO, S.; MADIKERI, S.; VILLATORO-TELLO, E.; SHARMA, B.; MOTLÍČEK, P.; HACIOGLU, K.; VENKATESAN, S.; VYAS, S.; STOLCKE, A.

Rok RIV

2026

Vydáno

17.08.2025

Nakladatel

Isca-Int Speech Communication Assoc

Místo

Rotterdam, The Netherlands

Kniha

Interspeech

Periodikum

Interspeech

Stát

Francouzská republika

Strany od

4928

Strany do

4932

Strany počet

5

URL

BibTex

@inproceedings{BUT201433,
  author="{} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and Petr {Motlíček} and  {} and  {} and  {} and  {}",
  title="Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering",
  booktitle="Interspeech",
  year="2025",
  journal="Interspeech",
  pages="4928--4932",
  publisher="Isca-Int Speech Communication Assoc",
  address="Rotterdam, The Netherlands",
  doi="10.21437/Interspeech.2025-2580",
  url="https://www.fit.vut.cz/research/group/speech/public/publi/2025/rangappa_INTERSPEECH_2025_co-author_Motlicek.pdf"
}

Dokumenty