Detail výsledku VaV

Originální název

TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models

Anglický název

TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been pro- posed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions-a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.

Anglický abstrakt

Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been pro- posed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions-a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.

Klíčová slova

Self-supervised learning, target-speaker speech process, speech recognition, speech enhancement, voice activity detection

Klíčová slova v angličtině

Self-supervised learning, target-speaker speech process, speech recognition, speech enhancement, voice activity detection

Autoři

PENG, J.; ASHIHARA, T.; DELCROIX, M.; OCHIAI, T.; PLCHOT, O.; ARAKI, S.; ČERNOCKÝ, J.

Vydáno

06.04.2025

Nakladatel

IEEE Signal Processing Society

Místo

Hyderabad

ISBN

979-8-3503-6874-1

Kniha

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Strany od

1

Strany do

5

Strany počet

5

URL

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10887574

BibTex

@inproceedings{BUT198051,
  author="PENG, J. and ASHIHARA, T. and DELCROIX, M. and OCHIAI, T. and PLCHOT, O. and ARAKI, S. and ČERNOCKÝ, J.",
  title="TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2025",
  pages="1--5",
  publisher="IEEE Signal Processing Society",
  address="Hyderabad",
  doi="10.1109/ICASSP49660.2025.10887574",
  isbn="979-8-3503-6874-1",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10887574"
}

Dokumenty

TS-SUPERB_A_Target_Speech_Processing_Benchmark_for_Speech_Self-Supervised_Learning_Models

VUT

Fakulty a vysokoškolské ústavy

Součásti

TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models