Detail publikačního výsledku

Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

PÁLKA, P.; LANDINI, F.; KLEMENT, D.; DIEZ SÁNCHEZ, M.; SILNOVA, A.; BURGET, L.; DELCROIX, M.

Originální název

Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

Anglický název

Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a modular approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.

Anglický abstrakt

In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a modular approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.

Klíčová slova

speaker diarization, speaker embedding, voice activity detection, overlapped speech detection

Klíčová slova v angličtině

speaker diarization, speaker embedding, voice activity detection, overlapped speech detection

Autoři

PÁLKA, P.; LANDINI, F.; KLEMENT, D.; DIEZ SÁNCHEZ, M.; SILNOVA, A.; BURGET, L.; DELCROIX, M.

Rok RIV

2026

Vydáno

08.09.2025

Nakladatel

IEEE Signal Processing Society

Místo

Palermo

ISBN

978-9-46-459362-4

Kniha

Proceedings of 33rd European Signal Processing Conference (EUSIPCO 2025)

Strany od

31

Strany do

35

Strany počet

5

URL

BibTex

@inproceedings{BUT198669,
  author="Petr {Pálka} and Federico Nicolás {Landini} and Dominik {Klement} and Mireia {Diez Sánchez} and Anna {Silnova} and  {} and  {} and Lukáš {Burget} and Marc {Delcroix}",
  title="Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization",
  booktitle="Proceedings of 33rd European Signal Processing Conference (EUSIPCO 2025)",
  year="2025",
  pages="31--35",
  publisher="IEEE Signal Processing Society",
  address="Palermo",
  doi="10.23919/EUSIPCO63237.2025.11226253",
  isbn="978-9-46-459362-4",
  url="https://eusipco2025.org/wp-content/uploads/pdfs/0000031.pdf"
}

Dokumenty