R&D Result Detail

Original Title

Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

English Title

Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

Type

Paper in proceedings outside WoS and Scopus

Original Abstract

In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a modular approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.

English abstract

In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a modular approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.

Keywords

speaker diarization, speaker embedding, voice activity detection, overlapped speech detection

Key words in English

speaker diarization, speaker embedding, voice activity detection, overlapped speech detection

Authors

PÁLKA, P.; LANDINI, F.; KLEMENT, D.; DIEZ SÁNCHEZ, M.; SILNOVA, A.; BURGET, L.; DELCROIX, M.

Released

08.09.2025

Publisher

IEEE Signal Processing Society

Location

Palermo

ISBN

978-9-46-459362-4

Pages from

31

Pages to

35

Pages count

5

URL

https://www.fit.vut.cz/research/publication/13567/

BibTex

@inproceedings{BUT198669,
  author="Petr {Pálka} and Federico Nicolás {Landini} and Dominik {Klement} and Mireia {Diez Sánchez} and Anna {Silnova} and  {} and  {} and Lukáš {Burget} and Marc {Delcroix}",
  title="Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization",
  year="2025",
  pages="31--35",
  publisher="IEEE Signal Processing Society",
  address="Palermo",
  isbn="978-9-46-459362-4",
  url="https://www.fit.vut.cz/research/publication/13567/"
}

Documents

palka_eusipco2025_final

VUT

Faculties and university institutes

Parts

Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization