Doctoral Thesis

From Modular to End-to-End Speaker Diarization

Final Thesis 4.79 MB Summary of Thesis 4.79 MB

Author of thesis: Federico Nicolás Landini, Ph.D.

Acad. year: 2023/2024

Supervisor: doc. Ing. Lukáš Burget, Ph.D.

Reviewers: Herve Bredin, Ph.D., Sriram Ganapathy

Abstract:

Speaker diarization is usually referred to as the task that determines ``who spoke when'' in a recording. Until a few years ago, all competitive approaches were modular, i.e. voice activity detection, segmentation, embedding extraction, clustering and overlapped speech detection and handling were tackled by different sub-systems and applied in a cascaded fashion. Systems based on this framework reached state-of-the-art performance in most scenarios but had major difficulties dealing with overlapped speech. More recently, the advent of end-to-end models, capable of dealing with all aspects of speaker diarization with a single model and better performing regarding overlapped speech, has brought high levels of attention.
 
This thesis is framed during a period of co-existence of these two trends. We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx, which has shown remarkable performance on different datasets and challenges. We comment on its advantages and limitations and evaluate results on different relevant corpora. Then, we move towards end-to-end neural diarization (EEND) methods. Due to the need for large training sets for training these models and the lack of manually annotated diarization data in sufficient quantities, the compromise solution consists in generating training data artificially. We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps. We show how this method generating "simulated conversations" allows for better performance than using a previously proposed method for creating "simulated mixtures" when training the popular EEND with encoder-decoder attractors (EEND-EDA). We also propose a new EEND-based model, which we call DiaPer, and show that it can perform better than EEND-EDA, especially when dealing with many speakers and handling overlapped speech.
Finally, we compare both VBx-based and DiaPer systems on a wide variety of corpora and comment on the advantages of each technique.

Keywords:

Speaker diarization, VBx, end-to-end neural diarization, simulated conversations, DiaPer.

Date of defence

27.06.2024

Result of the defence

Defended (thesis was successfully defended)

znamkaPznamka

Process of defence

Student přednesl cíle a výsledky, kterých v rámci řešení disertační práce dosáhl. V rozpravě student odpověděl na otázky komise a oponentů a hostů. Diskuze je zaznamenána na diskuzních lístcích, které jsou přílohou protokolu. Počet diskuzních lístků: 7 Komise se v závěru jednomyslně usnesla, že student splnil podmínky pro udělení akademického titulu doktor. Komise jednomyslně doporučuje, aby studentovi byla udělena cena za výjimečně kvalitní disertační práci. The student presented the goals and results, which he achieved within the solution of the dissertation. The student has competently answered the questions of the committee members and reviewers and guests. The discussion is recorded on the discussion sheets, which are attached to the protocol. Number of discussion sheets: 7 The committee has agreed unanimously that the student has fulfilled requirements for being awarded the academic title Ph.D. The committee recommends awarding the thesis the deans prize.

Language of thesis

English

Faculty

Department

Study programme

Computer Science and Engineering (CSE-PHD-4)

Field of study

Computer Science and Engineering (DVI4)

Composition of Committee

doc. Ing. Jan Kořenek, Ph.D. (předseda)
doc. Ing. Zdeněk Žabokrtský, Ph.D. (člen)
doc. Mgr. Hana Rudová, Ph.D. (člen)
prof. Ing. Hynek Heřmanský, Dr. Eng. (člen)
Assoc. Prof. Sriram Ganapathy, PhD. (člen)

Supervisor’s report
doc. Ing. Lukáš Burget, Ph.D.

File inserted by supervisor Size
Hodnocení školitele [.pdf] 56,11 kB

Reviewer’s report
Herve Bredin, Ph.D.

File inserted by the reviewer Size
Posudek oponenta [.pdf] 87,93 kB

Reviewer’s report
Sriram Ganapathy

File inserted by the reviewer Size
Posudek oponenta [.pdf] 82,92 kB

Responsibility: Mgr. et Mgr. Hana Odstrčilová