Doctoral Thesis

Far-Field Speaker Verification Incorporating Multichannel Processing

Final Thesis 4.04 MB Summary of Thesis 4.04 MB

Author of thesis: Ing. Ladislav Mošner, Ph.D.

Acad. year: 2024/2025

Reviewers: Marc Delcroix, Prof. Dr. Reinhold Häb-Umbach

Abstract:

Far-field speech processing has gained increasing attention in recent years with the advent of smart speakers, home assistants, and meeting transcription systems. To support these applications, robust far-field speech processing techniques are required. A key task enabling personalized interaction is speaker verification. Compared to close-talking conditions, far-field systems face additional challenges such as reverberation and background noise, which degrade the target speech. To mitigate these effects, far-field devices typically employ microphone arrays that provide spatial information. These challenges and opportunities motivate this thesis, focusing on multi-channel speaker verification.

Despite significant progress in related fields of speech processing, multi-channel speaker verification remains underexplored, hindered by limited data resources and specialized techniques. This thesis focuses on both aspects. On the data side, we repurposed existing publicly available corpora and created the MultiSV dataset, which provides simulated multi-channel mixtures with speech/noise training targets and speaker labels. MultiSV also defines multiple evaluation protocols based on retransmitted recordings, supporting various scenarios, such as single clean versus multi-channel corrupted enrollment. To support training more data-demanding models, we further introduced an extended dataset, MultiSV2.

On the modeling side, we first approached multi-channel speaker embedding extraction using a cascaded strategy, decomposing the problem into multi-channel preprocessing and single-channel embedding extraction. Motivated by advances in speech separation, we designed models ranging from signal-processing-based methods to hybrid neural network and beamforming front-ends. Notably, we proposed direct and indirect mask prediction for mask-based beamforming, and the reference-channel attention (RCA) combiner, which generalizes single-channel separation models to multi-channel inputs.

Recognizing the limitations of cascaded models, such as error propagation and different objectives of the modules, we next explored unified architectures for multi-channel embedding extraction. Leveraging MultiSV2, we fine-tuned cascaded components jointly with the end-task loss, and subsequently introduced METRO, a general framework that extends self-supervised speech representation models to multi-channel settings. METRO yields multi-channel speaker embeddings. However, it is general and potentially applicable to other speech processing tasks.

Keywords:

multi-channel speaker verification, microphone arrays, beamforming, speech separation, speaker embedding extraction, MultiSV

Date of defence

14.01.2026

Result of the defence

Defended (thesis was successfully defended)

znamkaPznamka

Process of defence

The student presented the goals and results that he achieved within the solution of the dissertation. The student has competently answered the questions of the committee members and reviewers. The discussion is recorded on the discussion sheets, which are attached to the protocol. Number of discussion sheets: 4. The committee has agreed unanimously that the student has fulfilled the requirements for being awarded the academic title Ph.D. The committee unanimously recommends, and the opponents support, to awarding the thesis the Dean's Award for an exceptionally high-quality dissertation. The candidate presented excellent technical results, excellent presentation and pedagogical skills and excellent publication activity including Google Scholar h-index of 14.

Language of thesis

English

Faculty

Fakulta informačních technologií

Department

Department of Computer Graphics and Multimedia

Study programme

Information Technology (DIT)

Composition of Committee

doc. Ing. Zdeněk Vašíček, Ph.D. (předseda)
prof. Ing. Zbyněk Koldovský, Ph.D. (člen)
doc. Ing. Pavel Král, Ph.D. (člen)
doc. Ing. Jiří Schimmel, Ph.D. (člen)
doc. RNDr. Petr Sojka, Ph.D. (člen)

Supervisor’s report
prof. Dr. Ing. Jan Černocký

To conclude, we fully recommend Ladislav Mošner’s Ph.D. thesis for defense, wish him all the best in his professional and personal life, and look forward to continuing working with him.

File inserted by supervisor	Size
Posudek vedoucího práce [.pdf]	67,39 kB

Reviewer’s report
Marc Delcroix

The conclusion should contain an explicit statement saying whether, in your opinion, the thesis and the student´s achievements until now meet the generally accepted requirements for the award of an academic degree. I have carefully reviewed the doctoral thesis of Mr. Ladislav Mošner. Despite a few recommendations and some points I raised for discussion, the thesis represents a significant contribution to the field of SV and will provide new opportunities for research and technological development. The work achieved is original and considerable. The investigations were carried out with great diligence to details. The material of the thesis is based on numerous peer-reviewed international conference papers, submitted at top conferences in the field. For all these reasons, I believe that the doctoral thesis meets the requirements of the proceedings leading to the PhD title conferment. Topics for thesis defence:

The approach taken by the candidate aligns with the research in multi-channel robust ASR, which has yielded very promising results. However, I wonder if this is the optimal approach for SV. Indeed, ASR needs to recover all parts of the speech to transcribe each uttered word accurately. In contrast, SV may not need to recover the whole speech content to capture the speaker's identity. Therefore, it may be better to completely ignore unreliable parts of the captured signal, caused by loud noise, etc. Could the candidate comment on this point?
In Chapter 5, although RCA seems to bring improvements in the case of the conv-TasNet configuration, the improvement appears small, and its significance is not measured. Note that in this chapter, all techniques are evaluated in terms of speech enhancement metrics, which, as Chapter 6 reveals, are not well correlated with SV performance. Could the candidate comment whether speech enhancement metrics are useful for selecting the front-end for multi-channel SV?
The tendency of the results is not always the same for the dev and eval sets. Could the candidate comment on these differences and how to practically choose the multi-channel SV system configuration?

File inserted by the reviewer	Size
Posudek oponenta [.pdf]	220,43 kB

Reviewer’s report
Prof. Dr. Reinhold Häb-Umbach

To conclude, in my opinion the doctoral thesis meets all requirements of the proceedings leading to a PhD conferment. Topics for thesis defence:

The MultiSV data sets contain retransmissions as the development and evalua-tion data. While this is a good compromise between pure simulation and real recordings of speakers, I wonder what the candidate’s opinion is about the va-lidity of drawn conclusions for an application with real speakers. Do you expect head movements and thus time-varying transfer characteristics to be critical, or why do you consider them to be not critical?
Considering a beamforming frontend for either speaker verification or ASR, do you expect the downstream task to influence what will come out as the best performing front end (not considering fine-tuning of front-end with gradients from downstream task)?
Fine-tuning the beamforming front-end with a speaker verification objective led to improved performance. Can you see any difference in the obtained beampat-terns with and without fine-tuning?
What is your overall conclusion/recommendation: Should we use a beamform-ing front-end or employ a multi-channel extension of pretrained models, such as WavLM, instead? Compare the two both in terms of verification performance and computational complexity.

File inserted by the reviewer	Size
Posudek oponenta [.pdf]	183,08 kB

Responsibility: Mgr. et Mgr. Hana Odstrčilová

VUT

Faculties and university institutes

Parts

Far-Field Speaker Verification Incorporating Multichannel Processing