Master's Thesis

Speaker conditioned semantic retrieval from speech databases

Author of thesis: Ing. Jan Zdeněk

Acad. year: 2025/2026

Abstract:

This thesis investigates speaker-conditioned semantic retrieval from speech databases, where retrieved results must satisfy both semantic relevance and speaker identity constraints. Existing speech retrieval systems typically rely on automatic speech recognition followed by text-based retrieval and therefore focus primarily on semantic content while disregarding speaker-specific information. To address this limitation, this thesis first establishes semantic and speaker retrieval baselines using modern embedding models and then explores a cascade retrieval framework that combines semantic similarity with speaker-based filtering. Building on this approach, an end-to-end retrieval model is proposed that jointly learns semantic and speaker representations within a shared embedding space. The proposed methods are evaluated on the SLUE-SQA5 dataset and on a custom dataset derived from the Fisher conversational speech corpus. Experimental results demonstrate that incorporating speaker information improves retrieval performance over semantic-only retrieval and that the proposed end-to-end model consistently outperforms both semantic and cascade-based approaches without requiring manual threshold tuning.

Keywords:

semantic retrieval, speaker-conditioned retrieval, speech retrieval, multimodal retrieval, dense embeddings, contrastive learning, speaker embeddings

Date of defence

25.06.2026

Result of the defence

Defended (thesis was successfully defended)

znamkaAznamka

Grading

Process of defence

Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm A.

Topics for thesis defence

V čem je zlepšení?
Čím je podložené zlepšení?

Language of thesis

English

Faculty

Fakulta informačních technologií

Department

Department of Computer Graphics and Multimedia

Study programme

Information Technology and Artificial Intelligence (MITAI)

Specialization

Machine Learning (NMAL)

Composition of Committee

prof. Dr. Ing. Jan Černocký (předseda)
prof. Ing. Hynek Heřmanský, Dr. Eng. (místopředseda)
prof. RNDr. Alexandr Meduna, CSc. (člen)
Ing. Michal Hradiš, Ph.D. (člen)
Ing. František Grézl, Ph.D. (člen)
Ing. Martin Fajčík, Ph.D. (člen)

Supervisor’s report
Santosh Kesiraju, Ph.D.

- Overall, the student has done an excellent work. The student has come up with a novel ideas for fusing embeddings from different modalities. The proposed approach has been shown to perform better than the cascaded baseline setups.

- The experiments and analysis was rigorous.

Evaluation criteria	Verbal classification
Informace k zadání	The work is moderate-to-hard in terms of difficulty. It addresses a fairly new problem of "speaker conditioning" in semantic retrieval. The experiments, and findings from this work makes it a novel contribution to the research community.
Aktivita při dokončování	- The incremental drafts of the thesis were consulted well in advance.
Publikační činnost, ocenění	- Based on the work done, the student is planning to submit an article for IEEE Spoken Language Technology.
Práce s literaturou	Adequate.
Aktivita během řešení, konzultace, komunikace	- Consultations were regular.

Points proposed by supervisor: 94

Grade proposed by supervisor: A

Reviewer’s report
Ing. Oldřich Plchot, Ph.D.

I view this work as very good. The qualities are evident, especially in the experimental and interpretation parts. I believe the work can serve as a basis for a high-quality scientific publication at a top conference and can be further extended by the author himself or at speech@FIT, where similar topics are also being researched.

Evaluation criteria	Verbal classification	Points
Rozsah splnění požadavků zadání	Evaluation level: zadání splněno In my opinion, the assignment was fully completed.
Rozsah technické zprávy	Evaluation level: je v obvyklém rozmezí The technical report is concise and falls on the lower end of the recommended length. The individual chapters, however, contain sufficient information to understand the topic and point to additional literature. The most important part, which is the description of the actual work, is information-rich and written well.
Prezentační úroveň technické zprávy	The thesis is well-structured, with chapters that logically follow each other. Sometimes the text seems overly structured and hard to read. For example, the author uses three levels of subsections, and some are very short, which hurts smooth readability (e.g., sections 4.2.3, 4.2.5, and 4.2.6).	80
Formální úprava technické zprávy	I have only very minor issues with typography and no issues with the English language. Some parts and the structure, however, remind one of a large language model's suggestions, and this could be improved. Also, the already mentioned very short subsections do not appear typographically nice.	90
Práce s literaturou	Overall, the literature is cited according to the standard, and the selection of references is appropriate when they are used. I would appreciate slightly denser referencing and linking. For example, the concrete links and configurations of the used models would be useful, even as footnotes. Then, for example, I did not find a reference to the generative speaker embedding extractor—the i-vector approach.	80
Realizační výstup	The thesis addresses an interesting research problem: joint semantic and speaker-aware retrieval from speech documents. The outputs of the work are of a scientific nature and include clear conclusions and validation across multiple datasets. I recommend being perhaps more precise in the conclusions when the author mentions implementing a successful end-to-end pipeline. This is, in my opinion, not entirely accurate, as the final useful system uses two frozen pre-trained models combined. This does not fully constitute the general understanding of an end-to-end system, where all parameters are typically tuned together.	95
Využitelnost výsledků	The work is original, solves a research problem, and the outputs are worth publishing at top speech and NLP-focused conferences. In practice, it can serve as a reference for implementing a system for an entity interested not only in semantic document retrieval but also in retrieving documents spoken by a known person.
Náročnost zadání	Evaluation level: obtížnější zadání The assignment is more difficult than average, as it addresses problems that are currently among the actively researched topics in state-of-the-art document retrieval.

Points proposed by reviewer: 85

Grade proposed by reviewer: B

Responsibility: Mgr. et Mgr. Hana Odstrčilová

VUT

Faculties and university institutes

Parts

Speaker conditioned semantic retrieval from speech databases