Master's Thesis

Retrieval-augmented speech-based question answering system

Author of thesis: Ing. Maxim Plička

Acad. year: 2025/2026

Abstract:

This thesis investigates question answering over long spoken documents using Large Audio Language Models without transcription into text. A text-based baseline is first established on the SLUE-SQA5 benchmark using transcripts, dense retrieval, reranking, and extractive question answering, followed by evaluation of audio-based question answering using Qwen2-Audio-7B-Instruct. The work identifies several limitations of this audio-language model, including truncation of audio inputs longer than 30 seconds, inefficient scaling of audio representations, and degraded performance with increasing retrieval depth. To address these issues, a modified long-audio processing pipeline and multiple audio subsampling methods are proposed. Experimental results show that the proposed methods improve long-audio question answering performance and scalability, while parameter-efficient fine-tuning using LoRA adapters further improves benchmark accuracy. Additional analysis, however, reveals that the model often relies more on internal language priors than on retrieved audio evidence. The findings highlight current limitations of Qwen2-Audio-7B-Instruct in retrieval-augmented long-audio question answering and emphasize the need for methods that better utilize retrieved acoustic information.

Keywords:

audio question answering, speech question answering, retrieval-augmented generation, audio-language models, long-context processing, multimodal learning, audio subsampling, question answering, Qwen2-Audio, LoRA fine-tuning

Date of defence

25.06.2026

Result of the defence

Defended (thesis was successfully defended)

znamkaAznamka

Grading

Process of defence

Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm A.

Topics for thesis defence

You present evidence of shortcut learning in your audio-LLM. Could you propose concrete ways to further analyze and isolate this phenomenon (e.g. differentiating reliance on prior knowledge vs. exploiting lexical characteristics of the SLUE benchmark) and also mitigate it in future experiments?
Jak hodnotíte zvolený benchmark?

Language of thesis

English

Faculty

Fakulta informačních technologií

Department

Department of Computer Graphics and Multimedia

Study programme

Information Technology and Artificial Intelligence (MITAI)

Specialization

Machine Learning (NMAL)

Composition of Committee

prof. Dr. Ing. Jan Černocký (předseda)
prof. Ing. Hynek Heřmanský, Dr. Eng. (místopředseda)
prof. RNDr. Alexandr Meduna, CSc. (člen)
Ing. Michal Hradiš, Ph.D. (člen)
Ing. František Grézl, Ph.D. (člen)
Ing. Martin Fajčík, Ph.D. (člen)

Supervisor’s report
Santosh Kesiraju, Ph.D.

- Overall, the student has done great work interms of architectural changes for speech LLMs, experiments and analysis for spoken question-answering.

- I believe, the findings from this thesis are important for the research community especially the shortfalls of "fine-tuning speech LLMs" -- where models find shortcuts in coming to the "correct answer" instead of grounding their response in the provided context (documents).

Evaluation criteria	Verbal classification
Informace k zadání	The thesis goals are moderate-to-hard, since it deals with recent advances in speech-LLMs. The problem of 'spoken question answering' however is not new. The primary work is reasearch in nature and the experiments and findings from the thesis reflect this clearly.
Aktivita při dokončování	- Draft of the thesis with incremental versions were reviewed and consulted in advance. However, since the work is "research in nature" several experiments and analyses were being done till the last minute. I see this as a positive sign since the student was constantly trying to improve the work.
Publikační činnost, ocenění	- The student plans to submit the work to an upcoming conference or workshop.
Práce s literaturou	- The literature study conducted by the student is adequate.
Aktivita během řešení, konzultace, komunikace	- Consultations were regular through out the work.

Points proposed by supervisor: 94

Grade proposed by supervisor: A

Reviewer’s report
Ing. Šimon Sedláček

The thesis addresses a challenging and timely topic, and the student handled it with clear competence. While there are minor shortcomings on the presentation side of the thesis, it should not detract from the quality of the presented technical and experimental work, the results of which are valuable to the wider field of audio-LLM-based spoken QA.

Evaluation criteria	Verbal classification	Points
Rozsah splnění požadavků zadání	Evaluation level: zadání splněno The assignment was fullfiled completely and without exception. The student also presents additional results and findings related to a crucial issue of shortcut learning for audio-LLMs, where the input audio documents are not attended to when generating the answers.
Rozsah technické zprávy	Evaluation level: je v obvyklém rozmezí The thesis length is within the standard range.
Prezentační úroveň technické zprávy	I found that the theoretical introduction (chapter 2) of the thesis a little hard to digest, as while the student provides a comprehensive overview of spoken question-answering (QA) and retrieval-augmented generation (RAG) as a whole, some of the presented concepts were not exactly directly related to the experimental work itself. On the other hand, audio-LLMs receive only limited attention though they are a core topic of the thesis. Some of the section names would also benefit from more specifc wordings for better reader orientation. Lastly, sometimes, certain pieces of information are unnecessarily repeated between neighbouring sections. Despite these shortcomings, the overall structure of the thesis makes sense for the reader and the experimental chapters of the thesis are of significantly higher quality in terms of presentation and flow, and I find that they offset the shortcomings of the theoretical parts.	85
Formální úprava technické zprávy	The thesis is written in near-perfect English with a few minor mistakes here and there that were presumably not caught during review. From the typographical standpoint, I note that table captions are situated below the tables rather than above them.	92
Práce s literaturou	The student cites relevant prior works in reasonable breadth with respect to the theis topic. When covering certain LALM concepts, survey articles are cited where it would be perhaps more appropriate to cite the original works in addition. Also, some of the bibliography entries are arxiv preprints, where the said article was published at a conference/in a journal and should instead be cited as such.	90
Realizační výstup	The code is structured and well-documented.	100
Využitelnost výsledků	Overall, the thesis presents valuable results and analysis in a currently highly-relevant domain of spoken QA. I find that the methodology is well controlled, where every part of both the baselines and the final system is well-ablated. The experimental results on adapting the Qwen audio-LLM for spoken QA are valuable for the wider spoken QA community, providing insights on the current shortcomings of such modles and how they can be addressed. I commend the presentation of evidence of shortcut learning, where the model does not attend to the spoken inputs, as such potentially hidden issues of audio-LLMs should be more widely understood in the audio/speech-LLM domain than they currently are. In my overall opinion, the work presented in the thesis warrants a publication.
Náročnost zadání	Evaluation level: značně obtížné zadání The topic requires the student to get closely acquainted with the state-of-the-art not only in retrieval-augmented generation, but also in the domain of audio-language models and their respective and currently often not-so-well documented challenges compared to standard LLMs, as careful architectural changes to existing models are necessary to solve the audio-LLM context window problems addressed in the thesis.

Topics for thesis defence:

You present evidence of shortcut learning in your audio-LLM. Could you propose concrete ways to further analyze and isolate this phenomenon (e.g. differentiating reliance on prior knowledge vs. exploiting lexical characteristics of the SLUE benchmark) and also mitigate it in future experiments?

Points proposed by reviewer: 90

Grade proposed by reviewer: A

Responsibility: Mgr. et Mgr. Hana Odstrčilová

VUT

Faculties and university institutes

Parts

Retrieval-augmented speech-based question answering system