Přístupnostní navigace
E-application
Search Search Close
Master's Thesis
Author of thesis: Ing. Maxim Plička
Acad. year: 2025/2026
Supervisor: Santosh Kesiraju, Ph.D.
Reviewer: Ing. Šimon Sedláček
This thesis investigates question answering over long spoken documents using Large Audio Language Models without transcription into text. A text-based baseline is first established on the SLUE-SQA5 benchmark using transcripts, dense retrieval, reranking, and extractive question answering, followed by evaluation of audio-based question answering using Qwen2-Audio-7B-Instruct. The work identifies several limitations of this audio-language model, including truncation of audio inputs longer than 30 seconds, inefficient scaling of audio representations, and degraded performance with increasing retrieval depth. To address these issues, a modified long-audio processing pipeline and multiple audio subsampling methods are proposed. Experimental results show that the proposed methods improve long-audio question answering performance and scalability, while parameter-efficient fine-tuning using LoRA adapters further improves benchmark accuracy. Additional analysis, however, reveals that the model often relies more on internal language priors than on retrieved audio evidence. The findings highlight current limitations of Qwen2-Audio-7B-Instruct in retrieval-augmented long-audio question answering and emphasize the need for methods that better utilize retrieved acoustic information.
audio question answering, speech question answering, retrieval-augmented generation, audio-language models, long-context processing, multimodal learning, audio subsampling, question answering, Qwen2-Audio, LoRA fine-tuning
Date of defence
25.06.2026
Result of the defence
Defended (thesis was successfully defended)
Grading
A
Process of defence
Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm A.
Topics for thesis defence
Language of thesis
English
Faculty
Fakulta informačních technologií
Department
Department of Computer Graphics and Multimedia
Study programme
Information Technology and Artificial Intelligence (MITAI)
Specialization
Machine Learning (NMAL)
Composition of Committee
prof. Dr. Ing. Jan Černocký (předseda) prof. Ing. Hynek Heřmanský, Dr. Eng. (místopředseda) prof. RNDr. Alexandr Meduna, CSc. (člen) Ing. Michal Hradiš, Ph.D. (člen) Ing. František Grézl, Ph.D. (člen) Ing. Martin Fajčík, Ph.D. (člen)
Supervisor’s reportSantosh Kesiraju, Ph.D.
- Overall, the student has done great work interms of architectural changes for speech LLMs, experiments and analysis for spoken question-answering.
- I believe, the findings from this thesis are important for the research community especially the shortfalls of "fine-tuning speech LLMs" -- where models find shortcuts in coming to the "correct answer" instead of grounding their response in the provided context (documents).
The thesis goals are moderate-to-hard, since it deals with recent advances in speech-LLMs. The problem of 'spoken question answering' however is not new.
The primary work is reasearch in nature and the experiments and findings from the thesis reflect this clearly.
- Draft of the thesis with incremental versions were reviewed and consulted in advance. However, since the work is "research in nature" several experiments and analyses were being done till the last minute. I see this as a positive sign since the student was constantly trying to improve the work.
- The student plans to submit the work to an upcoming conference or workshop.
- The literature study conducted by the student is adequate.
- Consultations were regular through out the work.
Grade proposed by supervisor: A
Reviewer’s reportIng. Šimon Sedláček
The thesis addresses a challenging and timely topic, and the student handled it with clear competence. While there are minor shortcomings on the presentation side of the thesis, it should not detract from the quality of the presented technical and experimental work, the results of which are valuable to the wider field of audio-LLM-based spoken QA.
Evaluation level: zadání splněno
The assignment was fullfiled completely and without exception. The student also presents additional results and findings related to a crucial issue of shortcut learning for audio-LLMs, where the input audio documents are not attended to when generating the answers.
Evaluation level: je v obvyklém rozmezí
The thesis length is within the standard range.
I found that the theoretical introduction (chapter 2) of the thesis a little hard to digest, as while the student provides a comprehensive overview of spoken question-answering (QA) and retrieval-augmented generation (RAG) as a whole, some of the presented concepts were not exactly directly related to the experimental work itself. On the other hand, audio-LLMs receive only limited attention though they are a core topic of the thesis. Some of the section names would also benefit from more specifc wordings for better reader orientation. Lastly, sometimes, certain pieces of information are unnecessarily repeated between neighbouring sections. Despite these shortcomings, the overall structure of the thesis makes sense for the reader and the experimental chapters of the thesis are of significantly higher quality in terms of presentation and flow, and I find that they offset the shortcomings of the theoretical parts.
The thesis is written in near-perfect English with a few minor mistakes here and there that were presumably not caught during review. From the typographical standpoint, I note that table captions are situated below the tables rather than above them.
The student cites relevant prior works in reasonable breadth with respect to the theis topic. When covering certain LALM concepts, survey articles are cited where it would be perhaps more appropriate to cite the original works in addition. Also, some of the bibliography entries are arxiv preprints, where the said article was published at a conference/in a journal and should instead be cited as such.
The code is structured and well-documented.
Overall, the thesis presents valuable results and analysis in a currently highly-relevant domain of spoken QA. I find that the methodology is well controlled, where every part of both the baselines and the final system is well-ablated. The experimental results on adapting the Qwen audio-LLM for spoken QA are valuable for the wider spoken QA community, providing insights on the current shortcomings of such modles and how they can be addressed. I commend the presentation of evidence of shortcut learning, where the model does not attend to the spoken inputs, as such potentially hidden issues of audio-LLMs should be more widely understood in the audio/speech-LLM domain than they currently are. In my overall opinion, the work presented in the thesis warrants a publication.
Evaluation level: značně obtížné zadání
The topic requires the student to get closely acquainted with the state-of-the-art not only in retrieval-augmented generation, but also in the domain of audio-language models and their respective and currently often not-so-well documented challenges compared to standard LLMs, as careful architectural changes to existing models are necessary to solve the audio-LLM context window problems addressed in the thesis.
Grade proposed by reviewer: A
Responsibility: Mgr. et Mgr. Hana Odstrčilová