Přístupnostní navigace
E-application
Search Search Close
Bachelor's Thesis
Author of thesis: Rostyslav Kachan
Acad. year: 2025/2026
Supervisor: Ing. Martin Kostelník
Reviewer: Ing. Petr Šilling
This thesis investigates parameter-efficient fine-tuning for improving mathematical reasoning in large language models. Three open-source 7-billion-parameter models — Llama-2-7b-hf, Llama-2-7b-chat-hf, and Mistral-7B-Instruct-v0.1 — were fine-tuned on the GSM8K dataset using Low-Rank Adaptation (LoRA), with a systematic search over learning rate, LoRA rank, alpha, and effective batch size. Performance was evaluated using exact-match accuracy and the Qwen2.5-Math-PRM-7B process reward model for assessing intermediate reasoning quality. Fine-tuning yielded substantial accuracy gains: Llama-2-7b-hf improved from 10.77% to 33.36%, Llama-2-7b-chat-hf from 25.47% to 39.88%, and Mistral-7B-Instruct-v0.1 from 42.23% to 57.01%. The results demonstrate that LoRA fine-tuning with carefully selected hyperparameters effectively enhances mathematical reasoning, while PRM evaluation provides additional insights into reasoning quality beyond accuracy alone.
Large language models, LLM, mathematical reasoning, parameter-efficient fine-tuning, PEFT, Low-Rank Adaptation, LoRA, GSM8K dataset, process reward model, PRM, hyperparameter search.
Date of defence
16.06.2026
Result of the defence
Defended (thesis was successfully defended)
Grading
C
Process of defence
Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm C.
Topics for thesis defence
Language of thesis
English
Faculty
Fakulta informačních technologií
Department
Department of Computer Graphics and Multimedia
Study programme
Information Technology (BIT)
Composition of Committee
doc. Ing. Lukáš Burget, Ph.D. (předseda) doc. Mgr. Adam Rogalewicz, Ph.D. (místopředseda) Ing. Libor Polčák, Ph.D. (člen) Ing. Michal Hradiš, Ph.D. (člen) Ing. Martin Žádník, Ph.D. (člen)
Supervisor’s reportIng. Martin Kostelník
The student worked actively throughout the entire academic year. He independently completed the tasks agreed upon during consultations and demonstrated the ability to acquire the necessary knowledge and tools. He independently learned to work with the MetaCentrum computing infrastructure and conducted a set of experiments from which he formulated appropriate conclusions.
Progress during the winter semester was somewhat slow. I had hoped the student would explore more advanced approaches to mathematical fine-tuning beyond the standard autoregressive formulation of the task.
Overall, however, I evaluate the thesis positively and propose a grade of B.
The thesis deals with fine-tuning large language models for solving mathematical problems, which goes beyond the knowledge acquired during bachelor studies. The student had to study the principles of large language models, efficient fine-tuning techniques and techniques for mathematical reasoning. The student selected a suitable dataset and conducted a set of experiments, where he obtained reasonable results. I consider the assignment fulfilled.
The student worked with literature recommended by the supervisor and independently searched for additional sources.
The student was highly active throughout the entire academic year while working on the thesis and was generally well prepared for consultations. However, progress on the experimental part of the work was somewhat slower during the winter semester.
The written part of the thesis was prepared during the last month, but progressively, and I was gradually familiarized with its content. The student incorporated feedback, and the completion of the thesis proceeded without significant time-related complications. The final content was sufficiently consulted.
Grade proposed by supervisor: B
Reviewer’s reportIng. Petr Šilling
The student successfully studied LLMs for mathematical reasoning and the relevant fine-tuning strategies, implementing a complete framework for efficient LoRA-based fine-tuning on the GSM8K dataset. The thesis was mostly experimental in nature and the student demonstrated his ability to perform and compare different experiments. However, the work is held back slightly by the limited scope of the selected methods.
Evaluation level: obtížnější zadání
The assignment required deep understanding of modern large language models and methods for their training and parameter-efficient fine-tuning, especially in the context of mathematical chain-of-thought reasoning. Furthermore, the student had to study and select relevant training data and evaluation strategies. Overall, the assignment goes beyond the scope of standard bachelor's studies and I consider it slightly above average in terms of difficulty.
The technical report is of high quality. It starts with an introduction to artificial intelligence, language modeling and Large Language Models (LLMs). This is followed by an overview of modern parameter-efficient fine-tuning techniques and mathematical reasoning models. Then, it describes the chosen training dataset, GSM8K, three LLMs to be fine-tuned, and the selected fine-tuning strategy based on low-rank adaptation (LoRA). It follows with implementation details and experimental results, where the performance gains of fine-tuned variants of the chosen models are discussed. The final chapter contains the conclusion.
I was missing more theoretical background on the specific thesis topic. The thesis dives unnecessarily deep into AI/ML basics and then misses some key definitions, which are later referenced in the text, despite the lack of detailed explanation. This includes, for instance, references to multi-head attention in Section 2.4 and to Reinforcement Learning with Human Feedback and Proximal Policy Optimization in Section 4.3. In general, I expected the thesis to delve a bit deeper into mathematical reasoning and different fine-tuning strategies. Finally, figures such as Figure 2.2, could use more detailed descriptions.
The formal aspects of the thesis are exemplary, with no noticeable typographical and grammatical errors. All images are relevant and of high quality.
The student selected a suitable dataset and fine-tuned three different LLMs using LoRA with a grid-like hyperparameter search. The experimental evaluation and discussion are sound, comparing different hyperparameters and model variants against each other.
However, significant effort focuses on hyperparameter tuning, even though the tuning methodology is rather simple. This translates to the thesis itself, which in general feels a bit minimalist. While the student clearly demonstrated his ability to perform and discuss the results of relevant experiments, he only applied the same methods on the same dataset to different pre-trained models. I was missing a comparison of different fine-tuning strategies (some of which, such as QLoRA, the student mentions himself), different math datasets, and a clearer emphasis on chain-of-thought reasoning.
Programmatically, the solution is a collection of Python scripts, which are well-structured but largely undocumented.
The thesis focuses primarily on comparing existing language models fine-tuned using LoRA. The resulting models are mainly experimental and have limited industrial value, but can be used as a useful reference for hyperparameter setup.
Evaluation level: zadání splněno
The assignment was fulfilled completely and all goals were sufficiently addressed.
Evaluation level: je v obvyklém rozmezí
The length of the technical report is within the usual range. The report contains all essential information.
The thesis cites a total of 33 sources, mostly scientific articles but also a significant amount of books. The sources are highly relevant, recent, and correctly used and cited in the text. I am not aware of any major violations of citation ethics. However, I would expect direct citations of all referenced methods, including older ones such as LSTM and GRU in Section 2.4.
Grade proposed by reviewer: B
Responsibility: Mgr. et Mgr. Hana Odstrčilová