Bachelor's Thesis

Large language models for solving mathematical problems

Author of thesis: Rostyslav Kachan

Acad. year: 2025/2026

Abstract:

This thesis investigates parameter-efficient fine-tuning for improving mathematical reasoning in large language models. Three open-source 7-billion-parameter models — Llama-2-7b-hf, Llama-2-7b-chat-hf, and Mistral-7B-Instruct-v0.1 — were fine-tuned on the GSM8K
dataset using Low-Rank Adaptation (LoRA), with a systematic search over learning rate,
LoRA rank, alpha, and effective batch size. Performance was evaluated using exact-match
accuracy and the Qwen2.5-Math-PRM-7B process reward model for assessing intermediate reasoning quality. Fine-tuning yielded substantial accuracy gains: Llama-2-7b-hf improved from 10.77% to 33.36%, Llama-2-7b-chat-hf from 25.47% to 39.88%, and Mistral-7B-Instruct-v0.1 from 42.23% to 57.01%. The results demonstrate that LoRA fine-tuning
with carefully selected hyperparameters effectively enhances mathematical reasoning, while
PRM evaluation provides additional insights into reasoning quality beyond accuracy alone.

Keywords:

Large language models, LLM, mathematical reasoning, parameter-efficient fine-tuning,
PEFT, Low-Rank Adaptation, LoRA, GSM8K dataset, process reward model, PRM, hyperparameter search.

Date of defence

16.06.2026

Result of the defence

Defended (thesis was successfully defended)

znamkaCznamka

Grading

Process of defence

Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm C.

Topics for thesis defence

The main motivation for choosing LoRA was efficiency. Yet you also mention QLoRA, an even more efficient fine-tuning method. Why not experiment with it as well?
Have you considered more sophisticated (e.g., Bayesian) methods for hyperparameter space search?
How many chain-of-thought examples are you prepending when evaluating different variants of the models (not just the baselines)?
Můžete popsat význam proměnné rank?
Můžete vysvětlit pojem LORA?
Jak vyhodnocujete vaše řešení?

Language of thesis

English

Faculty

Fakulta informačních technologií

Department

Department of Computer Graphics and Multimedia

Study programme

Information Technology (BIT)

Composition of Committee

doc. Ing. Lukáš Burget, Ph.D. (předseda)
doc. Mgr. Adam Rogalewicz, Ph.D. (místopředseda)
Ing. Libor Polčák, Ph.D. (člen)
Ing. Michal Hradiš, Ph.D. (člen)
Ing. Martin Žádník, Ph.D. (člen)

Supervisor’s report
Ing. Martin Kostelník

The student worked actively throughout the entire academic year. He independently completed the tasks agreed upon during consultations and demonstrated the ability to acquire the necessary knowledge and tools. He independently learned to work with the MetaCentrum computing infrastructure and conducted a set of experiments from which he formulated appropriate conclusions.

Progress during the winter semester was somewhat slow. I had hoped the student would explore more advanced approaches to mathematical fine-tuning beyond the standard autoregressive formulation of the task.

Overall, however, I evaluate the thesis positively and propose a grade of B.

Evaluation criteria	Verbal classification
Informace k zadání	The thesis deals with fine-tuning large language models for solving mathematical problems, which goes beyond the knowledge acquired during bachelor studies. The student had to study the principles of large language models, efficient fine-tuning techniques and techniques for mathematical reasoning. The student selected a suitable dataset and conducted a set of experiments, where he obtained reasonable results. I consider the assignment fulfilled.
Práce s literaturou	The student worked with literature recommended by the supervisor and independently searched for additional sources.
Aktivita během řešení, konzultace, komunikace	The student was highly active throughout the entire academic year while working on the thesis and was generally well prepared for consultations. However, progress on the experimental part of the work was somewhat slower during the winter semester.
Aktivita při dokončování	The written part of the thesis was prepared during the last month, but progressively, and I was gradually familiarized with its content. The student incorporated feedback, and the completion of the thesis proceeded without significant time-related complications. The final content was sufficiently consulted.
Publikační činnost, ocenění

Points proposed by supervisor: 80

Grade proposed by supervisor: B

Reviewer’s report
Ing. Petr Šilling

The student successfully studied LLMs for mathematical reasoning and the relevant fine-tuning strategies, implementing a complete framework for efficient LoRA-based fine-tuning on the GSM8K dataset. The thesis was mostly experimental in nature and the student demonstrated his ability to perform and compare different experiments. However, the work is held back slightly by the limited scope of the selected methods.

Evaluation criteria	Verbal classification	Points
Náročnost zadání	Evaluation level: obtížnější zadání The assignment required deep understanding of modern large language models and methods for their training and parameter-efficient fine-tuning, especially in the context of mathematical chain-of-thought reasoning. Furthermore, the student had to study and select relevant training data and evaluation strategies. Overall, the assignment goes beyond the scope of standard bachelor's studies and I consider it slightly above average in terms of difficulty.
Prezentační úroveň technické zprávy	The technical report is of high quality. It starts with an introduction to artificial intelligence, language modeling and Large Language Models (LLMs). This is followed by an overview of modern parameter-efficient fine-tuning techniques and mathematical reasoning models. Then, it describes the chosen training dataset, GSM8K, three LLMs to be fine-tuned, and the selected fine-tuning strategy based on low-rank adaptation (LoRA). It follows with implementation details and experimental results, where the performance gains of fine-tuned variants of the chosen models are discussed. The final chapter contains the conclusion. I was missing more theoretical background on the specific thesis topic. The thesis dives unnecessarily deep into AI/ML basics and then misses some key definitions, which are later referenced in the text, despite the lack of detailed explanation. This includes, for instance, references to multi-head attention in Section 2.4 and to Reinforcement Learning with Human Feedback and Proximal Policy Optimization in Section 4.3. In general, I expected the thesis to delve a bit deeper into mathematical reasoning and different fine-tuning strategies. Finally, figures such as Figure 2.2, could use more detailed descriptions.	80
Formální úprava technické zprávy	The formal aspects of the thesis are exemplary, with no noticeable typographical and grammatical errors. All images are relevant and of high quality.	100
Realizační výstup	The student selected a suitable dataset and fine-tuned three different LLMs using LoRA with a grid-like hyperparameter search. The experimental evaluation and discussion are sound, comparing different hyperparameters and model variants against each other. However, significant effort focuses on hyperparameter tuning, even though the tuning methodology is rather simple. This translates to the thesis itself, which in general feels a bit minimalist. While the student clearly demonstrated his ability to perform and discuss the results of relevant experiments, he only applied the same methods on the same dataset to different pre-trained models. I was missing a comparison of different fine-tuning strategies (some of which, such as QLoRA, the student mentions himself), different math datasets, and a clearer emphasis on chain-of-thought reasoning. Programmatically, the solution is a collection of Python scripts, which are well-structured but largely undocumented.	75
Využitelnost výsledků	The thesis focuses primarily on comparing existing language models fine-tuned using LoRA. The resulting models are mainly experimental and have limited industrial value, but can be used as a useful reference for hyperparameter setup.
Rozsah splnění požadavků zadání	Evaluation level: zadání splněno The assignment was fulfilled completely and all goals were sufficiently addressed.
Rozsah technické zprávy	Evaluation level: je v obvyklém rozmezí The length of the technical report is within the usual range. The report contains all essential information.
Práce s literaturou	The thesis cites a total of 33 sources, mostly scientific articles but also a significant amount of books. The sources are highly relevant, recent, and correctly used and cited in the text. I am not aware of any major violations of citation ethics. However, I would expect direct citations of all referenced methods, including older ones such as LSTM and GRU in Section 2.4.	95

Topics for thesis defence:

The main motivation for choosing LoRA was efficiency. Yet you also mention QLoRA, an even more efficient fine-tuning method. Why not experiment with it as well?
Have you considered more sophisticated (e.g., Bayesian) methods for hyperparameter space search?
How many chain-of-thought examples are you prepending when evaluating different variants of the models (not just the baselines)?

Points proposed by reviewer: 83

Grade proposed by reviewer: B

Responsibility: Mgr. et Mgr. Hana Odstrčilová

VUT

Faculties and university institutes

Parts

Large language models for solving mathematical problems