Bachelor's Thesis

Large language models for solving mathematical problems

Final Thesis 1.36 MB

Author of thesis: Rostyslav Kachan

Acad. year: 2025/2026

Supervisor: Ing. Martin Kostelník

Reviewer: Ing. Petr Šilling

Abstract:

This thesis investigates parameter-efficient fine-tuning for improving mathematical reasoning in large language models. Three open-source 7-billion-parameter models — Llama-2-7b-hf, Llama-2-7b-chat-hf, and Mistral-7B-Instruct-v0.1 — were fine-tuned on the GSM8K
dataset using Low-Rank Adaptation (LoRA), with a systematic search over learning rate,
LoRA rank, alpha, and effective batch size. Performance was evaluated using exact-match
accuracy and the Qwen2.5-Math-PRM-7B process reward model for assessing intermediate reasoning quality. Fine-tuning yielded substantial accuracy gains: Llama-2-7b-hf improved from 10.77% to 33.36%, Llama-2-7b-chat-hf from 25.47% to 39.88%, and Mistral-7B-Instruct-v0.1 from 42.23% to 57.01%. The results demonstrate that LoRA fine-tuning
with carefully selected hyperparameters effectively enhances mathematical reasoning, while
PRM evaluation provides additional insights into reasoning quality beyond accuracy alone.

Keywords:

Large language models, LLM, mathematical reasoning, parameter-efficient fine-tuning,
PEFT, Low-Rank Adaptation, LoRA, GSM8K dataset, process reward model, PRM, hyperparameter search.

Date of defence

16.06.2026

Result of the defence

Defended (thesis was successfully defended)

znamkaCznamka

Grading

C

Process of defence

Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm C.

Topics for thesis defence

  1. The main motivation for choosing LoRA was efficiency. Yet you also mention QLoRA, an even more efficient fine-tuning method. Why not experiment with it as well?
  2. Have you considered more sophisticated (e.g., Bayesian) methods for hyperparameter space search?
  3. How many chain-of-thought examples are you prepending when evaluating different variants of the models (not just the baselines)?
  4. Můžete popsat význam proměnné rank?
  5. Můžete vysvětlit pojem LORA?
  6. Jak vyhodnocujete vaše řešení?

Language of thesis

English

Faculty

Department

Study programme

Information Technology (BIT)

Composition of Committee

doc. Ing. Lukáš Burget, Ph.D. (předseda)
doc. Mgr. Adam Rogalewicz, Ph.D. (místopředseda)
Ing. Libor Polčák, Ph.D. (člen)
Ing. Michal Hradiš, Ph.D. (člen)
Ing. Martin Žádník, Ph.D. (člen)

Supervisor’s report
Ing. Martin Kostelník

The student worked actively throughout the entire academic year. He independently completed the tasks agreed upon during consultations and demonstrated the ability to acquire the necessary knowledge and tools. He independently learned to work with the MetaCentrum computing infrastructure and conducted a set of experiments from which he formulated appropriate conclusions.


Progress during the winter semester was somewhat slow. I had hoped the student would explore more advanced approaches to mathematical fine-tuning beyond the standard autoregressive formulation of the task.


Overall, however, I evaluate the thesis positively and propose a grade of B.

Evaluation criteria Verbal classification
Informace k zadání

The thesis deals with fine-tuning large language models for solving mathematical problems, which goes beyond the knowledge acquired during bachelor studies. The student had to study the principles of large language models, efficient fine-tuning techniques and techniques for mathematical reasoning. The student selected a suitable dataset and conducted a set of experiments, where he obtained reasonable results. I consider the assignment fulfilled.

Práce s literaturou

The student worked with literature recommended by the supervisor and independently searched for additional sources.

Aktivita během řešení, konzultace, komunikace

The student was highly active throughout the entire academic year while working on the thesis and was generally well prepared for consultations. However, progress on the experimental part of the work was somewhat slower during the winter semester.

Aktivita při dokončování

The written part of the thesis was prepared during the last month, but progressively, and I was gradually familiarized with its content. The student incorporated feedback, and the completion of the thesis proceeded without significant time-related complications. The final content was sufficiently consulted.

Publikační činnost, ocenění
Points proposed by supervisor: 80

Grade proposed by supervisor: B

Reviewer’s report
Ing. Petr Šilling

The student successfully studied LLMs for mathematical reasoning and the relevant fine-tuning strategies, implementing a complete framework for efficient LoRA-based fine-tuning on the GSM8K dataset. The thesis was mostly experimental in nature and the student demonstrated his ability to perform and compare different experiments. However, the work is held back slightly by the limited scope of the selected methods.

Evaluation criteria Verbal classification Points
Náročnost zadání

Evaluation level: obtížnější zadání

The assignment required deep understanding of modern large language models and methods for their training and parameter-efficient fine-tuning, especially in the context of mathematical chain-of-thought reasoning. Furthermore, the student had to study and select relevant training data and evaluation strategies. Overall, the assignment goes beyond the scope of standard bachelor's studies and I consider it slightly above average in terms of difficulty.

Prezentační úroveň technické zprávy

The technical report is of high quality. It starts with an introduction to artificial intelligence, language modeling and Large Language Models (LLMs). This is followed by an overview of modern parameter-efficient fine-tuning techniques and mathematical reasoning models. Then, it describes the chosen training dataset, GSM8K, three LLMs to be fine-tuned, and the selected fine-tuning strategy based on low-rank adaptation (LoRA). It follows with implementation details and experimental results, where the performance gains of fine-tuned variants of the chosen models are discussed. The final chapter contains the conclusion.

I was missing more theoretical background on the specific thesis topic. The thesis dives unnecessarily deep into AI/ML basics and then misses some key definitions, which are later referenced in the text, despite the lack of detailed explanation. This includes, for instance, references to multi-head attention in Section 2.4 and to Reinforcement Learning with Human Feedback and Proximal Policy Optimization in Section 4.3. In general, I expected the thesis to delve a bit deeper into mathematical reasoning and different fine-tuning strategies. Finally, figures such as Figure 2.2, could use more detailed descriptions.

80
Formální úprava technické zprávy

The formal aspects of the thesis are exemplary, with no noticeable typographical and grammatical errors. All images are relevant and of high quality.

100
Realizační výstup

The student selected a suitable dataset and fine-tuned three different LLMs using LoRA with a grid-like hyperparameter search. The experimental evaluation and discussion are sound, comparing different hyperparameters and model variants against each other.

However, significant effort focuses on hyperparameter tuning, even though the tuning methodology is rather simple. This translates to the thesis itself, which in general feels a bit minimalist. While the student clearly demonstrated his ability to perform and discuss the results of relevant experiments, he only applied the same methods on the same dataset to different pre-trained models. I was missing a comparison of different fine-tuning strategies (some of which, such as QLoRA, the student mentions himself), different math datasets, and a clearer emphasis on chain-of-thought reasoning.

Programmatically, the solution is a collection of Python scripts, which are well-structured but largely undocumented.

75
Využitelnost výsledků

The thesis focuses primarily on comparing existing language models fine-tuned using LoRA. The resulting models are mainly experimental and have limited industrial value, but can be used as a useful reference for hyperparameter setup.

Rozsah splnění požadavků zadání

Evaluation level: zadání splněno

The assignment was fulfilled completely and all goals were sufficiently addressed.

Rozsah technické zprávy

Evaluation level: je v obvyklém rozmezí

The length of the technical report is within the usual range. The report contains all essential information.

Práce s literaturou

The thesis cites a total of 33 sources, mostly scientific articles but also a significant amount of books. The sources are highly relevant, recent, and correctly used and cited in the text. I am not aware of any major violations of citation ethics. However, I would expect direct citations of all referenced methods, including older ones such as LSTM and GRU in Section 2.4.

95
Topics for thesis defence:
  1. The main motivation for choosing LoRA was efficiency. Yet you also mention QLoRA, an even more efficient fine-tuning method. Why not experiment with it as well?
  2. Have you considered more sophisticated (e.g., Bayesian) methods for hyperparameter space search?
  3. How many chain-of-thought examples are you prepending when evaluating different variants of the models (not just the baselines)?
Points proposed by reviewer: 83

Grade proposed by reviewer: B

Responsibility: Mgr. et Mgr. Hana Odstrčilová