Přístupnostní navigace
E-application
Search Search Close
Master's Thesis
Author of thesis: Ing. Samuel Šimún
Acad. year: 2025/2026
Supervisor: Ing. Tomáš Goldmann, Ph.D.
Reviewer: Ing. Filip Orság, Ph.D.
This thesis investigates stylometry as a cognitive biometric method for authorship verification, where the goal is to determine whether two texts were written by the same author. To reduce the influence of topic similarity, a semantic-agnostic dataset construction strategy is proposed, creating positive pairs from semantically distant texts by the same author and negative pairs from semantically similar texts by different authors. The work evaluates two transformer-based approaches: encoder-based BERT/RoBERTa models with co-attention and style regularisation, and decoder-based Qwen3 models fine-tuned using LoRA. Experiments on a custom dataset and the PAN21 benchmark show that the decoder-based Qwen3-4B model trained with focal loss achieves the best performance. The results demonstrate the potential of transformer-based models for robust stylometric authorship verification.
Stylometry, Cognitive Biometrics, Authorship Verification, Machine Learning, Deep Learning, Transformers, BERT, RoBERTa, Qwen3, LoRA, Co-Attention, Focal Loss
Date of defence
24.06.2026
Result of the defence
Defended (thesis was successfully defended)
Grading
A
Process of defence
Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm A.
Topics for thesis defence
Language of thesis
English
Faculty
Fakulta informačních technologií
Department
Department of Intelligent Systems
Study programme
Information Technology and Artificial Intelligence (MITAI)
Specialization
Machine Learning (NMAL)
Composition of Committee
prof. Dr. Ing. Jan Černocký (předseda) doc. Ing. Vítězslav Beran, Ph.D. (místopředseda) doc. Ing. Ondřej Lengál, Ph.D. (člen) doc. Ing. František Zbořil, Ph.D. (člen) Ing. Michal Hradiš, Ph.D. (člen) Ing. Martin Fajčík, Ph.D. (člen)
Supervisor’s reportIng. Tomáš Goldmann, Ph.D.
I consider the student's approach to the thesis to be exemplary. Both the implementation and the technical report were completed well ahead of the deadline, and the student incorporated the supervisor's comments. Beyond the scope of the assignment, he also prepared a manuscript for a scientific conference based on the results obtained, which demonstrates his deep interest in the subject and his ability to present his own findings to the scientific community. Overall, the student's approach to the thesis is evaluated as excellent (A).
The aim of the thesis was to develop an algorithm for text authorship verification using stylometric analysis. This is an above-average difficult assignment that is not frequently addressed in the scientific community, and the student therefore had to tackle a number of challenges independently. It was necessary to prepare a custom dataset suitable for model training and to design a custom neural network architecture tailored to the given task. All the objectives of the assignment have been met.
The thesis was completed well ahead of the submission deadline. The final version of the technical report was submitted to the supervisor in a timely manner and was sufficiently discussed. The student responded to feedback promptly and incorporated the requested revisions without unnecessary delays.
Part of the thesis output was written up as a manuscript intended for presentation at a conference. At the time of writing this review, the article is under peer review and the outcome is not yet known. Nevertheless, I greatly appreciate that the student produced this above-standard output, particularly in light of his intention to continue with doctoral studies.
The student independently gathered all necessary literature and I fully agree with the selection of sources used. The literature review is well-structured, covers both classical approaches to stylometry and modern deep learning-based methods, and corresponds well to the topic being addressed.
The student was very active throughout the development of the thesis and maintained a consistent and thorough understanding of the subject matter. He worked on the thesis continuously and on his own initiative, regularly informing the supervisor of progress made as well as any complications encountered. The number and frequency of consultations were adequate and communication proceeded smoothly throughout the entire duration of the project.
Grade proposed by supervisor: A
Reviewer’s reportIng. Filip Orság, Ph.D.
The master’s thesis deals with authorship verification using stylometry and modern transformer-based machine learning methods. The work is technically demanding and successfully fulfils the assignment. The strongest aspects of the thesis are the semantic-agnostic dataset construction, the comparison of encoder-based and decoder-based architectures, the use of LoRA fine-tuning, and the thorough experimental evaluation on both a custom dataset and PAN21. The thesis has some minor weaknesses, mainly in the formal aspects and in the limited depth of discussion of some practical limitations. Nevertheless, the technical contribution, implementation, and experimental results are clearly above average. Overall, I evaluate the thesis as excellent.
Evaluation level: zadání splněno a práce obsahuje podstatná rozšíření
The work goes beyond the original requirements through the design of an original semantics-agnostic dataset pipeline, a novel co-attention encoder architecture, and extensive experiments with LoRA-finetuned decoder models.
Evaluation level: je v obvyklém rozmezí
The thesis is logically structured and the chapters follow each other in a natural order. The presentation of experimental results is clear, especially in the decoder-based part, where the author discusses not only the achieved metrics but also cross-domain generalisation and threshold drift. A minor weakness is that some theoretical passages are rather broad and could be more concise. In several places, the report would also benefit from a more critical discussion of limitations, especially regarding the influence of dataset construction, calibration, and possible dataset-specific artefacts.
The technical report is written in English. The figures and tables generally support the explanation, and the architecture and evaluation parts are documented in sufficient detail. However, the English language contains some grammatical, stylistic, and typographical imperfections. In addition, the formal treatment of equations is not fully consistent. Several displayed equations are not properly integrated into the surrounding sentence, especially with respect to missing or inconsistent commas and full stops after equations.
The work with literature is of high quality. The thesis uses relevant sources and the bibliography includes both classical and recent works. The student clearly distinguishes the theoretical background from the proposed contributions. I did not notice any serious problem with citation ethics.
The student implemented a complete authorship verification framework in Python, including dataset construction, model training, evaluation, and an application interface. The solution includes both encoder-based models and decoder-based Qwen3 models fine-tuned with LoRA adapters. The application supports both a command-line interface and an API. The experimental part is very nice. The author evaluated several model configurations, including Qwen3-1.7B and Qwen3-4B with cross-entropy and focal loss. A minor limitation is that the benefit of focal loss appears to be very small in comparison with cross-entropy, which the thesis itself acknowledges. Third-party software and pretrained models are appropriately distinguished from the student's own work.
The results are usable for further research in authorship verification, stylometry, and cognitive biometrics. The proposed semantic-agnostic dataset construction strategy is particularly valuable. The implemented application also provides a useful basis for practical testing of the proposed methods. The results based on the thesis have been submitted to the IJCB 2026 conference.
Evaluation level: obtížnější zadání
The assignment is difficult, as it combines stylometry, biometric verification, dataset construction, modern transformer architectures, and experimental evaluation on both a custom dataset and an external benchmark. The thesis also includes the design of a practical application and experiments with encoder-based and decoder-based models, which further increases the overall complexity of the work.
Grade proposed by reviewer: A
Responsibility: Mgr. et Mgr. Hana Odstrčilová