Master's Thesis

Person Recognition Based on Stylometry

Final Thesis 11.51 MB

Author of thesis: Ing. Samuel Šimún

Acad. year: 2025/2026

Supervisor: Ing. Tomáš Goldmann, Ph.D.

Reviewer: Ing. Filip Orság, Ph.D.

Abstract:

This thesis investigates stylometry as a cognitive biometric method for authorship verification, where the goal is to determine whether two texts were written by the same author. To reduce the influence of topic similarity, a semantic-agnostic dataset construction strategy is proposed, creating positive pairs from semantically distant texts by the same author and negative pairs from semantically similar texts by different authors. The work evaluates two transformer-based approaches: encoder-based BERT/RoBERTa models with co-attention and style regularisation, and decoder-based Qwen3 models fine-tuned using LoRA. Experiments on a custom dataset and the PAN21 benchmark show that the decoder-based Qwen3-4B model trained with focal loss achieves the best performance. The results demonstrate the potential of transformer-based models for robust stylometric authorship verification.

Keywords:

Stylometry, Cognitive Biometrics, Authorship Verification, Machine Learning, Deep Learning, Transformers, BERT, RoBERTa, Qwen3, LoRA, Co-Attention, Focal Loss

Date of defence

24.06.2026

Result of the defence

Defended (thesis was successfully defended)

znamkaAznamka

Grading

A

Process of defence

Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm A.

Topics for thesis defence

  1. The proposed dataset construction strategy intentionally makes positive pairs semantically distant and negative pairs semantically similar. How can you verify that the model learns authorial style rather than artefacts introduced by this pair-mining strategy?
  2. The maximum input sequence length is set to 512 tokens, meaning that longer texts are truncated. How would you expect the model’s performance to change if the input sequence length were reduced further, for example, to 256 tokens? Did you conduct any experiments with different sequence lengths?

Language of thesis

English

Faculty

Department

Study programme

Information Technology and Artificial Intelligence (MITAI)

Specialization

Machine Learning (NMAL)

Composition of Committee

prof. Dr. Ing. Jan Černocký (předseda)
doc. Ing. Vítězslav Beran, Ph.D. (místopředseda)
doc. Ing. Ondřej Lengál, Ph.D. (člen)
doc. Ing. František Zbořil, Ph.D. (člen)
Ing. Michal Hradiš, Ph.D. (člen)
Ing. Martin Fajčík, Ph.D. (člen)

Supervisor’s report
Ing. Tomáš Goldmann, Ph.D.

I consider the student's approach to the thesis to be exemplary. Both the implementation and the technical report were completed well ahead of the deadline, and the student incorporated the supervisor's comments. Beyond the scope of the assignment, he also prepared a manuscript for a scientific conference based on the results obtained, which demonstrates his deep interest in the subject and his ability to present his own findings to the scientific community. Overall, the student's approach to the thesis is evaluated as excellent (A).

Evaluation criteria Verbal classification
Informace k zadání

The aim of the thesis was to develop an algorithm for text authorship verification using stylometric analysis. This is an above-average difficult assignment that is not frequently addressed in the scientific community, and the student therefore had to tackle a number of challenges independently. It was necessary to prepare a custom dataset suitable for model training and to design a custom neural network architecture tailored to the given task. All the objectives of the assignment have been met.

Aktivita při dokončování

The thesis was completed well ahead of the submission deadline. The final version of the technical report was submitted to the supervisor in a timely manner and was sufficiently discussed. The student responded to feedback promptly and incorporated the requested revisions without unnecessary delays.

Publikační činnost, ocenění

Part of the thesis output was written up as a manuscript intended for presentation at a conference. At the time of writing this review, the article is under peer review and the outcome is not yet known. Nevertheless, I greatly appreciate that the student produced this above-standard output, particularly in light of his intention to continue with doctoral studies.

Práce s literaturou

The student independently gathered all necessary literature and I fully agree with the selection of sources used. The literature review is well-structured, covers both classical approaches to stylometry and modern deep learning-based methods, and corresponds well to the topic being addressed.

Aktivita během řešení, konzultace, komunikace

The student was very active throughout the development of the thesis and maintained a consistent and thorough understanding of the subject matter. He worked on the thesis continuously and on his own initiative, regularly informing the supervisor of progress made as well as any complications encountered. The number and frequency of consultations were adequate and communication proceeded smoothly throughout the entire duration of the project.

Points proposed by supervisor: 99

Grade proposed by supervisor: A

Reviewer’s report
Ing. Filip Orság, Ph.D.

The master’s thesis deals with authorship verification using stylometry and modern transformer-based machine learning methods. The work is technically demanding and successfully fulfils the assignment. The strongest aspects of the thesis are the semantic-agnostic dataset construction, the comparison of encoder-based and decoder-based architectures, the use of LoRA fine-tuning, and the thorough experimental evaluation on both a custom dataset and PAN21. The thesis has some minor weaknesses, mainly in the formal aspects and in the limited depth of discussion of some practical limitations. Nevertheless, the technical contribution, implementation, and experimental results are clearly above average. Overall, I evaluate the thesis as excellent.

Evaluation criteria Verbal classification Points
Rozsah splnění požadavků zadání

Evaluation level: zadání splněno a práce obsahuje podstatná rozšíření

The work goes beyond the original requirements through the design of an original semantics-agnostic dataset pipeline, a novel co-attention encoder architecture, and extensive experiments with LoRA-finetuned decoder models.

Rozsah technické zprávy

Evaluation level: je v obvyklém rozmezí

Prezentační úroveň technické zprávy

The thesis is logically structured and the chapters follow each other in a natural order. The presentation of experimental results is clear, especially in the decoder-based part, where the author discusses not only the achieved metrics but also cross-domain generalisation and threshold drift. A minor weakness is that some theoretical passages are rather broad and could be more concise. In several places, the report would also benefit from a more critical discussion of limitations, especially regarding the influence of dataset construction, calibration, and possible dataset-specific artefacts.

95
Formální úprava technické zprávy

The technical report is written in English. The figures and tables generally support the explanation, and the architecture and evaluation parts are documented in sufficient detail. However, the English language contains some grammatical, stylistic, and typographical imperfections. In addition, the formal treatment of equations is not fully consistent. Several displayed equations are not properly integrated into the surrounding sentence, especially with respect to missing or inconsistent commas and full stops after equations.

85
Práce s literaturou

The work with literature is of high quality. The thesis uses relevant sources and the bibliography includes both classical and recent works. The student clearly distinguishes the theoretical background from the proposed contributions. I did not notice any serious problem with citation ethics.

100
Realizační výstup

The student implemented a complete authorship verification framework in Python, including dataset construction, model training, evaluation, and an application interface. The solution includes both encoder-based models and decoder-based Qwen3 models fine-tuned with LoRA adapters. The application supports both a command-line interface and an API. The experimental part is very nice. The author evaluated several model configurations, including Qwen3-1.7B and Qwen3-4B with cross-entropy and focal loss. A minor limitation is that the benefit of focal loss appears to be very small in comparison with cross-entropy, which the thesis itself acknowledges. Third-party software and pretrained models are appropriately distinguished from the student's own work.

90
Využitelnost výsledků

The results are usable for further research in authorship verification, stylometry, and cognitive biometrics. The proposed semantic-agnostic dataset construction strategy is particularly valuable. The implemented application also provides a useful basis for practical testing of the proposed methods. The results based on the thesis have been submitted to the IJCB 2026 conference.

Náročnost zadání

Evaluation level: obtížnější zadání

The assignment is difficult, as it combines stylometry, biometric verification, dataset construction, modern transformer architectures, and experimental evaluation on both a custom dataset and an external benchmark. The thesis also includes the design of a practical application and experiments with encoder-based and decoder-based models, which further increases the overall complexity of the work.

Topics for thesis defence:
  1. The proposed dataset construction strategy intentionally makes positive pairs semantically distant and negative pairs semantically similar. How can you verify that the model learns authorial style rather than artefacts introduced by this pair-mining strategy?
  2. The maximum input sequence length is set to 512 tokens, meaning that longer texts are truncated. How would you expect the model’s performance to change if the input sequence length were reduced further, for example, to 256 tokens? Did you conduct any experiments with different sequence lengths?
Points proposed by reviewer: 90

Grade proposed by reviewer: A

Responsibility: Mgr. et Mgr. Hana Odstrčilová