Bachelor's Thesis

Creating audiobooks with AI

Final Thesis 1.9 MB

Author of thesis: Timur Nurtdinov

Acad. year: 2025/2026

Supervisor: doc. Ing. Vítězslav Beran, Ph.D.

Reviewer: Ing. Michal Hradiš, Ph.D.

Abstract:

This bachelor’s thesis presents a system that automates the creation of multi-voice audiobooks from EPUB files to overcome the high costs and time constraints of traditional
production. The solution leverages Large Language Models to automatically analyze the
text, detect scenes, extract characters, and attribute dialogue. Speech is then synthesized
using advanced Text-to-Speech engines. Key features of the system include a multi-stage
text analysis pipeline and a web-based system, which enables users to select specific voices
for characters, configure the narrator’s style, and overlay ambient sound onto the final
audio. This approach significantly accelerates audiobook production while preserving the
user’s creative control.

Keywords:

audiobook, multi-voice speech synthesis, text-to-speech, large language models, user inter-
face, web application

Date of defence

15.06.2026

Result of the defence

Defended (thesis was successfully defended)

znamkaAznamka

Grading

A

Process of defence

Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm A.

Topics for thesis defence

  1. How does your solution compare with other existing tools?
  2. Have you, or someone else, already listened to an entire audiobook generated by your software?
  3. How exactly do you split the text for attribution? I may have misread, but my understanding was that voices are selected per paragraph. What exactly are those paragraphs?
  4. Kolik lidského a strojového času zabere vaším nástrojem zpracovat celou knihu? 

Language of thesis

English

Faculty

Department

Study programme

Information Technology (BIT)

Composition of Committee

prof. Ing. Adam Herout, Ph.D. (předseda)
doc. Mgr. Adam Rogalewicz, Ph.D. (místopředseda)
Ing. Vladimír Bartík, Ph.D. (člen)
Ing. Michal Hradiš, Ph.D. (člen)
Ing. Josef Strnadel, Ph.D. (člen)

Student Timur Nurtdinov dedicated himself to the project conscientiously and with great interest, demonstrating an outstanding capacity for independent work. Following the successful deployment of the core system, he methodically expanded the solution to include a high-quality GUI for the management and user parameterisation of the entire process. The student successfully proposed partial solutions, effectively resolved technical challenges during integration, and delivered an exceptionally high-quality final product.

Evaluation criteria Verbal classification
Informace k zadání

The bachelor's thesis focuses on automating the processing of electronic books to generate audio versions with dramatisation. The topic requires a deep understanding of LLM attributes, including both text analysis and text-to-speech generation. The student successfully fulfilled all aspects of the assignment: he designed the automated process while preserving creative user control, selected suitable models, prepared a testing data set, and executed evaluation experiments. The resulting solution meets high standards, is entirely self-contained, and does not depend on previous projects.

Práce s literaturou

The student drew from an extensive list of relevant technical literature and materials regarding LLMs and speech generation. He utilised several essential academic sources effectively, while less methodical, non-peer-reviewed references, such as documentation and software manuals, are mostly cited in the footnotes.

Aktivita během řešení, konzultace, komunikace

Timur Nurtdinov was highly active and intensely interested in the topic. He attended consultations thoroughly prepared and according to the schedule. In the initial phase, the student focused primarily on the functionality of individual system elements and their integration into the overall pipeline. Gradually, he adopted a more methodical approach, focusing on data structures, user inputs, intermediate results, and the interfaces between system components, thereby successfully shifting attention toward specific tasks with high technical added value.

Aktivita při dokončování

The work on developing the pipeline and preparing the data set progressed continuously according to the schedule, allowing the thesis to be completed well ahead of the deadline. After implementing the basic functional version, the student methodically elaborated on individual components, finalised the user interaction process and control mechanisms, and built a functional web application with a GUI. The final content was fully consulted, and all recommendations were incorporated.

Publikační činnost, ocenění

The paper was presented at the Excel@FIT 2026 student conference.

Points proposed by supervisor: 100

Grade proposed by supervisor: A

Reviewer’s report
Ing. Michal Hradiš, Ph.D.

The student created a relatively complex application that, in its current form, is already usable for the intended task. He worked creatively and systematically, and he comprehensively tested the resulting solution. With further work, the software could be turned into a successful open-source project or product.

Evaluation criteria Verbal classification Points
Náročnost zadání

Evaluation level: obtížnější zadání

The topic is complex and involves integrating many components.

Prezentační úroveň technické zprávy

The text is understandable and readable. I like that it starts with “Traditional Audiobook Production.” It maintains a relatively high level of abstraction, but I find that suitable, and it keeps the text length within a reasonable range.

I would probably separate the UI description from some of the implementation details, but I find the current arrangement acceptable.

I am missing a review of at least some of the many existing tools for automatic and semi-automatic audiobook production, including their capabilities and limitations.

85
Formální úprava technické zprávy

The text is well written and contains only very few errors, although it is sometimes missing a definite or indefinite article. The formatting is often good, but there are also several issues:

  • missing text between headings
  • figures and tables placed in the middle of a page
  • unaesthetic tables
  • low-quality raster image in Figure 2.1
  • indented heading: “Hardware Limitations”
  • very small text in Figure 3.2
  • several problems with spaces around punctuation marks and brackets
  • source code included as part of the text instead of as separate listings
83
Realizační výstup

The created tool is functional and usable. It is able to segment speech, assign voices, select background sounds for individual scenes, and render a full audiobook. The created web application allows users to manage the process and correct or adjust some aspects of audio generation. The automated processing steps were well tested on a custom annotated dataset. The student also performed user testing.

The application definitely still has room for additional functionality, but most of it is very well summarized in the “Future Work” section of the thesis.

95
Využitelnost výsledků

The created application could be a good starting point, for example, for a cool open-source project.

Rozsah splnění požadavků zadání

Evaluation level: zadání splněno

Rozsah technické zprávy

Evaluation level: je v obvyklém rozmezí

Práce s literaturou

The thesis cites 29 relevant sources, consisting of a mix of peer-reviewed publications, documentation, web sources, a book, and a standard. The sources are mostly sufficient for the thesis, except for the missing review of similar existing tools.

In a few places, the sources should be cited a bit more diligently, but overall I find the sources and their use reasonable.

81
Topics for thesis defence:
  1. How does your solution compare with other existing tools?
  2. Have you, or someone else, already listened to an entire audiobook generated by your software?
  3. How exactly do you split the text for attribution? I may have misread, but my understanding was that voices are selected per paragraph. What exactly are those paragraphs?
Points proposed by reviewer: 90

Grade proposed by reviewer: A

Responsibility: Mgr. et Mgr. Hana Odstrčilová