Přístupnostní navigace
E-application
Search Search Close
Master's Thesis
Author of thesis: Bc. Martin Pospíšil
Acad. year: 2025/2026
Supervisor: prof. Ing. Radim Burget, Ph.D.
Reviewer: Ing. Jan Dorazil, Ph.D.
This thesis addresses the problem of fully local automatic transcription of clinical interviews without the need for an internet connection. The aim was to experimentally compare modern speech-to-text models, select the most suitable architecture, and subsequently design and implement a functional prototype of the MedApp application, capable of processing real audio recordings of doctor-patient interviews and producing a structured report. In the first part of the thesis, selected variants of the Whisper model and the Vosk system were tested on fourteen recordings divided into two datasets, using the WER, CER, and RTF metrics. Five recordings consisted of acted clinical recordings, referred to as dataset A, while nine recordings were synthetic recordings created using ElevenLabs, referred to as dataset B. The tests showed that the Whisper large-v3 model achieved the best results, with an average WER of 17.7% and CER of 12.8% without contextual prompting, while also demonstrating low computational time and stable performance when running on a GPU. Based on these results, it was selected as the core of the proposed system. The second part of the thesis focused on the implementation of a complete on-premise system comprising speech-to-text transcription, speaker diarization using the pyannote.audio 3.1 model, and generation of structured output in JSON format. The solution also includes a module for automatic interview summarization, implemented using the local language model Ollama llama3.1:8b, which creates a structured record containing key topics, a summary, and action points based on the transcript, without the need to send data outside the device. The proposed system achieved an average WER of 8.2% and CER of 4.9% across all recordings. On the acted clinical recordings in dataset A, it achieved a WER of 15.7%, while on the synthetic recordings in dataset B, it achieved a WER of 3.9%. The average real-time factor reached RTF = 0.460. Speaker diarization using the pyannote.audio 3.1 model achieved an average DER of 2.1%. The outcome of the thesis is a functional and practically validated application prototype that enables fully local transcription of clinical interviews with automatic speaker identification and automatic generation of a structured record.
Automatic speech recognition, Whisper, speaker diarization, on-premise processing, local language model, Ollama, transcription, clinical interviews, structured output, JSON
Date of defence
11.06.2026
Result of the defence
Defended (thesis was successfully defended)
Grading
A
Process of defence
Student prezentoval výsledky své práce a komise byla seznámena s posudky. Student obhájil diplomovou práci Otázky oponenta: V sekci 2.4.1 uvádíte, že vyšší hodnota parametru „beam_size“ přináší lepší přesnost bez znatelného dopadu na rychlost zpracování. Proč jste tedy neprozkoumal hodnoty parametru vyšší než 8? Je teoreticky možné zpracovávat větší množství hypotéz v celém řetězci tak, aby si uživatel mohl vybrat nejvhodnější variantu přepisu a sumarizace? Rozeberte, co by takové rozšíření znamenalo z pohledu implementace a praktického využití. V praktickém využití, byl by systém centralizovaný na nemocničním serveru? Nebo na PC jednotlivých ordinací? Jaké jsou licence použitých modelů?
Language of thesis
Czech
Faculty
Fakulta elektrotechniky a komunikačních technologií
Department
Department of Telecommunications
Study programme
Audio Engineering (MPC-AUD)
Specialization
Audio Production and Recording (AUDM-ZVUK)
Composition of Committee
prof. Ing. Zdeněk Smékal, CSc. (předseda) Ing.MgA. Edgar Mojdl, Ph.D. (místopředseda) Dr. Ing. Libor Husník (člen) Ing. Václav Mach, Ph.D. (člen) Ing. Matěj Ištvánek, Ph.D. (člen)
Supervisor’s reportprof. Ing. Radim Burget, Ph.D.
Grade proposed by supervisor: A
Reviewer’s reportIng. Jan Dorazil, Ph.D.
Grade proposed by reviewer: B
Responsibility: Mgr. et Mgr. Hana Odstrčilová