Doctoral Thesis

Language models supporting imperfect handwriting and speech recognition systems

Final Thesis 4.71 MB Summary of Thesis 4.71 MB

Author of thesis: Ing. Karel Beneš, Ph.D.

Acad. year: 2023/2024

Supervisor: doc. Ing. Lukáš Burget, Ph.D.

Reviewers: Ing. Marek Hrúz, Ph.D., Matthew Wiesner, PhD.

Abstract:

The role of statistical language models is to discover and quantify natural patterns in text data.
In this thesis, we utilize language models to improve the accuracy of speech and handwriting recognition systems.

First, we work with fixed-size topic representations as means to introduce longer context into otherwise computationally very cheap feed-forward neural language models (LMs).
We show that this simple technique allows to decrease the performance gap between these LMs and much more powerful recurrent models by half.
Then, we study the ability of these topic representations to smooth out errors in recognition and thus to improve the accuracy of second pass decoding.
The improvement obtained is consistent albeit very small.
Next, we study the training of neural LMs on machine-annotated data, with the aim of adapting the LM to a new domain with little human intervention.
Demonstrating such approach on optical character recognition, we conclude that language models are fairly robust to errors in the machine annotation, allowing the developer of the LM to skip the step of data filtering in most cases.
In the most challenging scenario considered in our experiments, we show that while the original recognition system achieves character error rate of 6.43 % (which can be reduced to 5.34 % by using an LM trained on human annotated data), utilizing the machine annotated data to the full extent allows to reduce the error rate to 2.88 %.

In the second part of the thesis, we study simple ways of regularizing language models by data augmentation resembling errors made by speech recognizers.
We obtain the best results when the augmentation does not attempt to model errors made by an actual ASR.
By further analysis of this surprising result, we conclude that the improvements are indeed coming from a regularization effect rather than the originally aimed robustness to ASR-specific errors.
Finally, we demonstrate a way to reintroduce word-level confidences into output of various end-to-end ASR systems --- in case their outputs are rescored by language models, we are able to effectively restore an ability of HMM-based systems that was neglected with end-to-end systems.
In addition to studying the quality of such confidence estimates, we quantitatively show that they considerably improve fusion of multiple systems; compared to voting-based mechanism --- proper confidences improve the accuracy of fused ASR system approximately as if there was one more ASR system in the fusion.

Keywords:

subspace multinomial model, topic representation, data filtering, self-training, error simulation, data augmentation, word confidence, recognizer fusion, automatic speech recognition, optical character recognition, statistical language modeling

Date of defence

10.03.2025

Result of the defence

Defended (thesis was successfully defended)

znamkaPznamka

Process of defence

The student presented the goals and results, which he achieved within the solution of the dissertation. The student has competently answered the questions of the committee members and reviewers and guests. The discussion is recorded on the discussion sheets, which are attached to the protocol. Number of discussion sheets: 9 The committee has agreed unanimously that the student has fulfilled requirements for being awarded the academic title Ph.D. The committee and reviewers recommend to consider the thesis for the Dean's prize which is awarded to good theses.

Language of thesis

English

Faculty

Fakulta informačních technologií

Department

Department of Computer Graphics and Multimedia

Study programme

Information Technology (DIT)

Composition of Committee

doc. RNDr. Milan Češka, Ph.D. (předseda)
doc. RNDr. Ondřej Bojar, Ph.D. (člen)
doc. Ing. Jiří Málek, PhD. (člen)
prof. Ing. Radomil Matoušek, Ph.D. (člen)
doc. Mgr. Radek Pelánek, Ph.D. (člen)

Supervisor’s report
doc. Ing. Lukáš Burget, Ph.D.

I am pleased that Karel Beneš has submitted his thesis under my supervision, culminating his years of effort into the submission of his doctoral dissertation at FIT BUT.

Karel’s thesis addresses the important problem of utilizing language models to support recognition systems, such as those used for automatic speech recognition and optical character recognition. More specifically, the work focuses on scenarios where the language model is trained on imperfect or erroneous training data, or is applied to postprocess the erroneous outputs of the recognition system. It analyzes how these errors influence the behavior and effectiveness of language models and proposes methods to mitigate their impact. The scientific content of the thesis is well-articulated in the document and its reviews; therefore, I will concentrate on more personal remarks.

Karel is an excellent student (he graduated both the Bachelor and Master degrees at FIT with the highest distinctions — the “red diploma”) and I am happy that he has been part of our laboratory since his Bachelor studies; his thesis “Finite State Grammars and Language Models for Automatic Speech Recognition” supervised by Dr. Mirko Hannemann (currently with Apple, Inc.) dates back to 2014. He has been actively working in the area of neural artificial intelligence models, applicable across various fields including automatic speech recognition (ASR), natural language processing (NLP), and optical character recognition (OCR).

Karel has authored 15 conference and journal publications, most of which have appeared in respected international venues such as Interspeech (a CORE A conference) and the International Journal on Document Analysis and Recognition (IJDAR), a leading journal in automatic document processing. His research has been widely cited; according to Google Scholar, he has received 53 citations, a commendable achievement for a researcher at this stage in his career. His work on the “Residual Memory Network” earned him and Murali Karthick Baskar the Best Student Paper Award at INTERSPEECH 2017 in Stockholm.

His contributions are also prominent in various projects. Karel was a valuable team member of the Neural Representations in Multi-modal and Multi-lingual Modeling (NEUREM3) project, funded by the Czech National Science Foundation (GACR) under the prestigious EXPRO scheme. Additionally, he participated in the EU Horizon 2020 project, Multiple Intelligent Conversation Agent Services for Reception, Management and Integration of Third Country Nationals (WELCOME), where he applied his expertise in speech and language technologies to handle dialogues in under-represented languages. His involvement in a series of OCR projects sponsored by the Czech Ministry of Culture stands out; the resulting application integrates advanced computer vision and NLP techniques, enabling automatic transcription of various printed documents in most European languages, including Latin, old documents in Fraktur and similar scripts in German, and handwritten Czech. This application provides an efficient interface for text corrections and offers multiple transcription formats, and it is routinely used by several Czech and European libraries, including the Military History Institute Prague and the University Library of Mannheim.

Karel is also actively looking to enrich his knowledge beyond his native lab, and he was with RWTH Aachen (the most respected German speech and NLP laboratory) for 6 months in 2018/19 working on two-pass decoding in automatic speech recognition with Kazuki Irie (now at Harvard University).

Moreover, Karel is an excellent and dedicated teacher, coordinating the compulsory Artificial Intelligence and Machine Learning (SUI) course for all Master’s students at FIT BUT. Despite being a Ph.D. student, he ranked 3rd in the 2023 FIT teacher rankings for the Master’s program. He never misses an opportunity to help students (and sometimes also seniors) in the lab on a variety of issues ranging from Python programming, through running experiments on massively parallel computing architectures, to medieval swordsmanship. He is also a keen organizer of the group’s sports life including its indoor climbing club.

In conclusion, I wholeheartedly recommend Karel Beneš’s Ph.D. thesis for defense. I wish him all the best in his future professional and personal endeavors and look forward to continuing our collaboration.

Reviewer’s report
Ing. Marek Hrúz, Ph.D.

In summary, the thesis integrates novel approaches, real-world applications, and high scientific rigor to address key challenges in language modeling. While there are areas for improvement in formal presentation and broader applicability, the work meets the criteria for awarding a doctoral degree. The publications and research outputs further validate the candidate’s contributions to the field. In my opinion, the thesis and the student's achievements until now meet the generally accepted requirements for the award of an academic degree. Topics for thesis defence:

How were the hyperparameters for NN training selected? E.g. LR = 20 for LSTM (Chapter 3.5) seems unusually large. Was any sweep of the hyperparameters performed?
„Note that this way we never use the introductory paragraph for i-vector estimation. This way, we make the task more challenging“ (Chapter 3.6) It seems intuitively correct, but has it been tested?
Has there been a consideration to use the Transformer model as a LM? Could you extend the solution to them? Why YES/NO?
Softmax is never defined - please define and explain.
Discussion. There are several methods presented in this work. Can the candidate present one as the main outcome of his work? Can you elaborate on the answer?

File inserted by the reviewer	Size
Posudek oponenta [.pdf]	192,58 kB

Reviewer’s report
Matthew Wiesner, PhD.

Karel Beneš’s publication record, consisting of 15 papers, awards, including a best student paper at Interspeech, his teaching excellence, and experience participating in international challenges such as CHiME, IWSLT, etc., demonstrate that he meets accepted requirements for being awarded a Ph.D.

File inserted by the reviewer	Size
Posudek oponenta [.pdf]	194,78 kB

Responsibility: Mgr. et Mgr. Hana Odstrčilová

VUT

Faculties and university institutes

Parts

Language models supporting imperfect handwriting and speech recognition systems