Doctoral Thesis

Semi-Supervised Speech-to-Text Recognition with Text-to-Speech Critic

Final Thesis 3.77 MB Summary of Thesis 3.77 MB

Author of thesis: Ing. Murali Karthick Baskar, Ph.D.

Acad. year: 2023/2024

Supervisor: doc. Ing. Lukáš Burget, Ph.D.

Reviewers: Ing. Jan Trmal, Ph.D., Vimal Manohar

Abstract:

Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of training data to attain good performance. For this reason, unsupervised and semi-supervised training in seq2seq models have recently witnessed a surge in interest. This work builds upon recent results showing notable improvements in semi-supervised training using cycle-consistency and related techniques. Such techniques derive training procedures and losses able to leverage unpaired speech and/or text data by combining ASR with text-to-speech (TTS) models.

This thesis first proposes a new semi-supervised modelling framework combining an end-to-end differentiable ASR->TTS loss with TTS->ASR loss. The method is able to leverage unpaired speech and text data to outperform recently proposed related techniques in terms of word error rate (WER). We provide extensive results analysing the impact of data quantity as well as the contribution of speech and text modalities in recovering errors and show consistent gains across WSJ and LibriSpeech corpora.

The thesis also discusses the limitations of the ASR<->TTS model in out-of-domain data conditions. We propose an enhanced ASR<->TTS (EAT) model incorporating two main features: 1) the ASR->TTS pipeline is equipped with a language model reward to penalize the ASR hypotheses before forwarding them to TTS; and 2) speech regularizer trained in unsupervised fashion is introduced in TTS->ASR to correct the synthesized speech before sending it to the ASR model. Training strategies and the effectiveness of the EAT model are explored and compared with augmentation approaches. The results show that EAT reduces the performance gap between supervised and semi-supervised training by absolute WER improvement of 2.6% and 2.7% on LibriSpeech and BABEL respectively.

Keywords:

Automatic speech recognition, text to speech, semi-supervised training, cycle-consistency, unpaired speech and text data, regularization.

Date of defence

15.11.2023

Result of the defence

Defended (thesis was successfully defended)

znamkaPznamka

Process of defence

Student přednesl cíle a výsledky, kterých v rámci řešení disertační práce dosáhl. V rozpravě student odpověděl na otázky komise a oponentů a hostů. Diskuze je zaznamenána na diskuzních lístcích, které jsou přílohou protokolu. Počet diskuzních lístků: 7. Komise se v závěru jednomyslně usnesla, že student splnil podmínky pro udělení akademického titulu doktor. Komise jednomyslně doporučuje, aby studentovi byla udělena cena za výjimečně kvalitní disertační práci. The student presented the goals and results, which he achieved within the solution of the dissertation. The student has competently answered the questions of the committee members and reviewers and guests. The discussion is recorded on the discussion sheets, which are attached to the protocol. Number of discussion sheets: 7. The committee has agreed unanimously that the student has fulfilled requirements for being awarded the academic title Ph.D. The committee recommends awarding the thesis the deans prize.

Language of thesis

English

Faculty

Fakulta informačních technologií

Department

Department of Computer Graphics and Multimedia

Study programme

Computer Science and Engineering (CSE-PHD-4)

Field of study

Computer Science and Engineering (DVI4)

Composition of Committee

prof. Ing. Jiří Jaroš, Ph.D. (předseda)
prof. Ing. Mária Bieliková, Ph.D. (člen)
doc. Ing. Jiří Mekyska, Ph.D. (člen)
doc. Ing. Jindřich Matoušek, Ph.D. (člen)
Ing. Jan Trmal, Ph.D. (člen)

Supervisor’s report
doc. Ing. Lukáš Burget, Ph.D.

File inserted by supervisor	Size
Hodnocení školitele [.pdf]	58,05 kB

Reviewer’s report
Ing. Jan Trmal, Ph.D.

File inserted by the reviewer	Size
Posudek oponenta [.pdf]	284,30 kB

Reviewer’s report
Vimal Manohar

File inserted by the reviewer	Size
Posudek oponenta [.pdf]	104,30 kB

Responsibility: Mgr. et Mgr. Hana Odstrčilová

VUT

Faculties

University Institutes

Parts

Semi-Supervised Speech-to-Text Recognition with Text-to-Speech Critic