Detail výsledku VaV

Originální název

Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

Anglický název

Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers. We employ a joint decoder that can make use of this uncertainty together with higher-level language information. For this, we revisit decoding algorithms used in factorial generative models in early multi-talker speech recognition systems. In contrast with these early works, we replace the GMM acoustic model with DNN, which provides greater modeling power and simplifies part of the inference. We demonstrate the advantage of joint decoding in proof of concept experiments on a mixed-TIDIGITS dataset.

Anglický abstrakt

In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers. We employ a joint decoder that can make use of this uncertainty together with higher-level language information. For this, we revisit decoding algorithms used in factorial generative models in early multi-talker speech recognition systems. In contrast with these early works, we replace the GMM acoustic model with DNN, which provides greater modeling power and simplifies part of the inference. We demonstrate the advantage of joint decoding in proof of concept experiments on a mixed-TIDIGITS dataset.

Klíčová slova

Multi-talker speech recognition, Permutation invariant training, Factorial Hidden Markov models

Klíčová slova v angličtině

Multi-talker speech recognition, Permutation invariant training, Factorial Hidden Markov models

Autoři

KOCOUR, M.; ŽMOLÍKOVÁ, K.; ONDEL YANG, L.; ŠVEC, J.; DELCROIX, M.; OCHIAI, T.; BURGET, L.; ČERNOCKÝ, J.

Rok RIV

2023

Vydáno

18.09.2022

Nakladatel

International Speech Communication Association

Místo

Incheon

Kniha

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

ISSN

1990-9772

Periodikum

Proceedings of Interspeech

Číslo

9

Stát

Francouzská republika

Strany od

4955

Strany do

4959

Strany počet

5

URL

https://www.isca-speech.org/archive/pdfs/interspeech_2022/kocour22_interspeech.pdf

BibTex

@inproceedings{BUT179827,
  author="KOCOUR, M. and ŽMOLÍKOVÁ, K. and ONDEL YANG, L. and ŠVEC, J. and DELCROIX, M. and OCHIAI, T. and BURGET, L. and ČERNOCKÝ, J.",
  title="Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model",
  booktitle="Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
  year="2022",
  journal="Proceedings of Interspeech",
  number="9",
  pages="4955--4959",
  publisher="International Speech Communication Association",
  address="Incheon",
  doi="10.21437/Interspeech.2022-10406",
  issn="1990-9772",
  url="https://www.isca-speech.org/archive/pdfs/interspeech_2022/kocour22_interspeech.pdf"
}

Dokumenty

kocour22_interspeech2022_revisiting

VUT

Fakulty a vysokoškolské ústavy

Součásti

Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model