Detail výsledku VaV

Originální název

Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks

Anglický název

Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manuallygenerated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.

Anglický abstrakt

In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manuallygenerated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.

Klíčová slova

Speech Recognition, Human-computer Interaction, Spoken Language Understanding, Word Consensus Networks, Cross-modal Attention

Klíčová slova v angličtině

Speech Recognition, Human-computer Interaction, Spoken Language Understanding, Word Consensus Networks, Cross-modal Attention

Autoři

VILLATORO-TELLO, E.; MADIKERI, S.; ZULUAGA-GOMEZ, J.; SHARMA, B.; SARFJOO, S.; NIGMATULINA, I.; MOTLÍČEK, P.; IVANOV, V.; GANAPATHIRAJU, A.

Rok RIV

2024

Vydáno

04.06.2023

Nakladatel

IEEE Signal Processing Society

Místo

Rhodes Island

ISBN

978-1-7281-6327-7

Kniha

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Strany od

1

Strany do

5

Strany počet

5

URL

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10095168

BibTex

@inproceedings{BUT187787,
  author="VILLATORO-TELLO, E. and MADIKERI, S. and ZULUAGA-GOMEZ, J. and SHARMA, B. and SARFJOO, S. and NIGMATULINA, I. and MOTLÍČEK, P. and IVANOV, V. and GANAPATHIRAJU, A.",
  title="Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2023",
  pages="1--5",
  publisher="IEEE Signal Processing Society",
  address="Rhodes Island",
  doi="10.1109/ICASSP49357.2023.10095168",
  isbn="978-1-7281-6327-7",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10095168"
}

Dokumenty

villatoro-tello_icassp2023_10095168

VUT

Fakulty

Vysokoškolské ústavy

Součásti

Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks