Detail výsledku VaV

Originální název

Performance Evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Anglický název

Performance Evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations on in-domain data, such as changes in speech rate or additive noise, can significantly degrade performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.

Anglický abstrakt

Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations on in-domain data, such as changes in speech rate or additive noise, can significantly degrade performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.

Klíčová slova

ASR, LLMs, embeddings, speech-to-text alignment, foundation models

Klíčová slova v angličtině

ASR, LLMs, embeddings, speech-to-text alignment, foundation models

Autoři

KUMAR, S.; THORBECKE, I.; BURDISSO, S.; VILLATORO-TELLO, E.; MANJUNATH, K.; HACIOGLU, K.; RANGAPPA, P.; MOTLÍČEK, P.; GANAPATHIRAJU, A.; STOLCKE, A.

Rok RIV

2026

Vydáno

06.04.2025

Nakladatel

IEEE

Místo

Hyderabad, Indická republika

ISBN

979-8-3315-1932-2

Kniha

2025 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW

Strany od

1

Strany do

5

Strany počet

5

URL

https://www.fit.vut.cz/research/group/speech/public/publi/2025/kumar_interspeech2025_co-author_Motlicek.pdf

BibTex

@inproceedings{BUT201439,
  author="{} and  {} and  {} and  {} and  {} and  {} and  {} and Petr {Motlíček} and  {} and  {}",
  title="Performance Evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward",
  booktitle="2025 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW",
  year="2025",
  pages="1--5",
  publisher="IEEE",
  address="Hyderabad, Indická republika",
  doi="10.1109/ICASSPW65056.2025.11010998",
  isbn="979-8-3315-1932-2",
  url="https://www.fit.vut.cz/research/group/speech/public/publi/2025/kumar_interspeech2025_co-author_Motlicek.pdf"
}

Dokumenty

kumar_interspeech2025_co-author_Motlicek

VUT

Fakulty a vysokoškolské ústavy

Součásti

Performance Evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward