R&D Result Detail

Original Title

Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers

English Title

Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers

Type

Paper in proceedings outside WoS and Scopus

Original Abstract

Traditionally, automatic speech recognition (ASR) and speaker change detection (SCD) systems have been independently trained to generate comprehensive transcripts accompanied by speaker turns. Recently, joint training of ASR and SCD systems, by inserting speaker turn tokens in the ASR training text, has been shown to be successful. In this work, we present a multitask alternative to the joint training approach. Results obtained on the mix-headset audios of AMI corpus show that the proposed multitask training yields an absolute improvement of 1.8% in coverage and purity based F1 score on SCD task without ASR degradation. We also examine the trade-offs between the ASR and SCD performance when trained using multitask criteria. Additionally, we validate the speaker change information in the embedding spaces obtained after different transformer layers of a self-supervised pre-trained model, such as XLSR-53, by integrating an SCD classifier at the output of specific transformer layers. Results reveal that the use of different embedding spaces from XLSR-53 model for multitask ASR and SCD is advantageous.1

English abstract

Traditionally, automatic speech recognition (ASR) and speaker change detection (SCD) systems have been independently trained to generate comprehensive transcripts accompanied by speaker turns. Recently, joint training of ASR and SCD systems, by inserting speaker turn tokens in the ASR training text, has been shown to be successful. In this work, we present a multitask alternative to the joint training approach. Results obtained on the mix-headset audios of AMI corpus show that the proposed multitask training yields an absolute improvement of 1.8% in coverage and purity based F1 score on SCD task without ASR degradation. We also examine the trade-offs between the ASR and SCD performance when trained using multitask criteria. Additionally, we validate the speaker change information in the embedding spaces obtained after different transformer layers of a self-supervised pre-trained model, such as XLSR-53, by integrating an SCD classifier at the output of specific transformer layers. Results reveal that the use of different embedding spaces from XLSR-53 model for multitask ASR and SCD is advantageous.1

Keywords

speaker change detection, speaker turn detection, speech recognition, multitask learning, F1 score

Key words in English

speaker change detection, speaker turn detection, speech recognition, multitask learning, F1 score

Authors

KUMAR, S.; MADIKERI, S.; NIGMATULINA, I.; VILLATORO-TELLO, E.; MOTLÍČEK, P.; PANDIA, K.; DUBAGUNTA, P.; GANAPATHIRAJU, A.

RIV year

2025

Released

14.04.2024

Publisher

IEEE Signal Processing Society

Location

Seoul

ISBN

979-8-3503-4485-1

Book

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pages from

12592

Pages to

12596

Pages count

5

URL

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446130

BibTex

@inproceedings{BUT196785,
  author="KUMAR, S. and MADIKERI, S. and NIGMATULINA, I. and VILLATORO-TELLO, E. and MOTLÍČEK, P. and PANDIA, K. and DUBAGUNTA, P. and GANAPATHIRAJU, A.",
  title="Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers",
  booktitle="ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
  year="2024",
  pages="12592--12596",
  publisher="IEEE Signal Processing Society",
  address="Seoul",
  doi="10.1109/ICASSP48485.2024.10446130",
  isbn="979-8-3503-4485-1",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10446130"
}

Documents

kumar_icassp2024_Multitask_Speech_Recognition_and_Speaker_Change_Detection_for_Unknown_Number_of_Speakers

VUT

Faculties and university institutes

Parts

Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers