R&D Result Detail

Original Title

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

English Title

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

Type

Paper in proceedings (conference paper)

Original Abstract

State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usu- ally modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending Hy- perMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recogni- tion, leading to HyperConformer. In particular, multi-head Hy- perConformer achieves comparable or higher recognition per- formance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available train- ing data. HyperConformer achieves a word error rate of 2.9% on LibriSpeech test-clean with less than 8M neural parameters and a peak memory during training of 5.7GB, hence trainable with accessible hardware. Encoder speed is between 38% on mid-length speech and 56% on long speech faster than an equiv- alent Conformer.1)

English abstract

State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usu- ally modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending Hy- perMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recogni- tion, leading to HyperConformer. In particular, multi-head Hy- perConformer achieves comparable or higher recognition per- formance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available train- ing data. HyperConformer achieves a word error rate of 2.9% on LibriSpeech test-clean with less than 8M neural parameters and a peak memory during training of 5.7GB, hence trainable with accessible hardware. Encoder speed is between 38% on mid-length speech and 56% on long speech faster than an equiv- alent Conformer.1)

Keywords

Hypernetworks, HyperMixer, Efficient Auto- matic Speech Recognition, LibriSpeech, SpeechBrain

Key words in English

Hypernetworks, HyperMixer, Efficient Auto- matic Speech Recognition, LibriSpeech, SpeechBrain

Authors

MAI, F.; ZULUAGA-GOMEZ, J.; PARCOLLET, T.; MOTLÍČEK, P.

RIV year

2024

Released

20.08.2023

Publisher

International Speech Communication Association

Location

Dublin

Book

Proceedings of the Annual Conference of International Speech Communication Association, INTERSPEECH

ISBN

1990-9772

Periodical

Proceedings of Interspeech

Volume

2023

Number

08

State

French Republic

Pages from

2213

Pages to

2217

Pages count

5

URL

https://www.isca-archive.org/interspeech_2023/mai23_interspeech.pdf

BibTex

@inproceedings{BUT187786,
  author="MAI, F. and ZULUAGA-GOMEZ, J. and PARCOLLET, T. and MOTLÍČEK, P.",
  title="HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition",
  booktitle="Proceedings of the Annual Conference of International Speech Communication Association, INTERSPEECH",
  year="2023",
  journal="Proceedings of Interspeech",
  volume="2023",
  number="08",
  pages="2213--2217",
  publisher="International Speech Communication Association",
  address="Dublin",
  doi="10.21437/Interspeech.2023-1611",
  issn="1990-9772",
  url="https://www.isca-archive.org/interspeech_2023/mai23_interspeech.pdf"
}

Documents

mai23_interspeech

VUT

Faculties

University Institutes

Parts

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition