Publication result detail

Factorized RVQ-GAN For Disentangled Speech Tokenization

KHURANA, S.; KLEMENT, D.; LAURENT, A.; BOBOS, D.; NOVOSAD, J.; GAZDIK, P.; ZHANG, E.; HUANG, Z.; HUSSEIN, A.; MARXER, R.; MASUYAMA, Y.; AIHARA, R.; HORI, C.; GERMAIN, F.; WICHERN, G.; LE ROUX, J.

Original Title

Factorized RVQ-GAN For Disentangled Speech Tokenization

English Title

Factorized RVQ-GAN For Disentangled Speech Tokenization

Type

Paper in proceedings (conference paper)

Original Abstract

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.

English abstract

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.

Keywords

Audio Codec | GAN | RVQ | Speech Tokenization

Key words in English

Audio Codec | GAN | RVQ | Speech Tokenization

Authors

KHURANA, S.; KLEMENT, D.; LAURENT, A.; BOBOS, D.; NOVOSAD, J.; GAZDIK, P.; ZHANG, E.; HUANG, Z.; HUSSEIN, A.; MARXER, R.; MASUYAMA, Y.; AIHARA, R.; HORI, C.; GERMAIN, F.; WICHERN, G.; LE ROUX, J.

Released

01.01.2025

Publisher

International Speech Communication Association

Location

Rotterdam, The Netherlands

Book

Proceedings of the Annual Conference of the International Speech Communication Association Interspeech

Periodical

Interspeech

State

Kingdom of the Netherlands

Pages from

3514

Pages to

3518

Pages count

5

URL

BibTex

@inproceedings{BUT199387,
  author="{} and Dominik {Klement} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {}",
  title="Factorized RVQ-GAN For Disentangled Speech Tokenization",
  booktitle="Proceedings of the Annual Conference of the International Speech Communication Association Interspeech",
  year="2025",
  journal="Interspeech",
  pages="3514--3518",
  publisher="International Speech Communication Association",
  address="Rotterdam, The Netherlands",
  doi="10.21437/Interspeech.2025-2612",
  url="https://www.isca-archive.org/interspeech_2025/khurana25_interspeech.pdf"
}

Documents