Detail publikačního výsledku

Factorized RVQ-GAN For Disentangled Speech Tokenization

KHURANA, S.; KLEMENT, D.; LAURENT, A.; BOBOS, D.; NOVOSAD, J.; GAZDIK, P.; ZHANG, E.; HUANG, Z.; HUSSEIN, A.; MARXER, R.; MASUYAMA, Y.; AIHARA, R.; HORI, C.; GERMAIN, F.; WICHERN, G.; LE ROUX, J.

Originální název

Factorized RVQ-GAN For Disentangled Speech Tokenization

Anglický název

Factorized RVQ-GAN For Disentangled Speech Tokenization

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.

Anglický abstrakt

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.

Klíčová slova

Audio Codec | GAN | RVQ | Speech Tokenization

Klíčová slova v angličtině

Audio Codec | GAN | RVQ | Speech Tokenization

Autoři

KHURANA, S.; KLEMENT, D.; LAURENT, A.; BOBOS, D.; NOVOSAD, J.; GAZDIK, P.; ZHANG, E.; HUANG, Z.; HUSSEIN, A.; MARXER, R.; MASUYAMA, Y.; AIHARA, R.; HORI, C.; GERMAIN, F.; WICHERN, G.; LE ROUX, J.

Vydáno

01.01.2025

Nakladatel

International Speech Communication Association

Místo

Rotterdam, The Netherlands

Kniha

Proceedings of the Annual Conference of the International Speech Communication Association Interspeech

Periodikum

Interspeech

Stát

Nizozemsko

Strany od

3514

Strany do

3518

Strany počet

5

URL

BibTex

@inproceedings{BUT199387,
  author="{} and Dominik {Klement} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {} and  {}",
  title="Factorized RVQ-GAN For Disentangled Speech Tokenization",
  booktitle="Proceedings of the Annual Conference of the International Speech Communication Association Interspeech",
  year="2025",
  journal="Interspeech",
  pages="3514--3518",
  publisher="International Speech Communication Association",
  address="Rotterdam, The Netherlands",
  doi="10.21437/Interspeech.2025-2612",
  url="https://www.isca-archive.org/interspeech_2025/khurana25_interspeech.pdf"
}

Dokumenty