Přístupnostní navigace
E-application
Search Search Close
Publication result detail
KHURANA, S.; KLEMENT, D.; LAURENT, A.; BOBOS, D.; NOVOSAD, J.; GAZDIK, P.; ZHANG, E.; HUANG, Z.; HUSSEIN, A.; MARXER, R.; MASUYAMA, Y.; AIHARA, R.; HORI, C.; GERMAIN, F.; WICHERN, G.; LE ROUX, J.
Original Title
Factorized RVQ-GAN For Disentangled Speech Tokenization
English Title
Type
Paper in proceedings (conference paper)
Original Abstract
We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.
English abstract
Keywords
Audio Codec | GAN | RVQ | Speech Tokenization
Key words in English
Authors
Released
01.01.2025
Publisher
International Speech Communication Association
Location
Rotterdam, The Netherlands
Book
Proceedings of the Annual Conference of the International Speech Communication Association Interspeech
Periodical
Interspeech
State
Kingdom of the Netherlands
Pages from
3514
Pages to
3518
Pages count
5
URL
https://www.isca-archive.org/interspeech_2025/khurana25_interspeech.pdf
BibTex
@inproceedings{BUT199387, author="{} and Dominik {Klement} and {} and {} and {} and {} and {} and {} and {} and {} and {} and {} and {} and {} and {} and {}", title="Factorized RVQ-GAN For Disentangled Speech Tokenization", booktitle="Proceedings of the Annual Conference of the International Speech Communication Association Interspeech", year="2025", journal="Interspeech", pages="3514--3518", publisher="International Speech Communication Association", address="Rotterdam, The Netherlands", doi="10.21437/Interspeech.2025-2612", url="https://www.isca-archive.org/interspeech_2025/khurana25_interspeech.pdf" }
Documents
khurana_interspeech_2025