Přístupnostní navigace
E-přihláška
Vyhledávání Vyhledat Zavřít
Detail publikačního výsledku
KHURANA, S.; KLEMENT, D.; LAURENT, A.; BOBOS, D.; NOVOSAD, J.; GAZDIK, P.; ZHANG, E.; HUANG, Z.; HUSSEIN, A.; MARXER, R.; MASUYAMA, Y.; AIHARA, R.; HORI, C.; GERMAIN, F.; WICHERN, G.; LE ROUX, J.
Originální název
Factorized RVQ-GAN For Disentangled Speech Tokenization
Anglický název
Druh
Stať ve sborníku v databázi WoS či Scopus
Originální abstrakt
We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.
Anglický abstrakt
Klíčová slova
Audio Codec | GAN | RVQ | Speech Tokenization
Klíčová slova v angličtině
Autoři
Vydáno
01.01.2025
Nakladatel
International Speech Communication Association
Místo
Rotterdam, The Netherlands
Kniha
Proceedings of the Annual Conference of the International Speech Communication Association Interspeech
Periodikum
Interspeech
Stát
Nizozemsko
Strany od
3514
Strany do
3518
Strany počet
5
URL
https://www.isca-archive.org/interspeech_2025/khurana25_interspeech.pdf
BibTex
@inproceedings{BUT199387, author="{} and Dominik {Klement} and {} and {} and {} and {} and {} and {} and {} and {} and {} and {} and {} and {} and {} and {}", title="Factorized RVQ-GAN For Disentangled Speech Tokenization", booktitle="Proceedings of the Annual Conference of the International Speech Communication Association Interspeech", year="2025", journal="Interspeech", pages="3514--3518", publisher="International Speech Communication Association", address="Rotterdam, The Netherlands", doi="10.21437/Interspeech.2025-2612", url="https://www.isca-archive.org/interspeech_2025/khurana25_interspeech.pdf" }
Dokumenty
khurana_interspeech_2025