Detail výsledku VaV

Originální název

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification

Anglický název

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

Self-supervised learning (SSL) models for speaker verifica- tion (SV) have gained significant attention in recent years. However, existing SSL-based SV systems often struggle to capture local temporal dependencies and generalize across different tasks. In this paper, we pro- pose context-aware multi-head factorized attentive pooling (CA-MHFA), a lightweight framework that incorporates contextual information from surrounding frames. CA-MHFA leverages grouped, learnable queries to effectively model contextual dependencies while maintaining efficiency by sharing keys and values across groups. Experimental results on the VoxCeleb dataset show that CA-MHFA achieves EERs of 0.42%, 0.48%, and 0.96% on Vox1-O, Vox1-E, and Vox1-H, respectively, outperforming complex models like WavLM-TDNN with fewer parameters and faster convergence. Additionally, CA-MHFA demonstrates strong generalization across multiple SSL models and tasks, including emotion recognition and anti-spoofing, highlighting its robustness and versatility.

Anglický abstrakt

Self-supervised learning (SSL) models for speaker verifica- tion (SV) have gained significant attention in recent years. However, existing SSL-based SV systems often struggle to capture local temporal dependencies and generalize across different tasks. In this paper, we pro- pose context-aware multi-head factorized attentive pooling (CA-MHFA), a lightweight framework that incorporates contextual information from surrounding frames. CA-MHFA leverages grouped, learnable queries to effectively model contextual dependencies while maintaining efficiency by sharing keys and values across groups. Experimental results on the VoxCeleb dataset show that CA-MHFA achieves EERs of 0.42%, 0.48%, and 0.96% on Vox1-O, Vox1-E, and Vox1-H, respectively, outperforming complex models like WavLM-TDNN with fewer parameters and faster convergence. Additionally, CA-MHFA demonstrates strong generalization across multiple SSL models and tasks, including emotion recognition and anti-spoofing, highlighting its robustness and versatility.

Klíčová slova

Self-supervised learning, speaker verification, speaker extractor, pooling mechanism, speech classification

Klíčová slova v angličtině

Self-supervised learning, speaker verification, speaker extractor, pooling mechanism, speech classification

Autoři

PENG, J.; MOŠNER, L.; ZHANG, L.; PLCHOT, O.; STAFYLAKIS, T.; BURGET, L.; ČERNOCKÝ, J.

Vydáno

06.04.2025

Nakladatel

IEEE Signal Processing Society

Místo

Hyderabad

ISBN

979-8-3503-6874-1

Kniha

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Strany od

1

Strany do

5

Strany počet

5

URL

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10889058

BibTex

@inproceedings{BUT198050,
  author="Junyi {Peng} and Ladislav {Mošner} and Lin {Zhang} and Oldřich {Plchot} and Themos {Stafylakis} and Lukáš {Burget} and Jan {Černocký}",
  title="CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2025",
  pages="1--5",
  publisher="IEEE Signal Processing Society",
  address="Hyderabad",
  doi="10.1109/ICASSP49660.2025.10889058",
  isbn="979-8-3503-6874-1",
  url="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10889058"
}

Dokumenty

CA-MHFA_A_Context-Aware_Multi-Head_Factorized_Attentive_Pooling_for_SSL-Based_Speaker_Verification

VUT

Fakulty a vysokoškolské ústavy

Součásti

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification