Detail publikačního výsledku

Speaker activity driven neural speech extraction

DELCROIX, M.; ŽMOLÍKOVÁ, K.; OCHIAI, T.; KINOSHITA, K.; NAKATANI, T.

Originální název

Speaker activity driven neural speech extraction

Anglický název

Speaker activity driven neural speech extraction

Druh

Stať ve sborníku v databázi WoS či Scopus

Originální abstrakt

Target speech extraction, which extracts the speech of a targetspeaker in a mixture given auxiliary speaker clues, has recentlyreceived increased interest. Various clues have been investigatedsuch as pre-recorded enrollment utterances, direction information,or video of the target speaker. In this paper, we explore the use ofspeaker activity information as an auxiliary clue for single-channelneural network-based speech extraction. We propose a speaker activitydriven speech extraction neural network (ADEnet) and showthat it can achieve performance levels competitive with enrollmentbasedapproaches, without the need for pre-recordings. We furtherdemonstrate the potential of the proposed approach for processingmeeting-like recordings, where speaker activity obtained from a diarizationsystem is used as a speaker clue for ADEnet. We show thatthis simple yet practical approach can successfully extract speakersafter diarization, which leads to improved ASR performancewhen using a single microphone, especially in high overlappingconditions, with relative word error rate reduction of up to 25 %.

Anglický abstrakt

Target speech extraction, which extracts the speech of a targetspeaker in a mixture given auxiliary speaker clues, has recentlyreceived increased interest. Various clues have been investigatedsuch as pre-recorded enrollment utterances, direction information,or video of the target speaker. In this paper, we explore the use ofspeaker activity information as an auxiliary clue for single-channelneural network-based speech extraction. We propose a speaker activitydriven speech extraction neural network (ADEnet) and showthat it can achieve performance levels competitive with enrollmentbasedapproaches, without the need for pre-recordings. We furtherdemonstrate the potential of the proposed approach for processingmeeting-like recordings, where speaker activity obtained from a diarizationsystem is used as a speaker clue for ADEnet. We show thatthis simple yet practical approach can successfully extract speakersafter diarization, which leads to improved ASR performancewhen using a single microphone, especially in high overlappingconditions, with relative word error rate reduction of up to 25 %.

Klíčová slova

Speech extraction, Speaker activity, Speech enhancement,Meeting recognition, Neural network

Klíčová slova v angličtině

Speech extraction, Speaker activity, Speech enhancement,Meeting recognition, Neural network

Autoři

DELCROIX, M.; ŽMOLÍKOVÁ, K.; OCHIAI, T.; KINOSHITA, K.; NAKATANI, T.

Rok RIV

2022

Vydáno

06.06.2021

Nakladatel

IEEE Signal Processing Society

Místo

Toronto

ISBN

978-1-7281-7605-5

Kniha

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Strany od

6099

Strany do

6103

Strany počet

5

URL

BibTex

@inproceedings{BUT171749,
  author="DELCROIX, M. and ŽMOLÍKOVÁ, K. and OCHIAI, T. and KINOSHITA, K. and NAKATANI, T.",
  title="Speaker activity driven neural speech extraction",
  booktitle="ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
  year="2021",
  pages="6099--6103",
  publisher="IEEE Signal Processing Society",
  address="Toronto",
  doi="10.1109/ICASSP39728.2021.9414998",
  isbn="978-1-7281-7605-5",
  url="https://www.fit.vut.cz/research/publication/12479/"
}

Dokumenty