Embedding-Based Speaker Adaptive Training of Deep Neural Networks

Xiaodong Cui; Vaibhava Goel; George Saon

2017 INTERSPEECH INTERSPEECH 2017

Embedding-Based Speaker Adaptive Training of Deep Neural Networks

Abstract

An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling. In this approach, speaker embedding vectors, which are a constant given a particular speaker, are mapped through a control network to layer-dependent element-wise affine transformations to canonicalize the internal feature representations at the output of hidden layers of a main network. The control network for generating the speaker-dependent mappings are jointly estimated with the main network for the overall speaker adaptive acoustic modeling. Experiments on large vocabulary continuous speech recognition (LVCSR) tasks show that the proposed SAT scheme can yield superior performance over the widely-used speaker-aware training using i-vectors with speaker-adapted input features.

🌉 Interdisciplinary Bridge - Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer - speaker embedding

🐝 Cross-Pollinator - Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

📈 Trend Setter - Transfer Learning

🐣 Hot Topic Early Bird - speaker embedding

Authors

Xiaodong Cui , Vaibhava Goel , George Saon

Topics

Machine Learning > Core Methods > Embedding Learning Deep Learning > Techniques > Pretraining Speech & Audio > Recognition > Speech Recognition Deep Learning > Learning Types > Representation Learning Deep Learning > Learning Types > Transfer Learning

Keywords

speaker embedding speaker recognition acoustic modeling feature representation deep neural network speaker adaptive training

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017