Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification

Zhanghao Wu; Shuai Wang; Yanmin Qian; Kai Yu

2019 INTERSPEECH INTERSPEECH 2019

Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification

Abstract

Domain or environment mismatch between training and testing, such as various noises and channels, is a major challenge for speaker verification. In this paper, a variational autoencoder (VAE) is designed to learn the patterns of speaker embeddings extracted from noisy speech segments, including i-vector and x-vector, and generate embeddings with more diversity to improve the robustness of speaker verification systems with probabilistic linear discriminant analysis (PLDA) back-end. The approach is evaluated on the standard NIST SRE 2016 dataset. Compared to manual and generative adversarial network (GAN) based augmentation approaches, the proposed VAE based augmentation achieves a slightly better performance for i-vector on Tagalog and Cantonese with EERs of 15.54% and 7.84%, and a more significant improvement for x-vector on those two languages with EERs of 11.86% and 4.20%.

🌉 Interdisciplinary Bridge - Deep Learning and Machine Learning

🧭 Keyword Pioneer - embedding augmentation

🐝 Cross-Pollinator - Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Zhanghao Wu , Shuai Wang , Yanmin Qian , Kai Yu

Topics

Machine Learning > Application Areas > Data Augmentation Deep Learning > Architectures > Autoencoders Deep Learning > Models > Variational Inference Speech & Audio > Recognition > Speaker Recognition

Keywords

data augmentation speaker embedding speaker verification variational autoencoder probabilistic linear discriminant analysis embedding augmentation

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019