Audio Feature Extraction with Convolutional Neural Autoencoders with Application to Voice Conversion

Feature extraction is a key step in many machine learning and signal processing applications. For speech signals in particular, it is important to derive features that contain both the vocal characteristics of the speaker and the content of the speech. In this paper, we introduce a convolutional auto-encoder (CAE) to extract features from speech represented via proposed short-time discrete cosine transform (STDCT). We then introduce a deep neural mapping at the encoding bottleneck to enable converting a source speaker’s speech to a target speaker’s speech while preserving the source-speech content. We further compare this approach to clustering-based and linear mappings.


Presented at:
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), London, May 12-17, 2019
Year:
May 12 2019
Keywords:
Laboratories:




 Record created 2018-11-25, last modified 2019-03-17

Fulltext:
Download fulltext
PDF

Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)