voice conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/dlhlp20/voice... ·...

30
Voice Conversion Hung-yi Lee 李宏毅

Upload: others

Post on 26-Jul-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Voice ConversionHung-yi Lee

李宏毅

Page 2: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Voice Conversion (VC)

Voice Conversion

speech

Sometimes T = T’

d

T

speech

d

T’

Vocoder

Used in VC, TTS, Denoise, etc. (not today)

• Rule-based: Griffin-Lim algorithm• Deep Learning: WaveNet

Page 3: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Categories

Parallel Data

Unparallel Data

How are you? How are you?

天氣真好 How are you?

Lack of training data:• Model Pre-training• Synthesized data!

[Huang, et al., arXiv’19]

[Biadsy, et al., INTERSPEECH’19]

• This is “audio style transfer”• Borrowing techniques from image

style transfer

Page 4: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Categories

Parallel Data

Unparallel Data

Direct Transformation

Feature Disentangle

speaker information

phonetic information

ContentEncoder

SpeakerEncoder

Page 5: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Feature Disentangle

Do you want to study PhD?

ContentEncoder

Do you want to study PhD?

Decoder

SpeakerEncoder

Do you ……

Do you want to study PhD?

Page 6: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Feature Disentangle

Do you want to study PhD?

Good bye

ContentEncoder

Do you want to study PhD?

Decoder

SpeakerEncoder

Do you ……

Do you want to study PhD?

Page 7: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Feature Disentangle

as close as possible (L1 or L2 distance)

• Pre-training encoders• Adding discriminator• Designing network architecture

ContentEncoder

Decoder

reconstructed SpeakerEncoder

input audio

Page 8: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Pre-training Encoders

ContentEncoder

Decoder

reconstructed SpeakerEncoder

input audio

• One-hot vector for each speaker

• Speaker embedding (i-vector, d-vector, x-vector)

Issue: difficult to consider new speakers

• Speech recognition W AH N P AH N CH …

Page 9: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Pre-training Encoders

ContentEncoder

Decoder

reconstructed input audio

• One-hot vector for each speaker

1

0

Speaker A

AB

Speaker A

Page 10: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Pre-training Encoders

ContentEncoder

Decoder

reconstructed input audio

• One-hot vector for each speaker

0

1

Speaker B

AB

Speaker B

Page 11: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Pre-training Encoders

ContentEncoder

Decoder

reconstructed SpeakerEncoder

input audio

• One-hot vector for each speaker

• Speaker embedding (i-vector, d-vector, x-vector)

Issue: difficult to consider new speakers

• Speech recognition W AH N P AH N CH …

Page 12: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Adversarial Training

How are you?

How are you?

Decoder

How are you?

SpeakerClassifier

orLearn to fool the speaker classifier

(Discriminator)

Speaker classifier and encoder are learned iteratively

ContentEncoder

SpeakerEncoder

Page 13: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Instance Normalization

How are you?

ContentEncoder

= instance normalizationIN

SpeakerEncoder

How are you?

Decoder

IN

How are you?

(remove global information)

Page 14: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Instance Normalization

= instance normalizationIN (remove global information)

Phonetic Encoder

Page 15: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Instance Normalization …

……

……

……

IN

……

……

……

……

Normalize for each channel

Each channel has zero mean and unit variance

Phonetic Encoder

Page 16: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Instance Normalization

How are you?

ContentEncoder

= instance normalizationIN

SpeakerEncoder

How are you?

Decoder

IN

How are you?

(remove global information)

How are you?

Page 17: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Instance Normalization

How are you?

ContentEncoder

SpeakerEncoder

How are you?

Decoder

IN

Ad

aIN

How are you?

= instance normalizationIN

AdaIN = adaptive instance normalization

(remove global information)

(only influence global information)

Page 18: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Output of Speaker Encoder

……

……

……

……

IN

……

……

……

……

𝑧1 𝑧2 𝑧3 𝑧4

Decoder

𝑧1′ 𝑧2

′ 𝑧3′ 𝑧4

Add Global

𝑧𝑖′ = 𝛾⨀𝑧𝑖 + 𝛽

Ad

aIN

𝛾

𝛽

AdaIN = adaptive instance normalization

(only influence global information)

Page 19: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Instance Normalization

How are you?

ContentEncoder

SpeakerEncoder

How are you?IN

Training from VCTK

which speaker?

Speaker

Classifier

With IN Without IN

Acc. 0.375 0.658

Page 20: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Instance Normalization

How are you?

ContentEncoder

SpeakerEncoder

IN

Training from VCTK

Unseen Speaker Utterances

female

male

For more results [Chou, et al., INTERSPEECH 2019]

Page 21: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Categories

Parallel Data

Unparallel Data

Direct Transformation

Feature Disentangle

Voice Conversion

• Training without parallel data• Using CycleGAN

Page 22: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Cycle GAN

𝐺𝑋→𝑌

𝐷𝑌

Speaker Y

Speaker X

scalar

Input audio belongs to speaker Y?

Become similar to speaker Y

Speaker X

Speaker Y

Page 23: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Cycle GAN

𝐺𝑋→𝑌

𝐷𝑌

Speaker Y

Speaker X

scalar

Input audio belongs to speaker Y?

Become similar to speaker Y

Speaker X

Speaker Y

Not what we want!

ignore input

Page 24: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Cycle GAN

𝐺𝑋→𝑌

𝐷𝑌 scalar

Input audio belongs to speaker Y or not

𝐺Y→X

as close as possible (L1 or L2 distance)

Cycle consistency

Speaker Y

identity

Page 25: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Cycle GAN

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐺Y→X 𝐺𝑋→𝑌

as close as possible

𝐷𝑌𝐷𝑋scalar: belongs to speaker Y or not

scalar: belongs to speaker X or not

Page 26: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

StarGAN

speaker 𝑠1 speaker 𝑠2

speaker 𝑠3 speaker 𝑠4

𝐺

speaker 𝑠𝑖

audio of speaker x

audio of speaker 𝑠𝑖

Page 27: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

StarGAN

𝐷scalar: belongs to input speaker or not

𝐺

speaker 𝑠𝑗

audio of speaker 𝑠𝑖

audio of speaker 𝑠𝑗

speaker 𝑠𝑖

Each speaker is represented as a vector.

Page 28: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

𝐺𝑋→𝑌 𝐺Y→X

as close as possible

𝐷𝑌scalar: belongs to speaker Y or not

𝐺

as close as possible

𝐷scalar: belongs to input speaker or not

speaker 𝑠𝑖

audio of speaker 𝑠𝑘

𝐺

speaker 𝑠𝑘

Page 29: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Categories

Parallel Data

Unparallel Data

Direct Transformation

Feature Disentangle

Page 30: Voice Conversion - speech.ee.ntu.edu.twspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Voice... · Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019 •[Biadsy,

Reference

• [Huang, et al., arXiv’19] Wen-Chin Huang,Tomoki Hayashi,Yi-Chiao Wu,HirokazuKameoka,Tomoki Toda, Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining, arXiv, 2019

• [Biadsy, et al., INTERSPEECH’19] Fadi Biadsy, Ron J. Weiss, Pedro J. Moreno, Dimitri Kanevsky, Ye Jia, Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation, INTERSPEECH, 2019