Deep generative models for raw audio synthesis презентация

Июль 31, 2022

Главная
Без категории
Deep generative models for raw audio synthesis

Содержание

2. Waveform Time Amplitude
4. VOICE CONVERSION IN A NUTSHELL Source speaker waveform Target speaker waveform Black magic Encoder Decoder some
5. Hello AIUkraine! Text-to-speech
6. Very high dimensionality Typical sample rate ranges from 16000 to 44000 samples per second One second
7. Samples are strongly correlated Periodicity + long-term dependencies We need to jointly model thousands of random
8. The is no single answer The same text
9. Issues with conventional methods Hard to control prosody (emotional content) Require a lot of labeled data
10. Idea: Reformulate the task as a joint probability function (or density) estimation: text Which waveforms are
11. Analogy to machine translation English German Multiple outcomes Joint distribution of words (language model)
12. Parameter estimation is typically performed via maximum likelihood estimation Text
13. Recap: the maximum likelihood Maximize the probability of observing the data
14. Autoregressive models Time series forecasting (ARIMA, SARIMA, FARIMA) Language models (typically with recurrent neural networks) Basic
15. WaveNet Source: DeepMind blog Waveform is modeled by a stack of dilated causal convolutions https://arxiv.org/abs/1609.03499 text
16. WaveNet Training: maximize the probability estimated by the model according to the maximum likelihood principle. Can
17. Data scientists when their model is training
18. Deep learning engineers when their WaveNet is generating
19. Autoencoders Encoder Decoder Bottleneck High-level abstract features Low-level features Goal: reconstruct the input
20. Variational autoencoder Latent space Learns an approximation to Condition (text)
21. Variational autoencoder: sampling Typically a normal distribution By tweaking the latent variables, we can control prosody,
22. Variational autoencoder: latent space Source: https://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html
23. Upgrade: VQ-VAE Now the latent space is discrete and represented by an autoregressive model https://arxiv.org/abs/1711.00937
24. Normalizing flows Take a random variable with distribution , apply some invertible mapping:
25. Normalizing flows Take a random variable with distribution , apply some invertible mapping: Recall the change
26. The change of variables rule For multidimensional random variables, replace the derivative with the Jacobian (a
27. General case (multiple transforms) Can be optimized directly, e.g. with a stochastic gradient ascent a flow
28. Waveform Text
29. Key idea: represent WaveNet with a normalizing flow This approach is called Inverse Autoregressive Flow
30. Waveform White noise Text https://deepmind.com/blog/article/high-fidelity-speech-synthesis-wavenet
31. Parallel WaveNet: the voice of Google Assistant https://arxiv.org/abs/1711.10433 fast training, slow generation slow training, fast generation
32. https://arxiv.org/abs/1609.03499 - WaveNet https://arxiv.org/abs/1312.6114 - Variational Autoencoder https://arxiv.org/abs/1711.00937 - VQ-VAE https://arxiv.org/abs/1711.10433 - Parallel WaveNet https://deepmind.com/blog/article/wavenet-generative-model-raw-audio -
34. Скачать презентацию

Слайд 2

Waveform
Time
Amplitude

Слайд 3

Слайд 4

VOICE CONVERSION IN A NUTSHELL
Source speaker waveform
Target speaker waveform
Black magic
Encoder
Decoder
some signal

processing
+
some deep learning

Similar to what people use in ASR systems

Waveform synthesis

Слайд 5

Hello AIUkraine!
Text-to-speech

Слайд 6

Very high dimensionality
Typical sample rate ranges from 16000 to 44000

samples per second

One second of 16 kHz speech

Слайд 7

Samples are strongly correlated
Periodicity + long-term dependencies
We need to jointly model

thousands of random variables

Слайд 8

The is no single answer
The same text

Слайд 9

Issues with conventional methods
Hard to control prosody (emotional content)
Require a lot

of labeled data
Inexpressive models (such as HMM)
Rely heavily on domain knowledge
Hard to get natural sounding

Слайд 10

Idea:
Reformulate the task as a joint probability function (or density) estimation:
text
Which

waveforms are likely to correspond to a given text?

Слайд 11

Analogy to machine translation
English
German
Multiple outcomes
Joint distribution of words (language model)

Слайд 12

Parameter estimation is typically performed via maximum likelihood estimation
Text

Слайд 13

Recap: the maximum likelihood
Maximize the probability of observing the data

Слайд 14

Autoregressive models
Time series forecasting (ARIMA, SARIMA, FARIMA)
Language models (typically with recurrent

neural networks)

Basic idea: the next value can be represented as a function of the previous values

Слайд 15

WaveNet
Source: DeepMind blog
Waveform is modeled by a stack of dilated causal

convolutions

https://arxiv.org/abs/1609.03499

text + previous amplitudes

amplitudes

Слайд 16

WaveNet
Training: maximize the probability estimated by the model according to the

maximum likelihood principle. Can be done in parallel for all time steps:

Generation: sequentially generate samples one by one, sampling from a predicted distribution on every time step

Слайд 17

Data scientists when their model is training

Слайд 18

Deep learning engineers when their WaveNet is generating

Слайд 19

Autoencoders
Encoder
Decoder
Bottleneck
High-level abstract features
Low-level features
Goal: reconstruct the input

Слайд 20

Variational autoencoder
Latent space
Learns an approximation to
Condition (text)

Слайд 21

Variational autoencoder: sampling
Typically a normal distribution
By tweaking the latent variables, we

can control prosody, tempo, accent and much more

Text

Слайд 22

Variational autoencoder: latent space
Source: https://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html

Слайд 23

Upgrade: VQ-VAE
Now the latent space is discrete and represented by an

autoregressive model

https://arxiv.org/abs/1711.00937

Слайд 24

Normalizing flows
Take a random variable with distribution , apply some invertible

mapping:

Слайд 25

Normalizing flows
Take a random variable with distribution , apply some invertible

mapping:

Recall the change of variables rule:

Слайд 26

The change of variables rule
For multidimensional random variables, replace the derivative

with the Jacobian (a matrix of derivatives)

Слайд 27

General case (multiple transforms)
Can be optimized directly, e.g. with a stochastic

gradient ascent

a flow

Слайд 28

Waveform
Text

Слайд 29

Key idea: represent WaveNet with a normalizing flow
This approach is called

Inverse Autoregressive Flow

Слайд 30

Waveform
White noise
Text
https://deepmind.com/blog/article/high-fidelity-speech-synthesis-wavenet

Слайд 31

Parallel WaveNet: the voice of Google Assistant
https://arxiv.org/abs/1711.10433
fast training, slow generation
slow training,

fast generation

Слайд 32

https://arxiv.org/abs/1609.03499 - WaveNet
https://arxiv.org/abs/1312.6114 - Variational Autoencoder
https://arxiv.org/abs/1711.00937 - VQ-VAE
https://arxiv.org/abs/1711.10433 - Parallel WaveNet
https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

- DeepMind’s blogpost on WaveNet
https://deepmind.com/blog/article/high-fidelity-speech-synthesis-wavenet - DeepMind’s blogbost on Parallel Wavenet
https://avdnoord.github.io/homepage/vqvae/ - VQ-VAE explanation from the author
https://deepgenerativemodels.github.io/notes/autoregressive/ - a good tutorial on deep autoregressive models
https://blog.evjang.com/2018/01/nf1.html - a nice intro to normalizing flows
https://medium.com/@kion.kim/wavenet-a-network-good-to-know-7caaae735435 - introductory blogpost on WaveNet
http://anotherdatum.com/vae.html - a good explanation of principles and math behind VAE

Deep generative models for raw audio synthesis презентация

Содержание

WaveformTimeAmplitude

VOICE CONVERSION IN A NUTSHELLSource speaker waveformTarget speaker waveformBlack magicEncoderDecodersome signal

Hello AIUkraine!Text-to-speech

Very high dimensionality Typical sample rate ranges from 16000 to 44000

Samples are strongly correlatedPeriodicity + long-term dependenciesWe need to jointly model

The is no single answerThe same text

Issues with conventional methodsHard to control prosody (emotional content)Require a lot

Idea:Reformulate the task as a joint probability function (or density) estimation:textWhich

Analogy to machine translationEnglishGermanMultiple outcomesJoint distribution of words (language model)

Parameter estimation is typically performed via maximum likelihood estimationText

Recap: the maximum likelihoodMaximize the probability of observing the data

Autoregressive modelsTime series forecasting (ARIMA, SARIMA, FARIMA)Language models (typically with recurrent

WaveNetSource: DeepMind blogWaveform is modeled by a stack of dilated causal

WaveNetTraining: maximize the probability estimated by the model according to the

Data scientists when their model is training

Deep learning engineers when their WaveNet is generating

AutoencodersEncoderDecoderBottleneckHigh-level abstract featuresLow-level featuresGoal: reconstruct the input

Variational autoencoderLatent spaceLearns an approximation to Condition (text)

Variational autoencoder: samplingTypically a normal distributionBy tweaking the latent variables, we

Variational autoencoder: latent spaceSource: https://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html

Upgrade: VQ-VAENow the latent space is discrete and represented by an

Normalizing flowsTake a random variable with distribution , apply some invertible

Normalizing flowsTake a random variable with distribution , apply some invertible

The change of variables ruleFor multidimensional random variables, replace the derivative

General case (multiple transforms)Can be optimized directly, e.g. with a stochastic

WaveformText

Key idea: represent WaveNet with a normalizing flowThis approach is called

WaveformWhite noiseText https://deepmind.com/blog/article/high-fidelity-speech-synthesis-wavenet

Parallel WaveNet: the voice of Google Assistanthttps://arxiv.org/abs/1711.10433fast training, slow generationslow training,

https://arxiv.org/abs/1609.03499 - WaveNethttps://arxiv.org/abs/1312.6114 - Variational Autoencoderhttps://arxiv.org/abs/1711.00937 - VQ-VAEhttps://arxiv.org/abs/1711.10433 - Parallel WaveNethttps://deepmind.com/blog/article/wavenet-generative-model-raw-audio

Похожие презентации

Waveform
Time
Amplitude

VOICE CONVERSION IN A NUTSHELL
Source speaker waveform
Target speaker waveform
Black magic
Encoder
Decoder
some signal

Hello AIUkraine!
Text-to-speech

Very high dimensionality
Typical sample rate ranges from 16000 to 44000

Samples are strongly correlated
Periodicity + long-term dependencies
We need to jointly model

The is no single answer
The same text

Issues with conventional methods
Hard to control prosody (emotional content)
Require a lot

Idea:
Reformulate the task as a joint probability function (or density) estimation:
text
Which

Analogy to machine translation
English
German
Multiple outcomes
Joint distribution of words (language model)

Parameter estimation is typically performed via maximum likelihood estimation
Text

Recap: the maximum likelihood
Maximize the probability of observing the data

Autoregressive models
Time series forecasting (ARIMA, SARIMA, FARIMA)
Language models (typically with recurrent

WaveNet
Source: DeepMind blog
Waveform is modeled by a stack of dilated causal

WaveNet
Training: maximize the probability estimated by the model according to the

Autoencoders
Encoder
Decoder
Bottleneck
High-level abstract features
Low-level features
Goal: reconstruct the input

Variational autoencoder
Latent space
Learns an approximation to
Condition (text)

Variational autoencoder: sampling
Typically a normal distribution
By tweaking the latent variables, we

Variational autoencoder: latent space
Source: https://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html

Upgrade: VQ-VAE
Now the latent space is discrete and represented by an

Normalizing flows
Take a random variable with distribution , apply some invertible

Normalizing flows
Take a random variable with distribution , apply some invertible

The change of variables rule
For multidimensional random variables, replace the derivative

General case (multiple transforms)
Can be optimized directly, e.g. with a stochastic

Waveform
Text

Key idea: represent WaveNet with a normalizing flow
This approach is called

Waveform
White noise
Text
https://deepmind.com/blog/article/high-fidelity-speech-synthesis-wavenet

Parallel WaveNet: the voice of Google Assistant
https://arxiv.org/abs/1711.10433
fast training, slow generation
slow training,

https://arxiv.org/abs/1609.03499 - WaveNet
https://arxiv.org/abs/1312.6114 - Variational Autoencoder
https://arxiv.org/abs/1711.00937 - VQ-VAE
https://arxiv.org/abs/1711.10433 - Parallel WaveNet
https://deepmind.com/blog/article/wavenet-generative-model-raw-audio