Deep generative models for raw audio synthesis презентация

Содержание

Слайд 2

Waveform Time Amplitude

Waveform

Time

Amplitude

Слайд 3

Слайд 4

VOICE CONVERSION IN A NUTSHELL Source speaker waveform Target speaker

VOICE CONVERSION IN A NUTSHELL

Source speaker waveform

Target speaker waveform

Black magic

Encoder

Decoder

some signal

processing
+
some deep learning

Similar to what people use in ASR systems

Waveform synthesis

Слайд 5

Hello AIUkraine! Text-to-speech

Hello AIUkraine!

Text-to-speech

Слайд 6

Very high dimensionality Typical sample rate ranges from 16000 to

Very high dimensionality
Typical sample rate ranges from 16000 to 44000

samples per second

One second of 16 kHz speech

Слайд 7

Samples are strongly correlated Periodicity + long-term dependencies We need

Samples are strongly correlated

Periodicity + long-term dependencies

We need to jointly model

thousands of random variables
Слайд 8

The is no single answer The same text

The is no single answer

The same text

Слайд 9

Issues with conventional methods Hard to control prosody (emotional content)

Issues with conventional methods

Hard to control prosody (emotional content)
Require a lot

of labeled data
Inexpressive models (such as HMM)
Rely heavily on domain knowledge
Hard to get natural sounding
Слайд 10

Idea: Reformulate the task as a joint probability function (or

Idea:
Reformulate the task as a joint probability function (or density) estimation:

text

Which

waveforms are likely to correspond to a given text?
Слайд 11

Analogy to machine translation English German Multiple outcomes Joint distribution of words (language model)

Analogy to machine translation

English

German

Multiple outcomes
Joint distribution of words (language model)

Слайд 12

Parameter estimation is typically performed via maximum likelihood estimation Text

Parameter estimation is typically performed via maximum likelihood estimation

Text

Слайд 13

Recap: the maximum likelihood Maximize the probability of observing the data

Recap: the maximum likelihood

Maximize the probability of observing the data

Слайд 14

Autoregressive models Time series forecasting (ARIMA, SARIMA, FARIMA) Language models

Autoregressive models

Time series forecasting (ARIMA, SARIMA, FARIMA)

Language models (typically with recurrent

neural networks)

Basic idea: the next value can be represented as a function of the previous values

Слайд 15

WaveNet Source: DeepMind blog Waveform is modeled by a stack

WaveNet

Source: DeepMind blog

Waveform is modeled by a stack of dilated causal

convolutions

https://arxiv.org/abs/1609.03499

text + previous amplitudes

amplitudes

Слайд 16

WaveNet Training: maximize the probability estimated by the model according

WaveNet

Training: maximize the probability estimated by the model according to the

maximum likelihood principle. Can be done in parallel for all time steps:

Generation: sequentially generate samples one by one, sampling from a predicted distribution on every time step

Слайд 17

Data scientists when their model is training

Data scientists when their model is training

Слайд 18

Deep learning engineers when their WaveNet is generating

Deep learning engineers when their WaveNet is generating

Слайд 19

Autoencoders Encoder Decoder Bottleneck High-level abstract features Low-level features Goal: reconstruct the input

Autoencoders

Encoder

Decoder

Bottleneck

High-level abstract features

Low-level features

Goal: reconstruct the input

Слайд 20

Variational autoencoder Latent space Learns an approximation to Condition (text)

Variational autoencoder

Latent space

Learns an approximation to

Condition (text)

Слайд 21

Variational autoencoder: sampling Typically a normal distribution By tweaking the

Variational autoencoder: sampling

Typically a normal distribution

By tweaking the latent variables, we

can control prosody, tempo, accent and much more

Text

Слайд 22

Variational autoencoder: latent space Source: https://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html

Variational autoencoder: latent space

Source: https://blog.fastforwardlabs.com/2016/08/12/introducing-variational-autoencoders-in-prose-and.html

Слайд 23

Upgrade: VQ-VAE Now the latent space is discrete and represented by an autoregressive model https://arxiv.org/abs/1711.00937

Upgrade: VQ-VAE

Now the latent space is discrete and represented by an

autoregressive model

https://arxiv.org/abs/1711.00937

Слайд 24

Normalizing flows Take a random variable with distribution , apply some invertible mapping:

Normalizing flows

Take a random variable with distribution , apply some invertible

mapping:
Слайд 25

Normalizing flows Take a random variable with distribution , apply

Normalizing flows

Take a random variable with distribution , apply some invertible

mapping:

Recall the change of variables rule:

Слайд 26

The change of variables rule For multidimensional random variables, replace

The change of variables rule

For multidimensional random variables, replace the derivative

with the Jacobian (a matrix of derivatives)
Слайд 27

General case (multiple transforms) Can be optimized directly, e.g. with a stochastic gradient ascent a flow

General case (multiple transforms)

Can be optimized directly, e.g. with a stochastic

gradient ascent

a flow

Слайд 28

Waveform Text

Waveform

Text

Слайд 29

Key idea: represent WaveNet with a normalizing flow This approach is called Inverse Autoregressive Flow

Key idea: represent WaveNet with a normalizing flow

This approach is called

Inverse Autoregressive Flow
Слайд 30

Waveform White noise Text https://deepmind.com/blog/article/high-fidelity-speech-synthesis-wavenet

Waveform

White noise

Text

https://deepmind.com/blog/article/high-fidelity-speech-synthesis-wavenet

Слайд 31

Parallel WaveNet: the voice of Google Assistant https://arxiv.org/abs/1711.10433 fast training, slow generation slow training, fast generation

Parallel WaveNet: the voice of Google Assistant

https://arxiv.org/abs/1711.10433

fast training, slow generation

slow training,

fast generation
Слайд 32

https://arxiv.org/abs/1609.03499 - WaveNet https://arxiv.org/abs/1312.6114 - Variational Autoencoder https://arxiv.org/abs/1711.00937 - VQ-VAE

https://arxiv.org/abs/1609.03499 - WaveNet
https://arxiv.org/abs/1312.6114 - Variational Autoencoder
https://arxiv.org/abs/1711.00937 - VQ-VAE
https://arxiv.org/abs/1711.10433 - Parallel WaveNet
https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

- DeepMind’s blogpost on WaveNet
https://deepmind.com/blog/article/high-fidelity-speech-synthesis-wavenet - DeepMind’s blogbost on Parallel Wavenet
https://avdnoord.github.io/homepage/vqvae/ - VQ-VAE explanation from the author
https://deepgenerativemodels.github.io/notes/autoregressive/ - a good tutorial on deep autoregressive models
https://blog.evjang.com/2018/01/nf1.html - a nice intro to normalizing flows
https://medium.com/@kion.kim/wavenet-a-network-good-to-know-7caaae735435 - introductory blogpost on WaveNet
http://anotherdatum.com/vae.html - a good explanation of principles and math behind VAE
Имя файла: Deep-generative-models-for-raw-audio-synthesis.pptx
Количество просмотров: 25
Количество скачиваний: 0