Toward a high-quality singing synthesizer with vocal texture control - PowerPoint PPT Presentation

About This Presentation

Title:

Toward a high-quality singing synthesizer with vocal texture control

Description:

the propagation time for sound wave to travel one acoustic tube. ... Error for one glottal cycle in vector form, L2 norm is used ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 46

Provided by: vick96

Learn more at: https://ccrma.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Toward a high-quality singing synthesizer with vocal texture control

1
Toward a high-quality singing synthesizer with
vocal texture control

Hui-Ling Lu
Center for Computer Research in Music and
Acoustics (CCRMA)
Stanford University, Stanford, CA94305, USA

2
Score-to-Singing system
Parametric Database
Phoneme
F0 Sound level Duration Vibrato
Score Lyrics Singing style
Singing voice
Rule system
Sound synthesis

Acoustic rendering
Co-articulation rules

Lyrics-to-phoneme
Musical rules

3
General sound synthesis approaches
Cons
Pros
Physical Modeling

flexible/intuitive control
expressive
co-articulation easy

analysis/re-synthesis
difficult
invasive measurements

Source-filter Model
Spectral Modeling

analysis/re-synthesis
easy

less expressive
co-articulation
difficult

4
Contributions
A pseudo-physical model for singing voice
synthesis which

is an approximate physical model.
can generate high-quality non-nasal singing
voice.
has analysis/re-synthesis ability.
is computationally affordable.
provides flexible control of vocal textures.

An Automatic analysis procedure for
analysis/re-synthesis
A parametric model for vocal texture control
5
Outline

Human voice production system
Synthesis model
Analysis procedure
Vocal texture parametric model
Vocal texture control demo
Contributions and future directions

6
The human voice production system
Nasal sound output
Nasal cavity
Velum
Oral sound output
Oral cavity
Pharyngeal cavity
Vocal folds
Tongue hump
Lungs
Muscle force
7
Oscillation pattern of the vocal folds
Opening period
Closing period
Close phase
Open phase

The oscillation results from the balancing of
the subglottal
pressure, the Bernoulli pressure and the elastic
restoring force
of the vocal folds.

Prephonatory position the initial
configuration of the
vocal folds before the beginning of oscillation.

8
(No Transcript)
9
Variation of vocal textures
Pressed
Normal
Breathy
10
Simplified human voice production model

Source-tract interaction The glottal waveform
in general
depends on the vocal tract configuration.
Neglect the source-tract interaction since the
glottal impedance
is very high most of the time.

11
Source-filter type synthesis model
Glottal Source
Vocal Tract Filter
Radiation
Aspiration noise
12
Overview of the proposed synthesis model
Glottal excitation
Filter
Derivative glottal wave
Voice output
All Pole Filter
Transformed Liljencrants-Fant Model
Noise Residual Model
High-passed aspiration noise
13
Transformed Liljencrants-Fant (LF) model

The transformed LF model controls the wave shape
of the derivative
glottal wave via a single parameter, Rd (
wave-shape control parameter).

14
Transformed Liljencrants-Fant (LF) model

Transformed LF model is an extension of the LF
model. It provides
a control interface for the LF model to change
the wave shape of the
derivative glottal wave easily.

Wave shape control parameter
Direct synthesis timing parameters
Analysis
Estimated derivative glottal wave
LF fitting
Mapping-1
Rd
15

Liljencrants-Fant (LF) model
16
Transformed Liljencrants-Fant (LF) model

Transformed LF model is an extension of the LF
model. It provides
a control interface for the LF model to change
the wave shape of the
derivative glottal wave easily.

Wave shape control parameter
Direct synthesis timing parameters
Analysis
Estimated derivative glottal wave
LF fitting
Mapping-1
Rd
17

Noise residual model
Noise floor
Bn
Noise residual
Gaussian Noise Generator
Amplitude Modulation

An
GCI
L
18
Vocal tract filter

An all-pole filter.
The vocal tract is assumed to be a series of
concatenated uniform
lossless cylindrical acoustic tubes.
Assume that sound waves obey planar propagation
along the axis
of the vocal tract.

?
Alip
A1
AN
A2
glottis
lip end
1-kN
Ulip
Ug
-kN
-1
19
Vocal tract filter
Kelly-Lochbaum junction
1-km

Um
Um1
Scattering coefficient
-km
km
Am
Am1
-
-
Um1
Um
1km

the propagation time for sound wave to travel
one acoustic tube.
N the number of acoustic tubes excluding the
glottis and the lip end.

If sampling period T 2? , the transfer
function of the vocal tract
acoustic tubes can be shown to be an Nth order
all-pole filter.
The autoregressive coefficients of the vocal
tract filter can be
converted to scattering coefficients by Durbins
method.

20
Overall synthesis model implementation
Transformed LF model
Degree of breathiness
Ee , F0
Vocal texture model
Rd

0.8
Noise residual model
Glottal excitation strength Ee
Fundamental frequency F0
Output voice
?
?
(No noise input)
?
?
21
Analysis procedure
Inverse filtered glottal excitation
Desired voice recording
LF model coefficients
Fitting the estimated derivative glottal wave
via LF model
Source-filter de-convolution
De-noising by Wavelet Packet Analysis
High-passed aspiration noise
22
Source-filter de-convolution

Synthesis model for analysis

KLGLOTT88 (KL) derivative glottal wave
Basic Voicing Waveform (a, b, OQ)
23
(No Transcript)
24
Source-filter de-convolution

Synthesis model for analysis

KLGLOTT88 (KL) derivative glottal wave
Nth order All pole vocal tract filter

Basic Voicing Waveform (a, b, OQ)
Low-pass filter
25
Source-filter deconvolution estimation flowchart
Voice signal after removing the low frequency
drift
GCI detection
Phase I
One glottal period signal
Loop for each period
Loop over different OQ values Vocal tract filter
and glottal source estimation via SUMT End
Select and store 5 best estimates
Loop for each period Enforce continuity
constraints via Dynamic Programming End
Phase II
Smoothing the vocal tract area by time averaging
and linear interpolation
Estimated model parameter sequence
26
Convex optimization formulation
Inverse filter

Estimate

by minimizing the error
between the basic voicing waveform and the
estimated one.
27
Convex optimization formulation

Error for one glottal cycle in vector form,

A convex optimization problem
Minimize
Subject to

L2 norm is used

The above problem can be solved by SUMT
(sequential unconstrained minimization
technique).
28
De-convolution result (synthetic data)
29
Effective analysis/re-synthesis
Baritone examples

Normal phonation

original
KLGLOTT88

Pressed phonation

original
KLGLOTT88
KLGLOTT88 (KL) derivative glottal wave
30
Analysis procedure
Inverse filtered glottal excitation
Desired voice recording
LF model coefficients
Fitting the estimated derivative glottal wave
via LF model
Source-filter de-convolution
De-noising by Wavelet Packet Analysis
High-passed aspiration noise
31
De-noising by Wavelet Packet Analysis
De-noising by best basis thresholding

A noisy data record X f W

Transform the noisy data to another basis
via Wavelet Packet Analysis XB fB WB

Thresholding out the smaller coefficients of XB
by assuming
that f can be compactly represented in the new
basis by
a few large coefficients.

Select the wavelet filter by energy compactness
criteria
1/(number of coefficients needed to accumulate
0.9 of the total energy).

32
De-noising result (synthetic data)
33
Analysis procedure
Inverse filtered glottal excitation
Desired voice recording
LF model coefficients
Fitting the estimated derivative glottal wave
via LF model
Source-filter de-convolution
De-noising by Wavelet Packet Analysis
High-passed aspiration noise
34
Effective analysis/re-synthesis
Baritone examples

Normal phonation

original
LF

Pressed phonation

original
LF
35
Vocal texture control

The parametric vocal texture control model
determines the
parameterizations of the glottal excitation to
achieve the desired vocal texture.
Reduce the control complexity by exploring the
correlations
between the model parameters.

Wave shape control parameter
Desired vocal texture
Non-breathy mode
Transformed LF model
?
Rd
Glottal excitation strength Ee
Rd
breathy mode
Noise residual model
?
36
Vocal texture control (non-breathy mode)
Pressed and normal modes Wave-shape control
parameter Rd and normalized glottal excitation
strength Ee are highly correlated.
37
(No Transcript)
38
(No Transcript)
39
Vocal texture control (non-breathy mode)
Degree of pressness
interpolation
(apress bpress cpress) (anormal bnormal
cnormal)
Wave shape control parameter
(a, b, c)
Glottal excitation
Glottal excitation strength Ee
Transformed LF model
Rd
40
Vocal texture control (breathy mode)
High-passed noise energy

NHR per glottal cycle ?

Glottal excitation strength Ee

NHR is an indicator for the degree of
breathiness.
The contour of the noise strength is adjusted by
NHR.

Glottal excitation
Desired vocal texture
Transformed LF model
NHR

Rd
Ee
Bn1
gain
Noise residual model
An 2.4138 Bn 0.213
duty cycle
window lag
41
Overall synthesis model implementation
Transformed LF model
Degree of breathiness
Ee , F0
Vocal texture model
Rd
Glottal excitation

0.8
Noise residual model
Glottal excitation strength Ee
Fundamental frequency F0
Output voice
?
?
?
?
42
Vocal texture control demo
43
Contributions
A pseudo-physical model for singing voice
synthesis which