Title: Toward a high-quality singing synthesizer with vocal texture control
1Toward a high-quality singing synthesizer with
vocal texture control
- Hui-Ling Lu
- Center for Computer Research in Music and
Acoustics (CCRMA) - Stanford University, Stanford, CA94305, USA
2Score-to-Singing system
Parametric Database
Phoneme
F0 Sound level Duration Vibrato
Score Lyrics Singing style
Singing voice
Rule system
Sound synthesis
- Acoustic rendering
- Co-articulation rules
- Lyrics-to-phoneme
- Musical rules
3General sound synthesis approaches
Cons
Pros
Physical Modeling
- flexible/intuitive control
- expressive
- co-articulation easy
- analysis/re-synthesis
- difficult
- invasive measurements
Source-filter Model
Spectral Modeling
- analysis/re-synthesis
- easy
- less expressive
- co-articulation
- difficult
4Contributions
A pseudo-physical model for singing voice
synthesis which
- is an approximate physical model.
- can generate high-quality non-nasal singing
voice. - has analysis/re-synthesis ability.
- is computationally affordable.
- provides flexible control of vocal textures.
An Automatic analysis procedure for
analysis/re-synthesis
A parametric model for vocal texture control
5Outline
- Human voice production system
- Synthesis model
- Analysis procedure
- Vocal texture parametric model
- Vocal texture control demo
- Contributions and future directions
6The human voice production system
Nasal sound output
Nasal cavity
Velum
Oral sound output
Oral cavity
Pharyngeal cavity
Vocal folds
Tongue hump
Lungs
Muscle force
7Oscillation pattern of the vocal folds
Opening period
Closing period
Close phase
Open phase
- The oscillation results from the balancing of
the subglottal - pressure, the Bernoulli pressure and the elastic
restoring force - of the vocal folds.
- Prephonatory position the initial
configuration of the - vocal folds before the beginning of oscillation.
8(No Transcript)
9Variation of vocal textures
Pressed
Normal
Breathy
10Simplified human voice production model
- Source-tract interaction The glottal waveform
in general - depends on the vocal tract configuration.
- Neglect the source-tract interaction since the
glottal impedance - is very high most of the time.
11Source-filter type synthesis model
Glottal Source
Vocal Tract Filter
Radiation
Aspiration noise
12Overview of the proposed synthesis model
Glottal excitation
Filter
Derivative glottal wave
Voice output
All Pole Filter
Transformed Liljencrants-Fant Model
Noise Residual Model
High-passed aspiration noise
13Transformed Liljencrants-Fant (LF) model
- The transformed LF model controls the wave shape
of the derivative - glottal wave via a single parameter, Rd (
wave-shape control parameter).
14Transformed Liljencrants-Fant (LF) model
- Transformed LF model is an extension of the LF
model. It provides - a control interface for the LF model to change
the wave shape of the - derivative glottal wave easily.
Wave shape control parameter
Direct synthesis timing parameters
Analysis
Estimated derivative glottal wave
LF fitting
Mapping-1
Rd
15 Liljencrants-Fant (LF) model
16Transformed Liljencrants-Fant (LF) model
- Transformed LF model is an extension of the LF
model. It provides - a control interface for the LF model to change
the wave shape of the - derivative glottal wave easily.
Wave shape control parameter
Direct synthesis timing parameters
Analysis
Estimated derivative glottal wave
LF fitting
Mapping-1
Rd
17 Noise residual model
Noise floor
Bn
Noise residual
Gaussian Noise Generator
Amplitude Modulation
An
GCI
L
18Vocal tract filter
- An all-pole filter.
- The vocal tract is assumed to be a series of
concatenated uniform - lossless cylindrical acoustic tubes.
- Assume that sound waves obey planar propagation
along the axis - of the vocal tract.
?
Alip
A1
AN
A2
glottis
lip end
1-kN
Ulip
Ug
-kN
-1
19Vocal tract filter
Kelly-Lochbaum junction
1-km
Um
Um1
Scattering coefficient
-km
km
Am
Am1
-
-
Um1
Um
1km
- the propagation time for sound wave to travel
one acoustic tube. - N the number of acoustic tubes excluding the
glottis and the lip end.
- If sampling period T 2? , the transfer
function of the vocal tract - acoustic tubes can be shown to be an Nth order
all-pole filter. - The autoregressive coefficients of the vocal
tract filter can be - converted to scattering coefficients by Durbins
method.
20Overall synthesis model implementation
Transformed LF model
Degree of breathiness
Ee , F0
Vocal texture model
Rd
0.8
Noise residual model
Glottal excitation strength Ee
Fundamental frequency F0
Output voice
?
?
(No noise input)
?
?
21Analysis procedure
Inverse filtered glottal excitation
Desired voice recording
LF model coefficients
Fitting the estimated derivative glottal wave
via LF model
Source-filter de-convolution
De-noising by Wavelet Packet Analysis
High-passed aspiration noise
22Source-filter de-convolution
- Synthesis model for analysis
KLGLOTT88 (KL) derivative glottal wave
Basic Voicing Waveform (a, b, OQ)
23(No Transcript)
24Source-filter de-convolution
- Synthesis model for analysis
KLGLOTT88 (KL) derivative glottal wave
Nth order All pole vocal tract filter
Basic Voicing Waveform (a, b, OQ)
Low-pass filter
25Source-filter deconvolution estimation flowchart
Voice signal after removing the low frequency
drift
GCI detection
Phase I
One glottal period signal
Loop for each period
Loop over different OQ values Vocal tract filter
and glottal source estimation via SUMT End
Select and store 5 best estimates
Loop for each period Enforce continuity
constraints via Dynamic Programming End
Phase II
Smoothing the vocal tract area by time averaging
and linear interpolation
Estimated model parameter sequence
26Convex optimization formulation
Inverse filter
by minimizing the error
between the basic voicing waveform and the
estimated one.
27Convex optimization formulation
- Error for one glottal cycle in vector form,
A convex optimization problem
Minimize
Subject to
The above problem can be solved by SUMT
(sequential unconstrained minimization
technique).
28De-convolution result (synthetic data)
29Effective analysis/re-synthesis
Baritone examples
original
KLGLOTT88
original
KLGLOTT88
KLGLOTT88 (KL) derivative glottal wave
30Analysis procedure
Inverse filtered glottal excitation
Desired voice recording
LF model coefficients
Fitting the estimated derivative glottal wave
via LF model
Source-filter de-convolution
De-noising by Wavelet Packet Analysis
High-passed aspiration noise
31De-noising by Wavelet Packet Analysis
De-noising by best basis thresholding
- A noisy data record X f W
- Transform the noisy data to another basis
- via Wavelet Packet Analysis XB fB WB
- Thresholding out the smaller coefficients of XB
by assuming - that f can be compactly represented in the new
basis by - a few large coefficients.
- Select the wavelet filter by energy compactness
criteria - 1/(number of coefficients needed to accumulate
0.9 of the total energy).
32De-noising result (synthetic data)
33Analysis procedure
Inverse filtered glottal excitation
Desired voice recording
LF model coefficients
Fitting the estimated derivative glottal wave
via LF model
Source-filter de-convolution
De-noising by Wavelet Packet Analysis
High-passed aspiration noise
34Effective analysis/re-synthesis
Baritone examples
original
LF
original
LF
35Vocal texture control
- The parametric vocal texture control model
determines the - parameterizations of the glottal excitation to
achieve the desired vocal texture. - Reduce the control complexity by exploring the
correlations - between the model parameters.
Wave shape control parameter
Desired vocal texture
Non-breathy mode
Transformed LF model
?
Rd
Glottal excitation strength Ee
Rd
breathy mode
Noise residual model
?
36Vocal texture control (non-breathy mode)
Pressed and normal modes Wave-shape control
parameter Rd and normalized glottal excitation
strength Ee are highly correlated.
37(No Transcript)
38(No Transcript)
39Vocal texture control (non-breathy mode)
Degree of pressness
interpolation
(apress bpress cpress) (anormal bnormal
cnormal)
Wave shape control parameter
(a, b, c)
Glottal excitation
Glottal excitation strength Ee
Transformed LF model
Rd
40Vocal texture control (breathy mode)
High-passed noise energy
Glottal excitation strength Ee
- NHR is an indicator for the degree of
breathiness. - The contour of the noise strength is adjusted by
NHR.
Glottal excitation
Desired vocal texture
Transformed LF model
NHR
Rd
Ee
Bn1
gain
Noise residual model
An 2.4138 Bn 0.213
duty cycle
window lag
41Overall synthesis model implementation
Transformed LF model
Degree of breathiness
Ee , F0
Vocal texture model
Rd
Glottal excitation
0.8
Noise residual model
Glottal excitation strength Ee
Fundamental frequency F0
Output voice
?
?
?
?
42Vocal texture control demo
43Contributions
A pseudo-physical model for singing voice
synthesis which
- is an approximate physical model.
- can generate high-quality non-nasal singing
voice. - has analysis/re-synthesis ability.
- is computationally affordable.
- provides flexible control of vocal textures.
An Automatic analysis procedure for
analysis/re-synthesis
A parametric model for vocal texture control
44Future research
- Build a complete score-to-singing system using
the proposed - synthesis model. Its associated analysis
procedure will be used - to construct the parametric database.
- Investigate potential usage of the source-filter
deconvolution - algorithm to low-bit rate high quality speech
coding. - Explore the application of the analysis
procedure on sound - transformation of vocal textures.
45Thank you !