Toward a high-quality singing synthesizer with vocal texture control - PowerPoint PPT Presentation

About This Presentation
Title:

Toward a high-quality singing synthesizer with vocal texture control

Description:

the propagation time for sound wave to travel one acoustic tube. ... Error for one glottal cycle in vector form, L2 norm is used ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 46
Provided by: vick96
Category:

less

Transcript and Presenter's Notes

Title: Toward a high-quality singing synthesizer with vocal texture control


1
Toward a high-quality singing synthesizer with
vocal texture control
  • Hui-Ling Lu
  • Center for Computer Research in Music and
    Acoustics (CCRMA)
  • Stanford University, Stanford, CA94305, USA

2
Score-to-Singing system
Parametric Database
Phoneme
F0 Sound level Duration Vibrato
Score Lyrics Singing style
Singing voice
Rule system
Sound synthesis
  • Acoustic rendering
  • Co-articulation rules
  • Lyrics-to-phoneme
  • Musical rules

3
General sound synthesis approaches
Cons
Pros
Physical Modeling
  • flexible/intuitive control
  • expressive
  • co-articulation easy
  • analysis/re-synthesis
  • difficult
  • invasive measurements

Source-filter Model
Spectral Modeling
  • analysis/re-synthesis
  • easy
  • less expressive
  • co-articulation
  • difficult

4
Contributions
A pseudo-physical model for singing voice
synthesis which
  • is an approximate physical model.
  • can generate high-quality non-nasal singing
    voice.
  • has analysis/re-synthesis ability.
  • is computationally affordable.
  • provides flexible control of vocal textures.

An Automatic analysis procedure for
analysis/re-synthesis
A parametric model for vocal texture control
5
Outline
  • Human voice production system
  • Synthesis model
  • Analysis procedure
  • Vocal texture parametric model
  • Vocal texture control demo
  • Contributions and future directions

6
The human voice production system
Nasal sound output
Nasal cavity
Velum
Oral sound output
Oral cavity
Pharyngeal cavity
Vocal folds
Tongue hump
Lungs
Muscle force
7
Oscillation pattern of the vocal folds
Opening period
Closing period
Close phase
Open phase
  • The oscillation results from the balancing of
    the subglottal
  • pressure, the Bernoulli pressure and the elastic
    restoring force
  • of the vocal folds.
  • Prephonatory position the initial
    configuration of the
  • vocal folds before the beginning of oscillation.

8
(No Transcript)
9
Variation of vocal textures
Pressed
Normal
Breathy
10
Simplified human voice production model
  • Source-tract interaction The glottal waveform
    in general
  • depends on the vocal tract configuration.
  • Neglect the source-tract interaction since the
    glottal impedance
  • is very high most of the time.

11
Source-filter type synthesis model
Glottal Source
Vocal Tract Filter
Radiation
Aspiration noise
12
Overview of the proposed synthesis model
Glottal excitation
Filter
Derivative glottal wave
Voice output
All Pole Filter
Transformed Liljencrants-Fant Model
Noise Residual Model
High-passed aspiration noise
13
Transformed Liljencrants-Fant (LF) model
  • The transformed LF model controls the wave shape
    of the derivative
  • glottal wave via a single parameter, Rd (
    wave-shape control parameter).

14
Transformed Liljencrants-Fant (LF) model
  • Transformed LF model is an extension of the LF
    model. It provides
  • a control interface for the LF model to change
    the wave shape of the
  • derivative glottal wave easily.

Wave shape control parameter
Direct synthesis timing parameters
Analysis
Estimated derivative glottal wave
LF fitting
Mapping-1
Rd
15

Liljencrants-Fant (LF) model
16
Transformed Liljencrants-Fant (LF) model
  • Transformed LF model is an extension of the LF
    model. It provides
  • a control interface for the LF model to change
    the wave shape of the
  • derivative glottal wave easily.

Wave shape control parameter
Direct synthesis timing parameters
Analysis
Estimated derivative glottal wave
LF fitting
Mapping-1
Rd
17

Noise residual model
Noise floor
Bn
Noise residual
Gaussian Noise Generator
Amplitude Modulation

An
GCI
L
18
Vocal tract filter
  • An all-pole filter.
  • The vocal tract is assumed to be a series of
    concatenated uniform
  • lossless cylindrical acoustic tubes.
  • Assume that sound waves obey planar propagation
    along the axis
  • of the vocal tract.

?
Alip
A1
AN
A2
glottis
lip end
1-kN
Ulip
Ug
-kN
-1
19
Vocal tract filter
Kelly-Lochbaum junction
1-km


Um
Um1
Scattering coefficient
-km
km
Am
Am1
-
-
Um1
Um
1km
  • the propagation time for sound wave to travel
    one acoustic tube.
  • N the number of acoustic tubes excluding the
    glottis and the lip end.
  • If sampling period T 2? , the transfer
    function of the vocal tract
  • acoustic tubes can be shown to be an Nth order
    all-pole filter.
  • The autoregressive coefficients of the vocal
    tract filter can be
  • converted to scattering coefficients by Durbins
    method.

20
Overall synthesis model implementation
Transformed LF model
Degree of breathiness
Ee , F0
Vocal texture model
Rd

0.8
Noise residual model
Glottal excitation strength Ee
Fundamental frequency F0
Output voice
?
?
(No noise input)
?
?
21
Analysis procedure
Inverse filtered glottal excitation
Desired voice recording
LF model coefficients
Fitting the estimated derivative glottal wave
via LF model
Source-filter de-convolution
De-noising by Wavelet Packet Analysis
High-passed aspiration noise
22
Source-filter de-convolution
  • Synthesis model for analysis

KLGLOTT88 (KL) derivative glottal wave
Basic Voicing Waveform (a, b, OQ)
23
(No Transcript)
24
Source-filter de-convolution
  • Synthesis model for analysis

KLGLOTT88 (KL) derivative glottal wave
Nth order All pole vocal tract filter

Basic Voicing Waveform (a, b, OQ)
Low-pass filter
25
Source-filter deconvolution estimation flowchart
Voice signal after removing the low frequency
drift
GCI detection
Phase I
One glottal period signal
Loop for each period
Loop over different OQ values Vocal tract filter
and glottal source estimation via SUMT End
Select and store 5 best estimates
Loop for each period Enforce continuity
constraints via Dynamic Programming End
Phase II
Smoothing the vocal tract area by time averaging
and linear interpolation
Estimated model parameter sequence
26
Convex optimization formulation
Inverse filter
  • Estimate

by minimizing the error
between the basic voicing waveform and the
estimated one.
27
Convex optimization formulation
  • Error for one glottal cycle in vector form,

A convex optimization problem
Minimize
Subject to
  • L2 norm is used

The above problem can be solved by SUMT
(sequential unconstrained minimization
technique).
28
De-convolution result (synthetic data)
29
Effective analysis/re-synthesis
Baritone examples
  • Normal phonation

original
KLGLOTT88
  • Pressed phonation

original
KLGLOTT88
KLGLOTT88 (KL) derivative glottal wave
30
Analysis procedure
Inverse filtered glottal excitation
Desired voice recording
LF model coefficients
Fitting the estimated derivative glottal wave
via LF model
Source-filter de-convolution
De-noising by Wavelet Packet Analysis
High-passed aspiration noise
31
De-noising by Wavelet Packet Analysis
De-noising by best basis thresholding
  • A noisy data record X f W
  • Transform the noisy data to another basis
  • via Wavelet Packet Analysis XB fB WB
  • Thresholding out the smaller coefficients of XB
    by assuming
  • that f can be compactly represented in the new
    basis by
  • a few large coefficients.
  • Select the wavelet filter by energy compactness
    criteria
  • 1/(number of coefficients needed to accumulate
    0.9 of the total energy).

32
De-noising result (synthetic data)
33
Analysis procedure
Inverse filtered glottal excitation
Desired voice recording
LF model coefficients
Fitting the estimated derivative glottal wave
via LF model
Source-filter de-convolution
De-noising by Wavelet Packet Analysis
High-passed aspiration noise
34
Effective analysis/re-synthesis
Baritone examples
  • Normal phonation

original
LF
  • Pressed phonation

original
LF
35
Vocal texture control
  • The parametric vocal texture control model
    determines the
  • parameterizations of the glottal excitation to
    achieve the desired vocal texture.
  • Reduce the control complexity by exploring the
    correlations
  • between the model parameters.

Wave shape control parameter
Desired vocal texture
Non-breathy mode
Transformed LF model
?
Rd
Glottal excitation strength Ee
Rd
breathy mode
Noise residual model
?
36
Vocal texture control (non-breathy mode)
Pressed and normal modes Wave-shape control
parameter Rd and normalized glottal excitation
strength Ee are highly correlated.
37
(No Transcript)
38
(No Transcript)
39
Vocal texture control (non-breathy mode)
Degree of pressness
interpolation
(apress bpress cpress) (anormal bnormal
cnormal)
Wave shape control parameter
(a, b, c)
Glottal excitation
Glottal excitation strength Ee
Transformed LF model
Rd
40
Vocal texture control (breathy mode)
High-passed noise energy
  • NHR per glottal cycle ?

Glottal excitation strength Ee
  • NHR is an indicator for the degree of
    breathiness.
  • The contour of the noise strength is adjusted by
    NHR.

Glottal excitation
Desired vocal texture
Transformed LF model
NHR

Rd
Ee
Bn1
gain
Noise residual model
An 2.4138 Bn 0.213
duty cycle
window lag
41
Overall synthesis model implementation
Transformed LF model
Degree of breathiness
Ee , F0
Vocal texture model
Rd
Glottal excitation

0.8
Noise residual model
Glottal excitation strength Ee
Fundamental frequency F0
Output voice
?
?
?
?
42
Vocal texture control demo
43
Contributions
A pseudo-physical model for singing voice
synthesis which
  • is an approximate physical model.
  • can generate high-quality non-nasal singing
    voice.
  • has analysis/re-synthesis ability.
  • is computationally affordable.
  • provides flexible control of vocal textures.

An Automatic analysis procedure for
analysis/re-synthesis
A parametric model for vocal texture control
44
Future research
  • Build a complete score-to-singing system using
    the proposed
  • synthesis model. Its associated analysis
    procedure will be used
  • to construct the parametric database.
  • Investigate potential usage of the source-filter
    deconvolution
  • algorithm to low-bit rate high quality speech
    coding.
  • Explore the application of the analysis
    procedure on sound
  • transformation of vocal textures.

45
Thank you !
Write a Comment
User Comments (0)
About PowerShow.com