A SegmentBased Generative Model of Speech

About This Presentation

Title:

A SegmentBased Generative Model of Speech

Description:

Vibrating vocal cords: voiced speech. Frequency of vibration called pitch ... Voice background music. Timescale modified (slower) Denoising. Compression ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 19

Provided by: Kan6150

Category:

more less

Transcript and Presenter's Notes

Title: A SegmentBased Generative Model of Speech

1
A Segment-Based Generative Model of Speech

Kannan Achan
Joint work with
Sam Roweis, Aaron Hertzmann, Brendan Frey
University of Toronto
http//www.psi.toronto.edu/kannan/segmental

Speech processing in purely time domain is
generally considered difficult
perceptual similarity is not preserved
Microphones, room acoustics.
Time frequency representation is generally used
Phase !

Same utterance different microphones
3
Voiced, Unvoiced, Silence .

Vibrating vocal cords voiced speech
Frequency of vibration called pitch
Turbulent air flow unvoiced speech
Coloured noise
Silence period

4
Time domain modeling
Voiced region
UnVoiced region

Identify the Glottal pulse
period for voiced regions

Mechanism to identify region as unvoiced
5
Time domain modeling voiced region

Goal find segments s1,s2,.sN corresponding to
glottal pulse period
Notice that adjacent segments look similar
Transformation t(a,b,g)
Time Warp (a) Stretch/Shrink - Maps a n-vector
to an-vector
Amplitude Scaling (b) Scalar multiplication
(bx)
Amplitude Shift (g) Scalar addition (xg)
?For a given ak, we can find bk and gk using
linear regression

6
Generative Model

Assuming segments are generated by a first order
Markov process 4 types of transitions are
possible
Voiced to Voiced
Voiced to Unvoiced
Unvoiced to Voiced
Unvoiced to Unvoiced
Given segment boundaries b, segment types v
(voiced v1 or unvoiced v0) and transformation t
, the generative model is a conditional Markov
model

7
Generative Model

Successive voiced regions
red overlay in the 2nd period is the prediction

8
Generative Model
When 2 successive frames are not voiced, we
assume that phase information in the latter
cannot be reliably predicted ? use model of
power spectrum Define f(y) abs(F(y))/abs(F(y)
) (normalized power spectrum)
9
Regularizer - Upward Zero Crossings(due to John
Hopfield)

Constraint the segment boundaries to start and
end at only upward zero crossings

To further regularize the space of valid segment
boundaries, we can impose constraint on the
minimum and maximum length of segments
10
Inference (approx. E step)

Computational task
Infer segment boundaries, segment types and
transformation parameters
Exact inference intractable
Valid configurations of boundary variable
exponential
Find MAP estimates using dynamic programming
2-dimensional dynamic programming grid with size
given by the cardinality of upward zero crossings
For every valid pair of boundary configuration
(a,b), entry in the grid refers to the
probability of (a,b) being the last segment in
the best segmentation of the signal up to b.
Grid is sparse

11
Learning

Learn the parameters of the model l0 and l1 by
maximizing the expected value of the complete log
likelihood (posterior is the delta function
computed during inference)
Updates for l0 and l1 correspond to normalized
average spectrum of voiced and unvoiced segments
?s correspond to variances on these spectra

12
Results typical segmentation
13
Time Scale Modification

Stochastically remove or add frames

Original clip
2 x slower
2 x faster
14
Pitch Tracking / Voicing Detection
Counting the number of samples in the segments
for voiced region gives an estimate for pitch
period.
15
Filling in missing/corrupted region of speech

Our algorithm treats the corrupted region as
unvoiced.
To reconstruct - fill in the corrupted region by
generating new segments with periods between the
two bounding voiced regions.

16
Clipped speech restoration