Title: A Segmental HMM for Speech Waveforms
1A Segmental HMM for Speech Waveforms
- Kannan Achan, Sam Roweis, Brendan Frey
2- Speech processing in purely time domain is
generally considered difficult - perceptual similarity is not preserved
- Microphones, room acoustics.
- Time frequency representation is generally used
3How is Speech produced?
- Energy source air pressure from lungs
- Larynx, vocal chords
- Modulated by a transfer function that depends on
the shape of vocal tract
4Voiced, Unvoiced, Silence .
- Vibrating vocal cords voiced speech
- Frequency of vibration called pitch
- Turbulent air flow unvoiced speech
- Coloured noise
- Silence period
5Time domain modeling
Voiced region
UnVoiced region
- Identify the Glottal pulse
- period for voiced regions
Mechanism to identify region as unvoiced
6Time domain modeling
Goal break the speech signal S in to segments
s1,s2,.sN corresponding to glottal pulse
period Notice that adjacent segments look
similar Idea First order Markov chain Model
each segment as a transformed version of previous
segment
7Transformations
- Time Warp (a) horizontal stretching
- Stretches/shrinks the segment
- Matrix multiplication Sx
- S is a an x n matrix
- Maps a n-vector to an-vector
- Amplitude Scaling (b) vertical stretching
- Scalar multiplication (bx)
- Amplitude Shift (g) vertical shift
- Scalar addition (xg)
8Probability model
- Segments s1,s2,.sK
- Segment boundaries b1,b2,.bK1
- Transformation Tk maps sk-1 to sk
- Parameterized as Tk(ak,bk,gk)
?ak is discretized range handset using expected
pitch period and sampling frequency ?For a given
ak, we can find bk and gk using linear regression
9 Regularizer - Upward Zero Crossings(due to John
Hopfield)
- Constraint the segment boundaries to start and
end at only upward zero crossings - Goal Break the speech signal in to glottal pulse
periods starting at an upward zero crossing
10A Simple Greedy algorithm
- Given a good initial guess St (bt,bt1)
- Enumerate N zero crossings that occur
- immediately after bt1. z1t,z2t,zNt
- For each zero crossing zkt
- Resample St to be of length (zkt -bt1)
- Find the optimal amplitude scaling and shift
- parameters using linear regression
- Compute the error with the target
- segment x(zkt bt1)
- End
- Select the target segment with the
- least error as St1
11Problems with Greedy technique
Unvoiced to voiced transition lack of a reliable
template leads to poor estimates We need an
efficient optimizer to infer the segments
12Embedded HMM Neal, Beal Roweis(NIPS 2003)
- Addresses the issue of sampling from the
posterior distribution in a non linear dynamical
system - We will use it as an optimizer to find MAP
estimate
13HMM Create a Pool of Segments
- Let z11 z21 zK1 be the participating zero
crossings in the current solution. - Define Nghd(zkt,g) as the set
- containing zkt and
- g neighbouring upward zero
- crossings around zkt.
- States of HMM are given by the set Nghd(zkt,g) X
Nghd(zkt1,g) - consider only those tuples that form a valid
segment (non-negative length) - Define compatibility function to enforce segment
continuity
14Dynamic programming to infer segment boundaries
- Using current estimate form a pool of candidate
segments - Run Viterbi algorithm to find the best path
b11,b21
b12,b22
b1T,b2T
bk1,bl1
bk2,bl2
bkT,blT
bx1,by1
bx2,by2
bxT,byT
?Monotonically improves the model likelihood of
the observed waveform
15Results
(C) After 3 iterations
16Time Scale Modification
- Stochastically remove or add frames
Original clip
2 x slower
2 x faster
17Voicing/Unvoicing detection
- Periodicity of the segments can be used to
discriminate voiced and unvoiced regions - Voiced regions more periodic
18Pitch Tracking
Counting the number of samples in the segments
for voiced region gives an estimate for pitch
period.
19Pitch on spectrogram
20Clipped speech restoration
- Saturation due to poor recording / quantization
- Can we use the inferred transformation/pitch to
complete?
Work in progress
21Voice/Gender conversion
- A very naïve approach
- Pitch of male voice around 110Hz
- Pitch of female voice around 210 Hz
- Idea Stretch/shrink segments to
decrease/increase pitch - Cubic spline smoother along segment boundaries
.
female
Work in progress
Trumpet
22Multiple sound sources
- Several templates evolving simultaneously
- Need for more complicated model
- Example
- Voice background music
- Timescale modified (slower)
23.
- Model hidden state corresponding to the state of
glottis - 0 un-voiced /silence
- 1 voiced
- Handle multiple speakers
- Need for more complicated models