Title: Dynamic Time Warping and Minimum Distance Paths for Speech Recognition
1Dynamic Time Warping and Minimum Distance Paths
for Speech Recognition
- Isolated word recognition
- Task
- Want to build an isolated word recogniser e.g.
voice dialling on mobile phones - Method
- Record, parameterise and store vocabulary of
reference words - Record test word to be recognised and
parameterise - Measure distance between test word and each
reference word - Choose reference word closest to test word
2Words are parameterised on a frame-by-frame
basis Choose frame length, over which speech
remains reasonably stationary Overlap frames e.g.
40ms frames, 10ms frame shift
40ms
20ms
We want to compare frames of test and reference
words i.e. calculate distances between them
3Calculating Distances
- Easy
- Sum differences between corresponding frames
- Problem
- Number of frames wont always correspond
4- Solution 1 Linear Time Warping
- Stretch shorter sound
- Problem?
- Some sounds stretch more than others
5- Solution 2
- Dynamic Time Warping (DTW)
-
-
5 3 9 7 3
Test
4 7 4
Reference
Using a dynamic alignment, make most similar
frames correspond Find distances between two
utterences using these corresponding frames
6Digression Dynamic Programming
- The shortest route from Dublin to Limerick goes
through - Kildare
- Monasterevin
- Portlaoise
- Mountrath
- Roscrea
- Nenagh
- Now consider the shortest route from Dublin to
Nenagh - What towns does the route go through?
7Intercity Example
8(No Transcript)
9Compute minimum distances dist each point and
place in mindist matrix mindist(5,3) min1
mindist(5,2), 1 mindist(4,2), 1
mindist(4,3)
Place distance between frame r of Test and frame
c of Reference in cell(r,c) of distance matrix
3 5 1 x 4 x 1 x
7 4 3 x 0 x 3 x
9 3 5 x 2 x 5 x
3 2 1 x 4 x 1 x
5 1 1 x 2 x 1 x
1 2 3
4 7 4
Test
3 5 11 x 8 x 5 x
7 4 10 x 4 x 7 x
9 3 7 x 4 x 9 x
3 2 2 x 5 x 4 x
5 1 1 x 3 x 4 x
1 2 3
4 7 4
Test
Reference
We can also find the path through the grid that
minimizes total cost of path
Reference
10Examples so far are uni-dimensional Speech is
multi-dimensional e.g. two dimensions, using
points (4,3) and (5,2)
4 5
54321
x
x
1 2 3 4 5
Distance equation for 2 dimensions
Distance equation for multi-dimensional
11Constraints
- Global
- Endpoint detection
- Path should be close to diagonal
- Local
- Must always travel upwards or eastwards
- No jumps
- Slope weighting
- Consecutive moves upwards/eastwards
12Global Constraints
13Local Constraints
mindist(r,c)
1
mindist(r,c-1)
weights
1
2
mindist(r-1,c)
mindist(r-1,c-1)
14Points to Note
- DTW really only suitable for small vocabularies
and/or speaker dependent recognition - Should normalise for reference length
- Can use multiple utterances and cluster them
- Poor performance if recording environment changes
- High computation cost
-
15Evaluation
- Performance of designs only comparable by
evaluation - Use a test set
- For single word recognition we can simply quote
accuracy
In error analysis, it can be helpful to use a
confusion matrix
16Confusion Matrix
references test tokens test tokens
references yes no
yes 24 2
no 3 21