Voice DSP Processing IV - PowerPoint PPT Presentation

About This Presentation

Title:

Voice DSP Processing IV

Description:

Part 1 Speech biology and what we can learn from it ... Isolated-word/Connected-word/Continuous-speech ... Signal classification, Speech, Latin OCR. Stein ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 33

Provided by: yjs8

Category:

more less

Transcript and Presenter's Notes

Title: Voice DSP Processing IV

1
VoiceDSPProcessing IV

Yaakov J. Stein
Chief ScientistRAD Data Communications

2
Voice DSP

Part 1 Speech biology and what we can learn from
it
Part 2 Speech DSP (AGC, VAD, features, echo
cancellation)
Part 3 Speech compression techiques
Part 4 Speech Recognition

3
Voice DSP - Part 4

Speech Recognition tasks
ASR Engine
Pattern Recognition and Phonetic labeling
DTW
HMM
State-of-the-Art

4
Speech Recognition Tasks

ASR - Automatic Speech Recognition
(speech-to-text)
Language/Accent Identification
Keyword-spotting/Gisting/Task-related
Isolated-word/Connected-word/Continuous-speech
Speaker-dependent/Speaker-adaptive/Speaker-indepen
dent
Small-vocabulary/Large-vocabulary/Arbitrary-input
Constrained-grammar/Free-speech
Speaker Recognition
Gender-only
Text-dependent/Text-independent
Speaker-spotting/Speaker-verification/Forensic
Closed-set/open-set
Misc
Emotional-state/Voice-polygraph
Translating-telephone
Singing

5
Basic ASR Engine
Acoustic Processing
Phonetic Labeling
Time Alignment
speech
Dictionary Syntactic Processing
Semantic Processing
text
Notes not all elements are present in all
systems we will not discuss
syntactic/semantic processing here
6
Acoustic Processing

All the processing we have learned before
AGC
VAD
Filtering
Echo/noise cancellation
Pre-emphasis
Mel/Bark scale spectral analysis
Filter banks
LPC
Cepstrum
...

7
Phonetic labeling

Some ASR systems attempt to label phonemes
Others dont label at all, or label other
pseudo-acoustic entities
Labeling simplifies overall engine architecture
Changing speaker/language/etc. has less system
impact
Later stages are DSP-free

8
Phonetic Labeling - cont.

Peterson - Barney data - an attempt at labeling
in formant space

9
Phonetic Labeling - cont.

Phonetic labeling is a classical Pattern
Recognition task
independence of
Need channel,
speaker, speed, etc
adaptation to
Pattern recognition can be computationally
complex
so feature extraction is often performed for
data dimensionality reduction (but always loses
information)
Commonly used features
LPC
LPC cepstrum (shown to be optimal in some sense)
(mel/Bark scale) filter-bank representation
RASTA (good for cross-channel applications)
Cochlea model features (high dimensionality)

10
Pattern Recognition Quick Review

What is Pattern Recognition ?
classification of real-world objects
Not a unified field (more ART than SCIENCE)
Not trivial (or even well-defined)
1, 2, 3, 4, 5, 6, 7,
and the answer is ...

1, 2, 3, 4, 5, 6, 7, 5048
because I meant n (n-1)(n-2)(n-3)(n-4)(n-5)(n-6
)(n-7)
11
Pattern Recognition - approaches

Approaches to Pattern Recognition
Statistical (Decision Theoretic)
Syntactical (Linguistic)
Syntactical Method
Describe classes by rules (grammar) in a formal
language
Identify pattern's class by grammar it obeys
Reduces classification to string parsing
Applications Fingerprinting, Scene analysis,
Chinese OCR
Statistical Method (here concentrate on this)
Describe patterns as points in n dimensional
vector space
Describe classes as hypervolumes or statistical
clouds
Reduces classification to geometry or function
evaluation
Applications Signal classification, Speech,
Latin OCR

12
PR Approaches - examples

Class A Class
B Class C
Statistical ( 1, 0, 0 ) ( 0, 1, 0 ) ( 0,
0, 1 )
(0.9, 0.1, 0) (0.1, 1, -0.1) (-0.1,
0.1, 1)
(1.1, 0, 0.1) (0, 0.9, -0.1) (0.1,
0, 1.1)
Syntactic ( 1, 1, 1 ) ( 1, 2, 3 ) ( 1,
2, 4 )
( 2, 2, 2 ) ( 2, 3, 4 ) ( 2,
4, 8 )
( 3, 3, 3 ) ( 3, 4, 5 ) ( 3,
6, 12)

13
Classifier types

Decision theoretic pattern recognizers come in
three types
Direct probability estimation
1NN, kNN,
Parzen, LBG, LVQ
Hypersphere
potentials, Mahalanobis (MAP for Gauss),
RBF, RCE
Hyperplane
Karhunen-Loeve, Fisher discriminant,
Gaussian mixture classifiers,
CART, MLP

14
Learning Theory

Decision theoretic pattern recognizer is usually
trained
Training (learning) algorithms come in three
types
Supervised (learning by examples, query learning)
Reinforcement (good-dog/bad-dog, TDl)
Unsupervised (clustering, VQ)
Cross Validation
In order not to fall into the generalization trap
we need
training set
validation set
test set (untainted, fair estimate of
generalization)
Probably Approximately Correct Learning
teacher and student
VC dimension - strength of classifier
Limit on generalization error
Egen - Etr lt a dVC / Ntr

15
Why ASR is not pattern recognition
Say

pneumonoultramicroscopicsilicovolcanoconiosis

I bet you cant say it again!
pneumonoultramicroscopicsilicovolcanoconiosis
I mean pronounce precisely the same thing It
might sound the same to your ear ( brain), but
the timing will never be exactly the same The
relationship is one of nonlinear time warping
16
Time Warping

The Real Problem in ASR - we have to correct for
the time warping
Note that since the distortion is time-variant it
is not a filter!
One way to deal with such warping is to use
Dynamic Programming
The main DP algorithm has many names
Viterbi algorithm
Levenshtein distance
DTW
but they are all basically the same thing!
The algorithm(s) are computationally efficient
since they find a global minimum based on local
decisions

17
Levenshtein Distance

Easy case to demonstrate - distance between two
strings
Useful in spelling checkers, OCR/ASR
post-processors, etc
There are three possible errors
Deletion digital - digtal
Insertion signal - signall
Substitution processing - prosessing
The Levenshtein distance is the minimal number of
errors
distance between dijitl and digital is 2
How do we find the minimal number of errors?
Algorithm is best understood graphically

18
Levenshtein distance - cont.

What is the Levenshtein distance between
prossesing and processing ?

Rules 1 enter square from left (deletion)
cost 1 2 enter square from under
(insertion) 3a enter square from
diagonal and same letter cost
0 3b enter square from diagonal and
different letter (substitution)
cost 1 4 Always use minimal cost
p r o s s e s i n g
p r o c e s s i n g
19
Levenshtein distance - cont.

Start with 0 in the bottom left corner
9
8
7
6
5
4
3
2
1
0 1 2 3 4
5 6 7 8 9

p r o s s e s i n g
p r o c e s s i n g
0
20
Levenshtein distance - cont.

Continue filling in table
9
8
7
6
5
4 3 2 2 2
3 2 1 1 2
2 1 0 1 2
1 0 1 2 3
0 1 2 3 4
5 6 7 8 9

Note that only local computations and decisions
are made
0
21
Levenshtein distance - cont.

Finish filling in table
9 8 7 7 6
5 5 5 4 3
8 7 6 6 5
4 4 4 3 4
7 6 5 5 4
3 3 3 5 5
6 5 4 4 3
2 3 4 4 5
5 4 3 3 2
3 3 3 4 5
4 3 2 2 2
2 2 3 4 5
3 2 1 1 2
2 3 4 5 6
2 1 0 1 2
3 4 5 6 7
1 0 1 2 3
4 5 6 7 8
0 1 2 3 4
5 6 7 8 9

The global result is 3 !
0
22
Levenshtein distance - end

Backtrack to find path actually taken
9 8 7 7 6
5 5 5 4 3
8 7 6 6 5
4 4 4 3 4
7 6 5 5 4
3 3 3 4 5
6 5 4 4 3
2 3 4 4 5
5 4 3 3 2
3 3 3 4 5
4 3 2 2 2
2 2 3 4 5
3 2 1 1 2
2 3 4 5 6
2 1 0 1 2
3 4 5 6 7
1 0 1 2 3
4 5 6 7 8
0 1 2 3 4
5 6 7 8 9

Remember The question is always how we got to a
square
23
Generalization to DP

What if not all substitutions are equally
probable?
Then we add a cost function instead of 1
We can also have costs for insertions and
deletions
Di j min ( Di-1 j Ci-1 j I j
Di-1 j-1 Ci-1 j-1 I j
Di j-1 Ci-1 j I j )
Even more general rules are often used

24
DTW

DTW uses the same technique for matching spoken
words
The input is separately matched to each
dictionary word
and the word with the least distortion is the
winner!
When waveforms are used the comparison measure
is
correlation, segmental SNR, Itakura-Saito, etc
When (N) features are used the comparison is
(N-dimensional) Euclidean distance
With DTW there is no labeling,
alignment and dictionary are performed together

25
DTW - cont.

Some more details
In isolated word recognition systems
energy contours are used to cut the words
linear time warping is then used to normalize the
utterance
special mechanisms are used for endpoint location
flexibility
so there are endpoint and path constraints
In connected word recognition systems
the endpoint of each recognized utterance is used
as a starting point for searching for the next
word
In speaker-independent recognition systems
we need multiple templates for each reference
word
the number of templates can be reduced by VQ

26
Markov Models

An alternative to DTW is based on Markov Models
A discrete-time left-to-right first order Markov
model
A DT LR second order Markov model

a12
a11 a12 1, a22 a23 1, etc.
State 1 2
3 4
a11 a12 a13 1, a22 a23 a24 1, etc.
27
Markov Models - cont.

General DT Markov Model
Model jumps from state to state with given
probabilities
e.g. 1 1 1 1 2 2 3 3 3 3 3 3 3 3 4 4 4
or 1 1 2 2 2 2 2 2 2 2 2 4 4 4 (LR
models)

28
Markov Models - cont.

Why use Markov models for speech recognition?
States can represent phonemes (or whatever)
Different phoneme durations (but exponentially
decaying)
Phoneme deletions using 2nd or higher order
So time warping is automatic !
We build a Markov model for each word
given an utterance, select the most probable word

29
HMM

But the same phoneme can be said in different
ways
so we need a Hidden Markov Model

a11
a44
a33
a22
a12
a34
a23
b14
b11
b13
b12
4
1
3
2
aij are transition probabilities bik are
observation (output) probabilities
acoustic phenomenon
b11 b12 b13 b14 1, b21 b22 b23 b24
1, etc.
30
HMM - cont.