Voice DSP Processing IV - PowerPoint PPT Presentation

About This Presentation
Title:

Voice DSP Processing IV

Description:

Part 1 Speech biology and what we can learn from it ... Isolated-word/Connected-word/Continuous-speech ... Signal classification, Speech, Latin OCR. Stein ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 33
Provided by: yjs8
Category:
Tags: dsp | processing | voice

less

Transcript and Presenter's Notes

Title: Voice DSP Processing IV


1
VoiceDSPProcessing IV
  • Yaakov J. Stein
  • Chief ScientistRAD Data Communications

2
Voice DSP
  • Part 1 Speech biology and what we can learn from
    it
  • Part 2 Speech DSP (AGC, VAD, features, echo
    cancellation)
  • Part 3 Speech compression techiques
  • Part 4 Speech Recognition

3
Voice DSP - Part 4
  • Speech Recognition tasks
  • ASR Engine
  • Pattern Recognition and Phonetic labeling
  • DTW
  • HMM
  • State-of-the-Art

4
Speech Recognition Tasks
  • ASR - Automatic Speech Recognition
    (speech-to-text)
  • Language/Accent Identification
  • Keyword-spotting/Gisting/Task-related
  • Isolated-word/Connected-word/Continuous-speech
  • Speaker-dependent/Speaker-adaptive/Speaker-indepen
    dent
  • Small-vocabulary/Large-vocabulary/Arbitrary-input
  • Constrained-grammar/Free-speech
  • Speaker Recognition
  • Gender-only
  • Text-dependent/Text-independent
  • Speaker-spotting/Speaker-verification/Forensic
  • Closed-set/open-set
  • Misc
  • Emotional-state/Voice-polygraph
  • Translating-telephone
  • Singing

5
Basic ASR Engine
Acoustic Processing
Phonetic Labeling
Time Alignment
speech
Dictionary Syntactic Processing
Semantic Processing
text
Notes not all elements are present in all
systems we will not discuss
syntactic/semantic processing here
6
Acoustic Processing
  • All the processing we have learned before
  • AGC
  • VAD
  • Filtering
  • Echo/noise cancellation
  • Pre-emphasis
  • Mel/Bark scale spectral analysis
  • Filter banks
  • LPC
  • Cepstrum
  • ...

7
Phonetic labeling
  • Some ASR systems attempt to label phonemes
  • Others dont label at all, or label other
    pseudo-acoustic entities
  • Labeling simplifies overall engine architecture
  • Changing speaker/language/etc. has less system
    impact
  • Later stages are DSP-free

8
Phonetic Labeling - cont.
  • Peterson - Barney data - an attempt at labeling
    in formant space

9
Phonetic Labeling - cont.
  • Phonetic labeling is a classical Pattern
    Recognition task
  • independence of
  • Need channel,
    speaker, speed, etc
  • adaptation to
  • Pattern recognition can be computationally
    complex
  • so feature extraction is often performed for
  • data dimensionality reduction (but always loses
    information)
  • Commonly used features
  • LPC
  • LPC cepstrum (shown to be optimal in some sense)
  • (mel/Bark scale) filter-bank representation
  • RASTA (good for cross-channel applications)
  • Cochlea model features (high dimensionality)

10
Pattern Recognition Quick Review
  • What is Pattern Recognition ?
  • classification of real-world objects
  • Not a unified field (more ART than SCIENCE)
  • Not trivial (or even well-defined)
  • 1, 2, 3, 4, 5, 6, 7,
  • and the answer is ...

1, 2, 3, 4, 5, 6, 7, 5048
because I meant n (n-1)(n-2)(n-3)(n-4)(n-5)(n-6
)(n-7)
11
Pattern Recognition - approaches
  • Approaches to Pattern Recognition
  • Statistical (Decision Theoretic)
  • Syntactical (Linguistic)
  • Syntactical Method
  • Describe classes by rules (grammar) in a formal
    language
  • Identify pattern's class by grammar it obeys
  • Reduces classification to string parsing
  • Applications Fingerprinting, Scene analysis,
    Chinese OCR
  • Statistical Method (here concentrate on this)
  • Describe patterns as points in n dimensional
    vector space
  • Describe classes as hypervolumes or statistical
    clouds
  • Reduces classification to geometry or function
    evaluation
  • Applications Signal classification, Speech,
    Latin OCR

12
PR Approaches - examples
  • Class A Class
    B Class C
  • Statistical ( 1, 0, 0 ) ( 0, 1, 0 ) ( 0,
    0, 1 )
  • (0.9, 0.1, 0) (0.1, 1, -0.1) (-0.1,
    0.1, 1)
  • (1.1, 0, 0.1) (0, 0.9, -0.1) (0.1,
    0, 1.1)
  • Syntactic ( 1, 1, 1 ) ( 1, 2, 3 ) ( 1,
    2, 4 )
  • ( 2, 2, 2 ) ( 2, 3, 4 ) ( 2,
    4, 8 )
  • ( 3, 3, 3 ) ( 3, 4, 5 ) ( 3,
    6, 12)

13
Classifier types
  • Decision theoretic pattern recognizers come in
    three types
  • Direct probability estimation
  • 1NN, kNN,
  • Parzen, LBG, LVQ
  • Hypersphere
  • potentials, Mahalanobis (MAP for Gauss),
  • RBF, RCE
  • Hyperplane
  • Karhunen-Loeve, Fisher discriminant,
  • Gaussian mixture classifiers,
  • CART, MLP

14
Learning Theory
  • Decision theoretic pattern recognizer is usually
    trained
  • Training (learning) algorithms come in three
    types
  • Supervised (learning by examples, query learning)
  • Reinforcement (good-dog/bad-dog, TDl)
  • Unsupervised (clustering, VQ)
  • Cross Validation
  • In order not to fall into the generalization trap
    we need
  • training set
  • validation set
  • test set (untainted, fair estimate of
    generalization)
  • Probably Approximately Correct Learning
  • teacher and student
  • VC dimension - strength of classifier
  • Limit on generalization error
  • Egen - Etr lt a dVC / Ntr

15
Why ASR is not pattern recognition
Say
  • pneumonoultramicroscopicsilicovolcanoconiosis

I bet you cant say it again!
pneumonoultramicroscopicsilicovolcanoconiosis
I mean pronounce precisely the same thing It
might sound the same to your ear ( brain), but
the timing will never be exactly the same The
relationship is one of nonlinear time warping
16
Time Warping
  • The Real Problem in ASR - we have to correct for
    the time warping
  • Note that since the distortion is time-variant it
    is not a filter!
  • One way to deal with such warping is to use
    Dynamic Programming
  • The main DP algorithm has many names
  • Viterbi algorithm
  • Levenshtein distance
  • DTW
  • but they are all basically the same thing!
  • The algorithm(s) are computationally efficient
  • since they find a global minimum based on local
    decisions

17
Levenshtein Distance
  • Easy case to demonstrate - distance between two
    strings
  • Useful in spelling checkers, OCR/ASR
    post-processors, etc
  • There are three possible errors
  • Deletion digital - digtal
  • Insertion signal - signall
  • Substitution processing - prosessing
  • The Levenshtein distance is the minimal number of
    errors
  • distance between dijitl and digital is 2
  • How do we find the minimal number of errors?
  • Algorithm is best understood graphically

18
Levenshtein distance - cont.
  • What is the Levenshtein distance between
    prossesing and processing ?

Rules 1 enter square from left (deletion)
cost 1 2 enter square from under
(insertion) 3a enter square from
diagonal and same letter cost
0 3b enter square from diagonal and
different letter (substitution)
cost 1 4 Always use minimal cost
p r o s s e s i n g
p r o c e s s i n g
19
Levenshtein distance - cont.
  • Start with 0 in the bottom left corner
  • 9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1
  • 0 1 2 3 4
    5 6 7 8 9

p r o s s e s i n g
p r o c e s s i n g
0
20
Levenshtein distance - cont.
  • Continue filling in table
  • 9
  • 8
  • 7
  • 6
  • 5
  • 4 3 2 2 2
  • 3 2 1 1 2
  • 2 1 0 1 2
  • 1 0 1 2 3
  • 0 1 2 3 4
    5 6 7 8 9

Note that only local computations and decisions
are made
0
21
Levenshtein distance - cont.
  • Finish filling in table
  • 9 8 7 7 6
    5 5 5 4 3
  • 8 7 6 6 5
    4 4 4 3 4
  • 7 6 5 5 4
    3 3 3 5 5
  • 6 5 4 4 3
    2 3 4 4 5
  • 5 4 3 3 2
    3 3 3 4 5
  • 4 3 2 2 2
    2 2 3 4 5
  • 3 2 1 1 2
    2 3 4 5 6
  • 2 1 0 1 2
    3 4 5 6 7
  • 1 0 1 2 3
    4 5 6 7 8
  • 0 1 2 3 4
    5 6 7 8 9

The global result is 3 !
0
22
Levenshtein distance - end
  • Backtrack to find path actually taken
  • 9 8 7 7 6
    5 5 5 4 3
  • 8 7 6 6 5
    4 4 4 3 4
  • 7 6 5 5 4
    3 3 3 4 5
  • 6 5 4 4 3
    2 3 4 4 5
  • 5 4 3 3 2
    3 3 3 4 5
  • 4 3 2 2 2
    2 2 3 4 5
  • 3 2 1 1 2
    2 3 4 5 6
  • 2 1 0 1 2
    3 4 5 6 7
  • 1 0 1 2 3
    4 5 6 7 8
  • 0 1 2 3 4
    5 6 7 8 9

Remember The question is always how we got to a
square
23
Generalization to DP
  • What if not all substitutions are equally
    probable?
  • Then we add a cost function instead of 1
  • We can also have costs for insertions and
    deletions
  • Di j min ( Di-1 j Ci-1 j I j
  • Di-1 j-1 Ci-1 j-1 I j
  • Di j-1 Ci-1 j I j )
  • Even more general rules are often used

24
DTW
  • DTW uses the same technique for matching spoken
    words
  • The input is separately matched to each
    dictionary word
  • and the word with the least distortion is the
    winner!
  • When waveforms are used the comparison measure
    is
  • correlation, segmental SNR, Itakura-Saito, etc
  • When (N) features are used the comparison is
  • (N-dimensional) Euclidean distance
  • With DTW there is no labeling,
  • alignment and dictionary are performed together

25
DTW - cont.
  • Some more details
  • In isolated word recognition systems
  • energy contours are used to cut the words
  • linear time warping is then used to normalize the
    utterance
  • special mechanisms are used for endpoint location
    flexibility
  • so there are endpoint and path constraints
  • In connected word recognition systems
  • the endpoint of each recognized utterance is used
  • as a starting point for searching for the next
    word
  • In speaker-independent recognition systems
  • we need multiple templates for each reference
    word
  • the number of templates can be reduced by VQ

26
Markov Models
  • An alternative to DTW is based on Markov Models
  • A discrete-time left-to-right first order Markov
    model
  • A DT LR second order Markov model

a12
a11 a12 1, a22 a23 1, etc.
State 1 2
3 4
a11 a12 a13 1, a22 a23 a24 1, etc.
27
Markov Models - cont.
  • General DT Markov Model
  • Model jumps from state to state with given
    probabilities
  • e.g. 1 1 1 1 2 2 3 3 3 3 3 3 3 3 4 4 4
  • or 1 1 2 2 2 2 2 2 2 2 2 4 4 4 (LR
    models)

28
Markov Models - cont.
  • Why use Markov models for speech recognition?
  • States can represent phonemes (or whatever)
  • Different phoneme durations (but exponentially
    decaying)
  • Phoneme deletions using 2nd or higher order
  • So time warping is automatic !
  • We build a Markov model for each word
  • given an utterance, select the most probable word

29
HMM
  • But the same phoneme can be said in different
    ways
  • so we need a Hidden Markov Model

a11
a44
a33
a22
a12
a34
a23
b14
b11
b13
b12
4
1
3
2
aij are transition probabilities bik are
observation (output) probabilities
acoustic phenomenon
b11 b12 b13 b14 1, b21 b22 b23 b24
1, etc.
30
HMM - cont.
  • For a given state sequence S1 S2 S3 ST
  • the probability of an observation sequence O1 O2
    O3 OT
  • is P(OS) bS1O1 bS2O2 bS3O3 bSTOT
  • For a given hidden Markov model M p, a, b
  • the probability of the state sequence S1 S2 S3
    ST
  • is (the initial probability of S1 is taken to be
    pS1)
  • P(SM) pS1 aS1S2 aS2S3 aS3S4
    aST-1ST
  • So, for a given hidden Markov model M
  • the probability of an observation sequence O1 O2
    O3 OT
  • is obtained by summing over all possible state
    sequences

31
HMM - cont.
  • P(O M) S P(OS) P(SM)
  • S pS1 bS1O1 aS1S2
    bS2O2 aS2S3 bS2O2
  • So for an observation sequence we can find
  • the probability of any word model
  • (but this can be made more efficient using the
    forward-backward algorithm)
  • How do we train HMM models?
  • The full algorithm (Baum-Welch) is an EM
    algorithm
  • An approximate algorithm is based on the Viterbi
    (DP) algorithm
  • (Sorry, but beyond our scope here)

32
State-of-the-Art
  • Isolated digits Isolated words
    Constrained Free speech
  • Spkr dep _at_100 98
    95 90
  • Spkr ind gt 98 95
    85 lt 70
  • NB 95 on words is about 50 on sentences
  • only gt 97 is considered useful for transcription
  • (otherwise more time consuming to correct than to
    type)
Write a Comment
User Comments (0)
About PowerShow.com