Title: LandmarkBased Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Net
1Landmark-Based Speech RecognitionSpectrogram
Reading,Support Vector Machines,Dynamic
Bayesian Networks,and Phonology
- Mark Hasegawa-Johnson
- jhasegaw_at_uiuc.edu
- University of Illinois at Urbana-Champaign, USA
2Lecture 5 Generalization Error Support Vector
Machines
- Observation Vector Summary Statistic Principal
Components Analysis (PCA) - Risk Minimization
- If Posterior Probability is known MAP is optimal
- Example Linear Discriminant Analysis (LDA)
- When true Posterior is unknown Generalization
Error - VC Dimension, and bounds on Generalization Error
- Lagrangian Optimization
- Linear Support Vector Machines
- The SVM Optimality Metric
- Lagrangian Optimization of SVM Metric
- Hyper-parameters Over-training
- Kernel-Based Support Vector Machines
- Kernel-based classification optimization
formulas - Hyperparameters Over-training
- The Entire Regularization Path of the SVM
- High-Dimensional Linear SVM
- Text classification using indicator functions
- Speech acoustic classification using redundant
features
3What is an Observation?
- Observation can be
- A vector created by vectorizing many
consecutive MFCC or mel-spectra - A vector including MFCC, formants, pitch,
PLP, auditory model features,
4Normalized Observations
5Plotting the Observations, Part I Scatter Plots
and Histograms
6Problem Where is the Information in a
1000-Dimensional Vector?
7Statistics that Summarize a Training Corpus
8Summary Statistics Matrix Notation
Examples of y-1
Examples of y1
9Eigenvectors and Eigenvalues of R
10Plotting the Observations, Part 2 Principal
Components Analysis
11What Does PCA Extract from the Spectrogram?
Plot PCAGram
- 1024-dimensional principal component ? 32X32
spectrogram, plot as an image - 1st principal component (not shown) measures
total energy of the spectrogram - 2nd principal component E(after landmark)
E(before landmark) - 3rd principal component E(at the landmark)
E(surrounding syllables)
12Minimum-Risk Classifier Design
13True Risk, Empirical Risk, and Generalization
14When PDF is Known Maximum A Posteriori (MAP) is
Optimal
15Another Way to Write the MAP Classifier Test the
Sign of the Log Likelihood Ratio
16MAP Example Gaussians with Equal Covariance
17Linear Discriminant Projection of the Data
18Other Linear Classifiers Empirical Risk
Minimization (Choose v, b to Minimize Remp(v,b))
19A Serious Problem Over-Training
The same projection, applied to new test data
Minimum-Error projection of training data
20When the True PDF is Unknown Upper Bounds on
True Risk
21The VC Dimension of a Hyperplane Classifier
22Schematic Depiction w Controls the
Expressiveness of the Classifier(and a less
expressive classifier is less prone to overtrain)
23The SVM An Optimality Criterion
24Lagrangian Optimization Inequality Constraint
- Consider minimizing f(v), subject to the
constraint g(v) 0. Two solution types exist - g(v) 0
- g(v)0 curve is tangent to f(v)fmin curve at
vv - g(v) gt 0
- v minimizes f(v)
g(v) lt 0
Unconstrained Minimum
g(v) lt 0
v
g(v) 0
v
g(v) gt 0
g(v) gt 0
g(v) 0
Diagram from Osborne, 2004
25Case 1 gm(v)0
26Case 2 gm(v)gt0
27Training an SVM
28Differentiate the Lagrangian
29 now Simplify the Lagrangian
30 and impose Kuhn-Tucker
31Three Types of Vectors
Interior Vector a0
Margin Support Vector 0ltaltC
Error aC
Partial Error aC
From Hastie et al., NIPS 2004
32 and finally, Solve the SVM
33Quadratic Programming
ai2
C
ai1
C
ai
ai2 is off the margin truncate to ai20. ai1
is still a margin candidate solve for it again
in iteration i1.
34Linear SVM Example
35Linear SVM Example
36Choosing the Hyper-Parameter to Avoid
Over-Training(Wang, Presentation at CLSP
workshop WS04)
SVM test corpus error vs. l1/C, classification
of nasal vs. non-nasal vowels.
37Choosing the Hyper-Parameter to Avoid
Over-Training
- Recall that vSm amymxm
- Therefore, v lt (C Sm xm)1/2 lt (CM maxxm)1/2
- Therefore, width of the margin is constrained to
1/v gt (CM maxxm)-1/2, and therefore, the
SVM is not allowed to make the margin very small
in its quest to fix individual errors - Recommended solution
- Normalize xm so that maxxm1 (e.g., using
libsvm) - Set C1/M
- If desired, adjust C up or down by a factor of 2,
to see if error rate on independent development
test data will decrease
38From Linear to Nonlinear SVM
39Example RBF Classifier
40An RBF Classification Boundary
41Two Hyperparameters ? Choosing Hyperparameters is
Much Harder(Hastie, Rosset, Tibshirani, and Zhu,
NIPS 2004)
42Optimum Value of C Depends on g(Hastie, Rosset,
Tibshirani, and Zhu, NIPS 2004)
From Hastie et al., NIPS 2004
43SVM is a Regularized Learner (l1/C)
44SVM Coefficients are a Piece-Wise Linear Function
of l1/C(Hastie, Rosset, Tibshirani, and Zhu,
NIPS 2004)
45The Entire Regularization Path of the SVM
Algorithm(Hastie, Zhu, Tibshirani and Rosset,
NIPS 2004)
- Start with l large enough (C small enough) so
that all training tokens are partial errors
(amC). Compute the solution to the quadratic
programming problem in this case, including
inversion of XTX or XXT. - Reduce l (increase C) until the initial event
occurs two partial error points enter the
margin, i.e., in the QP problem, amC becomes the
unconstrained solution rather than just the
constrained solution. This is the first
breakpoint. The slopes dam/dl change, but only
for the two training vectors the margin all
other training vectors continue to have
amC.Calculate the new values of dam/dl for these
two training vectors. - Iteratively find the next breakpoint. The next
breakpoint occurs when one of the following
occurs - A value of am that was on the margin leaves the
margin, i.e., the piece-wise-linear function
am(l) hits am0 or amC. - One or more interior points enter the margin,
i.e., in the QP problem, am0 becomes the
unconstrained solution rather than just the
constrained solution. - One or more interior points enter the margin,
i.e., in the QP problem, amC becomes the
unconstrained solution rather than just the
constrained solution.
46One Method for Using SVMPath (WS04, Johns
Hopkins, 2004)
- Download SVMPath code from Trevor Hasties web
page - Test several values of g, including values within
a few orders of magnitude from g1/K. - For each candidate value of g, use SVMPath to
find the C-breakpoints. Choose a few dozen
C-breakpoints for further testing, and write out
the corresponding values of am. - Test the SVMs on a separate development test
database for each combination (C,g), find the
development test error. Choose the combination
that gives least development test error.
47Results, RBF SVM
SVM test corpus error vs. l1/C, classification
of nasal vs. non-nasal vowels.
Wang, WS04 Student Presentation, 2004
48High-Dimensional Linear SVMs
49Motivation Project it Yourself
- The purpose of a nonlinear SVM
- f(x) contains higher-order polynomial terms in
the elements of x. - By combining these higher-order polynomial terms,
SymamK(x,xm) can create a more flexible boundary
than can SymamxTxm. - The flexibility of the boundary does not lead to
generalization error the regularization term
lv2 avoids generalization error. - A different approach
- Augment x with higher-order terms, up to a very
large dimension. These terms can include - Polynomial terms, e.g., xixj
- N-gram terms, e.g., (xi at time t AND xj at time
t) - Other features suggested by knowledge-based
analysis of the problem - Then apply a linear SVM to the
higher-dimensional problem
50Example 1 Acoustic Classification of Stop Place
of Articulation
- Feature Dimension K483/10ms
- MFCCsddd, 25ms window K39/10ms
- Spectral shape energy, spectral tilt, and
spectral compactness, once/millisecond K40/10ms - Noise-robust MUSIC-based formant frequencies,
amplitudes, and bandwidths K10/10ms - Acoustic-phonetic parameters (Formant-based
relative spectral measures and time-domain
measures) K42/10ms - Rate-place model of neural response fields in the
cat auditory cortex K352/10ms - Observation concatenation of up to 17 frames,
for a total of K17 X 483 8211 dimensions - Results Accuracy improves as you add more
features, up to 7 frames (one/10ms
3381-dimensional x). Adding more frames didnt
help. - RBF SVM still outperforms linear SVM, but only by
1
51Example 2 Text Classification
- Goal
- Utterances were recorded by physical therapy
patients, specifying their physical activity
once/half hour for seven days. - Example utterance I ate breakfast for twenty
minutes, then I walked to school for ten
minutes. - Goal for each time period, determine the type of
physical activity, from among 2000 possible type
categories. - Indicator features
- 50000 features one per word, in a 50000-word
dictionary - x d1, d2, d3, , d50000 T
- di 1 if the ith dictionary word was contained
in the utterance, zero otherwise - X is very sparse most sentences contain only a
few words - Linear SVM is very efficient
52Example 2 Text Classification
- Result
- 85 classification accuracy
- Most incorrect classifications were reasonable to
a human - I played hopskotch with my daughter playing
a game, or light physical exercise? - Some categories were never observed in the
training data, therefore no test data were
assigned to those categories - Conclusion SVM is learning keywords keyword
combinations
53Summary
- Plotting the Data Use PCA, LDA, or any other
discriminant - If PDF is known Use MAP classifier
- If PDF unknown Structural Risk Minimization
- SVM is a training criterion a particular
upper bound on structural risk of hyperplane - Choosing hyperparameters
- Easy for a linear classifier
- For a nonlinear classifier use the Complete
Regularization Path algorithm - High-dimensional Linear SVMs human user acts as
an intelligent kernel