LandmarkBased Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Net - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

LandmarkBased Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Net

Description:

1024-dimensional principal component 32X32 spectrogram, plot as an image: 1st principal component (not shown) measures total energy of the spectrogram ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 54
Provided by: jhas
Category:

less

Transcript and Presenter's Notes

Title: LandmarkBased Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Net


1
Landmark-Based Speech RecognitionSpectrogram
Reading,Support Vector Machines,Dynamic
Bayesian Networks,and Phonology
  • Mark Hasegawa-Johnson
  • jhasegaw_at_uiuc.edu
  • University of Illinois at Urbana-Champaign, USA

2
Lecture 5 Generalization Error Support Vector
Machines
  • Observation Vector Summary Statistic Principal
    Components Analysis (PCA)
  • Risk Minimization
  • If Posterior Probability is known MAP is optimal
  • Example Linear Discriminant Analysis (LDA)
  • When true Posterior is unknown Generalization
    Error
  • VC Dimension, and bounds on Generalization Error
  • Lagrangian Optimization
  • Linear Support Vector Machines
  • The SVM Optimality Metric
  • Lagrangian Optimization of SVM Metric
  • Hyper-parameters Over-training
  • Kernel-Based Support Vector Machines
  • Kernel-based classification optimization
    formulas
  • Hyperparameters Over-training
  • The Entire Regularization Path of the SVM
  • High-Dimensional Linear SVM
  • Text classification using indicator functions
  • Speech acoustic classification using redundant
    features

3
What is an Observation?
  • Observation can be
  • A vector created by vectorizing many
    consecutive MFCC or mel-spectra
  • A vector including MFCC, formants, pitch,
    PLP, auditory model features,

4
Normalized Observations
5
Plotting the Observations, Part I Scatter Plots
and Histograms
6
Problem Where is the Information in a
1000-Dimensional Vector?
7
Statistics that Summarize a Training Corpus
8
Summary Statistics Matrix Notation
Examples of y-1
Examples of y1
9
Eigenvectors and Eigenvalues of R
10
Plotting the Observations, Part 2 Principal
Components Analysis
11
What Does PCA Extract from the Spectrogram?
Plot PCAGram
  • 1024-dimensional principal component ? 32X32
    spectrogram, plot as an image
  • 1st principal component (not shown) measures
    total energy of the spectrogram
  • 2nd principal component E(after landmark)
    E(before landmark)
  • 3rd principal component E(at the landmark)
    E(surrounding syllables)

12
Minimum-Risk Classifier Design
13
True Risk, Empirical Risk, and Generalization
14
When PDF is Known Maximum A Posteriori (MAP) is
Optimal
15
Another Way to Write the MAP Classifier Test the
Sign of the Log Likelihood Ratio
16
MAP Example Gaussians with Equal Covariance
17
Linear Discriminant Projection of the Data
18
Other Linear Classifiers Empirical Risk
Minimization (Choose v, b to Minimize Remp(v,b))
19
A Serious Problem Over-Training
The same projection, applied to new test data
Minimum-Error projection of training data
20
When the True PDF is Unknown Upper Bounds on
True Risk
21
The VC Dimension of a Hyperplane Classifier
22
Schematic Depiction w Controls the
Expressiveness of the Classifier(and a less
expressive classifier is less prone to overtrain)
23
The SVM An Optimality Criterion
24
Lagrangian Optimization Inequality Constraint
  • Consider minimizing f(v), subject to the
    constraint g(v) 0. Two solution types exist
  • g(v) 0
  • g(v)0 curve is tangent to f(v)fmin curve at
    vv
  • g(v) gt 0
  • v minimizes f(v)

g(v) lt 0
Unconstrained Minimum
g(v) lt 0
v
g(v) 0
v
g(v) gt 0
g(v) gt 0
g(v) 0
Diagram from Osborne, 2004
25
Case 1 gm(v)0
26
Case 2 gm(v)gt0
27
Training an SVM
28
Differentiate the Lagrangian
29
now Simplify the Lagrangian
30
and impose Kuhn-Tucker
31
Three Types of Vectors
Interior Vector a0
Margin Support Vector 0ltaltC
Error aC
Partial Error aC
From Hastie et al., NIPS 2004
32
and finally, Solve the SVM
33
Quadratic Programming
ai2
C
ai1
C
ai
ai2 is off the margin truncate to ai20. ai1
is still a margin candidate solve for it again
in iteration i1.
34
Linear SVM Example
35
Linear SVM Example
36
Choosing the Hyper-Parameter to Avoid
Over-Training(Wang, Presentation at CLSP
workshop WS04)
SVM test corpus error vs. l1/C, classification
of nasal vs. non-nasal vowels.
37
Choosing the Hyper-Parameter to Avoid
Over-Training
  • Recall that vSm amymxm
  • Therefore, v lt (C Sm xm)1/2 lt (CM maxxm)1/2
  • Therefore, width of the margin is constrained to
    1/v gt (CM maxxm)-1/2, and therefore, the
    SVM is not allowed to make the margin very small
    in its quest to fix individual errors
  • Recommended solution
  • Normalize xm so that maxxm1 (e.g., using
    libsvm)
  • Set C1/M
  • If desired, adjust C up or down by a factor of 2,
    to see if error rate on independent development
    test data will decrease

38
From Linear to Nonlinear SVM
39
Example RBF Classifier
40
An RBF Classification Boundary
41
Two Hyperparameters ? Choosing Hyperparameters is
Much Harder(Hastie, Rosset, Tibshirani, and Zhu,
NIPS 2004)
42
Optimum Value of C Depends on g(Hastie, Rosset,
Tibshirani, and Zhu, NIPS 2004)
From Hastie et al., NIPS 2004
43
SVM is a Regularized Learner (l1/C)
44
SVM Coefficients are a Piece-Wise Linear Function
of l1/C(Hastie, Rosset, Tibshirani, and Zhu,
NIPS 2004)
45
The Entire Regularization Path of the SVM
Algorithm(Hastie, Zhu, Tibshirani and Rosset,
NIPS 2004)
  • Start with l large enough (C small enough) so
    that all training tokens are partial errors
    (amC). Compute the solution to the quadratic
    programming problem in this case, including
    inversion of XTX or XXT.
  • Reduce l (increase C) until the initial event
    occurs two partial error points enter the
    margin, i.e., in the QP problem, amC becomes the
    unconstrained solution rather than just the
    constrained solution. This is the first
    breakpoint. The slopes dam/dl change, but only
    for the two training vectors the margin all
    other training vectors continue to have
    amC.Calculate the new values of dam/dl for these
    two training vectors.
  • Iteratively find the next breakpoint. The next
    breakpoint occurs when one of the following
    occurs
  • A value of am that was on the margin leaves the
    margin, i.e., the piece-wise-linear function
    am(l) hits am0 or amC.
  • One or more interior points enter the margin,
    i.e., in the QP problem, am0 becomes the
    unconstrained solution rather than just the
    constrained solution.
  • One or more interior points enter the margin,
    i.e., in the QP problem, amC becomes the
    unconstrained solution rather than just the
    constrained solution.

46
One Method for Using SVMPath (WS04, Johns
Hopkins, 2004)
  • Download SVMPath code from Trevor Hasties web
    page
  • Test several values of g, including values within
    a few orders of magnitude from g1/K.
  • For each candidate value of g, use SVMPath to
    find the C-breakpoints. Choose a few dozen
    C-breakpoints for further testing, and write out
    the corresponding values of am.
  • Test the SVMs on a separate development test
    database for each combination (C,g), find the
    development test error. Choose the combination
    that gives least development test error.

47
Results, RBF SVM
SVM test corpus error vs. l1/C, classification
of nasal vs. non-nasal vowels.
Wang, WS04 Student Presentation, 2004
48
High-Dimensional Linear SVMs
49
Motivation Project it Yourself
  • The purpose of a nonlinear SVM
  • f(x) contains higher-order polynomial terms in
    the elements of x.
  • By combining these higher-order polynomial terms,
    SymamK(x,xm) can create a more flexible boundary
    than can SymamxTxm.
  • The flexibility of the boundary does not lead to
    generalization error the regularization term
    lv2 avoids generalization error.
  • A different approach
  • Augment x with higher-order terms, up to a very
    large dimension. These terms can include
  • Polynomial terms, e.g., xixj
  • N-gram terms, e.g., (xi at time t AND xj at time
    t)
  • Other features suggested by knowledge-based
    analysis of the problem
  • Then apply a linear SVM to the
    higher-dimensional problem

50
Example 1 Acoustic Classification of Stop Place
of Articulation
  • Feature Dimension K483/10ms
  • MFCCsddd, 25ms window K39/10ms
  • Spectral shape energy, spectral tilt, and
    spectral compactness, once/millisecond K40/10ms
  • Noise-robust MUSIC-based formant frequencies,
    amplitudes, and bandwidths K10/10ms
  • Acoustic-phonetic parameters (Formant-based
    relative spectral measures and time-domain
    measures) K42/10ms
  • Rate-place model of neural response fields in the
    cat auditory cortex K352/10ms
  • Observation concatenation of up to 17 frames,
    for a total of K17 X 483 8211 dimensions
  • Results Accuracy improves as you add more
    features, up to 7 frames (one/10ms
    3381-dimensional x). Adding more frames didnt
    help.
  • RBF SVM still outperforms linear SVM, but only by
    1

51
Example 2 Text Classification
  • Goal
  • Utterances were recorded by physical therapy
    patients, specifying their physical activity
    once/half hour for seven days.
  • Example utterance I ate breakfast for twenty
    minutes, then I walked to school for ten
    minutes.
  • Goal for each time period, determine the type of
    physical activity, from among 2000 possible type
    categories.
  • Indicator features
  • 50000 features one per word, in a 50000-word
    dictionary
  • x d1, d2, d3, , d50000 T
  • di 1 if the ith dictionary word was contained
    in the utterance, zero otherwise
  • X is very sparse most sentences contain only a
    few words
  • Linear SVM is very efficient

52
Example 2 Text Classification
  • Result
  • 85 classification accuracy
  • Most incorrect classifications were reasonable to
    a human
  • I played hopskotch with my daughter playing
    a game, or light physical exercise?
  • Some categories were never observed in the
    training data, therefore no test data were
    assigned to those categories
  • Conclusion SVM is learning keywords keyword
    combinations

53
Summary
  • Plotting the Data Use PCA, LDA, or any other
    discriminant
  • If PDF is known Use MAP classifier
  • If PDF unknown Structural Risk Minimization
  • SVM is a training criterion a particular
    upper bound on structural risk of hyperplane
  • Choosing hyperparameters
  • Easy for a linear classifier
  • For a nonlinear classifier use the Complete
    Regularization Path algorithm
  • High-dimensional Linear SVMs human user acts as
    an intelligent kernel
Write a Comment
User Comments (0)
About PowerShow.com