Title: My Slides
1My Slides
- Support vector machines (brief intro)
- WS04 What we accomplished
- WS04 Organizational lessons
- AVICAR video corpus current status
2Support Vector Machinesas (sort of) compared
toNeural Networks
Difficult to do because they have never been
compared head-to-head on any speech task!
3SVM Regularized Nonlinear Discriminant
Kernel Transform to Infinite- Dimensional Hilbert
Space
The only way in which SVM differs from
RBF-NN THE TRAINING CRITERION SVM Discriminant
Dimension c argmin( training_error(c)
l/width(margin(c)) )
SVM Extracts a Discriminant Dimension
(Bourlard/Morgan Hybrid) Niyogi Burges, 2002
Posterior PDF Sigmoid Model in Discriminant
Dimension
OR
(BDFK Tandem) Borys Hasegawa-Johnson, 2005
Likelihood Mixture Gaussian in Discriminant
Dimension
4Binary Classifier sign( Nonlinear
Discriminant )
5Advantages of SVMs w.r.t. NNs
- Accuracy
- SVM generalizes much better from small training
data sets ( training tokens gt 6X observation
vector size ) - As training data size increases, accuracy of NN
and SVM converge - Theoretically, and in some practical experiments
too - Like 3-layer-MLP, RBF-SVM is a universal
approximator - Fast training nearly quadratic optimality
criterion
6Disadvantages of SVMs w.r.t. NNs
- No way to train with very large training set
- Complexity O(N2) either fast or impossible
- Computational complexity during test
- Solution Burges Reduced Set Method (extra
training step only available right now in
matlab) - Accuracy unless you optimize the
hyper-parameters, accuracy is good but not great - Exhaustive hyper-training is very slow
- Can get good accuracy but not great accuracy with
the theoretically correct hyper-parameters
7Disadvantages of SVMs w.r.t. NNs
- The real problem We need phonetically labeled
training data - Embedded re-estimation experiment
- pre-trained SVMs used as HMM input (tandem
system) - RBF weights re-estimated, together with HMM
params, in order to maximize likelihood of the
training data - Result Training Data Likelihood, WRA increased
- Result Test Data WRA decreased
8WS04
9WS04 SVM/DBN hybrid recognizer
Word
A
LIKE
Canonical Form
Tongue closed
Tongue Mid
Tongue front
Tongue open
Surface Form
Semi-closed
Tongue open
Tongue front
Tongue Front
Manner
Glide
Front
Vowel
Place
Palatal
SVM Outputs
p( gPGR(x) palatal glide release)
p( gGR(x) glide release )
x Multi-Frame Observation including Spectrum, Fo
rmants, Auditory Model
10WS04 Organizational LessonsWhat Worked
- Innovative experiments, made possible by people
who really wanted to be doing what they were
doing - Result Published ideas were interesting to many
people - Parallel SVM classification experiments allowed
us to test many different SVM definitions - Result Classification errors mostly below 20 by
end of WS - Parallel recognizer test experiments (DBN/SVM was
one, MaxEnt-based lattice rescoring was another) - Result both achieved small (nonsignificant) WER
reduction over baseline
11WS04 Organizational LessonsWhat Didnt Work
- Software Bottleneck between the SVMs and the
recognizers Only one tool available to apply an
SVM to every frame in a speech file, and only
person knew how to use it. - Too Many Experimental Variables Should SVMs be
trained using (1) all frames, or (2) only
landmark frames? DBN expects 1. HMM works best
if manner features are 1, place features are 2.
DBN? Impossible to test in six weeks. - Apples Oranges SVM-only classifier outputs in
cases 1, 2 were incomparable gt no test short
of full DBN integration is meaningful.
12WS04 Organizational LessonsWhat Didnt Work
- Unbeatable baseline Goal was to rescore the
output of the SRI recognizer in order to reduce
WER gt to find acoustic information not already
used by the baseline recognizer. - What information is not already used?
Phone-based ANN/HMM hybrid system hard to say. - When an experiment fails why?
- Better use open-source baseline ( not state of
the art, but thats OK), construct test systems
in a continuum between baseline and target.
13AVICAR
14AVICAR Recording Hardware
System is not permanently installed mounting
requires 10 minutes.
15AVICAR Data Summary
- 100 Talkers
- 5 noise conditions
- Engine idling,
- 35mph, windows closed / windows open
- 55mph, windows closed / windows open
- 4 types of utterances
- Isolated Digits
- Phone numbers
- Isolated Letters (e-set articulation test)
- TIMIT sentences
- Public release 16 schools companies (but I
dont know how many are using it)
16AVICAR Labeling Recognition
- Manual lip segmentation 36 images
- Automatic face tracking nearly perfect
- Automatic lip tracking not so good
- Manual audio segmentation sentence boundaries
- Audio Enhancement
- Audio Digit WRA 97, 89, 87, 84, 78
17AVICAR Data Problems
- DIVX encoding gt database lt 300G, but
- DIVX gt poor edge quality in some images
- Amelioration plan re-transfer from tapes in high
quality, huge data size for folks who want it.