HIWIRE Progress Report - PowerPoint PPT Presentation

1 / 68

About This Presentation

Title:

HIWIRE Progress Report

Description:

Feature extraction and combination. Segment models for ASR ... Prosody modeling m18. Stress modeling m18. Parametric modeling of feature trajectories ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 69

Provided by: vdi6

Category:

more less

Transcript and Presenter's Notes

Title: HIWIRE Progress Report

1
HIWIRE Progress Report

Technical University of Crete
Speech Processing and
Dialog Systems Group
Presenter Alex Potamianos (WP1)
Vassilis Diakoloukas (WP2)

2
Outline

Work package 1
Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
Audio-Visual ASR Baseline
Feature extraction and combination
Segment models for ASR
Blind Source Separation for multi-microphone ASR
Work package 2
Adaptation
Data collection

3
Outline

Work package 1
Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
Audio-Visual ASR Baseline
Feature extraction and combination
Segment models for ASR
Blind Source Separation for multi-microphone ASR
Work package 2
Adaptation
Data collection

4
Baseline

Baseline Performance Completed
Aurora 2 on HTK
Aurora 3 on HTK
Aurora 4 on HTK
Lattices for Aurora 4
Baseline Performance Ongoing
WSJ1 (Decipher)
DMHMMs (Decipher)

5
Aurora 2 Database

Based on TIdigits downsampled to 8KHz
Noise artificially added at several SNRs
3 sets of noises
A subway, babble, car, exhib. hall
B restaurant, street, airport, train station
C subway, street (with different freq.
characteristics)
Two training conditions
Training on clean data
Multi-condition Training on noisy data

6
Aurora 2 Database

8440 training sentences
1001 test sentences / test set
Three front-end configurations
HTK default
WI007 (Aurora 2 distribution)
WI008 (Thanks to Prof. Segura)

7
Aurora 2 Clean training

HTK default Front-End

8
Aurora 2 Multi-Condition training

HTK default Front-End

9
Aurora 2 Clean vs Multi-Condition Training
10
Aurora 2 Front End Comparison Clean Training
11
Front End Comparison Multi-Condition Training
12
Aurora 3 Database

5 languages
Finnish
German
Italian
Spanish
Danish
3 noise conditions
quiet
low noisy (low)
high noisy (high)
2 recording modes
close-talking microphone (ch0)
hands-free microphone (ch1)

13
Aurora 3 Database

3 experimental setups
Well-Matched (WM)
70 of all utts in quiet, low, high conditions
were used for training
remaining 30 were used for testing
Medium Mismatched (MM)
100 hands-free recordings from quiet and low
for training
100 hands-free recordings from high for
testing
High Mismatched (HM)
70 of close-talking recordings from all noise
conditions for training
30 of hands-free recordings from low and
high for testing

14
Baseline Aurora 3 performance
15
Baseline Aurora 3 performance
16
Baseline Aurora 3 with WI007 FE ( TUC - UGR
comparison )
17
Baseline Aurora 3 with WI007 FE ( TUC - UGR
comparison )
18
Baseline Aurora 3 with WI008 FE ( TUC - UGR
comparison )
19
Aurora 4 Database

Based on the WSJ phase 0 collection
5000 word vocabulary
7138 training data (ARPA evaluation)
2 recording microphones
6 different noises artificially added
Car, Babble, Restaurant, Street, Airport, TrainSt

20
Aurora 4 Training Data Sets

3 Training Conditions
(Clean MultiCondition Noisy)

21
Aurora 4 Test Sets

14 Test Sets
2 sizes small (166 utts) and large (330 utts)

22
Lattices

Obtained from SONIC recognizer
real time decoding for WSJ 5k task
State-of-the-art performance (8 WERR)
Lattices obtained from clean models
Three sizes lattices small, medium, large
Fixed branching factor for each lattice size
(small2.5, medium4, large5.5)
Speed-up factor compared to HTK decoding x100,
x50, x10

23
Baseline Aurora 4 with Lattices
24
Baseline Aurora 4 with Lattices
25
Baseline Aurora 4 (Comparing Lattices)
26
Aurora4 BaselineConclusions on Lattices

Lattices speed up recognition
Medium Size Lattice is 60 times faster
Small Size Lattice is 108 times faster
Problem improved performance in noisy test
Careful when using lattices in mismatched
conditions (clean training-noisy data)!
Solution
two sets of lattices lattices matched, mismatched

27
Audio-Visual ASR Database

Subset of CUAVE database used
36 speakers (30 training, 6 testing)
5 sequences of 10 connected digits per speaker
Training set 1500 digits (30x5x10)
Test set 300 digits (6x5x10)
CUAVE database also contains more complex data
sets speaker moving around, speaker shows
profile, continuous digits, two speakers (to be
used in future evaluations)

28
CUAVE Database Speakers
29
Audio-Visual ASR Feature Extraction

Lip region of interest (ROI) tracking
A fixed size ROI is detected using template
matching
ROI minimizes RGB-Euclidean distance with a given
ROI template
ROI template is selected from 1st frame of each
speaker
Continuity constraint search within a 20x20
pixel window of previous frame ROI (does not work
for rapid speaker movements)

30
(No Transcript)
31
Audio-Visual ASR Feature Extraction

Features extracted from ROI
ROI is transformed to grayscale
ROI is decimated to a 16x16 pixel region
2D separable DCT is applied to 16x16 pixel region
Upper-left 6x6 region is kept (excluding first
coef.)
35 feature vector is resampled in time from 29.97
fps (NTSC) to 100 fps
First and second derivatives in time are computed
using a 6 frame window (feature size 105)
Sanity check unsupervised k-means clustering of
ROI results in

32
(No Transcript)
33
Experiments

Recognition experiment
Open loop digit grammar (50 digits per utterance,
no endpointing)
Classification experiment
Single digit grammar (endpointed digits based on
provided segmentation)

34
Models

Features
Audio 39 features (MFCC_D_A)
Visual 105 features (ROIDCT_D_A)
Audio-Visual 3935 feats (MFCC_D_AROIDCT)
HMM models
8 state, left-to-right HMM whole-digit models
with no state skipping
Single Gaussian mixture
Audio-Visual HMM uses separate audio and video
feature streams with equal weights (1,1)

35
Results (Word Accuracy

Data
Training 1500 digits (30 speakers)
Testing 300 digits (6 speakers)

36
Future Work

Multi-mixture models
Front-end (NTUA)
Tracking algorithms
Feature extraction
Feature Combination
Feature integration
Feature weighting

37
Outline

Work package 1
Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
Audio-Visual ASR Baseline
Feature extraction and combination
Segment models for ASR
Blind Source Separation for multi-microphone ASR
Work package 2
Adaptation
Data collection

38
Feature extraction and combination

Noise Robust Features (NTUA) m12
AM-FM Features (NTUA) m12
Feature combination m12
Supra-segmental features (see also segment
models) m18

39
Outline

Work package 1
Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
Audio-Visual ASR Baseline
Feature extraction and combination
Segment models for ASR
Blind Source Separation for multi-microphone ASR
Work package 2
Adaptation
Data collection

40
Segment Models

Baseline system
Supra-segmental features
Phone Transition modeling m12
Prosody modeling m18
Stress modeling m18
Parametric modeling of feature trajectories
Dynamical system modeling
Combine with HMMs

41
Outline

Work package 1
Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
Audio-Visual ASR Baseline
Feature extraction and combination
Segment models for ASR
Blind Source Separation for multi-microphone ASR
Work package 2
Adaptation
Data collection

42
Blind Source Separation (Mokios, Sidiropoulos

Based on PARallel FACtor (PARAFAC) analysis,
i.e., low-rank decomposition of multi-dimensional
tensorial data
Collecting spatial covariance matrix estimates
which are sufficiently separated in time
Assumptions
uncorrelated speaker signals and noise
D(t) is a diagonal matrix of speaker powers for
measurement period t
denotes noise power (estimated from
silence intervals)

43
Outline

Work package 1
Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
Audio-Visual ASR Baseline
Feature extraction and combination
Segment models for ASR
Blind Source Separation for multi-microphone ASR
Work package 2
Adaptation
Data collection

44
Acoustic Model Adaptation

Adaptation Method
Bayes Optimal Classification
Acoustic Models
Discrete Mixture HMMs

45
Bayes optimal classification

Classifier decision for a test data vector xtest
Choose the class that results in the highest
value

46
Bayes optimal versus MAP

Assumption the posterior is sufficiently peaked
around the most probable point
MAP approximation
?MAP is the set of parameters that maximize

47
Why Bayes optimal classification

Optimal classification criterion
The prediction of all the parameter hypotheses is
combined
Better discrimination
Less training data
Faster asymptotic convergence to the ML estimate

48
Why Bayes optimal classification

However
Computationally more expensive
Difficult to find analytical solutions
....hence some approximations should still be
considered

49
Approximate Bayesian Decision rule (Merhav,
Ephraim 1991)

Having
Training data y
Test sequence x
M the number of source models H
? the parameter set of each source
However
Still difficult to be implemented
Strong assumptions

50
Discrete-Mixture HMMs (Digalakis et. al. 2000)

It is based on sub-vector quantization
Introduces a new form of observation
distributions

51
DMHMMs benefits (Digalakis et. al. 2000)

Speech Recognition performance driven
quantization scheme
Quantization of the acoustic space in sufficient
detail
Mixtures capture the correlation between
sub-vectors
Well-matched in client-server applications
Comparable performance to continuous HMMs
Faster decoding speeds

52
DMHMM parameters that could be adapted

Partitioning into sub-vectors
How many sub-vectors
Which MFCCs to form each sub-vector
Bit-allocation
Optimize bit-allocation based on adaptation data
Discrete Mixture Weights
Centroids of codebooks
Centroid observation probabilities

53
Adaptation on DMHMMs

Goal Reestimate the centroids observation
distribution
Transformation-based adaptation ?
Maybe not enough training data for the amount of
centroids
Bayesian adaptation ?
Could benefit from its convergence property
Optimal Bayes classification ?
Easier to find approximate forms for DMHMMs

54
Outline

Work package 1
Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
Audio-Visual ASR Baseline
Feature extraction and combination
Segment models for ASR
Blind Source Separation for multi-microphone ASR
Work package 2
Adaptation
Data collection

55
TUC Non-Native Recordings

10 Speakers (6 male 4 female)
Fluency in English
4 excellent
5 good very good
1 satisfactory
Speaker pronunciation
1 from Cyprus
3 from Northern Greece
1 from Ionian Islands
2 Athens area
1 from Crete
1 from Central Greece

56
EXTRA SLIDES
57
Prior Work Overview
Constr. Est. Adapt.
MLST.
Combinations
MAP (Bayes) Adapt.
VTLN
Genones
Segment Models
Robust Features
58
HIWIRE Work Proposal
Adaptation Bayes optimal class.
Acoustic Modeling Segment Models
Feature Selection AM-FM Features
Microphone Arrays Speech/Noise Separation
Audio Visual ASR Baseline experiments
59
Aurora 2 Performance with HTK FE (Clean Training)
60
Aurora 2 Performance with HTK FE
(Multi-Condition Training)
61
Aurora 2 Performance with WI008 FE (Clean
Training)
62
Aurora 2 Performance with WI008
FE(Multi-Condition Training)
63
Aurora 3 HTK Settings

Spanish
Parametrize.csh
Set Options -F RAW fs 8 q noc0 swap
Config_tr
TARGETKIND MFCC_E_D_A
DELTAWINDOW 3
ACCWINDOW 2
ENORMALISE F
HNETTRACE 2
NATURALREADORDER T
NATURALWRITEORDER T

64
Aurora 3 HTK Settings

Italian
Sdc_it.conf
FE_OPTIONS -q -F RAW fs 8
Config
TARGETKIND MFCC_D_A_E
HNETTRACE 2
ACCWINDOW 2
DELTAWINDOW 3
ENORMALISE F
NATURALREADORDER T
NATURALWRITEORDER T

65
Baseline Aurora 3 Performance
66
Baseline Aurora 3 with WI007 FE ( TUC - UGR
comparison )
67
Baseline Aurora 3 with WI008 FE ( TUC - UGR
comparison )
68
Baseline Aurora 4 with Lattices