HIWIRE Progress Report - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

HIWIRE Progress Report

Description:

Feature extraction and combination. Segment models for ASR ... Prosody modeling m18. Stress modeling m18. Parametric modeling of feature trajectories ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 69
Provided by: vdi6
Category:

less

Transcript and Presenter's Notes

Title: HIWIRE Progress Report


1
HIWIRE Progress Report
  • Technical University of Crete
  • Speech Processing and
  • Dialog Systems Group
  • Presenter Alex Potamianos (WP1)
  • Vassilis Diakoloukas (WP2)

2
Outline
  • Work package 1
  • Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
  • Audio-Visual ASR Baseline
  • Feature extraction and combination
  • Segment models for ASR
  • Blind Source Separation for multi-microphone ASR
  • Work package 2
  • Adaptation
  • Data collection

3
Outline
  • Work package 1
  • Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
  • Audio-Visual ASR Baseline
  • Feature extraction and combination
  • Segment models for ASR
  • Blind Source Separation for multi-microphone ASR
  • Work package 2
  • Adaptation
  • Data collection

4
Baseline
  • Baseline Performance Completed
  • Aurora 2 on HTK
  • Aurora 3 on HTK
  • Aurora 4 on HTK
  • Lattices for Aurora 4
  • Baseline Performance Ongoing
  • WSJ1 (Decipher)
  • DMHMMs (Decipher)

5
Aurora 2 Database
  • Based on TIdigits downsampled to 8KHz
  • Noise artificially added at several SNRs
  • 3 sets of noises
  • A subway, babble, car, exhib. hall
  • B restaurant, street, airport, train station
  • C subway, street (with different freq.
    characteristics)
  • Two training conditions
  • Training on clean data
  • Multi-condition Training on noisy data

6
Aurora 2 Database
  • 8440 training sentences
  • 1001 test sentences / test set
  • Three front-end configurations
  • HTK default
  • WI007 (Aurora 2 distribution)
  • WI008 (Thanks to Prof. Segura)

7
Aurora 2 Clean training
  • HTK default Front-End

8
Aurora 2 Multi-Condition training
  • HTK default Front-End

9
Aurora 2 Clean vs Multi-Condition Training
10
Aurora 2 Front End Comparison Clean Training
11
Front End Comparison Multi-Condition Training
12
Aurora 3 Database
  • 5 languages
  • Finnish
  • German
  • Italian
  • Spanish
  • Danish
  • 3 noise conditions
  • quiet
  • low noisy (low)
  • high noisy (high)
  • 2 recording modes
  • close-talking microphone (ch0)
  • hands-free microphone (ch1)

13
Aurora 3 Database
  • 3 experimental setups
  • Well-Matched (WM)
  • 70 of all utts in quiet, low, high conditions
    were used for training
  • remaining 30 were used for testing
  • Medium Mismatched (MM)
  • 100 hands-free recordings from quiet and low
    for training
  • 100 hands-free recordings from high for
    testing
  • High Mismatched (HM)
  • 70 of close-talking recordings from all noise
    conditions for training
  • 30 of hands-free recordings from low and
    high for testing

14
Baseline Aurora 3 performance
15
Baseline Aurora 3 performance
16
Baseline Aurora 3 with WI007 FE ( TUC - UGR
comparison )
17
Baseline Aurora 3 with WI007 FE ( TUC - UGR
comparison )
18
Baseline Aurora 3 with WI008 FE ( TUC - UGR
comparison )
19
Aurora 4 Database
  • Based on the WSJ phase 0 collection
  • 5000 word vocabulary
  • 7138 training data (ARPA evaluation)
  • 2 recording microphones
  • 6 different noises artificially added
  • Car, Babble, Restaurant, Street, Airport, TrainSt

20
Aurora 4 Training Data Sets
  • 3 Training Conditions
  • (Clean MultiCondition Noisy)

21
Aurora 4 Test Sets
  • 14 Test Sets
  • 2 sizes small (166 utts) and large (330 utts)

22
Lattices
  • Obtained from SONIC recognizer
  • real time decoding for WSJ 5k task
  • State-of-the-art performance (8 WERR)
  • Lattices obtained from clean models
  • Three sizes lattices small, medium, large
  • Fixed branching factor for each lattice size
    (small2.5, medium4, large5.5)
  • Speed-up factor compared to HTK decoding x100,
    x50, x10

23
Baseline Aurora 4 with Lattices
24
Baseline Aurora 4 with Lattices
25
Baseline Aurora 4 (Comparing Lattices)
26
Aurora4 BaselineConclusions on Lattices
  • Lattices speed up recognition
  • Medium Size Lattice is 60 times faster
  • Small Size Lattice is 108 times faster
  • Problem improved performance in noisy test
  • Careful when using lattices in mismatched
    conditions (clean training-noisy data)!
  • Solution
  • two sets of lattices lattices matched, mismatched

27
Audio-Visual ASR Database
  • Subset of CUAVE database used
  • 36 speakers (30 training, 6 testing)
  • 5 sequences of 10 connected digits per speaker
  • Training set 1500 digits (30x5x10)
  • Test set 300 digits (6x5x10)
  • CUAVE database also contains more complex data
    sets speaker moving around, speaker shows
    profile, continuous digits, two speakers (to be
    used in future evaluations)

28
CUAVE Database Speakers
29
Audio-Visual ASR Feature Extraction
  • Lip region of interest (ROI) tracking
  • A fixed size ROI is detected using template
    matching
  • ROI minimizes RGB-Euclidean distance with a given
    ROI template
  • ROI template is selected from 1st frame of each
    speaker
  • Continuity constraint search within a 20x20
    pixel window of previous frame ROI (does not work
    for rapid speaker movements)

30
(No Transcript)
31
Audio-Visual ASR Feature Extraction
  • Features extracted from ROI
  • ROI is transformed to grayscale
  • ROI is decimated to a 16x16 pixel region
  • 2D separable DCT is applied to 16x16 pixel region
  • Upper-left 6x6 region is kept (excluding first
    coef.)
  • 35 feature vector is resampled in time from 29.97
    fps (NTSC) to 100 fps
  • First and second derivatives in time are computed
    using a 6 frame window (feature size 105)
  • Sanity check unsupervised k-means clustering of
    ROI results in

32
(No Transcript)
33
Experiments
  • Recognition experiment
  • Open loop digit grammar (50 digits per utterance,
    no endpointing)
  • Classification experiment
  • Single digit grammar (endpointed digits based on
    provided segmentation)

34
Models
  • Features
  • Audio 39 features (MFCC_D_A)
  • Visual 105 features (ROIDCT_D_A)
  • Audio-Visual 3935 feats (MFCC_D_AROIDCT)
  • HMM models
  • 8 state, left-to-right HMM whole-digit models
    with no state skipping
  • Single Gaussian mixture
  • Audio-Visual HMM uses separate audio and video
    feature streams with equal weights (1,1)

35
Results (Word Accuracy
  • Data
  • Training 1500 digits (30 speakers)
  • Testing 300 digits (6 speakers)

36
Future Work
  • Multi-mixture models
  • Front-end (NTUA)
  • Tracking algorithms
  • Feature extraction
  • Feature Combination
  • Feature integration
  • Feature weighting

37
Outline
  • Work package 1
  • Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
  • Audio-Visual ASR Baseline
  • Feature extraction and combination
  • Segment models for ASR
  • Blind Source Separation for multi-microphone ASR
  • Work package 2
  • Adaptation
  • Data collection

38
Feature extraction and combination
  • Noise Robust Features (NTUA) m12
  • AM-FM Features (NTUA) m12
  • Feature combination m12
  • Supra-segmental features (see also segment
    models) m18

39
Outline
  • Work package 1
  • Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
  • Audio-Visual ASR Baseline
  • Feature extraction and combination
  • Segment models for ASR
  • Blind Source Separation for multi-microphone ASR
  • Work package 2
  • Adaptation
  • Data collection

40
Segment Models
  • Baseline system
  • Supra-segmental features
  • Phone Transition modeling m12
  • Prosody modeling m18
  • Stress modeling m18
  • Parametric modeling of feature trajectories
  • Dynamical system modeling
  • Combine with HMMs

41
Outline
  • Work package 1
  • Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
  • Audio-Visual ASR Baseline
  • Feature extraction and combination
  • Segment models for ASR
  • Blind Source Separation for multi-microphone ASR
  • Work package 2
  • Adaptation
  • Data collection

42
Blind Source Separation (Mokios, Sidiropoulos
  • Based on PARallel FACtor (PARAFAC) analysis,
    i.e., low-rank decomposition of multi-dimensional
    tensorial data
  • Collecting spatial covariance matrix estimates
    which are sufficiently separated in time
  • Assumptions
  • uncorrelated speaker signals and noise
  • D(t) is a diagonal matrix of speaker powers for
    measurement period t
  • denotes noise power (estimated from
    silence intervals)

43
Outline
  • Work package 1
  • Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
  • Audio-Visual ASR Baseline
  • Feature extraction and combination
  • Segment models for ASR
  • Blind Source Separation for multi-microphone ASR
  • Work package 2
  • Adaptation
  • Data collection

44
Acoustic Model Adaptation
  • Adaptation Method
  • Bayes Optimal Classification
  • Acoustic Models
  • Discrete Mixture HMMs

45
Bayes optimal classification
  • Classifier decision for a test data vector xtest
  • Choose the class that results in the highest
    value

46
Bayes optimal versus MAP
  • Assumption the posterior is sufficiently peaked
    around the most probable point
  • MAP approximation
  • ?MAP is the set of parameters that maximize

47
Why Bayes optimal classification
  • Optimal classification criterion
  • The prediction of all the parameter hypotheses is
    combined
  • Better discrimination
  • Less training data
  • Faster asymptotic convergence to the ML estimate

48
Why Bayes optimal classification
  • However
  • Computationally more expensive
  • Difficult to find analytical solutions
  • ....hence some approximations should still be
    considered

49
Approximate Bayesian Decision rule (Merhav,
Ephraim 1991)
  • Having
  • Training data y
  • Test sequence x
  • M the number of source models H
  • ? the parameter set of each source
  • However
  • Still difficult to be implemented
  • Strong assumptions

50
Discrete-Mixture HMMs (Digalakis et. al. 2000)
  • It is based on sub-vector quantization
  • Introduces a new form of observation
    distributions

51
DMHMMs benefits (Digalakis et. al. 2000)
  • Speech Recognition performance driven
    quantization scheme
  • Quantization of the acoustic space in sufficient
    detail
  • Mixtures capture the correlation between
    sub-vectors
  • Well-matched in client-server applications
  • Comparable performance to continuous HMMs
  • Faster decoding speeds

52
DMHMM parameters that could be adapted
  • Partitioning into sub-vectors
  • How many sub-vectors
  • Which MFCCs to form each sub-vector
  • Bit-allocation
  • Optimize bit-allocation based on adaptation data
  • Discrete Mixture Weights
  • Centroids of codebooks
  • Centroid observation probabilities

53
Adaptation on DMHMMs
  • Goal Reestimate the centroids observation
    distribution
  • Transformation-based adaptation ?
  • Maybe not enough training data for the amount of
    centroids
  • Bayesian adaptation ?
  • Could benefit from its convergence property
  • Optimal Bayes classification ?
  • Easier to find approximate forms for DMHMMs

54
Outline
  • Work package 1
  • Baseline Aurora 2, Aurora 3, Aurora 4 (lattices)
  • Audio-Visual ASR Baseline
  • Feature extraction and combination
  • Segment models for ASR
  • Blind Source Separation for multi-microphone ASR
  • Work package 2
  • Adaptation
  • Data collection

55
TUC Non-Native Recordings
  • 10 Speakers (6 male 4 female)
  • Fluency in English
  • 4 excellent
  • 5 good very good
  • 1 satisfactory
  • Speaker pronunciation
  • 1 from Cyprus
  • 3 from Northern Greece
  • 1 from Ionian Islands
  • 2 Athens area
  • 1 from Crete
  • 1 from Central Greece

56
EXTRA SLIDES
57
Prior Work Overview
Constr. Est. Adapt.
MLST.
Combinations
MAP (Bayes) Adapt.
VTLN
Genones
Segment Models
Robust Features
58
HIWIRE Work Proposal
Adaptation Bayes optimal class.
Acoustic Modeling Segment Models
Feature Selection AM-FM Features
Microphone Arrays Speech/Noise Separation
Audio Visual ASR Baseline experiments
59
Aurora 2 Performance with HTK FE (Clean Training)
60
Aurora 2 Performance with HTK FE
(Multi-Condition Training)
61
Aurora 2 Performance with WI008 FE (Clean
Training)
62
Aurora 2 Performance with WI008
FE(Multi-Condition Training)
63
Aurora 3 HTK Settings
  • Spanish
  • Parametrize.csh
  • Set Options -F RAW fs 8 q noc0 swap
  • Config_tr
  • TARGETKIND MFCC_E_D_A
  • DELTAWINDOW 3
  • ACCWINDOW 2
  • ENORMALISE F
  • HNETTRACE 2
  • NATURALREADORDER T
  • NATURALWRITEORDER T

64
Aurora 3 HTK Settings
  • Italian
  • Sdc_it.conf
  • FE_OPTIONS -q -F RAW fs 8
  • Config
  • TARGETKIND MFCC_D_A_E
  • HNETTRACE 2
  • ACCWINDOW 2
  • DELTAWINDOW 3
  • ENORMALISE F
  • NATURALREADORDER T
  • NATURALWRITEORDER T

65
Baseline Aurora 3 Performance
66
Baseline Aurora 3 with WI007 FE ( TUC - UGR
comparison )
67
Baseline Aurora 3 with WI008 FE ( TUC - UGR
comparison )
68
Baseline Aurora 4 with Lattices
  • Small Lattice Size

69
Baseline Aurora 4 with Lattices
  • Medium Lattice Size

70
Baseline Aurora 4 with Lattices
  • Small Lattice Size
Write a Comment
User Comments (0)
About PowerShow.com