Phoneme%20Recognition%20using%20Temporal%20Patterns - PowerPoint PPT Presentation

About This Presentation
Title:

Phoneme%20Recognition%20using%20Temporal%20Patterns

Description:

Phoneme Recognition using Temporal Patterns Petr Schwarz, Pavel Mat jka Brno University of Technology, Czech Republic OGI School of Science and Engineering at OHSU, USA – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 20
Provided by: PP72
Category:

less

Transcript and Presenter's Notes

Title: Phoneme%20Recognition%20using%20Temporal%20Patterns


1
Phoneme Recognition using Temporal Patterns
  • Petr Schwarz, Pavel Matejka

Brno University of Technology, Czech RepublicOGI
School of Science and Engineering at OHSU, USA
E-mail matejkap_at_feec.vutbr.cz,
schwarzp_at_fit.vutbr.cz
2
Outline
  • The goal
  • Experimental setup and system
  • Baseline experiment with MFCC and MFCC
    multi-frame
  • Comparison of conventional MFCC and novel
    TempoRAl Patterns (TRAPs) features under well
    matched and mismatched conditions
  • Optimization of TRAPs for our task
  • New three-band TRAPs system
  • Implementation and distribution of the SW
  • Conclusions and future work

3
The goal
  • For many applications, speech needs to be
    transcribed into discrete symbols.
  • very reliable phoneme recognizer (not only) for
    meeting domain
  • no language constraints
  • suitable as a front end to LVCSR, for keyword
    spotting, speaker recognition, language
    recognition or recognition of out-of-vocabulary
    words

Comparison of several techniques for automatic
recognition of unconstrained context-independent
phonemes
4
Experimental setup
  • Two databases TIMIT and NTIMIT
  • - all SA records are removed
  • - databases down-sampled to 8000 Hz
  • - 412 speakers for training, 50 for CV, 168 for
    test
  • The phoneme set contains 39 phonemes
  • - very similar to CMU/MIT phoneme set
  • - closures are merged with burst (bcl b ? b)
  • Experimental system is NN/HMM hybrid
  • - phoneme insertion penalty tuned to the equal
  • number of inserted and deleted phonemes

5
Experimental system
Which classifier?
6
Which classifier, GMM or NN?
  • HMM-GMM and HMM-NN with one-state models
  • MFCC ? ?? features
  • Number of parameters is increased until the
    decrease
  • in phoneme error rate (PER) is negligible
    (lt0.5 )

System PER Parameters
GMM 42.0 788736
NN 41.6 31200
NN doesnt degrade performance compared to GMM
2 absolute by merging
7
Single frame and multi-frame input with MFCC
FeatureNet
  • Subsequent frames are joined together
  • Size of context is being increased to find
    minimal PER
  • 300, 400 and 500 neurons in hidden layer tested
    - minimum change but the best is 400

frames PER
1 41.6
5 37.5
PER 37.5
8
TempoRAl Patterns
  • frequency-localized posterior probabilities of
    phonemes are estimated from temporal evolution of
    critical band energies within a single critical
    band
  • 2. such estimates are used in another
    class-posterior estimator which estimates the
    overall phoneme probability from the
    probabilities in the individual critical bands.

1. band classifier
2. band classifier
N. band classifier
9
TRAP system scheme
10
MFCC and TRAP on well-matched conditions
  • Training and testing data are from the same
    database
  • Similar performance of MFCC multi-frame and 1s
    long TRAPs
  • Improvement can be obtained when length of TRAP
    is optimized

PER TIMIT NTIMIT
MFCC39 41.6 55.6
MFCC39 5frames 37.5 49.0
TRAP 1sec 37.9 49.6
11
MFCC and TRAP on mismatched conditions
  • Training and testing data are from different
    databases
  • TRAP system yielded better results in both
    mismatched
  • conditions
  • Its better to train the system on corrupted
    speech rather
  • than on clean one

PER TIMIT/NTIMIT NTIMIT/TIMIT
MFCC39 80.9 63.4
MFCC39 5frames 80.1 75.7
TRAP 1sec 75.0 56.6
12
Effect of length of TRAP
  • The original TRAP length was kept 1 second long
    to be sure that
  • it covers all information about phoneme in the
    critical band, but
  • the length is not optimal
  • 300 ms long context is the best for the TIMIT
    database

PER 36.1
13
Effect of mean and variance normalization
  • Experiment was performed on original 1 second
    long TRAPs
  • Significant degradation caused by both
    normalizations can be
  • seen in well-matched conditions
  • Mean normalization always helps in mismatched
    condition,
  • the benefit of variance normalization is less
    clear

Normalization / PER TIMIT NTIMIT TIMIT/ NTIMIT NTIMIT/TIMIT
None 37.9 49.6 75.0 56.6
Mean 40.5 51.8 73.5 54.7
Mean variance 42.6 53.2 74.8 54.1
14
TRAP with more than one critical band
  • Three neighboring temporal vectors were merged
    together and sent to one classifier

system PER
TRAPS 36.1
3 band TRAPS 33.7
15
Implementation and distribution of the SW phnrec
  • Early experiments performed with a set of scripts
    interconnecting execs trapper, QuickNet, HTK,
    still used for the training.
  • Phoneme recognition in phnrec containing
  • feature extraction (MFCC (compat HTK),
    FeatureNet, TRAPS) from files or microphone
  • posterior-probability estimator (NN compatible
    with QuickNet nets)
  • Viterbi decoder can work also on-line with
    fixed delay.
  • Very good as black-box for people what want to
    consider speech-to-phoneme transcription as
    front-end

16
phnrec (2)
  • Source codes for Linux and EXE for Windows
    available for free for research.
  • Available with nets trained on US-English (TIMIT)
    and Czech (SpeechDat-E).
  • More languages to come (also some Language ID
    experiments running in Brno)
  • Works on-line

http//www.fit.vutbr.cz/speech/sw/phnrec.html
17
Conclusion
  • TRAP based phoneme recognizer was built,
    comparison to MFCC.
  • Properties of TRAPs were studied and TRAPs were
    optimized for phoneme recognition
  • New multi-band TRAPs approach was tested and its
    benefit is proved
  • The recognizer was successfully evaluated in
    language identification task
  • An easy-to-use software was written and is
    available for research community.

18
But
  • Adaptation to meeting data necessary (TIMIT clean
    training not good at all), updating the
    distribution on www.
  • Tests on ICSI, IDIAP and Brno data (which
    phonemes going to work the best for us CzEnglish
    ?)
  • Applications LID already tested, kwd spotting
    and LVCSR (some papers at Eurospeech making use
    of phoneme strings).
  • Phoneme lattices
  • Real-time issues (1 band version running ok on
    reasonable machine, 3 band not) NN weights
    pruning?

19
THE END
  • A demo during the break.
  • Please download phnrec, test it and comment !!!
  • Questions ?
Write a Comment
User Comments (0)
About PowerShow.com