Title: Phoneme%20Recognition%20using%20Temporal%20Patterns
1Phoneme Recognition using Temporal Patterns
- Petr Schwarz, Pavel Matejka
Brno University of Technology, Czech RepublicOGI
School of Science and Engineering at OHSU, USA
E-mail matejkap_at_feec.vutbr.cz,
schwarzp_at_fit.vutbr.cz
2Outline
- The goal
- Experimental setup and system
- Baseline experiment with MFCC and MFCC
multi-frame - Comparison of conventional MFCC and novel
TempoRAl Patterns (TRAPs) features under well
matched and mismatched conditions - Optimization of TRAPs for our task
- New three-band TRAPs system
- Implementation and distribution of the SW
- Conclusions and future work
3The goal
- For many applications, speech needs to be
transcribed into discrete symbols. - very reliable phoneme recognizer (not only) for
meeting domain - no language constraints
- suitable as a front end to LVCSR, for keyword
spotting, speaker recognition, language
recognition or recognition of out-of-vocabulary
words
Comparison of several techniques for automatic
recognition of unconstrained context-independent
phonemes
4Experimental setup
- Two databases TIMIT and NTIMIT
- - all SA records are removed
- - databases down-sampled to 8000 Hz
- - 412 speakers for training, 50 for CV, 168 for
test - The phoneme set contains 39 phonemes
- - very similar to CMU/MIT phoneme set
- - closures are merged with burst (bcl b ? b)
- Experimental system is NN/HMM hybrid
- - phoneme insertion penalty tuned to the equal
- number of inserted and deleted phonemes
5Experimental system
Which classifier?
6Which classifier, GMM or NN?
- HMM-GMM and HMM-NN with one-state models
- MFCC ? ?? features
- Number of parameters is increased until the
decrease - in phoneme error rate (PER) is negligible
(lt0.5 )
System PER Parameters
GMM 42.0 788736
NN 41.6 31200
NN doesnt degrade performance compared to GMM
2 absolute by merging
7Single frame and multi-frame input with MFCC
FeatureNet
- Subsequent frames are joined together
- Size of context is being increased to find
minimal PER - 300, 400 and 500 neurons in hidden layer tested
- minimum change but the best is 400
frames PER
1 41.6
5 37.5
PER 37.5
8TempoRAl Patterns
- frequency-localized posterior probabilities of
phonemes are estimated from temporal evolution of
critical band energies within a single critical
band - 2. such estimates are used in another
class-posterior estimator which estimates the
overall phoneme probability from the
probabilities in the individual critical bands.
1. band classifier
2. band classifier
N. band classifier
9TRAP system scheme
10MFCC and TRAP on well-matched conditions
- Training and testing data are from the same
database - Similar performance of MFCC multi-frame and 1s
long TRAPs - Improvement can be obtained when length of TRAP
is optimized
PER TIMIT NTIMIT
MFCC39 41.6 55.6
MFCC39 5frames 37.5 49.0
TRAP 1sec 37.9 49.6
11MFCC and TRAP on mismatched conditions
- Training and testing data are from different
databases - TRAP system yielded better results in both
mismatched - conditions
- Its better to train the system on corrupted
speech rather - than on clean one
PER TIMIT/NTIMIT NTIMIT/TIMIT
MFCC39 80.9 63.4
MFCC39 5frames 80.1 75.7
TRAP 1sec 75.0 56.6
12Effect of length of TRAP
- The original TRAP length was kept 1 second long
to be sure that - it covers all information about phoneme in the
critical band, but - the length is not optimal
- 300 ms long context is the best for the TIMIT
database
PER 36.1
13Effect of mean and variance normalization
- Experiment was performed on original 1 second
long TRAPs - Significant degradation caused by both
normalizations can be - seen in well-matched conditions
- Mean normalization always helps in mismatched
condition, - the benefit of variance normalization is less
clear
Normalization / PER TIMIT NTIMIT TIMIT/ NTIMIT NTIMIT/TIMIT
None 37.9 49.6 75.0 56.6
Mean 40.5 51.8 73.5 54.7
Mean variance 42.6 53.2 74.8 54.1
14TRAP with more than one critical band
- Three neighboring temporal vectors were merged
together and sent to one classifier
system PER
TRAPS 36.1
3 band TRAPS 33.7
15Implementation and distribution of the SW phnrec
- Early experiments performed with a set of scripts
interconnecting execs trapper, QuickNet, HTK,
still used for the training. - Phoneme recognition in phnrec containing
- feature extraction (MFCC (compat HTK),
FeatureNet, TRAPS) from files or microphone - posterior-probability estimator (NN compatible
with QuickNet nets) - Viterbi decoder can work also on-line with
fixed delay. - Very good as black-box for people what want to
consider speech-to-phoneme transcription as
front-end
16phnrec (2)
- Source codes for Linux and EXE for Windows
available for free for research. - Available with nets trained on US-English (TIMIT)
and Czech (SpeechDat-E). - More languages to come (also some Language ID
experiments running in Brno) - Works on-line
http//www.fit.vutbr.cz/speech/sw/phnrec.html
17Conclusion
- TRAP based phoneme recognizer was built,
comparison to MFCC. - Properties of TRAPs were studied and TRAPs were
optimized for phoneme recognition - New multi-band TRAPs approach was tested and its
benefit is proved - The recognizer was successfully evaluated in
language identification task - An easy-to-use software was written and is
available for research community.
18But
- Adaptation to meeting data necessary (TIMIT clean
training not good at all), updating the
distribution on www. - Tests on ICSI, IDIAP and Brno data (which
phonemes going to work the best for us CzEnglish
?) - Applications LID already tested, kwd spotting
and LVCSR (some papers at Eurospeech making use
of phoneme strings). - Phoneme lattices
- Real-time issues (1 band version running ok on
reasonable machine, 3 band not) NN weights
pruning?
19THE END
- A demo during the break.
- Please download phnrec, test it and comment !!!
- Questions ?