Phoneme%20Recognition%20using%20Temporal%20Patterns - PowerPoint PPT Presentation

About This Presentation

Title:

Phoneme%20Recognition%20using%20Temporal%20Patterns

Description:

Phoneme Recognition using Temporal Patterns Petr Schwarz, Pavel Mat jka Brno University of Technology, Czech Republic OGI School of Science and Engineering at OHSU, USA – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 20

Provided by: PP72

Category:

more less

Transcript and Presenter's Notes

Title: Phoneme%20Recognition%20using%20Temporal%20Patterns

1
Phoneme Recognition using Temporal Patterns

Petr Schwarz, Pavel Matejka

Brno University of Technology, Czech RepublicOGI
School of Science and Engineering at OHSU, USA
E-mail matejkap_at_feec.vutbr.cz,
schwarzp_at_fit.vutbr.cz
2
Outline

The goal
Experimental setup and system
Baseline experiment with MFCC and MFCC
multi-frame
Comparison of conventional MFCC and novel
TempoRAl Patterns (TRAPs) features under well
matched and mismatched conditions
Optimization of TRAPs for our task
New three-band TRAPs system
Implementation and distribution of the SW
Conclusions and future work

3
The goal

For many applications, speech needs to be
transcribed into discrete symbols.
very reliable phoneme recognizer (not only) for
meeting domain
no language constraints
suitable as a front end to LVCSR, for keyword
spotting, speaker recognition, language
recognition or recognition of out-of-vocabulary
words

Comparison of several techniques for automatic
recognition of unconstrained context-independent
phonemes
4
Experimental setup

Two databases TIMIT and NTIMIT
- all SA records are removed
- databases down-sampled to 8000 Hz
- 412 speakers for training, 50 for CV, 168 for
test
The phoneme set contains 39 phonemes
- very similar to CMU/MIT phoneme set
- closures are merged with burst (bcl b ? b)
Experimental system is NN/HMM hybrid
- phoneme insertion penalty tuned to the equal
number of inserted and deleted phonemes

5
Experimental system
Which classifier?
6
Which classifier, GMM or NN?

HMM-GMM and HMM-NN with one-state models
MFCC ? ?? features
Number of parameters is increased until the
decrease
in phoneme error rate (PER) is negligible
(lt0.5 )

System PER Parameters
GMM 42.0 788736
NN 41.6 31200
NN doesnt degrade performance compared to GMM
2 absolute by merging
7
Single frame and multi-frame input with MFCC
FeatureNet

Subsequent frames are joined together
Size of context is being increased to find
minimal PER
300, 400 and 500 neurons in hidden layer tested
- minimum change but the best is 400

frames PER
1 41.6
5 37.5
PER 37.5
8
TempoRAl Patterns

frequency-localized posterior probabilities of
phonemes are estimated from temporal evolution of
critical band energies within a single critical
band
2. such estimates are used in another
class-posterior estimator which estimates the
overall phoneme probability from the
probabilities in the individual critical bands.

1. band classifier
2. band classifier
N. band classifier
9
TRAP system scheme
10
MFCC and TRAP on well-matched conditions

Training and testing data are from the same
database
Similar performance of MFCC multi-frame and 1s
long TRAPs
Improvement can be obtained when length of TRAP
is optimized

PER TIMIT NTIMIT
MFCC39 41.6 55.6
MFCC39 5frames 37.5 49.0
TRAP 1sec 37.9 49.6
11
MFCC and TRAP on mismatched conditions

Training and testing data are from different
databases
TRAP system yielded better results in both
mismatched
conditions
Its better to train the system on corrupted
speech rather
than on clean one

PER TIMIT/NTIMIT NTIMIT/TIMIT
MFCC39 80.9 63.4
MFCC39 5frames 80.1 75.7
TRAP 1sec 75.0 56.6
12
Effect of length of TRAP

The original TRAP length was kept 1 second long
to be sure that
it covers all information about phoneme in the
critical band, but
the length is not optimal
300 ms long context is the best for the TIMIT
database

PER 36.1
13
Effect of mean and variance normalization

Experiment was performed on original 1 second
long TRAPs
Significant degradation caused by both
normalizations can be
seen in well-matched conditions
Mean normalization always helps in mismatched
condition,
the benefit of variance normalization is less
clear

Normalization / PER TIMIT NTIMIT TIMIT/ NTIMIT NTIMIT/TIMIT
None 37.9 49.6 75.0 56.6
Mean 40.5 51.8 73.5 54.7
Mean variance 42.6 53.2 74.8 54.1
14
TRAP with more than one critical band

Three neighboring temporal vectors were merged
together and sent to one classifier

system PER
TRAPS 36.1
3 band TRAPS 33.7
15
Implementation and distribution of the SW phnrec

Early experiments performed with a set of scripts
interconnecting execs trapper, QuickNet, HTK,
still used for the training.
Phoneme recognition in phnrec containing
feature extraction (MFCC (compat HTK),
FeatureNet, TRAPS) from files or microphone
posterior-probability estimator (NN compatible
with QuickNet nets)
Viterbi decoder can work also on-line with
fixed delay.
Very good as black-box for people what want to
consider speech-to-phoneme transcription as
front-end

16
phnrec (2)

Source codes for Linux and EXE for Windows
available for free for research.
Available with nets trained on US-English (TIMIT)
and Czech (SpeechDat-E).
More languages to come (also some Language ID
experiments running in Brno)
Works on-line

http//www.fit.vutbr.cz/speech/sw/phnrec.html
17
Conclusion

TRAP based phoneme recognizer was built,
comparison to MFCC.
Properties of TRAPs were studied and TRAPs were
optimized for phoneme recognition
New multi-band TRAPs approach was tested and its
benefit is proved
The recognizer was successfully evaluated in
language identification task
An easy-to-use software was written and is
available for research community.

18
But

Adaptation to meeting data necessary (TIMIT clean
training not good at all), updating the
distribution on www.
Tests on ICSI, IDIAP and Brno data (which
phonemes going to work the best for us CzEnglish
?)
Applications LID already tested, kwd spotting
and LVCSR (some papers at Eurospeech making use
of phoneme strings).
Phoneme lattices
Real-time issues (1 band version running ok on
reasonable machine, 3 band not) NN weights
pruning?

19
THE END