Title: Ingen diastitel
1Acoustic modeling on telephony speech corpora
for directory assistance systems applications
Børge Lindberg, Center for PersonKommunikation
(CPK), Aalborg University Denmark lindberg_at_cpk.a
uc.dk
2Outline
Part 1 - Acoustic modeling Reference recogniser
(COST 249) Part 2 - Directory assistance NaNu -
Names Numbers (Tele Danmark) Acoustic model
optimisation Project- and system details
3The COST 249 SpeechDat Multilingual Reference
Recogniserhttp//www.telenor.no/fou/prosjekter/ta
letek/refrec
COST 249
- F.T. Johansen, N. Warakagoda (Telenor, Kjeller,
Norway), - B. Lindberg (CPK, Aalborg, Denmark),
- G. Lehtinen (ETH, Zürich, Switzerland),
- Z. Kacic, B. Imperl, A. Zgank (UMB, Maribor,
Slovenia), - B. Milner, D. Chaplin (British Telecom, Ipswich,
UK), - K. Elenius, G. Salvi (KTH, Stockholm, Sweden),
- E. Sanders, F. de Wet (KUN, Nijmegen, The
Netherlands)
4What is the reference recogniser?
Phoneme based recogniser design
procedure Language-independent Fully automatic,
one script works straight from CDs Standardised
database format SpeechDat(II) Available in many
languages world wide Oriented towards telephone
applications Commonly available recogniser
toolkit HTK
5Motivation
- A fast start for recognition research in new
languages - Share experience, avoid doing the same mistakes
- Improve state-of-the-art
- Share research efforts
- Provide a benchmark for recogniser performance
comparison across tasks and languages - Facilitate true multilingual recognition research
6Related Work
COST 232 Assumed TIMIT-like segmented
database Reference verification systems CAVE,
PICASSO COST 250 GlobalPhone (Schultz Waibel,
ICSLP 98) Dictation type multilingual
databases Language independent and -adaptive
recognition
7SpeechDat(II) databases
20 FDBs (fixed network), 5 MDBs (mobile
networks) 500-5000 speakers, 4-8 minutes
recording sessions Telephone information and
transaction services Compatible
databases SpeechDat(E) 5 central and Eastern
European languages SALA 8 dialect zones in Latin
America SpeechDat-Car 9 languages, parallel GSM
and in-car SpeechDat Australian English
8Core Utterance Types in SpeechDat(II)
number type corpus code 1 isolated digit
items I 5 digit/number strings B,C 1 natural
numbers N 1 money amounts M 2 yes/no
questions Q 3 dates D 2 times T 3 applica
tion keywords/keyphrases A 1 word spotting
phrase E 5 directory assistance
names O 3 spellings L 4 phonetically rich
words W 9 phonetically rich sentences S 40 In
total
9Recogniser design - version 0.95
Standard HTK tutorial features (39-dimensional
MFCC_0_D_A), no normalisation Word internal
triphone HMMs, 3 states per model Decision-tree
state clustering Trained from flat-start using
only orthographic transcriptions and a SpeechDat
lexicon Remove difficult utterances from the
training set 1,2,4,8,16 and 32 diagonal
covariance Gaussian mixtures Re-training on
re-segmented material
10MFCC_0_D_A - feature set
Pre-empasis 0.97 Frame shift 10 ms Analysis
window Hamming Window length 25 ms Spectrum
type FFT-magnitude Filterbank type Mel-scale Fil
ter shape Triangular Filterbank
channels 26 Cepstral coefficients 12 Cepstral
liftering 22 Energy feature C0 Deltas 13 Delta
-deltas 13 Total features 39
11Test design
Common test suite on SpeechDat I-test Isolated
digit recognition (SVIP) Q-test Yes/no
recognition (SVIP) A-test Recognition of 30
isolated application words (SVIP) BC-test Unknown
length connected digit string recognition
(SVWL) O-test City name recognition
(MVIP) W-test Recognition of phonetically rich
words (MVIP) Two test procedures used SVIP
Small Vocabulary Isolated Phrase MVIP Medium
Vocabulary Isolated Phrase SVWL Small
Vocabulary Word Loop, NIST alignment
12Results
Six labs have completed the training procedure on
the SpeechDat(II) databases KUN has converted the
Dutch Polyphone to SpeechDat(II) format train
only on phonetically rich sentences tests only on
digit strings More details available on the web
13Training Statistics
External information available (either session
list, pronunciation lexicon or a phoneme
mapping - see web-site) Results are for
Refrec. v. 0.93
14A typical training curve
15Word error rates
Results are for Refrec. v. 0.93
Average numberof phonemes intest vocabularies
16Word error rates - cont.
17Word error rates - cont.
18Word error rates - cont.
19Language independent considerations
Performance probably below state-of-the-art
systems No whole-word modelling, no cross-word
context (especially needed for connected
digits) A lot of training data with noise has
been removed No speaker noise of filled pause
model Not robust enough feature analyser
20Language differences
Mobile database has 3-5 times the error rate of
FDBs more robust modeling needed Slovenian high
noise level on recordings
21Conclusion - part 1
Practical/logistic problems mostly solved Future
work Improve language and database coverage More
speakers Swedish 5000 More challenging tests,
large vocabularies More analyses Improved
training procedure, clustering
22Directory assistance
NaNu Børge Lindberg, Bo Nygaard Bai, Tom
Brøndsted, Jesper Ø. Olsen
- Recognition of Names Numbers
- In collaboration with Tele Danmark
- Auto attendant/directory assistance applications
- Large vocabulary - for the first time in Danish
- Exploiting the SpeechDat(II) database
23Acoustic modeling - Decision trees
(Ref HTK Book)
24Acoustic modeling of Danish diphthongs
25Acoustic modeling - CMN
26Acoustic modeling - decision trees
27NaNu
- Acoustic models
- SpeechDat - COST 249
- 20k tied-mixture tri-phones, 6554 clusters
- 16 mixture models - 100k mixture components
- Database
- ¼ million subscribers (Århus and Næstved areas)
- Vocabulary extracted from database, for which
- there is a minimum of two occurences
- transcription exists (Onomastica)
28Vocabulary and Coverage
NaNu Vocabulary
Unique database entries, Denmark (source Tele
Danmark)
29SLANG
- Recogniser - Spoken LANGuage
- Speech Recognition Research Platform
- For Dialogue Systems execution
- Modular design and implementation (C)
- Frame synchronous operation
- Dynamic Tree Structured Decoder
- Optimised towards large vocabulary recognition
(Gaussian mixture selection)
30NLP
- N-Best lists are parsed into semantic frames and
SQL queries are generated according to the
following strategy - 1. simple 1-best match
- 2. full search in all N-best lists
- 3. under specified (street name and last name
required to be contained in the N-best list) - Output is converted to synthetic speech.
31Dialogue System
- Java implementation of dialogue system and
telephony server. - uses SLANG speech recognition library in C
- connects to public domain SQL database (mySQL)
- system directed dialogue
- one word pr. turn - high perplexity
- dynamic, parallel allocation of recognisers
32Performance
- Lack of test data - SpeechDat data were used (!)
- Person names task
- First name, optional middle name, last name
- 434 test utterances (speaker independent)
- Results from predecessor configuration (10646
last names, 2777 first/middle names) - Recognition accuracy 1-best 39.1
33Conclusion - Part 2
Real system probably needs application specific
data - not mentioning the dialogue aspect
! Effect of further acoustic model optimisation
(on SpeechDat) may be marginal, when N-best lists
are used Limited number of pronunciation variants
available Immediate steps are- test data !-
acoustic validation of retrieved candidates Mixed
initiative dialogue - CPKs incentive to work on
NaNu !