Title: Tools and Methodologies for the Development of Speech Recognition Enabled Applications
1Tools and Methodologies for the Development of
Speech Recognition Enabled Applications
Dr. Ir. Jan Verhasselt, Director Embedded ASR
ResearchFebruary 14 2007
2Major Goals of This Presentation
- Answer the questions
- What are the requirements for an ASR engine that
can be used for voice-enabling a wide range of
applications in the embedded market? - What process can be used to guide the development
of applications that incorporate speech
recognition? - How can tools help to reduce of cost of
developing such applications? - As a side-effect, give insight in a number of
criteria that are important when choosing a
speech recognition engine for a certain
application
3Overview
- Embedded Speech Recognition at Nuance
- Important ASR Engine Features
- Other Requirements for Embedded ASR Product
- Speech Application Development Process
- Application Development Tools
4Embedded ASR Portfolio at Nuance
Dragon Naturally Speaking
Task Complexity
Fusion of VoCon3200v2 and Dragon NS VoCon 3200
v3 and VoCon Mobile X3
Dictation
More natural CC and Voice controled MP3
Current Flagship Automotive ASR engine VoCon3200
v2
Entertainment Basic Voice controled MP3
Navigation Voice Destination Entry (VDE)
Command Control
VoCon Mobile XGT
Phone Dialing Advanced VAD
VoCon SF
VoCon Mobile
Phone Dialing Simple VAD
Low end DSP ARM7/ARM9
High end RISC
Processor Capability
5Focus of this presentation VoCon3200
- Key Features
- Command Control including large name lists
(VAD, VDE, MP3) - Continuous speaker independent speech recognition
with support for speaker dependent voicetags and
speaker adaptation - Noise robust for automotive environment
(far-talk) also very good accuracy in
close-talk and noise-free environments - Modularity
- Portability
- Post-processors
- CFG grammar formalism
- Off-line and on-line grammar
- compilation/modification/activation
6ASR Engine Features Beyond Core ASR
- Noise-robust core ASR is important, but not
enough - Important product components for SI CC engines
for medium to large vocabularies - Grammar processor grammar formalism(s), grammar
compiler, dynamic activation / modification - Lexicon and Pronunciation Guesser
- Natural Language Understanding
- Voice Activity Detection, Extra-event rejection
- Returned results
- Speaker Normalization/Adaptation and/or User
Words - Specifics for name dialing and destination entry
7ASR Engine Features
- Grammar formalism
- Context Free Grammars (CFG) to describe tasks
- Example (BNF)
- Grammar compiler turns grammars in ASR contexts
- ASR context defines what ASR engine can recognize
- ASR context contains FSM representation of
grammars - Optional FSM minimization
- Grammar compilation often done off-line
!grammar Order !start ltSpeechgt ltSpeechgt
!optional(I would like) (ltDrinksgt ltFoodgt)
please ltDrinksgt a lemonade a milkshake an
orange juice ltFoodgt a hamburger French fries
8ASR Engine Features
- Specifying word pronunciations
- The LH phonetic alphabet
- Word-specific pronunciations the !pronounce
directive - Grammar-specific pronunciations the !pronounce
statement - Phonetic dictionary
- Pronunciation Guesser (G2P)
- Precedence pronounce directive gt pronounce
statement gt dictionary gt G2P
- !start ltrulegt
- !pronounce coffee kO.fi kA.fi
- ltrulegt I have read !pronounce(REd) a book
- I read a book
- I drink coffee
9ASR Engine Features
- Dynamic activation
- Goal quickly activate/deactivate parts of the
grammar - Directly on the engine no un-loading of context
1 engine - Rule/Grammar/Context/Label (de)activation
- On Context FSM node (de)activation
- Dynamic Modification
- Goal Quickly modify parts of a grammar without
complete recompilation - Add and remove rules to/from a grammar
- Add and remove alternatives to/from existing
rules - On Context Add and remove a list of names to a
context
10ASR Engine Features
- Natural Language Understanding
- Why?
- extract meaning from the users utterance
- Make application layer independent of how the
user exactly phrased his utterance - Two grammar formalisms that differ in NLU
statements only - BNFEM NLU handled by reco engine itself
- BNFAM NLU post-processor of reco result
- Robust deep parsing versus Shallow Parsing
- Reason tailor balance between expressive power
and footprint to the needs of the customer
11ASR Engine Features
- NLU in BNFEM
- NLU information is stored in the ASR Context
- Lowest memory requirements !id(number)
- Modify spoken utterance !action and !ignore
- NLU in BNFAM
- Syntax directed translation by means of CFG
parser - NLU result a set of attribute-value pairs
- ltCitygt New York !id ("1") The Big Apple
!id("1")
- ltCitygt New York !action("NY") The Big Apple
!action("NY")
- ltnumbergt lttensgt ltunitsgt pricesum(lttensgt,ltunits
gt) - lttensgt 20 30 90
- ltunitsgt 1 2 9
12ASR Engine Features
- Voice Activity Detection
- Detect start-of-speech
- Save CPU during leading silence
- Detect end-of-speech
- Responsiveness of the recognition engine
- Extra event rejection
- Reject noises
- Extra-Event models for cough, car-horn, wipers,
mouth clicks, - Can be put in parallel to the main grammar(s),
with their own DP (avoid pruning effects) - If the extra-event models score better than the
main grammar(s), the result type is set to
REJECTED.
13ASR Engine Features
- Returned results/events
- Signal events
- Abnormal conditions signal too loud, bad SNR,
- At regular intervals SNR, energy level,
- At certain moments trailing silence detected,
- ? Information can be used by application to give
feedback to the speaker, adapt dialog strategy, - Recognition result
- N-best alternatives
- Confidence values at word and sentence level
- Word segmentation
- Result type FINAL or REJECTED (ExEv)
- ?Configurable rejection behaviour, based on
confidence values and result type.
14ASR Engine Features
- Speaker Adaptation
- Acoustic Model adaptation based on few tens of
utterances - Reduces error rate up to 30 relative
- Even more for non-native speakers
- Speaker normalization
- Design of the feature extraction, ao. Cepstral
Mean Normalization - Speaker-Dependent Words User Words or Voice
tags - Nr of needed training utterances
- Combination of SD and SI words
- Works well in noisy conditions when trained in
clean conditions - Confusability check
15ASR Engine Features
- Specifics for name dialing and destination entry
- Dedicated isolated name search algorithm
- Low memory and CPU requirements for long lists of
isolated words, e.g. street names, city names,
person names, stock quotes, - Spelling post-processor
- Two steps
- Recognize the letter sequence (normal reco
engine) - Find the best matching name from a list
(post-processor) - Allows spelling errors, even deletions and
insertions - Support incremental partial spelling
16Other Requirements for Embedded ASR Product
- Footprint
- Storage, peak RAM, CPU needs
- Scalability
- Trade-off recognition accuracy versus footprint
- Modularity
- Trade-off supported features versus footprint
- Portability
- Abstract and isolate processor and OS-specific
functionality - Re-usability
- Across languages, grammar formalisms, character
encodings
17Other Requirements for Embedded ASR Product
- Language Portfolio
- Cost efficient production of tens of languages
- Language specific data, not code
- Documentation
- Getting Started, Functional Reference API,
Development Formalisms like BNF and LH,
Application Notes, Training Courses, Sample
Programs, Demonstration Tools - QA
- Automated nightly Build and Test System
- Code Checkers
- Design and Release Process
18A Recap Whats Next?
- Recap
- We have seen that a commercial ASR product
offering - has many features beyond core noise-robust ASR
- has many other requirements besides functionality
- We have introduced relevant ASR terminology
- Now were ready to
- Analyse what it takes to create an attractive
speech enabled application - Introduce tools and methods that help such design
and development
19The ultimate goal for speech in applications
User Satisfaction through Task Completion Success
- Variables affecting task completion rate and
speed - Technology
- Performance of ASR engine, a.o. acoustic model
size - Engine parameter settings, o.a. choice of search
type - Enabling technology spelling, adaptation,
tools,... - User Interface
- Appropriateness of prompts
- Fall-back strategies n-best candidates,
spelling, SD words - Rejection/Confirmation methods
- System Design and implementation
- Speaker adaptation acoustic model, language
model (DSM) - Quality of grammars and vocabulary
- Proper use of session data
- User input
- Speaker characteristics proper pronunciations?
- Audio quality / Signal-to-noise ratio (SNR)
20Speech Application Development Process
Speech Application Development
Process Flow
1
Language
Model
3
Interaction
Development
10
14
4
2
System
Updates To
Recognition
15
Specification
13
Integration
Grammars
Package
Functional
Usability
Prompting
Testing
Testing
Interaction
5
Flow
Provision Of
Data Logging
16
Capability
Peformance
Validation
6
Create
Usability Test
Scenarios
Speech Data
12
7
11
Transcribe
Create Data
Data
Speech
Validate
17
Collection
Collection
Data
Transcribe
Complex
Scripts
,
etc
Tuning
Grammars
Off Line Analysis
Improve Grammar Coverage
Pronunciations
8
Testers
Recruit
Testers
18
Repeat
Functional
9
Testing As
Prepare
Required
Performance
Validation
Production
Scenarios
21Performance Tuning
- UI Design
- Prompts, Grammars, UI strategies
- Implementation
- Search algorithm
- Grammar technology
- Audio Path
- Performance Validation and Parameter Tuning
- Tuning parameters for maximum accuracy at minimum
resources - Data Collection and Analysis
- Performance Validation Reports
- Acoustic Model Tuning
- Session Data
- Dynamic Semantic Models
- Speaker Adaptation
- Model Merging
22UI Design
- Prompts to guide users to what they can say
- Grammars designed to capture what users likely
say - Localization aspects
- UI Strategies
23Command and Prompt Design
- How do users refer to frequency ranges?
- AM / FM frequencies
- Digits, natural numbers, pairs?
- 530 gt five three zero
- 530 gt five hundred and thirty
- 530 gt five thirty
- 1610 gt sixteen ten
- 1610 gt eintausendsechshundertundzehn
- Challenge recognition accuracy vs. freedom of
input - Nuance principle
- Prioritize accuracy of expected user input as far
as possible by offering as much freedom as
possible - For example, better to have pairs recognized
perfectly instead of offering pairs as well as
natural numbers
24UI and Grammar Design Vocon3200
- Synonyms increase grammar coverage
- Optimize different constraints
- Define most important ways how users refer to a
command. - Optimize prompts in a way that variation of
responses is minimized as far as possible. - Allowable variation determined by
- Expected recognition accuracy, e.g. expected SNR
(close-talk vs far-talk) - RAM considerations size of grammar and search
space, possibly after grammar optimization) - CPU considerations recognition latency, grammar
loading times, choice of search algorithm
PROMPT
Which station are you travelling to?
Grammar synonyms
London Central London London Waterloo London
Marylebone London Paddington London Kings
Cross London Euston London Victoria etc
25UI and Grammar Design Vocon3200
PROMPT
Which station are you travelling to?
- Pronunciations increase population coverage
- Optimize different constraints
- More pronunciations lead to larger RAM and CPU
requirements - Restrict to those variants that cover normal
variation across the population - Assure recognition accuracy
- Confusability tool identifies words with similar
phonetic transcription
Grammar synonyms
London Central London London Waterloo London
Marylebone london Paddington London Kings
Cross London Euston London Victoria
n
Pronunciations
Marylebone /m a r ey l ax b ow n/ /m ah r l ax b
ow n/
26Confusability Checks
Number confusions
Critical pair
Command 1 Command 2
Confusion score
27UI Strategies
- One-shot versus multiple turns
- Accuracy and response time versus perceived task
time - Golden path
- Offer one-shot for most likely commands/choices
- Provide multiple turn or disambiguation for less
frequent cases - Offer Alternative Strategies
- Success for every user rather than for average
user - Eg offer spelling alternative
- Users dont always know how to speak a name
- Spelling can disambiguate confusable words
28Implementation
- Search algorithm selection
- Grammar technology
- Audio Path
29Search Algorithm Selection
- VoCon 3200 uses in principle a two pass search
- Three basic engine types can be selected for the
first pass - General purpose Wordpair N-best DP
- Search is performed on the word-level FSM in the
context - Dedicated for long item list recognition TreeDP
(and variants) - Search is performed on a phonetic tree
- Large Grammars describing natural utterances
incorporating one or more long item lists
LexTreeDP - Optional second pass rescores the N-best list
that results from the first pass - Search algorithms have different memory/resource
usage
30Grammar Technology
- VoCon 3200 offers Grammars and Contexts
- Grammars
- More flexibility, easy run-time modification
- Larger resource needs (loading time, memory)
- Contexts
- Highly optimized, minimal resource needs
31Audio Path
- Garbage (sound) in -gt garbage (results) out
- Audio recommendations
- SNR
- Below 5dB accuracy drops quickly
- Bandwidth
- For 16kHz models, bandwidth 7-gt7.3kHz can lead to
5 relative WER reduction - 12 bit dynamic range no AGC
- See VoCon 3200 Audio Recommendations Document
32Performance Validation and Parameter Tuning
- Tuning parameters for maximum accuracy at minimum
resources - Data Collection and Analysis
- Performance Validation Reports
33Accuracy metrics
- Did the speaker speak in- or out-of-vocabulary/gra
mmar (OOV) ? - OOVs can also be noise
- Did the recognizer make a correct decision (or a
false one) ? - Accepted the result?
- Rejected the result?
- Was there confirmation required?
SYSTEM Youd like to make a reservation, is
that correct?
SYSTEM Im sorry I didnt get that, please tell
me again what you are calling about
34Within resource constraints
- CPU limits
- Loading time
- Response time (latency)
- Memory limits
- Dynamic memory usage
35Word/Sentence Error Rate Evaluation
ASR is a statistical process ? measuring
performance is a statistical estimation problem
- Evaluation of accuracy requires enough
representative data to derive statistically
significant errors
36Confidence Intervals
- 95 confidence interval for WER f (f - ?, f
?), ? 1.96 ?(f(1-f)/N)
37Tuning process 5 steps
Collect data
Measure
Analyse
Experiment
(Pre-)Release
(Iterate)
38Evaluation Offline Testing
- Offline recordings in car similar to target car
- Same microphone
- About the same distance speaker - mouth
- About the same background noise at different
conditions - ? Outcome Offline test report
- check for recognition problems
- check wording from recognition performance point
of view - get first feedback from test persons
39Evaluation Online Test
40Tuning of Applications
Collect data
- What data?
- How to collect?
- What to measure?
- How to measure?
Measure
Analyse
Tuning Iterations
Experiment
(Pre)-Release
(Iterate)
41Evaluation Performance Validation and Tuning
Report
- Tuning of grammars, commands and dictionaries
- Based on spontaneous as well as correct
utterances - Optimize regarding RAM, heap size, format, ID
usage - Optimize regarding usage and recognition accuracy
- Tuning of parameters
- absoluteThreshold, MinSpeech, trailing silence,
pruning... - ? Nuance/OEM/Tier1
- adjust HMI design implementation according to
tuning results
42Performance Validation Reports
43Acoustic Model Tuning
- For Audio Characteristics Session Data
- For Speaker Characteristics Speaker Adaptation
- For Accuracy versus Size Model Merging and
Compiling
44Session Data
- Engine automatically adapts to speaker and
environment (microphone, room characteristics,
NOT background noise) - Information is contained in so-called session
data - Application can retrieve session data, store and
re-load it - Language (AMO) specific
- General rule
- Clear session data at system start-up (unknown
speaker) - Re-use session data during driving session
45Speaker Adaptation
- Adaptation to single user
- Most improvement for speakers with low accuracy
- Supervised enrollment min. 10sec of speech
distributed over 20 different short commands - Supervised selection by the application (eg based
on key or phone identification) - Fast loading (acoustic model modified in RAM)
- Adaptation to environment based on set of users
- Same technology, but enrollment done off-line
based on a set of users collected in target
environment
46Acoustic Model Merging and Compiling
- Three standard sizes of acoustic models per
language ultra-compact (320kBytes), Compact
(780kBytes), Full (4Mbytes) - Each of these models is complete they can
recognize any phoneme from the selected language - VoCon 3200 AMOs have built-in word models for
very frequent, important words - Digits 0-9 and Letters (not with ultra-compact
models) - Exceptionally other words, depending on language
- Merging
- Add parts of eg Full AMO to Compact AMO
- Resulting AMO
- Bit larger than Compact, bit more CPU
- Better performance for selected parts
- Compiling
- Keep only parts of model that are used by the set
of grammars
47Language Model Tuning DSM
- How will speakers vary?
- Language models
- bias the recogniser
- increase realised accuracy
- Language models in VoCon(X)3(200)
- Currently in products only by re-scoring the
N-best list of recognition hypos - Dynamic Semantic Models (DSM) adapt to the
speakers usage history - Examples
- DSM for VDE
- DSM for VAD
PROMPT
Which station are you travelling to?
Grammar synonyms
London Central London London Waterloo London
Marylebone london Paddington London Kings
Cross London Euston London Victoria
n
Pronunciations
Marylebone /m a r ey l ax b ow n/ /m ah r l ax b
ow n/
Language Model
50 London 16 Euston 12 Waterloo 5
Paddington etc
48Dynamic Semantic Models
PROMPT
Which station are you travelling to?
Grammar synonyms
London Central London London Waterloo London
Marylebone london Paddington London Kings
Cross London Euston London Victoria
Custom Grammar, Lexicon, DSM
- BENEFITS
- Increased coverage
- Increased accuracy
- Network versus Embedded
- In embedded no retuning after first limited
deployment - In embedded too little real data for fine
statistical language models (use categories
instead), BUT, application can adapt to a
particular speaker or small set of speakers
(exploit usage history in DSM)
n
Pronunciations
Marylebone /m a r ey l ax b ow n/ /m ah r l ax b
ow n/
Language Model
50 London 16 Euston 12 Waterloo etc
49ASR Application Development Tools Overview
- Grammar and Pronunciation Editing and Analysis
Suite - Purpose
- Fast development and testing of ASR grammars and
pronunciation dictionaries - Allow initial evaluation of grammar compilation
speed, recognition speed and recognition - Recognition Analysis Suite
- Purpose
- Get the best out of our ASR engines by tuning the
most important engine parameters, by further
tuning of pronunciations, etc. - Collect speech utterances to allow the tuning
50ASR Tools
GrammarTools
PronunciationTools
grammars
lexicon
Grammar Compiler
Engine TuningTools
configuration parameters
ASR Engine
Logging Library
reference data
speech data
Data Preparation Tools
51Grammar and Pronunciation Analysis Tools
- Grammar Editor
- Syntax highlighting, search and replace, spelling
checker, wide support for character encoding,
smart indentation, folding, - Grammar Creator Tool
- Create an ASR grammar from a list of names,
possibly including partial spelling, actions, - Grammar Compiler Tool
- compile text grammar in binary equivalent,
experiment with grammar compilation options - Context Compiler Tool
- compile text grammar(s) into an equivalent binary
context (for context from buffer functions
only) - Dictionary Compiler Tool
- Compile a text dictionary into an equivalent
binary dictionary - Spelling Tree Compiler Tool
- Compile a list of words (e.g. city names) in a
binary buffer that can be loaded on the spelling
post-processor
52Grammar Editor
53Grammar and Pronunciation Analysis Tools
- Model Compiler Tool
- Produce a reduced grammar specific acoustic model
for a given set of fixed grammars (exception
userwords) - Vocabulary Verifier Tool
- check grammars vocabulary and word
pronunciations - Context Verifier Tool
- generate sentences described by the context,
check whether a sentence is covered by the
context, - Recognition Test Tool
- Test recognition on a single utterance, either
from previously recorded file or with microphone
input - Confusability Tool
- identify confusable word or sentence pairs based
on their pronunciations - User Dictionary Editor
- GUI tool to create an exception dictionary with
better phonetic transcriptions for certain words
54Recognition Analysis Tools
- Audio Data Collector
- GUI tool to make elaborated utterance recordings
- Log Importer Tool
- convert a binary log file (created by
applications that use the logging library from
the Speech API) into a text log file - takes care of conversions of data types and
concatenation of small audio buffers into
utterances - Log Extractor Tool
- convert a central log file into files useable by
e.g. the Batch Recognition Tool - Allows filtering of interesting information, e.g.
only utterances of a certain speaker, and/or in a
certain state of the dialog,
55Recognition Analysis Tools
- Sound Tool
- GUI tool to listen to and analyze recorded
utterances to spot bad signal quality - Speech Verifier Tool
- annotate recorded utterances with orthographic
transcriptions - Batch Recognition Tool
- perform recognition on a series of recorded
utterances - Supports the spelling and NLU postprocessor
- Experiment with all engine parameters.
- Batch Userword Training Tool
- Train speaker dependent userwords on a set of
recorded utterances, possibly from different
speakers - Can be used in the Batch Recognition Tool
56Recognition Analysis Tools
- Batch Speaker Adaptation Enrolment Tool
- Train speaker profiles that can be used to adapt
the speaker independent acoustic model to a
speaker - Can be used in the Batch Recognition Tool
- Scoring Tool
- analyze recognizers output on a series of
utterances and generate detailed error report - Select utterances with particular errors, or
particular speakers, or - Also supports analysis and tuning of rejection
performance - Engine Tuning Tool
- Automatic tuning of optimal engine parameters by
running batch recognition on recorded/logged
speech
57Thanks
- Thank you for your attention!