Tools and Methodologies for the Development of Speech Recognition Enabled Applications

About This Presentation

Title:

Tools and Methodologies for the Development of Speech Recognition Enabled Applications

Description:

What are the requirements for an ASR engine that can be used for voice-enabling ... 'Extra-Event' models for cough, car-horn, wipers, mouth clicks, ... – PowerPoint PPT presentation

Number of Views:353

Avg rating:3.0/5.0

Slides: 58

Provided by: janverh

Category:

more less

Transcript and Presenter's Notes

Title: Tools and Methodologies for the Development of Speech Recognition Enabled Applications

1
Tools and Methodologies for the Development of
Speech Recognition Enabled Applications
Dr. Ir. Jan Verhasselt, Director Embedded ASR
ResearchFebruary 14 2007
2
Major Goals of This Presentation

Answer the questions
What are the requirements for an ASR engine that
can be used for voice-enabling a wide range of
applications in the embedded market?
What process can be used to guide the development
of applications that incorporate speech
recognition?
How can tools help to reduce of cost of
developing such applications?
As a side-effect, give insight in a number of
criteria that are important when choosing a
speech recognition engine for a certain
application

3
Overview

Embedded Speech Recognition at Nuance
Important ASR Engine Features
Other Requirements for Embedded ASR Product
Speech Application Development Process
Application Development Tools

4
Embedded ASR Portfolio at Nuance
Dragon Naturally Speaking
Task Complexity
Fusion of VoCon3200v2 and Dragon NS VoCon 3200
v3 and VoCon Mobile X3

Dictation
More natural CC and Voice controled MP3
Current Flagship Automotive ASR engine VoCon3200
v2
Entertainment Basic Voice controled MP3
Navigation Voice Destination Entry (VDE)
Command Control
VoCon Mobile XGT
Phone Dialing Advanced VAD
VoCon SF
VoCon Mobile
Phone Dialing Simple VAD
Low end DSP ARM7/ARM9
High end RISC
Processor Capability
5
Focus of this presentation VoCon3200

Key Features
Command Control including large name lists
(VAD, VDE, MP3)
Continuous speaker independent speech recognition
with support for speaker dependent voicetags and
speaker adaptation
Noise robust for automotive environment
(far-talk) also very good accuracy in
close-talk and noise-free environments
Modularity
Portability
Post-processors
CFG grammar formalism
Off-line and on-line grammar
compilation/modification/activation

6
ASR Engine Features Beyond Core ASR

Noise-robust core ASR is important, but not
enough
Important product components for SI CC engines
for medium to large vocabularies
Grammar processor grammar formalism(s), grammar
compiler, dynamic activation / modification
Lexicon and Pronunciation Guesser
Natural Language Understanding
Voice Activity Detection, Extra-event rejection
Returned results
Speaker Normalization/Adaptation and/or User
Words
Specifics for name dialing and destination entry

7
ASR Engine Features

Grammar formalism
Context Free Grammars (CFG) to describe tasks
Example (BNF)
Grammar compiler turns grammars in ASR contexts
ASR context defines what ASR engine can recognize
ASR context contains FSM representation of
grammars
Optional FSM minimization
Grammar compilation often done off-line

!grammar Order !start ltSpeechgt ltSpeechgt
!optional(I would like) (ltDrinksgt ltFoodgt)
please ltDrinksgt a lemonade a milkshake an
orange juice ltFoodgt a hamburger French fries
8
ASR Engine Features

Specifying word pronunciations
The LH phonetic alphabet
Word-specific pronunciations the !pronounce
directive
Grammar-specific pronunciations the !pronounce
statement
Phonetic dictionary
Pronunciation Guesser (G2P)
Precedence pronounce directive gt pronounce
statement gt dictionary gt G2P

!start ltrulegt
!pronounce coffee kO.fi kA.fi
ltrulegt I have read !pronounce(REd) a book
I read a book
I drink coffee

9
ASR Engine Features

Dynamic activation
Goal quickly activate/deactivate parts of the
grammar
Directly on the engine no un-loading of context
1 engine
Rule/Grammar/Context/Label (de)activation
On Context FSM node (de)activation
Dynamic Modification
Goal Quickly modify parts of a grammar without
complete recompilation
Add and remove rules to/from a grammar
Add and remove alternatives to/from existing
rules
On Context Add and remove a list of names to a
context

10
ASR Engine Features

Natural Language Understanding
Why?
extract meaning from the users utterance
Make application layer independent of how the
user exactly phrased his utterance
Two grammar formalisms that differ in NLU
statements only
BNFEM NLU handled by reco engine itself
BNFAM NLU post-processor of reco result
Robust deep parsing versus Shallow Parsing
Reason tailor balance between expressive power
and footprint to the needs of the customer

11
ASR Engine Features

NLU in BNFEM
NLU information is stored in the ASR Context
Lowest memory requirements !id(number)
Modify spoken utterance !action and !ignore
NLU in BNFAM
Syntax directed translation by means of CFG
parser
NLU result a set of attribute-value pairs

ltCitygt New York !id ("1") The Big Apple
!id("1")

ltCitygt New York !action("NY") The Big Apple
!action("NY")

ltnumbergt lttensgt ltunitsgt pricesum(lttensgt,ltunits
gt)
lttensgt 20 30 90
ltunitsgt 1 2 9

12
ASR Engine Features

Voice Activity Detection
Detect start-of-speech
Save CPU during leading silence
Detect end-of-speech
Responsiveness of the recognition engine
Extra event rejection
Reject noises
Extra-Event models for cough, car-horn, wipers,
mouth clicks,
Can be put in parallel to the main grammar(s),
with their own DP (avoid pruning effects)
If the extra-event models score better than the
main grammar(s), the result type is set to
REJECTED.

13
ASR Engine Features

Returned results/events
Signal events
Abnormal conditions signal too loud, bad SNR,
At regular intervals SNR, energy level,
At certain moments trailing silence detected,
? Information can be used by application to give
feedback to the speaker, adapt dialog strategy,
Recognition result
N-best alternatives
Confidence values at word and sentence level
Word segmentation
Result type FINAL or REJECTED (ExEv)
?Configurable rejection behaviour, based on
confidence values and result type.

14
ASR Engine Features

Speaker Adaptation
Acoustic Model adaptation based on few tens of
utterances
Reduces error rate up to 30 relative
Even more for non-native speakers
Speaker normalization
Design of the feature extraction, ao. Cepstral
Mean Normalization
Speaker-Dependent Words User Words or Voice
tags
Nr of needed training utterances
Combination of SD and SI words
Works well in noisy conditions when trained in
clean conditions
Confusability check

15
ASR Engine Features

Specifics for name dialing and destination entry
Dedicated isolated name search algorithm
Low memory and CPU requirements for long lists of
isolated words, e.g. street names, city names,
person names, stock quotes,
Spelling post-processor
Two steps
Recognize the letter sequence (normal reco
engine)
Find the best matching name from a list
(post-processor)
Allows spelling errors, even deletions and
insertions
Support incremental partial spelling

16
Other Requirements for Embedded ASR Product

Footprint
Storage, peak RAM, CPU needs
Scalability
Trade-off recognition accuracy versus footprint
Modularity
Trade-off supported features versus footprint
Portability
Abstract and isolate processor and OS-specific
functionality
Re-usability
Across languages, grammar formalisms, character
encodings

17
Other Requirements for Embedded ASR Product

Language Portfolio
Cost efficient production of tens of languages
Language specific data, not code
Documentation
Getting Started, Functional Reference API,
Development Formalisms like BNF and LH,
Application Notes, Training Courses, Sample
Programs, Demonstration Tools
QA
Automated nightly Build and Test System
Code Checkers
Design and Release Process

18
A Recap Whats Next?

Recap
We have seen that a commercial ASR product
offering
has many features beyond core noise-robust ASR
has many other requirements besides functionality
We have introduced relevant ASR terminology
Now were ready to
Analyse what it takes to create an attractive
speech enabled application
Introduce tools and methods that help such design
and development

19
The ultimate goal for speech in applications
User Satisfaction through Task Completion Success

Variables affecting task completion rate and
speed
Technology
Performance of ASR engine, a.o. acoustic model
size
Engine parameter settings, o.a. choice of search
type
Enabling technology spelling, adaptation,
tools,...
User Interface
Appropriateness of prompts
Fall-back strategies n-best candidates,
spelling, SD words
Rejection/Confirmation methods
System Design and implementation
Speaker adaptation acoustic model, language
model (DSM)
Quality of grammars and vocabulary
Proper use of session data
User input
Speaker characteristics proper pronunciations?
Audio quality / Signal-to-noise ratio (SNR)

20
Speech Application Development Process
Speech Application Development
Process Flow
1

Language
Model
3

Interaction
Development
10

14

4

2

System
Updates To
Recognition
15

Specification
13

Integration
Grammars
Package
Functional
Usability
Prompting

Testing
Testing
Interaction
5

Flow
Provision Of
Data Logging
16

Capability
Peformance
Validation
6

Create
Usability Test
Scenarios
Speech Data
12

7

11

Transcribe
Create Data
Data
Speech
Validate
17

Collection
Collection
Data
Transcribe
Complex
Scripts
,
etc

Tuning
Grammars
Off Line Analysis
Improve Grammar Coverage
Pronunciations
8

Testers
Recruit
Testers
18

Repeat
Functional
9

Testing As
Prepare
Required
Performance
Validation
Production
Scenarios
21
Performance Tuning

UI Design
Prompts, Grammars, UI strategies
Implementation
Search algorithm
Grammar technology
Audio Path
Performance Validation and Parameter Tuning
Tuning parameters for maximum accuracy at minimum
resources
Data Collection and Analysis
Performance Validation Reports
Acoustic Model Tuning
Session Data
Dynamic Semantic Models
Speaker Adaptation
Model Merging

22
UI Design

Prompts to guide users to what they can say
Grammars designed to capture what users likely
say
Localization aspects
UI Strategies

23
Command and Prompt Design

How do users refer to frequency ranges?
AM / FM frequencies
Digits, natural numbers, pairs?
530 gt five three zero
530 gt five hundred and thirty
530 gt five thirty
1610 gt sixteen ten
1610 gt eintausendsechshundertundzehn
Challenge recognition accuracy vs. freedom of
input
Nuance principle
Prioritize accuracy of expected user input as far
as possible by offering as much freedom as
possible
For example, better to have pairs recognized
perfectly instead of offering pairs as well as
natural numbers

24
UI and Grammar Design Vocon3200

Synonyms increase grammar coverage
Optimize different constraints
Define most important ways how users refer to a
command.
Optimize prompts in a way that variation of
responses is minimized as far as possible.
Allowable variation determined by
Expected recognition accuracy, e.g. expected SNR
(close-talk vs far-talk)
RAM considerations size of grammar and search
space, possibly after grammar optimization)
CPU considerations recognition latency, grammar
loading times, choice of search algorithm

PROMPT
Which station are you travelling to?
Grammar synonyms
London Central London London Waterloo London
Marylebone London Paddington London Kings
Cross London Euston London Victoria etc
25
UI and Grammar Design Vocon3200
PROMPT
Which station are you travelling to?

Pronunciations increase population coverage
Optimize different constraints
More pronunciations lead to larger RAM and CPU
requirements
Restrict to those variants that cover normal
variation across the population
Assure recognition accuracy
Confusability tool identifies words with similar
phonetic transcription

Grammar synonyms
London Central London London Waterloo London
Marylebone london Paddington London Kings
Cross London Euston London Victoria
n
Pronunciations
Marylebone /m a r ey l ax b ow n/ /m ah r l ax b
ow n/
26
Confusability Checks
Number confusions
Critical pair
Command 1 Command 2
Confusion score
27
UI Strategies

One-shot versus multiple turns
Accuracy and response time versus perceived task
time
Golden path
Offer one-shot for most likely commands/choices
Provide multiple turn or disambiguation for less
frequent cases
Offer Alternative Strategies
Success for every user rather than for average
user
Eg offer spelling alternative
Users dont always know how to speak a name
Spelling can disambiguate confusable words

28
Implementation

Search algorithm selection
Grammar technology
Audio Path

29
Search Algorithm Selection

VoCon 3200 uses in principle a two pass search
Three basic engine types can be selected for the
first pass
General purpose Wordpair N-best DP
Search is performed on the word-level FSM in the
context
Dedicated for long item list recognition TreeDP
(and variants)
Search is performed on a phonetic tree
Large Grammars describing natural utterances
incorporating one or more long item lists
LexTreeDP
Optional second pass rescores the N-best list
that results from the first pass
Search algorithms have different memory/resource
usage

30
Grammar Technology

VoCon 3200 offers Grammars and Contexts
Grammars
More flexibility, easy run-time modification
Larger resource needs (loading time, memory)
Contexts
Highly optimized, minimal resource needs

31
Audio Path

Garbage (sound) in -gt garbage (results) out
Audio recommendations
SNR
Below 5dB accuracy drops quickly
Bandwidth
For 16kHz models, bandwidth 7-gt7.3kHz can lead to
5 relative WER reduction
12 bit dynamic range no AGC
See VoCon 3200 Audio Recommendations Document

32
Performance Validation and Parameter Tuning

Tuning parameters for maximum accuracy at minimum
resources
Data Collection and Analysis
Performance Validation Reports

33
Accuracy metrics

Did the speaker speak in- or out-of-vocabulary/gra
mmar (OOV) ?
OOVs can also be noise
Did the recognizer make a correct decision (or a
false one) ?
Accepted the result?
Rejected the result?
Was there confirmation required?

SYSTEM Youd like to make a reservation, is
that correct?
SYSTEM Im sorry I didnt get that, please tell
me again what you are calling about
34
Within resource constraints

CPU limits
Loading time
Response time (latency)
Memory limits
Dynamic memory usage

35
Word/Sentence Error Rate Evaluation
ASR is a statistical process ? measuring
performance is a statistical estimation problem

Evaluation of accuracy requires enough
representative data to derive statistically
significant errors

36
Confidence Intervals

95 confidence interval for WER f (f - ?, f
?), ? 1.96 ?(f(1-f)/N)

37
Tuning process 5 steps
Collect data

What?
How?

What?
How?

Measure
Analyse

Tools
What to look for

Experiment

Tools
Methodology

(Pre-)Release

When?

(Iterate)
38
Evaluation Offline Testing

Offline recordings in car similar to target car
Same microphone
About the same distance speaker - mouth
About the same background noise at different
conditions
? Outcome Offline test report
check for recognition problems
check wording from recognition performance point
of view
get first feedback from test persons

39
Evaluation Online Test
40
Tuning of Applications
Collect data

What data?
How to collect?

What to measure?
How to measure?

Measure
Analyse

Tools
Accuracy report

Tuning Iterations
Experiment

Tools
Methodology

(Pre)-Release

When?

(Iterate)
41
Evaluation Performance Validation and Tuning
Report

Tuning of grammars, commands and dictionaries
Based on spontaneous as well as correct
utterances
Optimize regarding RAM, heap size, format, ID
usage
Optimize regarding usage and recognition accuracy
Tuning of parameters
absoluteThreshold, MinSpeech, trailing silence,
pruning...
? Nuance/OEM/Tier1
adjust HMI design implementation according to
tuning results

42
Performance Validation Reports
43
Acoustic Model Tuning

For Audio Characteristics Session Data
For Speaker Characteristics Speaker Adaptation
For Accuracy versus Size Model Merging and
Compiling

44
Session Data

Engine automatically adapts to speaker and
environment (microphone, room characteristics,
NOT background noise)
Information is contained in so-called session
data
Application can retrieve session data, store and
re-load it
Language (AMO) specific
General rule
Clear session data at system start-up (unknown
speaker)
Re-use session data during driving session

45
Speaker Adaptation

Adaptation to single user
Most improvement for speakers with low accuracy
Supervised enrollment min. 10sec of speech
distributed over 20 different short commands
Supervised selection by the application (eg based
on key or phone identification)
Fast loading (acoustic model modified in RAM)
Adaptation to environment based on set of users
Same technology, but enrollment done off-line
based on a set of users collected in target
environment

46
Acoustic Model Merging and Compiling

Three standard sizes of acoustic models per
language ultra-compact (320kBytes), Compact
(780kBytes), Full (4Mbytes)
Each of these models is complete they can
recognize any phoneme from the selected language
VoCon 3200 AMOs have built-in word models for
very frequent, important words
Digits 0-9 and Letters (not with ultra-compact
models)
Exceptionally other words, depending on language
Merging
Add parts of eg Full AMO to Compact AMO
Resulting AMO
Bit larger than Compact, bit more CPU
Better performance for selected parts
Compiling
Keep only parts of model that are used by the set
of grammars

47
Language Model Tuning DSM

How will speakers vary?
Language models
bias the recogniser
increase realised accuracy
Language models in VoCon(X)3(200)
Currently in products only by re-scoring the
N-best list of recognition hypos
Dynamic Semantic Models (DSM) adapt to the
speakers usage history
Examples
DSM for VDE
DSM for VAD

PROMPT
Which station are you travelling to?
Grammar synonyms
London Central London London Waterloo London
Marylebone london Paddington London Kings
Cross London Euston London Victoria
n
Pronunciations
Marylebone /m a r ey l ax b ow n/ /m ah r l ax b
ow n/
Language Model
50 London 16 Euston 12 Waterloo 5
Paddington etc
48
Dynamic Semantic Models
PROMPT
Which station are you travelling to?
Grammar synonyms
London Central London London Waterloo London
Marylebone london Paddington London Kings
Cross London Euston London Victoria
Custom Grammar, Lexicon, DSM

BENEFITS
Increased coverage
Increased accuracy
Network versus Embedded
In embedded no retuning after first limited
deployment
In embedded too little real data for fine
statistical language models (use categories
instead), BUT, application can adapt to a
particular speaker or small set of speakers
(exploit usage history in DSM)

n
Pronunciations
Marylebone /m a r ey l ax b ow n/ /m ah r l ax b
ow n/
Language Model
50 London 16 Euston 12 Waterloo etc
49
ASR Application Development Tools Overview

Grammar and Pronunciation Editing and Analysis
Suite
Purpose
Fast development and testing of ASR grammars and
pronunciation dictionaries
Allow initial evaluation of grammar compilation
speed, recognition speed and recognition
Recognition Analysis Suite
Purpose
Get the best out of our ASR engines by tuning the
most important engine parameters, by further
tuning of pronunciations, etc.
Collect speech utterances to allow the tuning

50
ASR Tools
GrammarTools
PronunciationTools
grammars
lexicon
Grammar Compiler
Engine TuningTools
configuration parameters
ASR Engine
Logging Library
reference data
speech data
Data Preparation Tools
51
Grammar and Pronunciation Analysis Tools

Grammar Editor
Syntax highlighting, search and replace, spelling
checker, wide support for character encoding,
smart indentation, folding,
Grammar Creator Tool
Create an ASR grammar from a list of names,
possibly including partial spelling, actions,
Grammar Compiler Tool
compile text grammar in binary equivalent,
experiment with grammar compilation options
Context Compiler Tool
compile text grammar(s) into an equivalent binary
context (for context from buffer functions
only)
Dictionary Compiler Tool
Compile a text dictionary into an equivalent
binary dictionary
Spelling Tree Compiler Tool
Compile a list of words (e.g. city names) in a
binary buffer that can be loaded on the spelling
post-processor

52
Grammar Editor
53
Grammar and Pronunciation Analysis Tools

Model Compiler Tool
Produce a reduced grammar specific acoustic model
for a given set of fixed grammars (exception
userwords)
Vocabulary Verifier Tool
check grammars vocabulary and word
pronunciations
Context Verifier Tool
generate sentences described by the context,
check whether a sentence is covered by the
context,
Recognition Test Tool
Test recognition on a single utterance, either
from previously recorded file or with microphone
input
Confusability Tool
identify confusable word or sentence pairs based
on their pronunciations
User Dictionary Editor
GUI tool to create an exception dictionary with
better phonetic transcriptions for certain words

54
Recognition Analysis Tools

Audio Data Collector
GUI tool to make elaborated utterance recordings
Log Importer Tool
convert a binary log file (created by
applications that use the logging library from
the Speech API) into a text log file
takes care of conversions of data types and
concatenation of small audio buffers into
utterances
Log Extractor Tool
convert a central log file into files useable by
e.g. the Batch Recognition Tool
Allows filtering of interesting information, e.g.
only utterances of a certain speaker, and/or in a
certain state of the dialog,

55
Recognition Analysis Tools

Sound Tool
GUI tool to listen to and analyze recorded
utterances to spot bad signal quality
Speech Verifier Tool
annotate recorded utterances with orthographic
transcriptions
Batch Recognition Tool
perform recognition on a series of recorded
utterances
Supports the spelling and NLU postprocessor
Experiment with all engine parameters.
Batch Userword Training Tool
Train speaker dependent userwords on a set of
recorded utterances, possibly from different
speakers
Can be used in the Batch Recognition Tool

56
Recognition Analysis Tools

Batch Speaker Adaptation Enrolment Tool
Train speaker profiles that can be used to adapt
the speaker independent acoustic model to a
speaker
Can be used in the Batch Recognition Tool
Scoring Tool
analyze recognizers output on a series of
utterances and generate detailed error report
Select utterances with particular errors, or
particular speakers, or
Also supports analysis and tuning of rejection
performance
Engine Tuning Tool
Automatic tuning of optimal engine parameters by
running batch recognition on recorded/logged
speech