Tools and Methodologies for the Development of Speech Recognition Enabled Applications - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Tools and Methodologies for the Development of Speech Recognition Enabled Applications

Description:

What are the requirements for an ASR engine that can be used for voice-enabling ... 'Extra-Event' models for cough, car-horn, wipers, mouth clicks, ... – PowerPoint PPT presentation

Number of Views:353
Avg rating:3.0/5.0
Slides: 58
Provided by: janverh
Category:

less

Transcript and Presenter's Notes

Title: Tools and Methodologies for the Development of Speech Recognition Enabled Applications


1
Tools and Methodologies for the Development of
Speech Recognition Enabled Applications
Dr. Ir. Jan Verhasselt, Director Embedded ASR
ResearchFebruary 14 2007
2
Major Goals of This Presentation
  • Answer the questions
  • What are the requirements for an ASR engine that
    can be used for voice-enabling a wide range of
    applications in the embedded market?
  • What process can be used to guide the development
    of applications that incorporate speech
    recognition?
  • How can tools help to reduce of cost of
    developing such applications?
  • As a side-effect, give insight in a number of
    criteria that are important when choosing a
    speech recognition engine for a certain
    application

3
Overview
  • Embedded Speech Recognition at Nuance
  • Important ASR Engine Features
  • Other Requirements for Embedded ASR Product
  • Speech Application Development Process
  • Application Development Tools

4
Embedded ASR Portfolio at Nuance
Dragon Naturally Speaking
Task Complexity
Fusion of VoCon3200v2 and Dragon NS VoCon 3200
v3 and VoCon Mobile X3

Dictation
More natural CC and Voice controled MP3
Current Flagship Automotive ASR engine VoCon3200
v2
Entertainment Basic Voice controled MP3
Navigation Voice Destination Entry (VDE)
Command Control
VoCon Mobile XGT
Phone Dialing Advanced VAD
VoCon SF
VoCon Mobile
Phone Dialing Simple VAD
Low end DSP ARM7/ARM9
High end RISC
Processor Capability
5
Focus of this presentation VoCon3200
  • Key Features
  • Command Control including large name lists
    (VAD, VDE, MP3)
  • Continuous speaker independent speech recognition
    with support for speaker dependent voicetags and
    speaker adaptation
  • Noise robust for automotive environment
    (far-talk) also very good accuracy in
    close-talk and noise-free environments
  • Modularity
  • Portability
  • Post-processors
  • CFG grammar formalism
  • Off-line and on-line grammar
  • compilation/modification/activation

6
ASR Engine Features Beyond Core ASR
  • Noise-robust core ASR is important, but not
    enough
  • Important product components for SI CC engines
    for medium to large vocabularies
  • Grammar processor grammar formalism(s), grammar
    compiler, dynamic activation / modification
  • Lexicon and Pronunciation Guesser
  • Natural Language Understanding
  • Voice Activity Detection, Extra-event rejection
  • Returned results
  • Speaker Normalization/Adaptation and/or User
    Words
  • Specifics for name dialing and destination entry

7
ASR Engine Features
  • Grammar formalism
  • Context Free Grammars (CFG) to describe tasks
  • Example (BNF)
  • Grammar compiler turns grammars in ASR contexts
  • ASR context defines what ASR engine can recognize
  • ASR context contains FSM representation of
    grammars
  • Optional FSM minimization
  • Grammar compilation often done off-line

!grammar Order !start ltSpeechgt ltSpeechgt
!optional(I would like) (ltDrinksgt ltFoodgt)
please ltDrinksgt a lemonade a milkshake an
orange juice ltFoodgt a hamburger French fries
8
ASR Engine Features
  • Specifying word pronunciations
  • The LH phonetic alphabet
  • Word-specific pronunciations the !pronounce
    directive
  • Grammar-specific pronunciations the !pronounce
    statement
  • Phonetic dictionary
  • Pronunciation Guesser (G2P)
  • Precedence pronounce directive gt pronounce
    statement gt dictionary gt G2P
  • !start ltrulegt
  • !pronounce coffee kO.fi kA.fi
  • ltrulegt I have read !pronounce(REd) a book
  • I read a book
  • I drink coffee

9
ASR Engine Features
  • Dynamic activation
  • Goal quickly activate/deactivate parts of the
    grammar
  • Directly on the engine no un-loading of context
    1 engine
  • Rule/Grammar/Context/Label (de)activation
  • On Context FSM node (de)activation
  • Dynamic Modification
  • Goal Quickly modify parts of a grammar without
    complete recompilation
  • Add and remove rules to/from a grammar
  • Add and remove alternatives to/from existing
    rules
  • On Context Add and remove a list of names to a
    context

10
ASR Engine Features
  • Natural Language Understanding
  • Why?
  • extract meaning from the users utterance
  • Make application layer independent of how the
    user exactly phrased his utterance
  • Two grammar formalisms that differ in NLU
    statements only
  • BNFEM NLU handled by reco engine itself
  • BNFAM NLU post-processor of reco result
  • Robust deep parsing versus Shallow Parsing
  • Reason tailor balance between expressive power
    and footprint to the needs of the customer

11
ASR Engine Features
  • NLU in BNFEM
  • NLU information is stored in the ASR Context
  • Lowest memory requirements !id(number)
  • Modify spoken utterance !action and !ignore
  • NLU in BNFAM
  • Syntax directed translation by means of CFG
    parser
  • NLU result a set of attribute-value pairs
  • ltCitygt New York !id ("1") The Big Apple
    !id("1")
  • ltCitygt New York !action("NY") The Big Apple
    !action("NY")
  • ltnumbergt lttensgt ltunitsgt pricesum(lttensgt,ltunits
    gt)
  • lttensgt 20 30 90
  • ltunitsgt 1 2 9

12
ASR Engine Features
  • Voice Activity Detection
  • Detect start-of-speech
  • Save CPU during leading silence
  • Detect end-of-speech
  • Responsiveness of the recognition engine
  • Extra event rejection
  • Reject noises
  • Extra-Event models for cough, car-horn, wipers,
    mouth clicks,
  • Can be put in parallel to the main grammar(s),
    with their own DP (avoid pruning effects)
  • If the extra-event models score better than the
    main grammar(s), the result type is set to
    REJECTED.

13
ASR Engine Features
  • Returned results/events
  • Signal events
  • Abnormal conditions signal too loud, bad SNR,
  • At regular intervals SNR, energy level,
  • At certain moments trailing silence detected,
  • ? Information can be used by application to give
    feedback to the speaker, adapt dialog strategy,
  • Recognition result
  • N-best alternatives
  • Confidence values at word and sentence level
  • Word segmentation
  • Result type FINAL or REJECTED (ExEv)
  • ?Configurable rejection behaviour, based on
    confidence values and result type.

14
ASR Engine Features
  • Speaker Adaptation
  • Acoustic Model adaptation based on few tens of
    utterances
  • Reduces error rate up to 30 relative
  • Even more for non-native speakers
  • Speaker normalization
  • Design of the feature extraction, ao. Cepstral
    Mean Normalization
  • Speaker-Dependent Words User Words or Voice
    tags
  • Nr of needed training utterances
  • Combination of SD and SI words
  • Works well in noisy conditions when trained in
    clean conditions
  • Confusability check

15
ASR Engine Features
  • Specifics for name dialing and destination entry
  • Dedicated isolated name search algorithm
  • Low memory and CPU requirements for long lists of
    isolated words, e.g. street names, city names,
    person names, stock quotes,
  • Spelling post-processor
  • Two steps
  • Recognize the letter sequence (normal reco
    engine)
  • Find the best matching name from a list
    (post-processor)
  • Allows spelling errors, even deletions and
    insertions
  • Support incremental partial spelling

16
Other Requirements for Embedded ASR Product
  • Footprint
  • Storage, peak RAM, CPU needs
  • Scalability
  • Trade-off recognition accuracy versus footprint
  • Modularity
  • Trade-off supported features versus footprint
  • Portability
  • Abstract and isolate processor and OS-specific
    functionality
  • Re-usability
  • Across languages, grammar formalisms, character
    encodings

17
Other Requirements for Embedded ASR Product
  • Language Portfolio
  • Cost efficient production of tens of languages
  • Language specific data, not code
  • Documentation
  • Getting Started, Functional Reference API,
    Development Formalisms like BNF and LH,
    Application Notes, Training Courses, Sample
    Programs, Demonstration Tools
  • QA
  • Automated nightly Build and Test System
  • Code Checkers
  • Design and Release Process

18
A Recap Whats Next?
  • Recap
  • We have seen that a commercial ASR product
    offering
  • has many features beyond core noise-robust ASR
  • has many other requirements besides functionality
  • We have introduced relevant ASR terminology
  • Now were ready to
  • Analyse what it takes to create an attractive
    speech enabled application
  • Introduce tools and methods that help such design
    and development

19
The ultimate goal for speech in applications
User Satisfaction through Task Completion Success
  • Variables affecting task completion rate and
    speed
  • Technology
  • Performance of ASR engine, a.o. acoustic model
    size
  • Engine parameter settings, o.a. choice of search
    type
  • Enabling technology spelling, adaptation,
    tools,...
  • User Interface
  • Appropriateness of prompts
  • Fall-back strategies n-best candidates,
    spelling, SD words
  • Rejection/Confirmation methods
  • System Design and implementation
  • Speaker adaptation acoustic model, language
    model (DSM)
  • Quality of grammars and vocabulary
  • Proper use of session data
  • User input
  • Speaker characteristics proper pronunciations?
  • Audio quality / Signal-to-noise ratio (SNR)

20
Speech Application Development Process
Speech Application Development
Process Flow
1

Language
Model
3

Interaction
Development
10

14

4

2

System
Updates To
Recognition
15

Specification
13

Integration
Grammars
Package
Functional
Usability
Prompting

Testing
Testing
Interaction
5

Flow
Provision Of
Data Logging
16

Capability
Peformance
Validation
6

Create
Usability Test
Scenarios
Speech Data
12

7

11

Transcribe
Create Data
Data
Speech
Validate
17

Collection
Collection
Data
Transcribe
Complex
Scripts
,
etc

Tuning
Grammars
Off Line Analysis
Improve Grammar Coverage
Pronunciations
8

Testers
Recruit
Testers
18

Repeat
Functional
9

Testing As
Prepare
Required
Performance
Validation
Production
Scenarios
21
Performance Tuning
  • UI Design
  • Prompts, Grammars, UI strategies
  • Implementation
  • Search algorithm
  • Grammar technology
  • Audio Path
  • Performance Validation and Parameter Tuning
  • Tuning parameters for maximum accuracy at minimum
    resources
  • Data Collection and Analysis
  • Performance Validation Reports
  • Acoustic Model Tuning
  • Session Data
  • Dynamic Semantic Models
  • Speaker Adaptation
  • Model Merging

22
UI Design
  • Prompts to guide users to what they can say
  • Grammars designed to capture what users likely
    say
  • Localization aspects
  • UI Strategies

23
Command and Prompt Design
  • How do users refer to frequency ranges?
  • AM / FM frequencies
  • Digits, natural numbers, pairs?
  • 530 gt five three zero
  • 530 gt five hundred and thirty
  • 530 gt five thirty
  • 1610 gt sixteen ten
  • 1610 gt eintausendsechshundertundzehn
  • Challenge recognition accuracy vs. freedom of
    input
  • Nuance principle
  • Prioritize accuracy of expected user input as far
    as possible by offering as much freedom as
    possible
  • For example, better to have pairs recognized
    perfectly instead of offering pairs as well as
    natural numbers

24
UI and Grammar Design Vocon3200
  • Synonyms increase grammar coverage
  • Optimize different constraints
  • Define most important ways how users refer to a
    command.
  • Optimize prompts in a way that variation of
    responses is minimized as far as possible.
  • Allowable variation determined by
  • Expected recognition accuracy, e.g. expected SNR
    (close-talk vs far-talk)
  • RAM considerations size of grammar and search
    space, possibly after grammar optimization)
  • CPU considerations recognition latency, grammar
    loading times, choice of search algorithm

PROMPT
Which station are you travelling to?
Grammar synonyms
London Central London London Waterloo London
Marylebone London Paddington London Kings
Cross London Euston London Victoria etc
25
UI and Grammar Design Vocon3200
PROMPT
Which station are you travelling to?
  • Pronunciations increase population coverage
  • Optimize different constraints
  • More pronunciations lead to larger RAM and CPU
    requirements
  • Restrict to those variants that cover normal
    variation across the population
  • Assure recognition accuracy
  • Confusability tool identifies words with similar
    phonetic transcription

Grammar synonyms
London Central London London Waterloo London
Marylebone london Paddington London Kings
Cross London Euston London Victoria
n
Pronunciations
Marylebone /m a r ey l ax b ow n/ /m ah r l ax b
ow n/
26
Confusability Checks
Number confusions
Critical pair
Command 1 Command 2
Confusion score
27
UI Strategies
  • One-shot versus multiple turns
  • Accuracy and response time versus perceived task
    time
  • Golden path
  • Offer one-shot for most likely commands/choices
  • Provide multiple turn or disambiguation for less
    frequent cases
  • Offer Alternative Strategies
  • Success for every user rather than for average
    user
  • Eg offer spelling alternative
  • Users dont always know how to speak a name
  • Spelling can disambiguate confusable words

28
Implementation
  • Search algorithm selection
  • Grammar technology
  • Audio Path

29
Search Algorithm Selection
  • VoCon 3200 uses in principle a two pass search
  • Three basic engine types can be selected for the
    first pass
  • General purpose Wordpair N-best DP
  • Search is performed on the word-level FSM in the
    context
  • Dedicated for long item list recognition TreeDP
    (and variants)
  • Search is performed on a phonetic tree
  • Large Grammars describing natural utterances
    incorporating one or more long item lists
    LexTreeDP
  • Optional second pass rescores the N-best list
    that results from the first pass
  • Search algorithms have different memory/resource
    usage

30
Grammar Technology
  • VoCon 3200 offers Grammars and Contexts
  • Grammars
  • More flexibility, easy run-time modification
  • Larger resource needs (loading time, memory)
  • Contexts
  • Highly optimized, minimal resource needs

31
Audio Path
  • Garbage (sound) in -gt garbage (results) out
  • Audio recommendations
  • SNR
  • Below 5dB accuracy drops quickly
  • Bandwidth
  • For 16kHz models, bandwidth 7-gt7.3kHz can lead to
    5 relative WER reduction
  • 12 bit dynamic range no AGC
  • See VoCon 3200 Audio Recommendations Document

32
Performance Validation and Parameter Tuning
  • Tuning parameters for maximum accuracy at minimum
    resources
  • Data Collection and Analysis
  • Performance Validation Reports

33
Accuracy metrics
  • Did the speaker speak in- or out-of-vocabulary/gra
    mmar (OOV) ?
  • OOVs can also be noise
  • Did the recognizer make a correct decision (or a
    false one) ?
  • Accepted the result?
  • Rejected the result?
  • Was there confirmation required?

SYSTEM Youd like to make a reservation, is
that correct?
SYSTEM Im sorry I didnt get that, please tell
me again what you are calling about
34
Within resource constraints
  • CPU limits
  • Loading time
  • Response time (latency)
  • Memory limits
  • Dynamic memory usage

35
Word/Sentence Error Rate Evaluation
ASR is a statistical process ? measuring
performance is a statistical estimation problem
  • Evaluation of accuracy requires enough
    representative data to derive statistically
    significant errors

36
Confidence Intervals
  • 95 confidence interval for WER f (f - ?, f
    ?), ? 1.96 ?(f(1-f)/N)

37
Tuning process 5 steps
Collect data
  • What?
  • How?
  • What?
  • How?

Measure
Analyse
  • Tools
  • What to look for

Experiment
  • Tools
  • Methodology

(Pre-)Release
  • When?

(Iterate)
38
Evaluation Offline Testing
  • Offline recordings in car similar to target car
  • Same microphone
  • About the same distance speaker - mouth
  • About the same background noise at different
    conditions
  • ? Outcome Offline test report
  • check for recognition problems
  • check wording from recognition performance point
    of view
  • get first feedback from test persons

39
Evaluation Online Test
40
Tuning of Applications
Collect data
  • What data?
  • How to collect?
  • What to measure?
  • How to measure?

Measure
Analyse
  • Tools
  • Accuracy report

Tuning Iterations
Experiment
  • Tools
  • Methodology

(Pre)-Release
  • When?

(Iterate)
41
Evaluation Performance Validation and Tuning
Report
  • Tuning of grammars, commands and dictionaries
  • Based on spontaneous as well as correct
    utterances
  • Optimize regarding RAM, heap size, format, ID
    usage
  • Optimize regarding usage and recognition accuracy
  • Tuning of parameters
  • absoluteThreshold, MinSpeech, trailing silence,
    pruning...
  • ? Nuance/OEM/Tier1
  • adjust HMI design implementation according to
    tuning results

42
Performance Validation Reports
43
Acoustic Model Tuning
  • For Audio Characteristics Session Data
  • For Speaker Characteristics Speaker Adaptation
  • For Accuracy versus Size Model Merging and
    Compiling

44
Session Data
  • Engine automatically adapts to speaker and
    environment (microphone, room characteristics,
    NOT background noise)
  • Information is contained in so-called session
    data
  • Application can retrieve session data, store and
    re-load it
  • Language (AMO) specific
  • General rule
  • Clear session data at system start-up (unknown
    speaker)
  • Re-use session data during driving session

45
Speaker Adaptation
  • Adaptation to single user
  • Most improvement for speakers with low accuracy
  • Supervised enrollment min. 10sec of speech
    distributed over 20 different short commands
  • Supervised selection by the application (eg based
    on key or phone identification)
  • Fast loading (acoustic model modified in RAM)
  • Adaptation to environment based on set of users
  • Same technology, but enrollment done off-line
    based on a set of users collected in target
    environment

46
Acoustic Model Merging and Compiling
  • Three standard sizes of acoustic models per
    language ultra-compact (320kBytes), Compact
    (780kBytes), Full (4Mbytes)
  • Each of these models is complete they can
    recognize any phoneme from the selected language
  • VoCon 3200 AMOs have built-in word models for
    very frequent, important words
  • Digits 0-9 and Letters (not with ultra-compact
    models)
  • Exceptionally other words, depending on language
  • Merging
  • Add parts of eg Full AMO to Compact AMO
  • Resulting AMO
  • Bit larger than Compact, bit more CPU
  • Better performance for selected parts
  • Compiling
  • Keep only parts of model that are used by the set
    of grammars

47
Language Model Tuning DSM
  • How will speakers vary?
  • Language models
  • bias the recogniser
  • increase realised accuracy
  • Language models in VoCon(X)3(200)
  • Currently in products only by re-scoring the
    N-best list of recognition hypos
  • Dynamic Semantic Models (DSM) adapt to the
    speakers usage history
  • Examples
  • DSM for VDE
  • DSM for VAD

PROMPT
Which station are you travelling to?
Grammar synonyms
London Central London London Waterloo London
Marylebone london Paddington London Kings
Cross London Euston London Victoria
n
Pronunciations
Marylebone /m a r ey l ax b ow n/ /m ah r l ax b
ow n/
Language Model
50 London 16 Euston 12 Waterloo 5
Paddington etc
48
Dynamic Semantic Models
PROMPT
Which station are you travelling to?
Grammar synonyms
London Central London London Waterloo London
Marylebone london Paddington London Kings
Cross London Euston London Victoria
Custom Grammar, Lexicon, DSM
  • BENEFITS
  • Increased coverage
  • Increased accuracy
  • Network versus Embedded
  • In embedded no retuning after first limited
    deployment
  • In embedded too little real data for fine
    statistical language models (use categories
    instead), BUT, application can adapt to a
    particular speaker or small set of speakers
    (exploit usage history in DSM)

n
Pronunciations
Marylebone /m a r ey l ax b ow n/ /m ah r l ax b
ow n/
Language Model
50 London 16 Euston 12 Waterloo etc
49
ASR Application Development Tools Overview
  • Grammar and Pronunciation Editing and Analysis
    Suite
  • Purpose
  • Fast development and testing of ASR grammars and
    pronunciation dictionaries
  • Allow initial evaluation of grammar compilation
    speed, recognition speed and recognition
  • Recognition Analysis Suite
  • Purpose
  • Get the best out of our ASR engines by tuning the
    most important engine parameters, by further
    tuning of pronunciations, etc.
  • Collect speech utterances to allow the tuning

50
ASR Tools
GrammarTools
PronunciationTools
grammars
lexicon
Grammar Compiler
Engine TuningTools
configuration parameters
ASR Engine
Logging Library
reference data
speech data
Data Preparation Tools
51
Grammar and Pronunciation Analysis Tools
  • Grammar Editor
  • Syntax highlighting, search and replace, spelling
    checker, wide support for character encoding,
    smart indentation, folding,
  • Grammar Creator Tool
  • Create an ASR grammar from a list of names,
    possibly including partial spelling, actions,
  • Grammar Compiler Tool
  • compile text grammar in binary equivalent,
    experiment with grammar compilation options
  • Context Compiler Tool
  • compile text grammar(s) into an equivalent binary
    context (for context from buffer functions
    only)
  • Dictionary Compiler Tool
  • Compile a text dictionary into an equivalent
    binary dictionary
  • Spelling Tree Compiler Tool
  • Compile a list of words (e.g. city names) in a
    binary buffer that can be loaded on the spelling
    post-processor

52
Grammar Editor
53
Grammar and Pronunciation Analysis Tools
  • Model Compiler Tool
  • Produce a reduced grammar specific acoustic model
    for a given set of fixed grammars (exception
    userwords)
  • Vocabulary Verifier Tool
  • check grammars vocabulary and word
    pronunciations
  • Context Verifier Tool
  • generate sentences described by the context,
    check whether a sentence is covered by the
    context,
  • Recognition Test Tool
  • Test recognition on a single utterance, either
    from previously recorded file or with microphone
    input
  • Confusability Tool
  • identify confusable word or sentence pairs based
    on their pronunciations
  • User Dictionary Editor
  • GUI tool to create an exception dictionary with
    better phonetic transcriptions for certain words

54
Recognition Analysis Tools
  • Audio Data Collector
  • GUI tool to make elaborated utterance recordings
  • Log Importer Tool
  • convert a binary log file (created by
    applications that use the logging library from
    the Speech API) into a text log file
  • takes care of conversions of data types and
    concatenation of small audio buffers into
    utterances
  • Log Extractor Tool
  • convert a central log file into files useable by
    e.g. the Batch Recognition Tool
  • Allows filtering of interesting information, e.g.
    only utterances of a certain speaker, and/or in a
    certain state of the dialog,

55
Recognition Analysis Tools
  • Sound Tool
  • GUI tool to listen to and analyze recorded
    utterances to spot bad signal quality
  • Speech Verifier Tool
  • annotate recorded utterances with orthographic
    transcriptions
  • Batch Recognition Tool
  • perform recognition on a series of recorded
    utterances
  • Supports the spelling and NLU postprocessor
  • Experiment with all engine parameters.
  • Batch Userword Training Tool
  • Train speaker dependent userwords on a set of
    recorded utterances, possibly from different
    speakers
  • Can be used in the Batch Recognition Tool

56
Recognition Analysis Tools
  • Batch Speaker Adaptation Enrolment Tool
  • Train speaker profiles that can be used to adapt
    the speaker independent acoustic model to a
    speaker
  • Can be used in the Batch Recognition Tool
  • Scoring Tool
  • analyze recognizers output on a series of
    utterances and generate detailed error report
  • Select utterances with particular errors, or
    particular speakers, or
  • Also supports analysis and tuning of rejection
    performance
  • Engine Tuning Tool
  • Automatic tuning of optimal engine parameters by
    running batch recognition on recorded/logged
    speech

57
Thanks
  • Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com