Computer Speech Recognition: Mimicking the Human System - PowerPoint PPT Presentation

About This Presentation
Title:

Computer Speech Recognition: Mimicking the Human System

Description:

Title: Robustness in CSR Using a Parametric Model of Acoustic Environments Author: deng Last modified by: deng Created Date: 2/15/2000 2:02:20 AM Document ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 24
Provided by: deng58
Category:

less

Transcript and Presenter's Notes

Title: Computer Speech Recognition: Mimicking the Human System


1
Computer Speech Recognition Mimicking the Human
System
Li Deng Microsoft Research, Redmond Feb. 2,
2005 at IPAM Workshop on Math of Ear and Sound
Processing (UCLA) Collaborators Dong Yu (MSR),
Xiang Li (CMU), A. Acero (MSR)
2
Speech Recognition--- Introduction
  • Converting naturally uttered speech into text and
    meaning
  • Human-machine dialogues (scenario demos)
  • Conventional technology --- statistical modeling
    and estimation (HMM)
  • Limitations
  • noisy acoustic environments
  • rigid speaking style
  • constrained task
  • unrealistic demand of training data
  • huge model sizes, etc.
  • far below human speech recognition performance
  • Trend Incorporate key aspects of human speech
    processing mechanisms

3
Production Perception Closed-Loop Chain
LISTENER
SPEAKER
decoded message
Internal model
message
ear/auditory reception
motor/articulators
Speech Acoustics in closed-loop chain
4
Encoder Two-Stage Production Mechanisms
  • Phonology (higher level)
  • Symbolic encoding of linguistic message
  • Discrete representation by phonological features
  • Loosely-coupled multiple feature tiers
  • Overcome beads-on-a-string phone model
  • Theories of distinctive features, feature
    geometry
  • articulatory phonology
  • Account for partial/full sound
    deletion/modification
  • in casual speech

SPEAKER
message
  • Phonetics (lower level)
  • Convert discrete linguistic features to
  • continuous acoustics
  • Mediated by motor control articulatory
    dynamics
  • Mapping from articulatory variables to
  • VT area function to acoustics
  • Account for co-articulation and reduction
    (target undershoot), etc.

motor/articulators
Speech Acoustics
5
Encoder Phonological Modeling
  • Computational phonology
  • Represent pronunciation variations as
  • constrained factorial Markov chain
  • Constraint from articulatory phonology
  • Language-universal representation

SPEAKER
ten themes
message
/ t e n ? i m z
/
Tongue Tip
motor/articulators

Tongue Body
High / Front
Mid / Front
Speech Acoustics
6
Encoder Phonetic Modeling
  • Computational phonetics
  • Segmental factorial HMM for sequential target
  • in articulatory or vocal tract resonance domain
  • Switching trajectory model for target-directed
  • articulatory dynamics
  • Switching nonlinear state-space model for
  • dynamics in speech acoustics
  • Illustration

SPEAKER
message
motor/articulators
Speech Acoustics
7
Phonetic Encoder Computation
SPEAKER
targets
articulation
message
distortion-free acoustics
distorted acoustics
motor/articulators
Speech Acoustics
distortion factors feedback to articulation
8
Phonetic Reduction Illustration
yo-yo (formal)
yo-yo (casual)
9
Decoder I Auditory Reception
  • Convert speech acoustic waves into
  • efficient robust auditory representation
  • This processing is largely independent
  • of phonological units
  • Involves processing stages in cochlea
  • (ear), cochlear nucleus, SOC, IC,, all
  • the way to A1 cortex
  • Principal roles
  • 1) combat environmental acoustic
  • distortion
  • 2) detect relevant speech features
  • 3) provide temporal landmarks to aid
  • decoding
  • Key properties
  • 1) Critical-band freq scale, logarithmic
    compression,
  • 2) adapt freq selectivity, cross-channel
    correlation,
  • 3) sharp response to transient sounds
  • 4) modulation in independent frequency bands,
  • 5) binaural noise suppression, etc.

LISTENER
decoded message
Internal model
message
ear/auditory reception
motor/articulators
10
Decoder II Cognitive Perception
  • Cognitive process recovery of linguistic
  • message
  • Relies on
  • 1) Internal model structural knowledge of
  • the encoder (production system)
  • 2) Robust auditory representation of features
  • 3) Temporal landmarks
  • Child speech acquisition process is one that
  • gradually establishes the internal model
  • Strategy analysis by synthesis
  • i.e., Probabilistic inference on (deeply)
  • hidden linguistic units using the internal
  • model
  • No motor theory the above strategy
  • requires no articulatory recovery from
  • speech acoustics

LISTENER
decoded message
Internal model
message
ear/auditory reception
motor/articulators
11
Speaker-Listener Interaction
  • On-line modification of speakers articulatory
    behavior (speaking effort, rate, clarity, etc.)
    based on listeners decoding performance (i.e.
    discrimination)
  • Especially important for conversational speech
    recognition and understanding
  • On-line adaptation of encoder parameters
  • Novel criteria
  • maximize discrimination while minimizing
    articulation effort
  • In this closed-loop model, the effort
    quantified as curvature of temporal sequence of
    articulatory vector zt.
  • No such concept of effort in conventional HMM
    systems

12
Stage-I illustration (effects of speaking rate)
13
Sound Confusion for Casual Speech (model vs. data)
model prediction
hand measurements
speaking rate
speaking rate
  • Two sounds merge when they become sloppy
  • Human perception does extrapolation so does
    our model
  • 5000 hand-labeled speech tokens
  • Source J. Acoustical Society of America, 2000

14
Model Stage-I
  • Impulse response of FIR filter (non-causal)
  • Output of filter

15
Model Stage-II
  • Analytical prediction of cepstra
  • Assuming P-th order all-pole model
  • Residual random vector for statistical bias
    modeling (finite pole order, no zeros)

residual
16
Illustration Output of Stage-II (green)
Model
data
17
Speech Recognizer Architecture
  • Stages I and II of the hidden trajectory model in
    combination ? speech recognizer
  • No context-dependent parameters ? FIR
    bi-directional filter provides context
    dependence, as well as reduction
  • Training procedure
  • Recognition procedure

18
Procedure --- Training
  • training residual parameters and

training waveform
LPCC
feature extraction
LPCC residual

monophone HMM trainer
-
-
target filtering w/ FIR
Table lookup
nonlinear mapping
phonetic xcript w/ time
VTR tracks predicted
LPCC predicted
target sequence
19
Procedure --- N-best Evaluation
test data
LPCC
H arg Max P(H1), P(H2),P(H1000)
feature extraction

nonlinear mapping
table lookup
Gaussian Scorer
FIR
Hyp 1
-

table lookup
Hyp 2
FIR
nonlinear mapping
Gaussian Scorer
triphone HMM system
-







table lookup
nonlinear mapping
Gaussian Scorer
FIR
Hyp N
-
N-best list (N1000) each hypothesis has
phonetic xcript time
parameter free
(k)
(k)
20
Results (recognition accuracy )
. . .
HMM
N in N-best
1001
11
21
Summary Conclusion
  • Human speech production/perception viewed as
    synergistic elements in a closed-looped
    communication chain
  • They function as encoding decoding of
    linguistic messages, respectively.
  • In human, speech encoder (production system)
    consists of phonological (symbolic) and phonetic
    (numeric) levels.
  • Current HMM approach approximates these two
    levels in a crude way
  • phone-based phonological model (beads-on-a-string
    )
  • multiple Gaussians as phonetic model for
    acoustics directly
  • very weak hidden structure

22
Summary Conclusion (contd)
  • Linguistic message recovery (decoding)
    formulated as
  • auditory reception for efficient robust speech
    representation for providing temporal landmarks
    for phonological features
  • cognition perception using encoder knowledge or
    internal model to perform probabilistic
    analysis by synthesis or pattern matching
  • Dynamic Bayes network developed as a
    computational tool for constructing encoder and
    decoder
  • Speaker-listener interaction (in addition to poor
    acoustic environment) cause substantial changes
    of articulation behavior and acoustic patterns
  • Scientific background and computational framework
    for our recent MSR speech recognition research

23
End Backup Slides
Write a Comment
User Comments (0)
About PowerShow.com