A glimpsing model of speech perception - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

A glimpsing model of speech perception

Description:

University of Sheffield http://www ... employ schemas for both foreground and ... Glimpsing study Aims Determine if glimpses contain sufficient information Explore ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 28
Provided by: Martin1178
Category:

less

Transcript and Presenter's Notes

Title: A glimpsing model of speech perception


1
A glimpsing model of speech perception
  • Martin Cooke Sarah Simpson

Speech and Hearing Research Department of
Computer Science University of Sheffield http//ww
w.dcs.shef.ac.uk/martin
2
Motivation The nonstationarity paradox
  • speech technology performance falls with the
    nonstationarity of the noise background

Simpson Cooke (2003)
3
MotivationThe nonstationarity paradox
  • speech technology performance falls with the
    nonstationarity of the noise background

Simpson Cooke (2003)
4
Possible factors
  • In a 1-speaker background, listeners can
  • employ organisational cues from the
    background source to help segregate foreground
  • employ schemas for both foreground and
    background
  • benefit from better glimpses of the speech
    target
  • but multi-speaker backgrounds have certain
    advantages
  • less chance of informational masking
  • easier enhancement algorithm

5
Glimpsing opportunities
Spectro-temporal glimpse densities

of time-frequency regions with a
locally-positive SNR
6
Glimpsing
Informal definition a glimpse is some
time-frequency region which contains a reasonably
undistorted view of local signal properties
  • Precursors
  • Term used by Miller Licklider (1950) to explain
    intelligibility of interrupted speech
  • Related to multiple looks model of Viemeister
    Wakefield (1991) which demonstrated intelligent
    temporal integration of tone bursts
  • Assmann Summerfield (in press) suggest
    glimpsing tracking as way of understanding
    how listeners cope with adverse conditions
  • Culling Darwin (1994) developed a glimpsing
    model to explain double vowel identification for
    small ?F0s
  • de Cheveigné Kawahara (1999) can be considered
    a glimpsing model of vowel identification
  • Close relation to missing data processing (Cooke
    et al, 1994)

7
Types of glimpses
Comodulated Eg Miller Licklider (1950)
Spectral Eg Warren et al (1995)
General uncomodulated Eg Howard-Jones Rosen
(1993), Buss et al (2003)
8
Evidence from distorted speech
e.g. Drullman (1995) filtered noisy speech into
24 ¼-octave bands, extracted the temporal
envelope in each band, and replaced those parts
of the envelope below a target level with a
constant value. Found intelligibility of 60 when
98 of signal was missing
9
Glimpsing in natural conditions the dominance
effect
  • Although audio signals add additively, the
    occlusion metaphor is more appropriate due to
    loglike compression in the auditory system

Consequently, most regions in a mixture are
dominated by one or other source, leaving very
few ambiguous regions, even for a pair of speech
signals mixed at 0 dB.
10
Issues for a glimpsing model
  • What constitutes a useful glimpse?
  • Is sufficient information contained in glimpses?
  • How do listeners detect glimpses?
  • How can they be integrated?

Glimpse detection
Glimpse integration
11
Glimpsing study
  • Aims
  • Determine if glimpses contain sufficient
    information
  • Explore definition of useful glimpse
  • Comparison between listeners and model using
    natural VCV stimuli
  • Subset of Shannon et al (1999) corpus
  • V /a/
  • C b, d, g, p, t, k, m, n, l, r, f, v, s, z,
    sh, ch
  • Background source
  • reversed multispeaker babbler for N1, 8
  • Allows variation in glimpsing opportunities
  • 3 SNRs (TMRs) 0, -6 and -12 dB
  • 12 listeners heard 160 tokens in each condition
  • 2 repeats X 16 VCVs X 5 male speakers

12
Identification results
1-speaker
8-speaker
13
Glimpsing model
  • CDHMM employing missing data techniques
  • 16 whole-word HMMs
  • 8 states
  • 4 component Gaussian mixture per state
  • Input representation
  • 10 ms frames of modelled auditory excitation
    pattern (40 gammatone filters, Hilbert envelope,
    8 ms smoothing)
  • NB only simultaneous masking is modelled
  • Training
  • 8 repetitions of each VCV by 5 male speakers per
    model
  • Testing
  • As for listeners viz. 2 repetitions of each VCV
    by 5 male speakers
  • Performance in clean gt 99

14
Model performance I ideal glimpses
  • Ideal glimpses
  • All time-frequency regions whose local SNR
    exceeds a threshold
  • Optimum threshold 0 dB
  • For this task, there is more than sufficient
    information in the glimpsed regions
  • Listeners perform suboptimally with respect to
    this glimpse definition

1
8
15
Model performancevariation in detection
threshold
  • Q Can varying the local SNR threshold for glimpse
    detection prodce a better match?
  • No choice of local SNR threshold provides good
    fit to listeners
  • Closest fit shown (-6 dB)

1
8
16
Analysis
  • Unreasonable to expect listeners to detect
    individual glimpses in a sea of noise unless
    glimpse region is large enough

17
Analysis
  • Unreasonable to expect listeners to detect
    individual glimpses in a sea of noise unless
    glimpse region is large enough

18
Model performance useable glimpses
  • Definition glimpsed region must occupy at least
    N ERBs and T ms
  • Search over 1-15 ERBs, 10-100 ms, at various
    detection thresholds
  • Best match at
  • 6.3 ERBs (9 channels)
  • 40 ms
  • 0 dB local SNR threshold

1
8
  • Howard-Jones Rosen (1993) suggested 2-4 bands
    limit for uncomodulated glimpsing
  • Buss et al (2003) found evidence for
    uncomodulated glimpsing in up to 9 bands

19
Consonant identification
  • Reasonable matches overall apart from b, s z
  • However, little token-by-token agreement between
    common listener errors and model errors.
  • Why?

20
Factors
Confusability
Audibility of target
Informational masking
Energetic masking
Existence of schemas for target
Successful identification
Organisational cues in target
Existence of schemas for background
Organisational cues in background
21
Measuring energetic masking
  • Approach resynthesise glimpses alone
  • Filter, time-reverse, refilter to remove phase
    distortion
  • Select regions based on local SNR mask
  • Results
  • Little difference for 1-speaker background,
    suggesting relatively low contribution of info
    masking in this case (due to reversed masker?)
  • Larger difference for 8-speaker case possibly due
    to unrealistic glimpses

1
8
glimpses alone
speechnoise
22
Comparison with ideal model
  • Results
  • Ideal model performs well in excess of listeners
    when supplied with precisely the same information
  • Possible reasons
  • Distortions
  • Glimpses do not occur in isolation possibility
    that a noise background will help
  • Lack of nonsimultaneous masking model will
    inflate model performance

Ideal (model)
Ideal? (listeners)
23
The glimpse decoder
  • Attempt at a unifying statistical theory for
    primitive and model-driven processes in CASA
  • Basic idea decoder not only determines the most
    likely speech hypothesis but also decides which
    glimpses to use
  • Key advantage no longer need to rely on clean
    acoustics!
  • Can interpret (some) informational masking
    effects as the incorrect assignment of glimpses
    during signal interpretation
  • Barker, J, Cooke, M.P. Ellis, D.P.W. Decoding
    speech in the presence of other sources,
    accepted for Speech Communication

24
Summary outlook
  • Proposed a glimpsing model of speech
    identification in noise
  • Demonstrated sufficiency of information in target
    glimpses, at least for VCV task
  • Preliminary definition of useful glimpse gives
    good overall model-listener match
  • Introduced 2 procedures for measuring the amount
    of energetic masking (i) via ASR (ii) via glimpse
    resynthesis
  • Need nonsimultaneous masking model
  • Need to isolate affects due to schemas
  • Repeat using non-reversed speech to introduce
    more informational masking
  • Need to quantify affect of distortion in glimpse
    resynthesis

25
Masking noise can be beneficial
Warren et al (1995) demonstrated spectral
induction effect with 2 narrow bands of speech
with intervening noise
fullband
Cooke Cunningham (in prep) Spectral induction
with single speech-bands.
26
Speech modulated noise
  • Speech modulated noise
  • As in Brungart (2001)
  • Model results and glimpse distributions indicate
    increase in energetic masking for this type of
    masker

Natural speech
natural, 1 spkr
natural, 8 spkr
SMN, 1 spkr
SMN, 8 spkr
Speech modulated noise
27
Speech modulated noise
  • Listeners perform better with SMN than predicted
    on the basis of reduced glimpses (cf SMN model),
    but not quite as well as they do with natural
    speech masker
  • Suggests energetic masking is not the whole story
    (cf Brungart, 2001), but further work needed to
    quantify relative contribution of
  • Release from IM
  • Absence of background models/cues
Write a Comment
User Comments (0)
About PowerShow.com