Title: Colleagues:
1A brains-eye-view of speech perception David
Poeppel Cognitive Neuroscience of Language
Lab Department of Linguistics and Department of
Biology Neuroscience and Cognitive Science
Program University of Maryland College Park
- Colleagues
- Allen Braun, NIH
- Greg Hickok, UC Irvine
- Jonathan Simon, Univ. Maryland
- Students
- Anthony Boemio
- Maria Chait
- Huan Luo
- Virginie van Wassenhove
2encoding ?
Is this a hard problem? Yes! If it could be
solved straightforwardly (e.g. by machine), Mark
Liberman would be in Tahiti having cold beers.
representation ?
3- Outline
- (1) Fractionating the problem in space
- Towards a functional anatomy of speech
perception - Fractionating the problem in time
- Towards a functional physiology of speech
perception - - A hypothesis about the quantization of time
- - Psychophysical evidence for temporal
integration - - Imaging evidence
4interface with lexical items, word recognition
5interface with lexical items, word recognition
hypothesis about storage distinctive
features -voice voice
voice labial high
labial -round round -round .
. .
6production, articulation of speech
interface with lexical items, word recognition
hypothesis about storage distinctive features
-voice voice voice labial
high labial -round round
-round . . .
7production, articulation of speech
hypothesis about production distinctive
features -voice voice labial
high . .
interface with lexical items, word recognition
hypothesis about storage distinctive
features -voice voice
voice labial high
labial -round round -round .
. .
8production, articulation of speech FEATURES
analysis of auditory signal ? spectro-temporal
rep. ? FEATURES
interface with lexical items, wordrecognition FEA
TURES
9Unifying concept distinctive feature
auditory-motor interface
coordinate transform from acoustic to
articulatory space
production, articulation of speech
analysis of auditory signal ? spectro-temporal
rep. ? FEATURES
auditory-lexical interface
interface with lexical items, word recognition
10coordinate transform from acoustic to
articulatory space
production, articulation of speech
analysis of auditory signal ? spectro-temporal
rep. ? FEATURES
interface with lexical items, word recognition
11Area Spt (left) auditory-motor interface
pIFG/dPM (left) articulatory-based speech codes
STG (bilateral) acoustic-phonetic speech codes
pMTG (left) sound-meaning interface
Hickok Poeppel (2000), Trends in Cognitive
Sciences Hickok Poeppel (in press), Cognition
12Indefrey Levelt, in press, Cognition Meta-analys
is of neuroimaging data, perception/production
overlap
Shared neural correlates of word production and
perception processes Bilat mid/post STG L
anterior STG L mid/post MTG L post IFG
- MTG and IFG overlap when controlling for the
overt/covert - distinction across tasks
- Hypothesized functions
- lexical selection (MTG)
- lexical phon. code retr. (MTG)
- post-lexical syllabification (IFG)
13Scott Johnsrude 2003
14Possible Subregions of Inferior Frontal
GyrusBurton (2001)
Auditory Studies Burton et al. (2000), Demonet et
al. (1992, 1994), Fiez et al, (1995), Zatorre et
al., (1992, 1996) Visual Studies Sergent et al.
(1992, 1993), Poldrack et al., (1999), Paulesu
et al. (1993, 1996), Sergent et al., 1993,
Shaywitz et al. (1995)
15Auditory lexical decision versus FM/sweeps (a),
CP/syllables (b), and rest (c)
(a)
(b)
(c)
D. Poeppel et al. (in press)
z6
z9
z12
16fMRI (yellow blobs) and MEG (red dots) recordings
of speech perception show pronounced bilateral
activation of left and right temporal cortices
T. Roberts D. Poeppel (in preparation)
17Binder et al. 2000
18Area Spt (left) auditory-motor interface
pIFG/dPM (left) articulatory-based speech codes
STG (bilateral) acoustic-phonetic speech codes
pMTG (left) sound-meaning interface
Hickok Poeppel (2000), Trends in Cognitive
Sciences Hickok Poeppel (in press), Cognition
19- Outline
- (1) Fractionating the problem in space
- Towards a functional anatomy of speech
perception - Fractionating the problem in time
- Towards a functional physiology of speech
perception - - A hypothesis about the quantization of time
- - Psychophysical evidence for temporal
integration - - Imaging evidence
20(No Transcript)
21The local/global distinction in vision is
intuitively clear
Chuck Close
22What information does the brain extract from
speech signals?
23Acoustic and articulatory phonetic phenomena
occur on different time scales
fine structure
envelope
24Does different granularity in time matter?
Segmental and subsegmental information serial
order in speech fool/flu carp/crap bat/t
ab Supra-segmental information proso
dy Sleep during lecture! Sleep during
lecture?
25The local/global distinction can be
conceptualized as a multi-resolution analysis in
time
Further processing
Binding process
Supra-segmental information (time 200ms)
Segmental information (time 20-50ms)
syllabicity
metrics
tone
features, segments
26- Outline
- (1) Fractionating the problem in space
- Towards a functional anatomy of speech
perception - Fractionating the problem in time
- Towards a functional physiology of speech
perception - - A hypothesis about the quantization of time
- - Psychophysical evidence for temporal
integration - - Imaging evidence
27Temporal integration windows Psychophysical and
electrophysiologic evidence suggests that
perceptual information is integrated and analysed
in temporal integration windows (v. Bekesy 1933
Stevens and Hall 1966 Näätänen 1992 Theunissen
and Miller 1995 etc). The importance of the
concept of a temporal integration window is that
it suggests the discontinuous processing of
information in the time domain. The CNS, on this
view, treats time not as a continuous variable
but as a series of temporal windows, and
extracts data from a given window.
arrow of time, physics
arrow of time, Central Nervous System
28Asymmetric sampling/quantization of the speech
waveform
This p a p er i s h
ar d tp u b l i sh
29Two spectrograms of the same word illustrate how
different analysis windows highlight different
aspects of the sounds. (a) high time resolution
- each glottal pulse visible as vertical
striation (b) high frequency resolution - each
harmonic visible as horizontal stripe
(a) High time, low frequ.- resolution
(b) Low time, high frequ.- resolution
30Hypothesis Asymmetric Sampling in Time
(AST) Left temporal cortical areas
preferentially extract information over 25ms
temporal integration windows. Right hemisphere
areas preferentially integrate over long,
150-250ms integration windows. By assumption,
the auditory input signal has a neural
representation that is bilaterally
symmetric (e.g. at the level of core) beyond the
initial representation, the signal is elaborated
asymmetrically in the time domain. Another way
to cocneptualize the AST proposal is to say
that the sampling rate of non-primary auditory
areas is different, with LH sampling at high
frequencies (40Hz) and RH sampling at low
frequencies (4-10Hz).
31a. Physiological lateralization
Symmetric representation of spectro-temporal
receptive fields in primary auditory cortex
Temporally asymmetric elaboration of perceptual
representations in non-primary cortex
LH
RH
Proportion of neuronal ensembles
25 40Hz 4Hz
250
25 40Hz 4Hz
250
Size of temporal integration windows
(ms) Associated oscillatory frequency (Hz)
32Asymmetric sampling in time (AST)
characteristics AST is an example of
functional segregation, a standard concept.
AST is an example of multi-resolution analysis, a
signal processing strategy common in other
cortical domains (cf. visual areas MT and V4
which, among other differences, have phasic
versus tonic firing properties, respectively).
AST speaks to the granularity of perceptual
representations the model suggests that there
exist basic perceptual representations that
correspond to the different temporal windows
(e.g. featural info is equally basic to the
envelope of syllables, on this view). The AST
model connects in plausible ways to the local
versus global distinction there are multiple
representations of a given signal on different
scales (cf. wavelets) Global gt large-chunk
analysis, e.g., syllabic level Local gt
small-chunk analysis, e.g., subsegmental level
33a. Physiological lateralization
Symmetric representation of spectro-temporal
receptive fields in primary auditory cortex
Temporally asymmetric elaboration of perceptual
representations in non-primary cortex
LH
RH
Proportion of neuronal ensembles
25 40Hz 4Hz
250
25 40Hz 4Hz
250
Size of temporal integration windows
(ms) Associated oscillatory frequency (Hz)
b. Functional lateralization
Analyses requiring high temporal resolution
Analyses requiring high spectral resolution
LH
RH
34- Outline
- (1) Fractionating the problem in space
- Towards a functional anatomy of speech
perception - Fractionating the problem in time
- Towards a functional physiology of speech
perception - - A hypothesis about the quantization of time
- AST model
- - Psychophysical evidence for temporal
integration - - Imaging evidence
35Perception of FM sweepsHuan Luo, Mike Gordon,
Anthony Boemio,David Poeppel
36FM Sweep Example
waveform
80msec, from 3-2 kHz, linear FM sweep
spectrogram
37The rationale
- Important cues for speech perception
- Formant transition in speech sounds
- (For example, F2 direction can distinguish
/ba/ from /da/) - Importance in tone languages
- Vertebrate auditory system is well equipped to
analyze FM signals.
38Tone languages
- For example, Chinese, Thai
- The direction of FM (of the fundamental
frequency) is important in the language to make
lexical distinctions. - (Four tones in Chinese)
- /Ma 1/, /Ma 2/ , /Ma 3/, /Ma 4/
39Questions
- How good are we at discriminating these signals?
- determine the threshold of the duration of
stimuli (corresponding to rate) for the
detection of FM direction - Any performance difference between UP and DOWN
detection? - Will language experience affect the performance
of such a basic perceptual ability?
40Stimuli
- Linearly frequency modulated
- Frequency range studied 2-3 kHz (0.5 oct)
- Two directions (Up / Down )
- Changing FM rate (frequency range/time) by
changing duration. For each frequency range,
frequency span is kept constant (slow / Fast
) - Stimuli duration from 5msec(100 oct/sec) to 640
msec (0.8 oct/sec)
Tasks
Detection and discrimination of UP versus
DOWN 2 AFC, 2IFC, 3IFC
41- English speakers
- 3 frequency ranges relevant to speech
- (approximately F1, F2, F3 ranges)
- single-interval 2-AFC
- Two main findings
- threshold for UP at 20ms
- UP better than DOWN
2-3 kHz
1-1.5 kHz
600-900Hz
Gordon Poeppel (2001), JASA-ARLO
422IFC
- To eliminate the possibility of bias strategy
subjects can use - To see whether the asymmetric performance of
English subjects is due to their Up preference
bias
Same duration of the two sounds, so the only
difference is direction
Interval 1
Interval 2
UP
Down
Which interval (1 or 2) contains certain
direction sound?
43Results for Chinese Subjects
no significant difference Threshold for both UP
and DOWN is about 20 msec
44Results for English Subjects
No difference now between UP and DOWN Threshold
for both at 20msec No difference between Chinese
and English subjects now.
453IFC
-
- Standard
Interval 1 Interval 2
UP
UP
Down
Choose which interval contains DIFFERENT among
the three sounds (different quality rather than
only direction)
463 IFC versus 2 IFC
No difference between Chinese and English subjects
Threshold confirmed at 20ms
47Conclusion
- Importance of 20 msec as the threshold for
discrimination of FM sweeps - - corresponds to temporal order threshold
determined by Hirsh 1959 - - consistent with Schouten 1985, 1989 testing FM
sweeps - - this basic threshold arguably reflects the
shortest integration window that generates robust
auditory percepts.
48Click trains Anthony Boemio David Poeppel
49Click Stimuli
50Psychophysics
51Auditory visual integration the McGurk
effectVirginie van Wassenhove, Ken Grant,David
Poeppel
52McGurk Effect
- Audiovisual (AV) token
- Visual (V) token
- Auditory (A) token
53Identification Task (3AFC) ApVk
TWI
True bimodal responses
Response rate as a function of SOA (ms) in the
ApVk McGurk pair. Mean responses (N21) and
standard errors. Fusion rate (open red squares)
and corrected fusion rate (filled red squares,
dotted line) are /ta/ responses, visually driven
responses (open green triangles) are /ka/, and
auditorily driven responses (filled blue circles)
are /pa/. A negative value in corrected fusion
rate is interpreted as a visually dominated error
response /ta/.
54Simultaneity Judgment Task (2AFC) ApVk vs. AtVt
and AbVg vs. AdVd
Simultaneity judgment task. Simultaneity
judgment as a function of SOA (ms) in both
incongruent and congruent conditions (ApVk and
AtVt N21 AbVg and AdVd N18). The congruent
conditions (open symbols) are associated with
broader and higher simultaneity judgment profile
than the incongruent conditions (filled symbols).
55Temporal Window of Integration (TWI) across
Tasks and Bimodal Speech Stimuli
56- Outline
- (1) Fractionating the problem in space
- Towards a functional anatomy of speech
perception - Fractionating the problem in time
- Towards a functional physiology of speech
perception - - A hypothesis about the quantization of time
- AST model
- - Psychophysical evidence for temporal
integration - FM sweeps and click trains 20-30ms
integration - AV processing in McGurk 200ms integration
- - Imaging evidence
57Binding of Temporal Quanta in Speech Processing
Maria Chait, Steven Greenberg, Takayuki Arai,
David Poeppel
58Multi Resolution Analysis Hypothesis
SYLLABLE
Binding process
Supra- segmental information (t.s 300 ms)
(Sub)-segmental information (t.s 30 ms)
stress
tone
syllabicity
feature
59(No Transcript)
60- 0-6 khz
- 14 channels
- spaced in 1/3 octave steps along the cochlear
frequency map. - Every two neighboring channels are separated by
50hz
61Envelope Extraction
Amplitude
Time
62(No Transcript)
63Original
High Passed
Low Passed
64Evidence
- Comodulation masking release
- Ahissar et al. (2001) - Phase locking in the
auditory cortex to the envelope of sentence
stimuli. - Shannon (1995)
- Drullman (1994)
-
Effect of low pass filtering the envelope on
speech reception severe reduction at 0-2Hz
cutoff frequencies marginal contribution of
frequencies above 16Hz Effect of High Pass
filtering the envelope reduction in speech
intelligibility for cutoff frequencies above
64Hz no reduction in sentence intelligibility
when only frequencies below 4Hz are reduced
65Experiment 1
- Stimuli
- - 53 Sentences from the IEEE corpus.
- - Nonsense Syllables (CUNY)
- 8 Blocks 2(voiced/voiceless)2
vowels(/a/,/i/) 2(CV/VC) -
- - 3 manipulations
- 0-3 Hz Low Pass
- 22-40 Hz Band Pass
- 0-3 and 22-40 Hz
-
- Each subject hears all 53 sentences but only one
manipulation - per sentence. A practice block of 26 sentences
precedes - the experiment.
- Task
- Sentences subjects asked to write down what they
heard as precisely as they can - Syllables 7-alternative forced choice
66Results
high-pass
67Results
high-pass
low-pass
68Results
high-pass plus low-pass?
high-pass
low-pass
69Results
high-pass
low-pass
high-pass plus low-pass?
Result reflects the interaction between
information carried on the short and long time
scales.
70- Outline
- (1) Fractionating the problem in space
- Towards a functional anatomy of speech
perception - Fractionating the problem in time
- Towards a functional physiology of speech
perception - - A hypothesis about the quantization of time
- AST model
- - Psychophysical evidence for temporal
integration - FM sweeps and click trains 20-30ms
integration - AV processing in McGurk 200ms integration
- Interaction of temporal windows
71- fMRI study of temporal
- structure in concatenated FMs
- Anthony Boemio, Allen Braun,
- Steven Fromm, David Poeppel
72Stimulus Properties
73Stimulus Properties
Spectrograms
Ampl. vs. Time
PSDs
FM Stimulus
TONE Stimulus
CNST Stimulus
1
1E-10
Time (sec)
0
1
Frequency (Hz)
100
1E4
All 13 stimuli have nearly identical long-term
spectra and RMS power over the entire 9-second
stimulus duration. Stimuli differ only in
segment duration which was determined by drawing
from a Gaussian distribution (previous panel),
with means of 12, 25, 45, 85, 160, and 300ms.
74fMRI Single-trial sparse acquisition
paradigm (clustered volume acqu.) 1.5T GE
Signa, echo-planar sequence 11.4s TR (9s
signal, 2.4s volume), TE 40ms 24
reps/condition SPM 99 random-effects Model,
plt0.05 corrected
75SPM 99 Cohort AnalysisFMs-CNST Categorical
Contrasts (p lt 0.05 corr.)
76(No Transcript)
77(No Transcript)
78Hemodynamic response/stimulus modelNot all
segment transitions are equal.
Only 1second of stimuli are shown for clarity
acquisition
Including the segment transitions and segments
themselves, but assuming that transitions between
long segments contribute more to the response
than shorter ones produces the observed
activation vs. segment-duration relation (left).
threshold set by categorical contrast to CNST
stimulus- anything below this level will be zero
in the SPM
79(No Transcript)
80- MEG study of spectral responses
- to complex sounds
- David Poeppel, Huan Luo, Dana Ritter, Anthony
Boemio, - Didier Depireux, Jonathan Simon
81Asymmetric sampling in time (AST)
hypothesis predicts electrophysiological
asymmetries in specific frequency bands, gamma
(25-55Hz) and theta (3-8Hz) . because the
hypothesized temporal quantization is reflected
as oscillatory activity.
LH
RH
Sensitivity of neuronal ensembles
25 250 40Hz 4Hz
25 250 40Hz 4Hz
Size of temporal integration windows
(ms) Associated oscillatory frequency (Hz)
82(No Transcript)
83(No Transcript)
84(No Transcript)
85Flow chart
Gamma for LH
LH
RMS
Gamma for RH
Theta for LH
RH
RMS
Theta for RH
86(No Transcript)
87Multi-taper spectral analysis
88Result
89Power ratio in specific frequency bands
(P(L)/(P(L)P(R)))
Kaiser Remetz Elliptic
Gamma 0.4769 0.4751 0.4733
Theta 0.3958 0.3965 0.4210
-
-
- The difference is much greater in Theta band (low
frequency band) and RH activation in Theta band
is greater than LH
90Distribution of spectral responses
91- Outline
- (1) Fractionating the problem in space
- Towards a functional anatomy of speech
perception - Fractionating the problem in time
- Towards a functional physiology of speech
perception - - A hypothesis about the quantization of time
- AST model
- - Psychophysical evidence for temporal
integration - FM sweeps and click trains 20-30ms
integration - AV processing in McGurk 200ms integration
- Interaction of temporal windows
92Area Spt (left) auditory-motor interface
pIFG/dPM (left) articulatory-based speech codes
STG (bilateral) acoustic-phonetic speech codes
pMTG (left) sound-meaning interface
Hickok Poeppel (2000), Trends in Cognitive
Sciences Hickok Poeppel (in press), Cognition
93Asymmetric sampling in time (AST) builds on
anatomical symmetry but permits functional
asymmetry
a. Physiological lateralization
Symmetric representation of spectro-temporal
receptive fields in primary auditory cortex
Temporally asymmetric elaboration of perceptual
representations in non-primary cortex
LH
RH
Proportion of neuronal ensembles
25 40Hz 4Hz
250
25 40Hz 4Hz
250
Size of temporal integration windows
(ms) Associated oscillatory frequency (Hz)
b. Functional lateralization
Analyses requiring high temporal resolution
Analyses requiring high spectral resolution
LH
RH
94- Conclusion
- The input signal (e.g. speech) must interface
with - higher-order symbolic representations of
different types - (e.g. segmental representations relevant to
lexical access - and supra-segmental representations relevant to
- interpretation).
- These higher-order representation categories
appear to be - lateralized (e.g. segmental phonology/LH, phrasal
prosody/RH). - The timing-based asymmetry provides a possible
cortical - logistical or administrative device that
helps create - representations of the appropriate granularity.
- If this is on the right track, syllable is - at
least for perception - - as elementary a unit as feature/segment. Both are
basic.
95Analysis-by-synthesis I
Hypothesize- and test models
96Analysis-by-synthesis II
Analysis-by-synthesis model of lexical hypothesis
generation and verification (adapted and
extended from Klatt, 1979)
analysis-by-synthesis verification internal
forward model
best- scoring lexical candidates
peripheral and central neurogram
partial feature matrix
lexical hypotheses
spectral analysis
segmental analysis
lexical search
synt./seman. analysis
speech waveform
predicted subsequent items
acceptable word string
97Analysis-by-synthesis III
analysis-by-synthesis verification internal
forward model
best- scoring lexical candidates
peripheral and central neurogram
partial feature matrix
lexical hypotheses
spectral analysis
segmental analysis
lexical search
synt./seman. analysis
speech waveform
predicted subsequent items
acceptable word string