Title: Speech%20Perception
1Speech Perception
??????????????? recognize speech wreck a
nice beach
?
2The Major Questions in Speech Perception
- How do we identify the sounds we hear?
- What about the lack of invariance in the speech
signal? - What about degraded signals?
3How do we identify sounds?
- Speech occurs at an alarming rate
- (estimates vary between 120-180 wpm)
- 10-15 or 25-30 phonetic segments/second!
- The speech signal is continuous there are no
easily identifiable boundaries between words - The speech signal to the right ?
- is segmented into how are you?
4How do we deal with the lack of invariance in a
speech signal?
- Lack of Invariance comes from
- Coarticulation effects (Allophonic variation)
- Tom Burton tried to steal a butter plate
- Speaker variation
- No exact repetition
- Reduction / deletion of segments
5Acoustic Cues
- No single acoustic cue is reliably present for
any given phoneme - for di and du, the /d/ is very different, but
speakers will indicate that its still the same
segment - Each phoneme has more than one acoustic cue
- voice-onset-time (VOT)
- energy in the burst
- onset frequency of the first formant
- placement in syllable
6Voice Onset Time (VOT)
- Measure of time between the burst of air and
beginning of vocal-fold vibration of the adjacent
vowel - Best single cue for distinguishing between
voiced/voiceless consonants in many languages
English, Dutch, Spanish, Hungarian, Tamil,
Cantonese, Thai, and Eastern Armenian (Lisker
Abramson, 1964) - BUT we can still interpret whispered speech!
(practically all voiceless)
7(No Transcript)
8Categorical Perception(chunking of speech
signals)
- Although speech is non-discrete, we perceive it
discretely! - Task Identify the sound
- 0------10------20------30------40------50------60
- /d/ 100 50 100 /t/
9Categorical Perception Yeni-Komshian and
LaFontaine (1983)
- 7 stimuli, between di/ti (VOT 0 - 60 ms)
- 0----10----20----30----40----50----60
-
- same 1-step 2-steps
- Task Discriminate between these sounds
- (2 steps apart so 20 ms difference on VOT)
- 0/20 ms 100 same 40/60 ms 100 same
- 10/30 ms 50 same 30/50 ms - 50 same
- 20/40 ms 100 different
10What about bilinguals?
- VOT boundaries vary between languages
- Perception studies show compromise-effects
- Canadian French-English bilinguals
- (Caramazza, Yeni-Komshian, Zurif, Carbone,
1973) - Spanish-English bilinguals
- (Williams, 1977, 1980)
- Bilinguals seem to have developed a single
perceptual system!
11Coarticulation Effects
- Phonemes are influenced by the sounds around
them! - Take naturally recorded speech
- Remove vowel
- Guess the vowel
- Example see si remove the vowel
- Play 150 ms of /s/
- Can identify removed vowel (for most vowels)
12How is speech perceived under less than ideal
conditions?
Top-down UNDERSTANDING Bottom-up
- Semantic context
- Syntactic structure
- Acoustic Information
13A demonstration
- The McGurk Effect
- We use visual AND auditory cues to determine what
segments were hearing!
14Top-Down Processing(using semantic and syntactic
information to decode individual words in fluent
speech)language speech recognition
talkrecognize speech???????????????Botto
m-Up Processing(using acoustic information to
encode the speech signal)
15Phoneme Restoration Effect(Warren, 1970)
- Replaced sounds with a cough
- Word presented in a sentence
- The bill was sent to the legi_lature.
- Where does the cough occur?
- Participants thought whole word was present. The
/s/ was mentally restored! - It was found that the _eel was on the orange.
- It was found that the _eel was on the shoe.
16Semantic Influences(Garnes and Bond, 1976)
- 16 tokens, spanning the spectrum of
bait-date-gate - 3 carrier sentences
- Heres the fishing gear and the ______.
- Check the time and the _______.
- Paint the fence and the _______.
- If unambiguous, get semantically implausible
sentences (Paint the fence and the bait.) - If ambiguous (near a phoneme boundary), semantic
context effects
17Slurred Speech
- Syntactic and semantic cues help!
- Words (with noise) are perceived more accurately
in sentences than in isolation - (Pollack Pickett, 1964) recorded
conversations and excised individual words.
Presented the words to listeners for
identification, and only half the excised words
were correctly recognized.
18Rules of Rapid Speechhanmethethimbook
- Often can drop the las consonan
- Consonants in clusters may be modified to have
the same blace of articulation/voicing - thimbook, thingcarpet, Istambul
- NOT thingbook, thim slice
- Almost all vowels can be shortened
19Listening for Mispronunciation(Cole, 1973 Cole,
Jakimik, Cooper, 1978 Cole Jakimik, 1980)
- 20-minute story. Press a button whenever you hear
a mispronunciation. - Notice more stop errors with voicing
- 70 for stops (boot to poot)
- 64 for affricates (chance to jance)
- 38 for fricatives (fin to vin)
- Notice almost all place changes (80-90)
- (take to pake)
- no higher percentage if voicing also changed
(take to gake) - Notice more errors at beginnings of words
- 72 for word-initial segments (dish to tish)
- 33 for word-final segments (split to splid)
- Conclusion we DO use bottom-up information!
20Mad Gab
21Mad Gab
22Mad Gab
23Mad Gab
24Mad Gab
25Mad Gab
- Ale All Heap Hop
- (A lollipop)
26Mad Gab
- Butcher Ed Stew Gather
- (Put your heads together)
27Mad Gab
- Lease Hummer Reap Wrest Lee
- (Lisa Marie Presley)
28Mad Gab
- Bill Spare Reed Oh-boy!
- (Pillsbury Dough Boy)
29Models of Speech Perception
- Motor Theory of Speech Perception
- Speech signals interpreted by reference to motor
speech movements - Cohort Model
- TRACE Model
30Models of Speech Perception
- Motor Theory of Speech Perception
- Cohort Model
- 1) The acoustic information at the beginning of a
word activates a cohort of possible words - 2) Syntax and semantics influence the selection
of the target word from the cohort - TRACE Model
31Cohort size
- Standard dictionary
- after 50 ms, 115 nouns share the same sounds
- after 100 ms, 43 nouns
- after 200 ms, 11 nouns
- after 300 ms, 5 nouns
- (Average word length, depending on speech rate,
for one-, two-, and three-syllable words is
between 550 830 ms) - Word recognition occurs before the isolation
point! (only one word possible)
32Models of Speech Perception
- Motor Theory of Speech Perception
- Cohort Model
- TRACE Model Neural Network
- (Elman and McClelland 1984, 1986)
- processing occurs through excitatory and
inhibitory connections in processing units
called nodes - 3 levels of nodes features, phonemes, and words
all highly interconnected
33Evidence for the TRACE model (or other
interactive models)
- We activate all possible words from the phonology
regardless of semantic fit - He swam across to the far side of the river and
scrambled up the bank before running off primes
bank as financial institution! - parts of words cause priming
- trombone primes for rib just as well as bone
- word boundaries dont interfere with phonological
retrieval - nudist is primed by the phrase new distance
- BUT we eliminate all the irrelevant words within
a few syllables
34For more information
- b-d-g continuum
- http//www.phonetik.uni-muenchen.de/Lehre/Skripten
/Haskins/Haskins/MISC/PP/bdg/bdgau.html - Resources on phonetics and phonology
- http//faculty.washington.edu/dillon/PhonResources
/ - Why we need prosody and lexical access
- http//emsah.uq.edu.au/linguistics/book/flant.htm