Speech Comprehension - PowerPoint PPT Presentation

About This Presentation
Title:

Speech Comprehension

Description:

Given a source, how it is heard is a function of the resonant cavities through ... Cheats' by building in phonemic activation (the selection of the initial cohort) ... – PowerPoint PPT presentation

Number of Views:271
Avg rating:3.0/5.0
Slides: 33
Provided by: chriswe
Category:

less

Transcript and Presenter's Notes

Title: Speech Comprehension


1
Speech Comprehension
2
A few words on acoustics
  • Given a source, how it is heard is a function of
    the resonant cavities through which it is
    filtered
  • The shape of a cavity in which a sound occurs
    determines several measurable properties of that
    sound
  • This is easy to see when you have a deformable
    sound cavity, such as a wind instrument
  • The sound that comes out is the one which is most
    resonant where the sound waves are in sync
  • This is a function of the length and shape of the
    resonating chamber simple if the chamber is a
    simple tube complex if the chamber reflect sound
    in complex ways
  • Speaking is a highly controlled deformation of
    the resonating chamber which is our vocal tract

3
Structure of sound
  • Most natural sounds are not one pure resonating
    frequency, but multiple resonating frequencies
    stacked up on each other
  • Those can be more or less cleanly dividing up the
    sound spectrum
  • Hissy' noises (like fricatives) send out waves
    at many frequencies at the same time, resulting
    in a complex spectrum of resonance
  • We can see fricatives as smears of high frequency
    bands interspersed with more clearly
    multiple-frequency (and low-frequency) bands of
    vowels
  • Note that the 'sh' sound is characterized by
    slightly lower frequencies
  • Clean sounds (like vowels) send out a controlled
    bands of frequencies at different ranges,
    resulting in a cleaner spectrum of resonance

4
Formants
  • When we deform our mouths, we are manipulating
    which frequencies will resonate
  • The ones that resonate are called formants see
    pg. 121, which appear as visible bands in a
    spectrogram
  • Before we said Speaking is a highly controlled
    deformation of the resonating chamber which is
    our vocal tract
  • We can equivalently say Speaking is a method for
    manipulating the resonance of formants.

5
We were away a year ago
Image from http//www.umanitoba.ca/faculties/arts
/linguistics/russell/138/sec4/specgram.htm
6
Whats in a phoneme?
  • As soon as we were able to electronically
    manipulate the signal, it was found that the
    speech signal could be greatly simplified much
    of the information carried is not necessary
  • Why is it a good thing to have (why might natural
    selection have favoured) unnecessary information
    in a signal system?
  • The question of interest is What are the
    components of the speech signal that carry
    necessary/sufficient information?

7
Whats in a phoneme?
  • The first and second formants are sufficient for
    comprehensible speech
  • In fact, subjects can get some discriminating
    information from only the first formant
    low-frequency formants were associated with low,
    back vowels (o, u) and higher-frequency with
    high, front vowels (i, e)

8
Whats in a phoneme?
  • We use sound (the formants we extract) to deduce
    information about how the vocal tract was
    positioned when that sound was produced
  • F1 largely reflects by tongue body height, which
    (as we saw previously) changes with different
    vowels
  • F2 reflects whether the tongue body is more
    front or more back
  • The difference between F1 and F2 is a better
    indicator
  • In this way the sound encodes information about
    the state of the system that produced it

9
A complication
  • Vowel sounds are dependent on the consonants that
    flank them
  • We make different sounds by changing the shape of
    our mouth and our mouth has to change in
    different ways to get to a particular vowel sound
    from one position than from another
  • In other words The very process of getting into
    position to make a sound involves manipulating
    exactly those elements which are manipulated to
    change sound

10
Whats in a vowel?
  • If you make CVC words and the chop out the V,
    people make many mistakes in guessing what that
    missing sound is supposed to be
  • They are much better at guessing what a vowel
    sound is if you give them only the flanking
    consonants
  • They were as good at silent center- the V taken
    out- as they were with the original word!
  • V recognition is worse if you discard temporal
    information, so subjects only hear a small,
    constant-length portion of the missing vowel
  • This suggests that temporal information- how long
    a vowel lasts- is one of the clues used in vowel
    identification.

11
Consonants too.
  • The same is true for consonants
  • If you take a stop consonant off the front of a
    vowel (the b in BA) it is utterly impossible to
    recognize what the consonant was (a beep or
    chirp?) it was never a b but a b merging
    rapidly into an a
  • Both the stop consonant and a chunk of the
    formant transition into the next vowel are
    necessary for comprehension

12
Coarticulation
  • A phoneme merging with its adjacent neighbour- is
    called an encoded phoneme
  • We can also say the two phonemes are
    coarticulated
  • Since an encoded phoneme is a single
    indistinguishable sound which encodes two
    phonemes- the encoded one and its neighbour- we
    say there is parallel transmission

13
Information compression
  • Coarticulation is a feature, not a bug
  • The informational compression it offers is one
    way we get up to the informational transfer rates
    that I mentioned last time, of 25-30 phonemes per
    second

14
More information, please
  • In normal sentence-level decoding of the phonetic
    stream, we have higher-level informational cues
  • Early work showed the words masked with noise are
    better recognized in sentences than in isolation
  • A classic experiment from the 1970s showed that
    people are amazingly smooth at using these cues
    to restore missing phonemic segments
  • Parts of sentence were chopped out (in mid-word)
    and replaced with the sound of someone coughing
  • Subjects reported that they didnt hear the cough
    cover any part of the speech signal at all- they
    claimed to have heard the entire word, with the
    cough in background

15
Our favourite theme
  • Yet another linguistic phenomenon (phoneme
    identification) that superficially appears to be
    a single function is in fact a complex function
    that uses many independent and redundant cues
  • Formant transitions from C to V anv V to C
  • Individual formants of V,
  • Durational information
  • Amount of energy in the burst the release of
    pressure after a stop
  • Onset frequency of the formant
  • Sentence and word level information

16
The McGurk Effect
17
Models of speech perception
  • i.) Motor theory of speech perception (Liberman)
  • ii.) Analysis by synthesis (Stevens)
  • iii.) Fuzzy logic model (Massaro)
  • iv.) Cohort model (Marslen-Wilson)
  • v.) TRACE model (Elman McLelland)

18
i.) Motor Theory Of Speech Perception (Liberman)
  • Main idea we interpret speech input by tying it
    to motor articulation required to produce it
  • Pros
  • Provides a nice evolutionary story phonetic
    comprehension built on a more 'primitive'
    (evolutionarily older) level of sound
    production.
  • Ties into 'hardware'
  • Explains McGurk effect
  • Explains how we deal with coarticulation so
    easily
  • Explains how we deal with invariance
  • Explains categorical perception
  • i.e. we use motor information to constrain
    possible sounds use motor invariance to counter
    acoustic variance

19
i.) Motor Theory Of Speech Perception
  • Cons
  • Animals also show categorical perception but
    cant produce phonemes
  • Humans with deformed mouths can comprehend speech
  • We can comprehend sounds we cannot make
  • Says nothing about semantic and pragmatic
    constraints

20
ii.) Analysis by synthesis (Stevens)
  • Main idea We synthesize speech from phonetic
    features we have 'rules for synthesizing, which
    can be absolute when the signal is clear, and
    less absolute (more dependent on contextual cues)
    when there is known ambiguity
  • The synthesized version is compared with the
    heard version not at the level of motor
    articulations

21
ii.) Analysis by synthesis
  • Pros
  • Tries to capture the fact that not all phonemes
    are created equal
  • ambiguous sounds must be more carefully
    analyzed- because they are subject to a greater
    variety of constraints- than unambiguous sounds
  • early phonemes have greater weight than later
    phonemes
  • The idea that rules across phonetic features
    underlie comprehension means that the problem
    will be tractable
  • Since we have a good handle on what those
    features are, there is hope we could specify how
    they combine

22
ii.) Analysis by synthesis
  • Cons
  • Cant explain McGurk effect, since everything is
    acoustically specified
  • Pretty vague without the rules actually being
    specified
  • Very abstract hard to falsify or confirm
    experimentally, because it makes claims about
    what is happening internally that cannot be
    tested easily
  • Says nothing about semantic and pragmatic
    constraints

23
iii.) Fuzzy Logic Model (Massaro)
  • Main idea speech perception is a special case
    of pattern recognition (analysis by features)
  • There are four steps
  • i.) Feature identification/extraction Identify
    the relevant features
  • ii.) Feature evaluation Match those features to
    prototypes in memory- i.e. generate a list of
    partial matches with features sets that contain
    some of the identified features
  • iii.) Feature integration Rank order the
    candidates according to the degree that they
    match
  • iv.) Feature decision Make a goodness of
    match decision and return the best candidate

24
iii.) Fuzzy Logic Model (Massaro)
  • Pros
  • Puts speech recognition out of the special case
    category into the category of general pattern
    recognition, thereby tying it in to work in other
    subfields, including other areas of language and
    into a general theory
  • This could also be a con, since speech
    recognition does seem to be special
  • Stresses continuous (quantitative) rather than
    discontinuous (qualitative) information, so a
    match can be more-or-less good more or
    less-certain

25
iii.) Fuzzy Logic Model (Massaro)
  • Cons
  • Very abstract hard to falsify or confirm
    experimentally, because it makes claims about
    what is happening internally that cannot be
    tested easily
  • Says nothing about semantic and pragmatic
    constraints (but perhaps it could?)

26
iv.) Cohort Model (Marslen-Wilson)
  • Basic idea A spreading activation model
  • Stage 1- Initial Access
  • Access cohort Bottom-up, based on first 150-200
    ms
  • Stage 2- Selection
  • Elimination of candidates that fail for reasons
    other than phonology- so we can weed out using
    semantic/pragmatic and syntactic constraints, as
    well as later-stage phonology
  • Stage 3 Integration of semantic and syntactic
    information

27
iv.) Cohort Model (Marslen-Wilson)
  • Pros
  • Does take into account semantics and pragmatics
  • Well-supported by a variety of experimental
    evidence frequency effects, neighbourhood
    effects, word/NW RT effect

28
iv.) Cohort Model (Marslen-Wilson)
  • Cons
  • Says nothing about mechanisms
  • Says nothing about word segmentation
  • The model assumes listeners pick out the words,
    but we have seen that word boundaries are not
    usually specified in the speech stream
  • Not incompatible with other models, since it
    takes phonemic activation (the selection of the
    initial cohort) for granted (maybe this is a
    pro?)

29
v.) TRACE Model (Elman McLelland)
  • Basic idea A parallel distributed processing
    (PDP) model degree of activation/inhibition from
    units at each of three levels (phonemic feature,
    phoneme, word) is determined by the resting
    activation level of word units
  • Each gets input directly from constant sequence
    of phonemes, all equally valuable

30
v.) TRACE Model
  • Pros
  • Decision based on overall goodness of fit, so
    degraded input is not problematic
  • Consistent in principle with cohort activation
    models (and so well-supported by experimental
    evidence)
  • Does take into account semantics and pragmatics-
    - activation of overlapping lexical levels is
    explicable

31
v.) TRACE Model
  • Cons
  • Treats all features as equal, which we know they
    are not
  • Says nothing about mechanisms
  • Cheats by building in phonemic activation (the
    selection of the initial cohort) by direct
    activation of those features
  • One big part of the puzzle (how do we specify
    and recognize these features?) is thereby glossed
    over
  • Highly over-simplified, both at the level of
    language and neurology

32
  • Are these models incompatible?
  • Can they be synthesized into a meta-model?
  • How could we test for which parts of each were
    best?
Write a Comment
User Comments (0)
About PowerShow.com