Class 3 - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Class 3

Description:

... and speech in free space can be localized due to the binaural fact of perception. Binaural perception also helps one ear to compensate for noise near the other. ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 32
Provided by: ogar1
Category:
Tags: binaural | class

less

Transcript and Presenter's Notes

Title: Class 3


1
Class 3
  • Hearing, Speech Perception and Discussion of
    Possible Project

2
How do we achieve communication?
The brains of the speaker and listener
participate actively in the process. In addition
we must consider the cultural and other
associated cues involved. In responding the same
process takes place but the roles of the speaker
and listener are reversed.
While we think of written and spoken human
language most of the time, this concept may apply
to any means of communications for which there is
an agreed upon mechanism to communicate,
particularly if common or analogous experiences
have been shared by the speaker(s) and the
listener(s) which allows a shared frame of
reference (vg sign language.)
3
Human Ear Anatomy
Middle ear (impe-dance match)
Outer ear
Inner ear (from cochlea in)
(from eardrum out)
It is in the cochlea where the transduction from
acoustic signal to nerve impulse takes place
The ear canal is 2.7 cm long and 11.45 cm2.
Pole at 3kHz
4
Auditory psychophysics
  • Psychophysics deals with the physical and
    psychological aspects of sound perception. If we
    do not restrict the intensity of the sound, the
    ear can perceive from 16 Hz to 18kHz. However, it
    is in the 1-5 kHz range that perception is
    possible at a broad range of sound energy levels.
    Just-noticeable-differences (JND or difference
    limen) which are sounds recognized correctly 75
    of the time gives an indication of the
    discriminating power of the ear and the number of
    quantization levels required for successful
    recognition by humans. Perceived loudness is
    measured in phons which, at around 1 kHz, are
    equivalent to 1 dB but vary with frequency
    because of the nonlinearity of the perception
    spectrum (see Fig. 4.10 in text). There are 1600
    distinguishable frequencies and 350 intensities
    but only about 300,000 tones of different
    frequency-intensity that can be discriminated in
    pairwise comparisons over the acoustic field
    between the thresholds of hearing and pain. The
    duration of the tones comes also into the
    picture. (Note hearing or the ear has greater
    power of resolution than the eye or vision, and
    it has been used to advantage in sonification
    applications.)

5
The boundaries of perception
6
Pitch Perception
  • Pitch Perceptual mapping of FO 1/T of the
    glottal pulse
  • There are two types of pitch which may be
    perceived
  • the virtual (normal) pitch which is perceived
    through its harmonics and
  • the spectral pitch which is determined by the
    relative energy levels in the spectrum of the
    signal.

7
Masking
  • When two sounds of the same or similar
    frequencies are perceived the can mask each
    (usually lower frequencies mask higher ones or if
    closely shifted they may mask each other) other
    in a manner different than the sum of their
    individual perceptions. Masking is the major
    nonlinear aspect of perception that limits the
    consideration of speech signals as the sum of
    their tone and bandlimited noise components. The
    separation of the frequencies of the tones makes
    a difference in the amplitude required for their
    individual perception. Masking may also occur in
    non-simultaneous and with stimuli more complex
    than simple tones.

8
Critical Bands
A critical band is visualized as a bandpass
filter with spectrum that approximates the tuning
curves of auditory neurons. According to the
text their individual bandwidth corresponds to
the 1.5 mm spacings of the 1200 primary nerve
fibers in the cochlea (gt 24 filters or fewer,
having bandwidths increasing with frequency, with
peaks initially at uniformly spaced frequencies
and later logarithmically spaced.) (Graph from
the Auditory Toolbox in http//cobweb.ecn.purdue.e
du/malcolm/interval/1998-010/ )
It is possible to use a filter bank for spectral
analysis which is a set of overlapping band pass
filters each analyzing a portion of the input
speech spectrum. It is considered that this is
more flexible than the DFT as a set of 8-12 band
pass filters yield a compact and efficient
representation. Their bandwidth increases
linearly up to 1 kHz and logarithmically
thereafter.
9
Bark scale and Mel scale from text
10
Speech hearing review
  • Given that speech recognition systems mimic the
    auditory process, understanding the hearing
    process is important
  • It is even more important in the synthesis of
    speech if it is to sound natural
  • In speech coding we can eliminate redundancies if
    we understand the information carrying narrow
    bandwidth and the masking properties of speech
  • While the neuron firings are relatively well
    understood, their interpretation into linguistic
    messages by the brain is far from being well
    understood

11
Speech perception by humans
  • We consider part of this study
  • Perceptually relevant aspects
  • Models of perception
  • Vowel and consonant perception
  • Perception of intonation
  • Along the way we consider the help of redundant
    cues in recognition but tend to ignore them in
    synthesis

12
Visual speech
  • Among the most notable redundant aspect of speech
    is the cues that a speaker provides in his facial
    and lip movements while speaking. While not
    totally capable of substituting speech they
    supplement its perception, particularly in the
    presence of noise. The likely sequences of
    clustered movements corresponding to acoustic
    phonemes are called visemes (fewer than
    phonemes). Almost all the vowels and diphthongs
    formed their own viseme groups while consonant
    phones merged into viseme groups with other
    consonant phones. It seems that visual and
    acoustic speech are complementary in their
    contribution to recognition. Combining acoustic
    and visual speech recognition would require
    computational power but would improve
    significantly the recognition.

13
CLUSTERS
Most of the vowels are visually distinguishable
However, several consonants are not
distinguishable visually, but they are
acoustically
(silence)
14
Results of our visual speech research (Goldschen)
  • It was possible to recognize 25 of a large set
    of spoken sentences (150) using only visual
    information from the oral cavity region, without
    the use of any context, grammar or acoustic
    information
  • Only 13 features were used, mostly dynamic (first
    and second derivatives)
  • Visual information could complement acoustic
    recognition (particularly in a noisy environment)
    because of their complementary

15
The McGurk effect
  • An interesting phenomenon occurs when human
    perceive conflicting acoustical and optical
    information. A video tape of a speaker saying a
    syllable is played to a listener. However, in
    this tape another syllable is presented
    (synchronized) visually to the listener. Often
    listener perceive a different third syllable that
    was neither one of the two previously considered.
    This is called the McGurk effect and shows the
    reliance on optical information in the perception
    of spoken speech.

16
Important features of speech perception
  • Spectral information is the single most important
    factor in the characterization of speech
    utterances (and is therefore preferred over time
    domain information). Dynamic changes in
    frequency components are also important and can
    be detected by machines as finite differences in
    consecutive frames.
  • A significant indication of the importance of
    spectral information is that the range of
    frequencies produced by the speaker coincides
    with the range that can be perceived by the
    listener. Voicing is perceived through the
    harmonics of F0.
  • Areas of larger energy in the signal (formants)
    are also significant cues that characterize
    speech. Also dynamic changes in energy will be
    useful in automatic speech recognition
  • The categorization of the manner of articulation
    (at low frequencies) and of the place of
    articulation (particularly its influence on F2)
    are important cues in human speech recognition
    which, along with power, seem to be retained more
    easily in short term memory.
  • When subject to noise a) the manner of
    articulation, b) voicing and c) place of
    articulation are most influential in the
    robustness of human speech recognition in that
    order

17
Experimentation and redundancy in speech
  • Because of the difficulty in producing the
    samples of speech in a consistent manner, often
    the experimentation is done with synthetic or
    recorded speech. The former has the advantage of
    reproducing the same speech but lack the
    naturalness of the second. Also, there are
    significant differences between read and
    conversational speech
  • Speech is highly redundant which aids recognition
    in the presence of noise. Evidence of redundancy
    within the speech signal is it is mostly
    recognizable even with amplitude clipping or
    filtering frequencies either above or below 1.8
    kHz. Furthermore the context of the sentence and
    the grammar of the language aid in providing
    redundant guidance.

18
Models of speech in human perception
  • Categorical models are mostly based on the
    pairwise comparison of sounds (possibly involving
    short term categorical memory)
  • Feature models seek to find feature detectors
    in the hearing system but even formants do not
    seem to be as relevant in perception as they are
    in speech production, although they correlate
    with neural firings but their dynamics
    (movement) are directly related to perception
  • Active models assume an interplay between the
    mechanisms of remembered production and
    perception
  • Passive models do not concern themselves with
    the production of speech but go directly from
    features (not how they occur) to phonetic
    categories. There is some analogies with machine
    speech recognition.

19
Perception of vowels
  • Vowels being voiced are mostly perceived by their
    F1 and F2 (good in classifying the vowels) and
    some helped by F3.

The so-called vowel triangle shows the
relationship between F1 and F2 (notice the scale
of the two axes is different).
Back vowels
Front vowels
Perception of vowels and consonants in CV, CVC
and VC combinations is significantly affected by
changes in the formant structure because of the
coarticulatory and dynamic effects
20
Handling duration in recognizers
  • Duration and related speaking rate cues are
    important in perception, but duration of phonemes
    is even more important in speech recognizers.
    Variations in duration are a cause of mistakes
    for a recognizer the two most frequent methods
    to handle them are the self loops in HMMs and DTW
    which optimizes the overall duration of a word
    given a prototype. This is particularly important
    for speakers who speak fast in conversational
    speech recognition. Variations in speaking rates
    in different speakers and even the same speaker
    are a major problem in recognition.

21
Perception of prosody
  • The issue of prosody (also stress, rythm or
    intonation) is one that must be dealt with beyond
    the phonemic level in syllables and mostly in
    words (lexical) and sentences (sentential). It is
    usually a cue as to the relative importance of
    certain portion of the speech. Even pauses are
    often cues to the importance of what follows.
    Intonation rises are sometimes indicative of a
    question. It is also used to indicate the
    importance of a clause relative to another, the
    alternative name or title (vocative or
    appositive) and the end of a sequence of items.
    Also, prosody is particularly helpful to the
    listener to assess the emotional state of the
    speaker, particularly though changes in F0 and
    acoustic energy. Polysyllabic words have a
    stressed syllable and sometimes a secondary one.
    Some languages emphasize the position of stress
    relative to sequence of syllables in a word
    (French stresses the last syllable, most often
    English the first one). Unfortunately, our
    ability to simulate or handle prosody is limited.

22
Acoustic correlates for the perception of stress
  • We can mention four (syllabic or beyond)
    correlates in order of importance 1) Pitch
    correlates monotonically with F0 and stress, 2)
    Length correlates length duration, 3) Loudness
    correlates with amplitude, and 4) Precision of
    articulation correlates with voice timbre (
    spectrum, particularly F1 and F2 it is what
    allows the comparison of different musical
    instruments playing the same note).
  • Of course, stress can be detected in sentences or
    parts of a sentence (clauses, words). English
    (and German) are stressed-timed in the sense that
    which means that stressed syllables tend to occur
    at regular intervals. But stress often resolves
    syntactic ambiguity as in They fed her dog
    biscuits which may be stressed in her or her
    dog. Notice the tracking of F0, the downward
    slope in Hz of F0 at the end and the hat effect
    in the following augmented sentence.

23
Difference between pitch and fundamental frequency
  • After studying this chapter on perception it
    should not be surprising to find out that pitch
    refers to the perceive F0 which is originated in
    the glottis. The F0 is transformed by the vocal
    tract (F1-F4 formed), environmental noise and
    reflections added, perceived by the ears which
    also filter it and finally reach the hearing
    portion of the brain. Pitch the auditory
    attribute of sound according to which sounds can
    be ordered on a scale from low to high. Changing
    in F0s (as we have seen that happens with
    intonation) in frequency and amplitude
    -particularly at low frequencies- may not be
    readily perceived by the ear because of
    just-noticeable-differences (JND). Pitch and
    actual F0 may differ up to 0.3 of the
    fundamental frequency.

24
Free space speech and localization (HRTF)
  • The place of origin of sounds and speech in free
    space can be localized due to the binaural fact
    of perception. Binaural perception also helps one
    ear to compensate for noise near the other. From
    Wikipedia The head-related transfer function
    HRTF, also called the anatomical transfer
    function ATF, describes how a given sound wave
    input (parameterized as frequency and source
    location) is filtered by the diffraction and
    reflection properties of the head, pinna, and
    torso, before the sound reaches the transduction
    machinery of the eardrum and inner ear (see
    auditory system). Biologically, the
    source-location-specific prefiltering effects of
    these external structures aid in the neural
    determination of source location, particularly
    the determination of source elevation. The HRTF
    is considered in the design of earphones that
    allow localization. The kind of media (recorded
    speech, telephone speech, etc.) and the kind of
    microphone used for training (head mounted, open,
    distance to mouth, etc.) are important factors to
    take into account for both training and
    recognition.

25
Effects of context
  • While the recognition of words in a sentence is
    helped by the overall context of the sentence and
    possible conversation when they are excised from
    a conversation only 50 are recognized. In
    studies in lipreading it is accepted that the
    best lip-readers recognized up to 35 of
    contextless words and our automated lipreading
    experiment achieved 25 success. Context is
    decided at higher levels in the cognitive chain,
    which points to the importance of the brain
    function in speech.

26
Project Rationale and motivation for the TIMIT
database
  • There are two main approaches to speech
    recognition systems speaker independent and
    speaker dependent. Both must be trained with
    either one speaker or many others. Even the
    speaker dependent systems need to be initialized
    and tested with some approximation to the voice
    of the speaker, if possible, when designed. In
    choosing the training set it is important to have
    phonemes in the approximate proportion in which
    they appear in the speakers language (balance)
    and with as many of the characteristics
    (shibboleth) of the speaker (his regional accents
    for example) as possible. Also, possibly some
    read speech combined with some conversational
    speech. TIMIT offers all of that in two sets for
    speaker independent training and testing sets.

27
Handout describing the contents of the TIMIT CD
and the Linguistic Data Consortium (LDC) function
.
  • While the TIMIT Database creation is rather old
    (1998) it has remained a standard for training
    and testing in American English speech as is
    evidenced by its being the top selling resource
    provided by the Linguistic Data Consortium
    (http//www.ldc.upenn.edu/Catalog/topten.jsp) .
    While the document handed out is also old, it
    contains the original information. An updated
    document may be available from NIST.

28
Discussion of the course project using a portion
of TIMIT and MATLAB
  • I have ordered a copy of the TIMIT database CD
    and hope to have a homework assignment in which I
    copy for you a portion of the database and have
    you experiment with it. Unfortunately the CD has
    not arrived and I have not been able to formulate
    a feasible project.
  • We will also use the MATLAB HMM capabilities to
    experiment with learning and recognition of some
    sequences. You should start utilizing the
    examples in MATLAB on how to use the HMM software
    in the package. A slide (slide 19) classifying
    it briefly was presented in lesson 1.

29
Summary of the lesson
  • In a previous class we emphasized how speech is
    produced. In this class we covered a) the
    mechanism of hearing or audition and b) the
    mechanism by which the perceived speech is
    transformed into nerve firings or perception.
  • We also covered non-auditory visual perception of
    speech (lip-reading)
  • We studied the effects of duration and prosody in
    recognition, stress, context
  • We finally considered the project for the course
    using TIMIT and MATLAB

30
Next class and assignment
  • Our next class WILL NOT take place on September
    19 as usual. It will take place on September 17
    from 530 to 830 PM. IT WILL BE VIDEO-RECORDED
    SO IF YOU CANNOT ATTEND YOU CAN WATCH IT ON THE
    UTA WEBSITE. If you can attend, I would
    appreciate it, since it is not fun to teach in an
    empty classroom.
  • Reading assignment for next class Read the
    material on the TIMIT database handout that
    describes the content of the TIMIT CD and the
    three related articles that follow it. There may
    be questions asked about this material.

31
Homework assignment to be turned in by or on the
September 26 class
  • What is the range of perceived frequencies that
    is need that is needed to hear all phonemes?
  • Given the telephone bandwidth (300-3300 Hz) which
    phonemes would be most distorted in the telephone
    channel?
  • What isolated phonemes are most easily confused
    with /b/ ? Why?
  • Why is place of articulation less reliable than
    manner of articulation in perceiving natural
    speech? Explain.
Write a Comment
User Comments (0)
About PowerShow.com