Class 3 - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Class 3

Description:

... and speech in free space can be localized due to the binaural fact of perception. Binaural perception also helps one ear to compensate for noise near the other. ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 32

Provided by: ogar1

Category:

more less

Transcript and Presenter's Notes

Title: Class 3

1
Class 3

Hearing, Speech Perception and Discussion of
Possible Project

2
How do we achieve communication?
The brains of the speaker and listener
participate actively in the process. In addition
we must consider the cultural and other
associated cues involved. In responding the same
process takes place but the roles of the speaker
and listener are reversed.
While we think of written and spoken human
language most of the time, this concept may apply
to any means of communications for which there is
an agreed upon mechanism to communicate,
particularly if common or analogous experiences
have been shared by the speaker(s) and the
listener(s) which allows a shared frame of
reference (vg sign language.)
3
Human Ear Anatomy
Middle ear (impe-dance match)
Outer ear
Inner ear (from cochlea in)
(from eardrum out)
It is in the cochlea where the transduction from
acoustic signal to nerve impulse takes place
The ear canal is 2.7 cm long and 11.45 cm2.
Pole at 3kHz
4
Auditory psychophysics

Psychophysics deals with the physical and
psychological aspects of sound perception. If we
do not restrict the intensity of the sound, the
ear can perceive from 16 Hz to 18kHz. However, it
is in the 1-5 kHz range that perception is
possible at a broad range of sound energy levels.
Just-noticeable-differences (JND or difference
limen) which are sounds recognized correctly 75
of the time gives an indication of the
discriminating power of the ear and the number of
quantization levels required for successful
recognition by humans. Perceived loudness is
measured in phons which, at around 1 kHz, are
equivalent to 1 dB but vary with frequency
because of the nonlinearity of the perception
spectrum (see Fig. 4.10 in text). There are 1600
distinguishable frequencies and 350 intensities
but only about 300,000 tones of different
frequency-intensity that can be discriminated in
pairwise comparisons over the acoustic field
between the thresholds of hearing and pain. The
duration of the tones comes also into the
picture. (Note hearing or the ear has greater
power of resolution than the eye or vision, and
it has been used to advantage in sonification
applications.)

5
The boundaries of perception
6
Pitch Perception

Pitch Perceptual mapping of FO 1/T of the
glottal pulse
There are two types of pitch which may be
perceived
the virtual (normal) pitch which is perceived
through its harmonics and
the spectral pitch which is determined by the
relative energy levels in the spectrum of the
signal.

7
Masking

When two sounds of the same or similar
frequencies are perceived the can mask each
(usually lower frequencies mask higher ones or if
closely shifted they may mask each other) other
in a manner different than the sum of their
individual perceptions. Masking is the major
nonlinear aspect of perception that limits the
consideration of speech signals as the sum of
their tone and bandlimited noise components. The
separation of the frequencies of the tones makes
a difference in the amplitude required for their
individual perception. Masking may also occur in
non-simultaneous and with stimuli more complex
than simple tones.

8
Critical Bands
A critical band is visualized as a bandpass
filter with spectrum that approximates the tuning
curves of auditory neurons. According to the
text their individual bandwidth corresponds to
the 1.5 mm spacings of the 1200 primary nerve
fibers in the cochlea (gt 24 filters or fewer,
having bandwidths increasing with frequency, with
peaks initially at uniformly spaced frequencies
and later logarithmically spaced.) (Graph from
the Auditory Toolbox in http//cobweb.ecn.purdue.e
du/malcolm/interval/1998-010/ )
It is possible to use a filter bank for spectral
analysis which is a set of overlapping band pass
filters each analyzing a portion of the input
speech spectrum. It is considered that this is
more flexible than the DFT as a set of 8-12 band
pass filters yield a compact and efficient
representation. Their bandwidth increases
linearly up to 1 kHz and logarithmically
thereafter.
9
Bark scale and Mel scale from text
10
Speech hearing review

Given that speech recognition systems mimic the
auditory process, understanding the hearing
process is important
It is even more important in the synthesis of
speech if it is to sound natural
In speech coding we can eliminate redundancies if
we understand the information carrying narrow
bandwidth and the masking properties of speech
While the neuron firings are relatively well
understood, their interpretation into linguistic
messages by the brain is far from being well
understood

11
Speech perception by humans

We consider part of this study
Perceptually relevant aspects
Models of perception
Vowel and consonant perception
Perception of intonation
Along the way we consider the help of redundant
cues in recognition but tend to ignore them in
synthesis

12
Visual speech

Among the most notable redundant aspect of speech
is the cues that a speaker provides in his facial
and lip movements while speaking. While not
totally capable of substituting speech they
supplement its perception, particularly in the
presence of noise. The likely sequences of
clustered movements corresponding to acoustic
phonemes are called visemes (fewer than
phonemes). Almost all the vowels and diphthongs
formed their own viseme groups while consonant
phones merged into viseme groups with other
consonant phones. It seems that visual and
acoustic speech are complementary in their
contribution to recognition. Combining acoustic
and visual speech recognition would require
computational power but would improve
significantly the recognition.

13
CLUSTERS
Most of the vowels are visually distinguishable
However, several consonants are not
distinguishable visually, but they are
acoustically
(silence)
14
Results of our visual speech research (Goldschen)

It was possible to recognize 25 of a large set
of spoken sentences (150) using only visual
information from the oral cavity region, without
the use of any context, grammar or acoustic
information
Only 13 features were used, mostly dynamic (first
and second derivatives)
Visual information could complement acoustic
recognition (particularly in a noisy environment)
because of their complementary

15
The McGurk effect

An interesting phenomenon occurs when human
perceive conflicting acoustical and optical
information. A video tape of a speaker saying a
syllable is played to a listener. However, in
this tape another syllable is presented
(synchronized) visually to the listener. Often
listener perceive a different third syllable that
was neither one of the two previously considered.
This is called the McGurk effect and shows the
reliance on optical information in the perception
of spoken speech.

16
Important features of speech perception

Spectral information is the single most important
factor in the characterization of speech
utterances (and is therefore preferred over time
domain information). Dynamic changes in
frequency components are also important and can
be detected by machines as finite differences in
consecutive frames.
A significant indication of the importance of
spectral information is that the range of
frequencies produced by the speaker coincides
with the range that can be perceived by the
listener. Voicing is perceived through the
harmonics of F0.
Areas of larger energy in the signal (formants)
are also significant cues that characterize
speech. Also dynamic changes in energy will be
useful in automatic speech recognition
The categorization of the manner of articulation
(at low frequencies) and of the place of
articulation (particularly its influence on F2)
are important cues in human speech recognition
which, along with power, seem to be retained more
easily in short term memory.
When subject to noise a) the manner of
articulation, b) voicing and c) place of
articulation are most influential in the
robustness of human speech recognition in that
order

17
Experimentation and redundancy in speech

Because of the difficulty in producing the
samples of speech in a consistent manner, often
the experimentation is done with synthetic or
recorded speech. The former has the advantage of
reproducing the same speech but lack the
naturalness of the second. Also, there are
significant differences between read and
conversational speech
Speech is highly redundant which aids recognition
in the presence of noise. Evidence of redundancy
within the speech signal is it is mostly
recognizable even with amplitude clipping or
filtering frequencies either above or below 1.8
kHz. Furthermore the context of the sentence and
the grammar of the language aid in providing
redundant guidance.

18
Models of speech in human perception

Categorical models are mostly based on the
pairwise comparison of sounds (possibly involving
short term categorical memory)
Feature models seek to find feature detectors
in the hearing system but even formants do not
seem to be as relevant in perception as they are
in speech production, although they correlate
with neural firings but their dynamics
(movement) are directly related to perception
Active models assume an interplay between the
mechanisms of remembered production and
perception
Passive models do not concern themselves with
the production of speech but go directly from
features (not how they occur) to phonetic
categories. There is some analogies with machine
speech recognition.

19
Perception of vowels

Vowels being voiced are mostly perceived by their
F1 and F2 (good in classifying the vowels) and
some helped by F3.

The so-called vowel triangle shows the
relationship between F1 and F2 (notice the scale
of the two axes is different).
Back vowels
Front vowels
Perception of vowels and consonants in CV, CVC
and VC combinations is significantly affected by
changes in the formant structure because of the
coarticulatory and dynamic effects
20
Handling duration in recognizers

Duration and related speaking rate cues are
important in perception, but duration of phonemes
is even more important in speech recognizers.
Variations in duration are a cause of mistakes
for a recognizer the two most frequent methods
to handle them are the self loops in HMMs and DTW
which optimizes the overall duration of a word
given a prototype. This is particularly important
for speakers who speak fast in conversational
speech recognition. Variations in speaking rates
in different speakers and even the same speaker
are a major problem in recognition.

21
Perception of prosody

The issue of prosody (also stress, rythm or
intonation) is one that must be dealt with beyond
the phonemic level in syllables and mostly in
words (lexical) and sentences (sentential). It is
usually a cue as to the relative importance of
certain portion of the speech. Even pauses are
often cues to the importance of what follows.
Intonation rises are sometimes indicative of a
question. It is also used to indicate the
importance of a clause relative to another, the
alternative name or title (vocative or
appositive) and the end of a sequence of items.
Also, prosody is particularly helpful to the
listener to assess the emotional state of the
speaker, particularly though changes in F0 and
acoustic energy. Polysyllabic words have a
stressed syllable and sometimes a secondary one.
Some languages emphasize the position of stress
relative to sequence of syllables in a word
(French stresses the last syllable, most often
English the first one). Unfortunately, our
ability to simulate or handle prosody is limited.

22
Acoustic correlates for the perception of stress

We can mention four (syllabic or beyond)
correlates in order of importance 1) Pitch
correlates monotonically with F0 and stress, 2)
Length correlates length duration, 3) Loudness
correlates with amplitude, and 4) Precision of
articulation correlates with voice timbre (
spectrum, particularly F1 and F2 it is what
allows the comparison of different musical
instruments playing the same note).
Of course, stress can be detected in sentences or
parts of a sentence (clauses, words). English
(and German) are stressed-timed in the sense that
which means that stressed syllables tend to occur
at regular intervals. But stress often resolves
syntactic ambiguity as in They fed her dog
biscuits which may be stressed in her or her
dog. Notice the tracking of F0, the downward
slope in Hz of F0 at the end and the hat effect
in the following augmented sentence.

23
Difference between pitch and fundamental frequency

After studying this chapter on perception it
should not be surprising to find out that pitch
refers to the perceive F0 which is originated in
the glottis. The F0 is transformed by the vocal
tract (F1-F4 formed), environmental noise and
reflections added, perceived by the ears which
also filter it and finally reach the hearing
portion of the brain. Pitch the auditory
attribute of sound according to which sounds can
be ordered on a scale from low to high. Changing
in F0s (as we have seen that happens with
intonation) in frequency and amplitude
-particularly at low frequencies- may not be
readily perceived by the ear because of
just-noticeable-differences (JND). Pitch and
actual F0 may differ up to 0.3 of the
fundamental frequency.

24
Free space speech and localization (HRTF)

The place of origin of sounds and speech in free
space can be localized due to the binaural fact
of perception. Binaural perception also helps one
ear to compensate for noise near the other. From
Wikipedia The head-related transfer function
HRTF, also called the anatomical transfer
function ATF, describes how a given sound wave
input (parameterized as frequency and source
location) is filtered by the diffraction and
reflection properties of the head, pinna, and
torso, before the sound reaches the transduction
machinery of the eardrum and inner ear (see
auditory system). Biologically, the
source-location-specific prefiltering effects of
these external structures aid in the neural
determination of source location, particularly
the determination of source elevation. The HRTF
is considered in the design of earphones that
allow localization. The kind of media (recorded
speech, telephone speech, etc.) and the kind of
microphone used for training (head mounted, open,
distance to mouth, etc.) are important factors to
take into account for both training and
recognition.

25
Effects of context

While the recognition of words in a sentence is
helped by the overall context of the sentence and
possible conversation when they are excised from
a conversation only 50 are recognized. In
studies in lipreading it is accepted that the
best lip-readers recognized up to 35 of
contextless words and our automated lipreading
experiment achieved 25 success. Context is
decided at higher levels in the cognitive chain,
which points to the importance of the brain
function in speech.

26
Project Rationale and motivation for the TIMIT
database

There are two main approaches to speech
recognition systems speaker independent and
speaker dependent. Both must be trained with
either one speaker or many others. Even the
speaker dependent systems need to be initialized
and tested with some approximation to the voice
of the speaker, if possible, when designed. In
choosing the training set it is important to have
phonemes in the approximate proportion in which
they appear in the speakers language (balance)
and with as many of the characteristics
(shibboleth) of the speaker (his regional accents
for example) as possible. Also, possibly some
read speech combined with some conversational
speech. TIMIT offers all of that in two sets for
speaker independent training and testing sets.

27
Handout describing the contents of the TIMIT CD
and the Linguistic Data Consortium (LDC) function
.

While the TIMIT Database creation is rather old
(1998) it has remained a standard for training
and testing in American English speech as is
evidenced by its being the top selling resource
provided by the Linguistic Data Consortium
(http//www.ldc.upenn.edu/Catalog/topten.jsp) .
While the document handed out is also old, it
contains the original information. An updated
document may be available from NIST.

28
Discussion of the course project using a portion
of TIMIT and MATLAB

I have ordered a copy of the TIMIT database CD
and hope to have a homework assignment in which I
copy for you a portion of the database and have
you experiment with it. Unfortunately the CD has
not arrived and I have not been able to formulate
a feasible project.
We will also use the MATLAB HMM capabilities to
experiment with learning and recognition of some
sequences. You should start utilizing the
examples in MATLAB on how to use the HMM software
in the package. A slide (slide 19) classifying
it briefly was presented in lesson 1.

29
Summary of the lesson

In a previous class we emphasized how speech is
produced. In this class we covered a) the
mechanism of hearing or audition and b) the
mechanism by which the perceived speech is
transformed into nerve firings or perception.
We also covered non-auditory visual perception of
speech (lip-reading)
We studied the effects of duration and prosody in
recognition, stress, context
We finally considered the project for the course
using TIMIT and MATLAB

30
Next class and assignment

Our next class WILL NOT take place on September
19 as usual. It will take place on September 17
from 530 to 830 PM. IT WILL BE VIDEO-RECORDED
SO IF YOU CANNOT ATTEND YOU CAN WATCH IT ON THE
UTA WEBSITE. If you can attend, I would
appreciate it, since it is not fun to teach in an
empty classroom.
Reading assignment for next class Read the
material on the TIMIT database handout that
describes the content of the TIMIT CD and the
three related articles that follow it. There may
be questions asked about this material.

31
Homework assignment to be turned in by or on the
September 26 class

What is the range of perceived frequencies that
is need that is needed to hear all phonemes?
Given the telephone bandwidth (300-3300 Hz) which
phonemes would be most distorted in the telephone
channel?
What isolated phonemes are most easily confused
with /b/ ? Why?
Why is place of articulation less reliable than
manner of articulation in perceiving natural
speech? Explain.