Title: Speaker Normalization in Speech Perception
1Speaker Normalization in Speech Perception
in D. Pisoni R. Remez (eds.), (2005), The
Handbook of Speech Perception. Oxford
Blackwells. Pages 363-389.
2Speaker Normalization
- Phonologically identical utterances show acoustic
variation - Yet listeners recognize utterances despite
variation - Def
- Normalization is the identification of
acoustically distinct utterances as the same
(abstract) linguistic object
3k?t
- Notice the difference in formant frequencies
4Sound
- Produced when air pressure fluctuations hit the
eardrum - Air is a medium
- Fluctuations are waves through the medium
- simple periodic waves sinusoidal waves
- Periodic?
- The wave repeats itself at regular intervals
- Cycle A full repetition
5Sine Wave
6Frequency
- The number of cycles per unit time
- Hertz (Hz) 1 cycle / second
- 1000 cycles/sec 1 KHz
- 1 KHz tone
7Complex Periodic Wave
- A periodic wave composed of multiple simple
periodic waves
Wave composed of 100 Hz and a 1000 Hz sine wave
8Happy Property
- A complex periodic wave can be decomposed into
its component simple periodic waves - Technique is called Fourier analysis
- Fundamental frequency, F0
- GCD of the frequencies of the component sine
waves - F0 of example 100 Hz
9Output of Fourier AnalysisPower Spectrum
- Power Spectrum plot of amplitude against
frequency - Amplitude height of a wave
- Suppose
- Ampl. Of 100 Hz wave in example .5
- Ampl. Of 1000 Hz wave in example 1
1
0
0 100 500
1000
10Spectrogram
- Power spectrum is a snapshot of a periodic wave
- We want a movie
- Adopt a shading scheme
- Amplitude in top 10 dark
- Amplitude in next 10 lighter, etc.
- Every few ms, sample the wave and plot the
spectrum
11Formants
- Component waves in a voice spectrum are called
harmonics - Formants are basically the harmonics
- Of course there are lots of other complicated
details - Sampling rate
- Filtering
- Length of the vocal tract
- Filtering causes adjacent harmonics to smear
together the dark bands in a spectogram
12Spectogram of ka
13Praat Interlude
- Praat, a downloadable phonetic analysis program
- www.praat.org
- Authors
- Paul Boersma
- David Weenink
14Importance of formants/ fundamental frequency in
Perception
- Doubling the fundamental frequency of vocal fold
vibration (F0) produces a vowel category shift
for most English vowels - Synthesizing a vowel composed of childrens high
F0 with male formants produced lower perception
accuracy - Smaller effects have been reported after altering
the higher formants
15Formant Ratio
- Vowels are relative patterns, not absolute
formant frequencies - Imagine seeing a man 100 meters away
- Now imagine seeing the same man 50 meters away
- It is the same man. The visual system
accommodates the growth in size - The ratio of formants is what is perceived
- Explains why altering f0 caused perceptual errors
16Sussmans (1986)Formant Ratio Model
- 1. Auditory nerve fibers are activated by a sound
wave - 2. Several combination-sensitive neurons each
combine information from two formants(f1-f2,
f1-f3, f2-f3) - 3. Passed to other neurons that compute
- x geometric mean of formants
- (f1f2f3)1/3
- ln(fn/x), n 1, 2, 3
- 4. Output is mapped to specific vowels
17The Model
18Other ResearchersOther Ratios
- Basic ratio idea
- Auditory system computes a ratio based on the
formants of the wave that vibrates the basilar
membrane - This ratio is perceived as one of the vowels
- F0 is absent in some of the models maybe
reflecting the fact that whispered vowels can be
identified (though not well) - All proposals are similar
- All do some arithmetic, A, with the formants
- All compute log (A) or Bark(A)
- Small problems Johnson
- doesnt distinguish between log and ln
- On p. 367 says Sussman uses a geometric mean
- But on p. 368 gives Sussmans formula with the
arithmetic mean)
The vowel chart plots f1 against f2, using the
non-linear Bark scale. Non-linearity reflects
the fact that the auditory system perceives more
clearly at low frequencies.
19Beyond Formants
- Formant ratio theories fail to account for all
aspects of speaker normalization - Vowels also differ in secondary cues
- Duration
- Formant trajectories (shift in formant frequency
over the duration of the vowel)
20Relevant Studies
- Lehiste Miller (1973)
- Subjects score slightly better than chance in
identifying fixed-duration vowels synthesized
with steady-state formant frequencies - Nearey Assmann (1986)
- Extracted small pieces from the beginning and end
of vowels - Identification was higher when the pieces were
played in the order extracted (rather than alone
or in reverse) - Conclusion formant ratios are not points but
trajectories through normalized space - Consistent with exemplar models
21Theres More Than Vowels
- Fricatives produced by men and women have
different spectral shapes - May (1976)
- Spliced the continuum from s to ? followed by
? produced by a M/F voice - Subjects perceived the s-? boundary at a
higher frequency for women - Implies that listeners normalized the fricative
based on the contextual information provided by
vowel.
22Statistical Methods
- Goal of all normalization algorithms
- Reduce within vowel category scatter
- Between category overlap
- No algorithm has been shown to work better than
standardization to speaker specific z scores
((score mean)/standard deviation) - Conclusion Maybe listeners do some kind of
statistical processing (if not the cognitively
implausible z score)
23Context
- One simple normalization technique shift the
spectrum of female speakers down 1 Bark (Bladon,
1984) - Ladefoged Barney found in 1957
- Vowels are identified differently based on the
formant values of the preceding phrase - Suggests that speakers
- use a cognitive frame of reference
- An auditory representation of the speaker
24Shifting Frame of Reference 1
- Recognition accuracy in noise decreases when the
speaker is unpredictable - Vowel identification is more accurate in single
talker, rather than mixed talker, word lists - Kato Kakehi (1988) showed listeners adapted
monotonically (70 - 76) to five stimuli
presented by the same speaker.
25Shifting Frame of Reference 2
- When told that an androgynous voice along a h?d
h?d continuum was male or female subjects
produced different category boundaries - Conclusion listeners perceive speech according
to an internal representation of the person
talking
26Leads ToTheories of Vocal Tract Normalization
- Formant ratio theories
- normalization is a function of the auditory
system encoding vowels, holistically, from a
collection of spectral cues - Vocal tract theories
- normalization refers to the perceived length of
the vocal tract for an individual speaker - Computation of formants is functionally
dependent on the idealized length of the vocal
tract (f1 c/4L, where c is the speed of sound
and L is the length of the vocal tract). - Essentially, a theory of context
27Adjusting for Context 1
- Uniform scaling (Lindblom, Fant)
- Took F3 to be a proxy for vocal tract length
- Computed a constant as the ratio of a speakers
vocal tract length to a reference vocal tract
length - Scaled incoming sounds by this ratio
- Reduces speaker differences
- Big problem male/female formant differences are
not uniformbut its a step in the right direction
28Adjusting for Context 2
- Non-uniform normalization
- Use multiple parameters
- Johnson mentions research from the 70s and 80s.
- Recover a richly detailed characterization of
vocal tract geometry from the signal - Almost identical acoustic values can be produced
by different vocal tracts - Indeterminacies found in extraction are magnified
in synthesis - speech gesture recovery, as a practical
normalization strategy, is out of reach at this
time (p. 375)
29Talkers or Vocal Tracts?
- VT normalization theory implies that speakers
differ primarily in VT anatomy - Traunmuller (1984) successfully reproduced Fants
male/female formant ratios by adjusting for - Pharynx length
- Resting tongue position
- Johnson He made assumptions about gender
differences and resting tongue position not
justifed by data - There appear to be other variables
- Ethological basis for gender speech differences
(Ohala 1984, Diehl et al., 1996) - Johnson evidence that anatomical differences
alone cant account for differential vowel space
of women and men
30The Puzzle of Gender
- Several studies show
- Listeners can identify the sex of prepubescent
boys and girls - From very short speech segments
- Boys are always more accurately identified than
girls - Two plots of second formant frequencies against
age follow (females are dashed lines) - Notice the difference well before the onset of
puberty
31Second Formant Data
32Source of Differences
- Perry et al. (2001) did a regression analysis on
the data - Body size accounts for 82 - 87 of difference in
F2 - Unspecified gender differences account for 9 of
variance in F2 and 5 of F1 and F3 - White (1999 579) males and females may well
adopt gender-specific articulatory behaviors from
childhood to further enhance sex distinctions.
33But is it Cross-Linguistic?There are certainly
differences
34But how to account for them?
- Using the data from the last figure
- Plot the male/female formant frequency difference
against female formant frequency - Recall Traunmiller (1984)
- Successfully reproduced Fants male/female
formant ratios by adjusting for - Pharynx length
- Resting tongue position
- Plot Traunmillers data in the same way
- Traunmillers data is in rough agreement with the
cross-linguistic data
35F by DF Space as a function of Female formant
frequencysolid symbols Traunmillersolid lines
Traunmiller regression
36Though The Fit is GoodIts not Perfect
- Speaker normalization that removes only vocal
tract differences will not account for between
sex differences among vowels. - Talkers choose social, dialectal, gender
markers (p. 381)
37The Speaker is Important
- Listeners recall words better when they are
spoken in a familiar voice - Familiarity with speaker softens the McGurk
effect. Subjects - Watch a speaker say a phoneme
- But hear a different phoneme
- Report a different, often intermediate phoneme
- Suggests that listeners learn speakers habits of
articulation
38Social Expectations Play a Role
- Listeners attribute personality traits on the
basis of their speech - Rubin (1992) found speech intelligibility is
reduced for American college-age listeners when a
voice is associated with an Asian face - Word recognition is slower when the speaker is
non-stereotypically male or female
39Exemplar Explanation
- Acoustic-Phonetic details are part of a
listeners long-term representation of speech - Goldinger and others have demonstrated retention
of talker-specific details - Not consistent with abstractionist models like
- Formant ratio normalization
- Vocal tract normalization
- Leads to an exemplar model
- Derived from prototype theory
- Common explanatory model in usage-based ling.
40Characteristics
- Abstractionist models bend the input signal to
match an hypothesized internal representation - Exemplars adapt to the identity of speaker
- Exemplars allow the activation of cues of all
sorts - Visual
- Prior expectation
- Recognition of known voice
- Acoustic
- Though Johnson doesnt say this
- May be modeled using PDP architecture
- Rumelhart and McClelland showed that a PDP model
could explain why people can more accurately
detect a phoneme in a word than in a non-word
41Its All Too Much
- HOLOFERNES He draweth out the thread of his
verbosity finerthan the staple of his argument.
I abhor suchfanatical phantasimes, such
insociable andpoint-devise companions such
rackers oforthography, as to speak dout, fine,
when he shouldsay doubt det, when he should
pronounce debt,--d,e, b, t, not d, e, t he
clepeth a calf, caufhalf, hauf neighbour
vocatur nebor neighabbreviated ne. This is
abhominable,--which hewould call abbominable it
insinuateth me ofinsanie anne intelligis,
domine? to make frantic, lunatic. -
- MOTH They have been at a great feastof
languages, and stolen the scraps. - Loves Labours Lost, Act V, Scene i
42(No Transcript)