Speaker Normalization in Speech Perception - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Speaker Normalization in Speech Perception

Description:

Phonologically identical utterances show acoustic variation ... Lehiste & Miller (1973) ... Ladefoged & Barney found in 1957 ... – PowerPoint PPT presentation

Number of Views:573
Avg rating:3.0/5.0
Slides: 43
Provided by: Pau163
Category:

less

Transcript and Presenter's Notes

Title: Speaker Normalization in Speech Perception


1
Speaker Normalization in Speech Perception
  • Keith Johnson

in D. Pisoni R. Remez (eds.), (2005), The
Handbook of Speech Perception. Oxford
Blackwells. Pages 363-389.
2
Speaker Normalization
  • Phonologically identical utterances show acoustic
    variation
  • Yet listeners recognize utterances despite
    variation
  • Def
  • Normalization is the identification of
    acoustically distinct utterances as the same
    (abstract) linguistic object

3
k?t
  • Notice the difference in formant frequencies

4
Sound
  • Produced when air pressure fluctuations hit the
    eardrum
  • Air is a medium
  • Fluctuations are waves through the medium
  • simple periodic waves sinusoidal waves
  • Periodic?
  • The wave repeats itself at regular intervals
  • Cycle A full repetition

5
Sine Wave
6
Frequency
  • The number of cycles per unit time
  • Hertz (Hz) 1 cycle / second
  • 1000 cycles/sec 1 KHz
  • 1 KHz tone

7
Complex Periodic Wave
  • A periodic wave composed of multiple simple
    periodic waves

Wave composed of 100 Hz and a 1000 Hz sine wave
8
Happy Property
  • A complex periodic wave can be decomposed into
    its component simple periodic waves
  • Technique is called Fourier analysis
  • Fundamental frequency, F0
  • GCD of the frequencies of the component sine
    waves
  • F0 of example 100 Hz

9
Output of Fourier AnalysisPower Spectrum
  • Power Spectrum plot of amplitude against
    frequency
  • Amplitude height of a wave
  • Suppose
  • Ampl. Of 100 Hz wave in example .5
  • Ampl. Of 1000 Hz wave in example 1

1
0
0 100 500
1000
10
Spectrogram
  • Power spectrum is a snapshot of a periodic wave
  • We want a movie
  • Adopt a shading scheme
  • Amplitude in top 10 dark
  • Amplitude in next 10 lighter, etc.
  • Every few ms, sample the wave and plot the
    spectrum

11
Formants
  • Component waves in a voice spectrum are called
    harmonics
  • Formants are basically the harmonics
  • Of course there are lots of other complicated
    details
  • Sampling rate
  • Filtering
  • Length of the vocal tract
  • Filtering causes adjacent harmonics to smear
    together the dark bands in a spectogram

12
Spectogram of ka
13
Praat Interlude
  • Praat, a downloadable phonetic analysis program
  • www.praat.org
  • Authors
  • Paul Boersma
  • David Weenink

14
Importance of formants/ fundamental frequency in
Perception
  • Doubling the fundamental frequency of vocal fold
    vibration (F0) produces a vowel category shift
    for most English vowels
  • Synthesizing a vowel composed of childrens high
    F0 with male formants produced lower perception
    accuracy
  • Smaller effects have been reported after altering
    the higher formants

15
Formant Ratio
  • Vowels are relative patterns, not absolute
    formant frequencies
  • Imagine seeing a man 100 meters away
  • Now imagine seeing the same man 50 meters away
  • It is the same man. The visual system
    accommodates the growth in size
  • The ratio of formants is what is perceived
  • Explains why altering f0 caused perceptual errors

16
Sussmans (1986)Formant Ratio Model
  • 1. Auditory nerve fibers are activated by a sound
    wave
  • 2. Several combination-sensitive neurons each
    combine information from two formants(f1-f2,
    f1-f3, f2-f3)
  • 3. Passed to other neurons that compute
  • x geometric mean of formants
  • (f1f2f3)1/3
  • ln(fn/x), n 1, 2, 3
  • 4. Output is mapped to specific vowels

17
The Model
18
Other ResearchersOther Ratios
  • Basic ratio idea
  • Auditory system computes a ratio based on the
    formants of the wave that vibrates the basilar
    membrane
  • This ratio is perceived as one of the vowels
  • F0 is absent in some of the models maybe
    reflecting the fact that whispered vowels can be
    identified (though not well)
  • All proposals are similar
  • All do some arithmetic, A, with the formants
  • All compute log (A) or Bark(A)
  • Small problems Johnson
  • doesnt distinguish between log and ln
  • On p. 367 says Sussman uses a geometric mean
  • But on p. 368 gives Sussmans formula with the
    arithmetic mean)

The vowel chart plots f1 against f2, using the
non-linear Bark scale. Non-linearity reflects
the fact that the auditory system perceives more
clearly at low frequencies.
19
Beyond Formants
  • Formant ratio theories fail to account for all
    aspects of speaker normalization
  • Vowels also differ in secondary cues
  • Duration
  • Formant trajectories (shift in formant frequency
    over the duration of the vowel)

20
Relevant Studies
  • Lehiste Miller (1973)
  • Subjects score slightly better than chance in
    identifying fixed-duration vowels synthesized
    with steady-state formant frequencies
  • Nearey Assmann (1986)
  • Extracted small pieces from the beginning and end
    of vowels
  • Identification was higher when the pieces were
    played in the order extracted (rather than alone
    or in reverse)
  • Conclusion formant ratios are not points but
    trajectories through normalized space
  • Consistent with exemplar models

21
Theres More Than Vowels
  • Fricatives produced by men and women have
    different spectral shapes
  • May (1976)
  • Spliced the continuum from s to ? followed by
    ? produced by a M/F voice
  • Subjects perceived the s-? boundary at a
    higher frequency for women
  • Implies that listeners normalized the fricative
    based on the contextual information provided by
    vowel.

22
Statistical Methods
  • Goal of all normalization algorithms
  • Reduce within vowel category scatter
  • Between category overlap
  • No algorithm has been shown to work better than
    standardization to speaker specific z scores
    ((score mean)/standard deviation)
  • Conclusion Maybe listeners do some kind of
    statistical processing (if not the cognitively
    implausible z score)

23
Context
  • One simple normalization technique shift the
    spectrum of female speakers down 1 Bark (Bladon,
    1984)
  • Ladefoged Barney found in 1957
  • Vowels are identified differently based on the
    formant values of the preceding phrase
  • Suggests that speakers
  • use a cognitive frame of reference
  • An auditory representation of the speaker

24
Shifting Frame of Reference 1
  • Recognition accuracy in noise decreases when the
    speaker is unpredictable
  • Vowel identification is more accurate in single
    talker, rather than mixed talker, word lists
  • Kato Kakehi (1988) showed listeners adapted
    monotonically (70 - 76) to five stimuli
    presented by the same speaker.

25
Shifting Frame of Reference 2
  • When told that an androgynous voice along a h?d
    h?d continuum was male or female subjects
    produced different category boundaries
  • Conclusion listeners perceive speech according
    to an internal representation of the person
    talking

26
Leads ToTheories of Vocal Tract Normalization
  • Formant ratio theories
  • normalization is a function of the auditory
    system encoding vowels, holistically, from a
    collection of spectral cues
  • Vocal tract theories
  • normalization refers to the perceived length of
    the vocal tract for an individual speaker
  • Computation of formants is functionally
    dependent on the idealized length of the vocal
    tract (f1 c/4L, where c is the speed of sound
    and L is the length of the vocal tract).
  • Essentially, a theory of context

27
Adjusting for Context 1
  • Uniform scaling (Lindblom, Fant)
  • Took F3 to be a proxy for vocal tract length
  • Computed a constant as the ratio of a speakers
    vocal tract length to a reference vocal tract
    length
  • Scaled incoming sounds by this ratio
  • Reduces speaker differences
  • Big problem male/female formant differences are
    not uniformbut its a step in the right direction

28
Adjusting for Context 2
  • Non-uniform normalization
  • Use multiple parameters
  • Johnson mentions research from the 70s and 80s.
  • Recover a richly detailed characterization of
    vocal tract geometry from the signal
  • Almost identical acoustic values can be produced
    by different vocal tracts
  • Indeterminacies found in extraction are magnified
    in synthesis
  • speech gesture recovery, as a practical
    normalization strategy, is out of reach at this
    time (p. 375)

29
Talkers or Vocal Tracts?
  • VT normalization theory implies that speakers
    differ primarily in VT anatomy
  • Traunmuller (1984) successfully reproduced Fants
    male/female formant ratios by adjusting for
  • Pharynx length
  • Resting tongue position
  • Johnson He made assumptions about gender
    differences and resting tongue position not
    justifed by data
  • There appear to be other variables
  • Ethological basis for gender speech differences
    (Ohala 1984, Diehl et al., 1996)
  • Johnson evidence that anatomical differences
    alone cant account for differential vowel space
    of women and men

30
The Puzzle of Gender
  • Several studies show
  • Listeners can identify the sex of prepubescent
    boys and girls
  • From very short speech segments
  • Boys are always more accurately identified than
    girls
  • Two plots of second formant frequencies against
    age follow (females are dashed lines)
  • Notice the difference well before the onset of
    puberty

31
Second Formant Data
32
Source of Differences
  • Perry et al. (2001) did a regression analysis on
    the data
  • Body size accounts for 82 - 87 of difference in
    F2
  • Unspecified gender differences account for 9 of
    variance in F2 and 5 of F1 and F3
  • White (1999 579) males and females may well
    adopt gender-specific articulatory behaviors from
    childhood to further enhance sex distinctions.

33
But is it Cross-Linguistic?There are certainly
differences
34
But how to account for them?
  • Using the data from the last figure
  • Plot the male/female formant frequency difference
    against female formant frequency
  • Recall Traunmiller (1984)
  • Successfully reproduced Fants male/female
    formant ratios by adjusting for
  • Pharynx length
  • Resting tongue position
  • Plot Traunmillers data in the same way
  • Traunmillers data is in rough agreement with the
    cross-linguistic data

35
F by DF Space as a function of Female formant
frequencysolid symbols Traunmillersolid lines
Traunmiller regression
36
Though The Fit is GoodIts not Perfect
  • Speaker normalization that removes only vocal
    tract differences will not account for between
    sex differences among vowels.
  • Talkers choose social, dialectal, gender
    markers (p. 381)

37
The Speaker is Important
  • Listeners recall words better when they are
    spoken in a familiar voice
  • Familiarity with speaker softens the McGurk
    effect. Subjects
  • Watch a speaker say a phoneme
  • But hear a different phoneme
  • Report a different, often intermediate phoneme
  • Suggests that listeners learn speakers habits of
    articulation

38
Social Expectations Play a Role
  • Listeners attribute personality traits on the
    basis of their speech
  • Rubin (1992) found speech intelligibility is
    reduced for American college-age listeners when a
    voice is associated with an Asian face
  • Word recognition is slower when the speaker is
    non-stereotypically male or female

39
Exemplar Explanation
  • Acoustic-Phonetic details are part of a
    listeners long-term representation of speech
  • Goldinger and others have demonstrated retention
    of talker-specific details
  • Not consistent with abstractionist models like
  • Formant ratio normalization
  • Vocal tract normalization
  • Leads to an exemplar model
  • Derived from prototype theory
  • Common explanatory model in usage-based ling.

40
Characteristics
  • Abstractionist models bend the input signal to
    match an hypothesized internal representation
  • Exemplars adapt to the identity of speaker
  • Exemplars allow the activation of cues of all
    sorts
  • Visual
  • Prior expectation
  • Recognition of known voice
  • Acoustic
  • Though Johnson doesnt say this
  • May be modeled using PDP architecture
  • Rumelhart and McClelland showed that a PDP model
    could explain why people can more accurately
    detect a phoneme in a word than in a non-word

41
Its All Too Much
  • HOLOFERNES He draweth out the thread of his
    verbosity finerthan the staple of his argument.
    I abhor suchfanatical phantasimes, such
    insociable andpoint-devise companions such
    rackers oforthography, as to speak dout, fine,
    when he shouldsay doubt det, when he should
    pronounce debt,--d,e, b, t, not d, e, t he
    clepeth a calf, caufhalf, hauf neighbour
    vocatur nebor neighabbreviated ne. This is
    abhominable,--which hewould call abbominable it
    insinuateth me ofinsanie anne intelligis,
    domine? to make frantic, lunatic.
  • MOTH They have been at a great feastof
    languages, and stolen the scraps.
  • Loves Labours Lost, Act V, Scene i

42
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com