Akiko Kusumoto1,2, JohnPaul Hosom1 Nancy Vaughan2 - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Akiko Kusumoto1,2, JohnPaul Hosom1 Nancy Vaughan2

Description:

... humans can comprehend speech as high as 500 word per minute (wpm) [Fulford, 93] ... Fulford, (1993). ' Can Learning be more efficient? ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 21
Provided by: akikoku
Category:

less

Transcript and Presenter's Notes

Title: Akiko Kusumoto1,2, JohnPaul Hosom1 Nancy Vaughan2


1
Akiko Kusumoto1,2, John-Paul Hosom1 Nancy Vaughan2
Comparison of Acoustic Features of
Time-Compressed and Natural Speech.
1Center for Spoken Language Understanding OGI
School of Science Engineering, Oregon Health
Science University.
2 VA RRD National Center for Rehabilitative
Auditory Research Portland, OR
2
INTRODUCTION
  • Computer Processed, Time-Compressed Speech
  • Widely used to examine limits of human speech
    processing.
  • Older and hearing impaired listeners have more
    difficulties than younger listeners.
  • Difficulty processing transient cues
  • Slowed rate of information processing.
  • Need to store more information when it is
    compressed.
  • Normally, humans can comprehend speech as high as
    500 word per minute (wpm) Fulford, 93.
  • Computer-compressed speech may have reduced
    intelligibility around 250 wpm depending on the
    algorithm Covell, 97.

3
INTRODUCTION
Time-Compressed Speech
Original Sentence (172 wpm)
CR 40 (287 wpm)
CR 60 (430 wpm)
  • Time Compression Rate (CR) indicates the amount
    of original speech removed from waveform.
  • CR is relative to the rate of original speech
    (speaker).

4
Pilot Study
Perceptual Test to investigate the effects of
age-related cognitive deficits on speech
perception.
  • Subjects
  • Group 1 Young, Normal Hearing (N26)
  • Group 2 Old, Normal Hearing (N36)
  • Group 3 Old, Mild-Moderate Hearing Loss (N71)
  • Speech Material
  • IEEE Sentences (contextual).
  • Anomalous Sentences (syntactically correct but
    semantically anomalous).
  • Methods
  • For young listeners, test from CR60 to CR75. For
    old listeners, test from CR40 to CR65.
  • Subjects repeat the entire sentence.
  • Each test sentence contains 5 key words for
    scoring.
  • 10 sentences for each compression rate.

5
Pilot Study
Perceptual Test Results
  • Functions are parallel
  • Young vs Old listeners
  • Small Differences
  • Old listeners with normal vs hearing loss
  • Contextual vs non-contextual sentences

For more details, please see poster 2aSC16 by
Furukawa et al.
6
Motivation for this Work
Identify possible acoustic sources of reduced
intelligibility for older listeners.
  • Possible acoustic sources of reduced
    intelligibilityrelated to artifacts of
    compression algorithm
  • Unwanted temporal changes?
  • Vowel durations
  • vowels may be overly compressed during
    steady-state regions or not compressed
    enoughrelative to consonants
  • Consonant-vowel durations
  • steady-state consonants may be overly
    compressed, stops may be under-compressed
  • Spectral distortions?

7
Time-Compression TechniqueSynchronous Overlap
Add (SOLA) Roucous, 85.
Example, CR 50
Input
(b)
Unprocessed speech
(a)

Maximum Correlation Point
Correlation (a) (b)
(a)
Output

Input
Unprocessed speech
(c)

Correlation (a) (c)
(a)
(c)
Output

8
Tool to Identify Phoneme BoundariesForced-Alignm
ent
  • Use existing automatic speech recognizerHosom,
    02.
  • Constrain to recognize only the correct phoneme
    sequence.
  • Return both phoneme identities and phoneme
    boundaries.

Measuring accuracy as the percentage of
boundaries within X ms of manually-placed
boundaries.
9
Speech Corpus IEEE sentences
Example The birch canoe slid on the smooth
planks.
  • Used for speech intelligibility test (SI).
  • Developed by IEEE Rothauser, 69.
  • Total 320 sentences used for analysis.
  • Each sentence contains 7 to 10 words
  • 2518 words in total.
  • 2859 vowels, 6673 consonants in total.

10
Temporal Domain Analysis
  • In English, duration often serves as a primary
    perceptual cue Klatt, 1976.
  • Vowel duration Stressed vs Unstressed
  • Example, /i/ vs /I/, /A/ vs //
  • Fricative duration Voiced vs Unvoiced
  • Example, /s/ vs /z/, /f/ vs /v/
  • Consonant-Vowel duration ratio influence phoneme
    identity .
  • Vowel duration indicates whether following
    consonant is voiced or unvoiced.
  • Example, mat (shorter vowel) vs mad (longer
    vowel) Ladefoged, 93.

11
Average Vowel Duration
Question 1 Does compression affect duration
differences of stressed and unstressed vowels
equally or not?
  • We compute average vowel duration and compare the
    original and time compressed

Number of Occurrences
12
Average Vowel Duration
  • Stressed and Unstressed vowels are compressed
    equally (except //).
  • No vowels are over-compressed more than the
    desired compression rate.
  • Except for CR70 vowels /A/, //, large timing
    difference between stressed and unstressed is
    preserved.

13
Consonant-Vowel Duration
Question 2 Does compression affect relative
duration differences of voiced unvoiced
consonants equally or not?
  • We compute the ratio between a consonant and its
    closest neighboring vowel, normalizing for effect
    of vowel identity.

average duration of all vowels (V).
average duration of vowel (v).
e.g. For the consonant /9r/,
/9r/ 70ms
/i/ 100ms
14
Average Consonant-Vowel RatioVoiced, Unvoiced
Stops
Number of Occurrences
  • For CR70, duration cues are less clear.
  • Unvoiced consonant ratios are smaller
  • Voiced consonant ratios are larger.
  • For CR30, CR50, duration cues are preserved.

15
Average Consonant-Vowel RatioVoiced, Unvoiced
Fricatives
Number of Occurrences
s
v
z

f
T
D
Z
S
  • /f/ vs /v/ and /S/ vs /Z/ duration cues are less
    clear with increasing compression.

16
Spectral Analysis
Question 3 Does time-compressed speech contain
perceptible spectral distortions?
  • We compute the average long-term power spectrum
    over 320 sentences.
  • We plot spectrum for original, CR40, CR60
    sentences.

17
Spectral Analysis
  • We found
  • about 3 dB increase at F1 region in CR60.
  • about 3 dB increase at F2 region in original.
  • JND of F1 about 1.5 dB and of F2 about 3 dB
    Oshaughnessy, 00

18
Conclusions
  • Average vowel duration indicates equal
    compression in stressed and unstressed vowels and
    no unwanted compression due to processing.
  • Average consonant-vowel ratio indicates
  • plosive duration cues preserved for CR30, CR50,
    but not CR70,
  • fricative duration cues preserved except for /S/,
    /Z/ and /f/, /v/ distinctions. (However, /S/, /Z/
    confusion usually does not impact word
    recognition).
  • Average long term power spectrum indicates
    potentially noticeable difference around F1, but
    not F2 region.
  • Noticeable difference of 3dB may or may not be
    perceptually important.
  • No spectral artifacts (e.g. clicks) discovered in
    spectrum.
  • No clear link between reduced intelligibility in
    older listeners and duration or power-spectrum
    artifacts.

19
Future Work
  • Analyze formant transitions, because they provide
    important perceptual cues.
  • When the speech is compressed, the formant
    transition may undershoot the target formant
    frequencies.
  • Need quantitative analysis considering JND for
    young vs old listeners with or without hearing
    loss.
  • JND of temporal cues depend on age and hearing
    level.
  • Investigate the relationship between the
    word-level confusion from the listening test
    results and vowel/consonant duration cues.

20
References
  • Fulford, (1993). Can Learning be more efficient?
    Using compressed audio tapes to enhance
    systematically designed text, Education
    Technology, 33(2) 51-59.
  • Covell, M., (1998). MACH1 for nonuniform
    time-scale modifications of speech theory,
    technique, and comparisons, Proc. IEEE ICASSP,
    Seattle, WA., vol. 1, 349-352.
  • Roucous, S., Wilgus, A., (1985). High quality
    time-scale modification for speech, Proc. IEEE
    ICASSP, Tampa FL., vol. 2, 493-496.
  • Hosom, J. P., (2002). "Automatic phoneme
    alignment based on acoustic-phonetic modeling,"
    Proc. ICSLP, Boulder, CO., vol. 1, 357-360.
  • Klatt, D.H., (1976). Linguistic uses of
    segmental duration in English Acoustic and
    perceptual evidence, Journal of the Acoustical
    Society of America, 59(5) 1208-1221.
  • Ladefoged, P., (1993). A Course in Phonetics,
    Chapter 4, English Vowels and Phonological
    Rules, Rule 6.
  • OShaughnessy, D., (2000). Speech
    Communications Human and Machine, IEEE Press,
    New York.
  • Rothauser, E. H., Chapman, W. D., Guttman, N.,
    Nordby, K. S., Silbiger, H. R., Urbanek, G. E.,
    and Weinstock, M. (1969). IEEE recommended
    practice for speech quality measurements, IEEE
    Trans. Audio Electroacoust. 17, 227-246.
Write a Comment
User Comments (0)
About PowerShow.com