Title: Akiko Kusumoto1,2, JohnPaul Hosom1 Nancy Vaughan2
1Akiko Kusumoto1,2, John-Paul Hosom1 Nancy Vaughan2
Comparison of Acoustic Features of
Time-Compressed and Natural Speech.
1Center for Spoken Language Understanding OGI
School of Science Engineering, Oregon Health
Science University.
2 VA RRD National Center for Rehabilitative
Auditory Research Portland, OR
2INTRODUCTION
- Computer Processed, Time-Compressed Speech
- Widely used to examine limits of human speech
processing. - Older and hearing impaired listeners have more
difficulties than younger listeners. - Difficulty processing transient cues
- Slowed rate of information processing.
- Need to store more information when it is
compressed. - Normally, humans can comprehend speech as high as
500 word per minute (wpm) Fulford, 93. - Computer-compressed speech may have reduced
intelligibility around 250 wpm depending on the
algorithm Covell, 97.
3INTRODUCTION
Time-Compressed Speech
Original Sentence (172 wpm)
CR 40 (287 wpm)
CR 60 (430 wpm)
- Time Compression Rate (CR) indicates the amount
of original speech removed from waveform. - CR is relative to the rate of original speech
(speaker).
4Pilot Study
Perceptual Test to investigate the effects of
age-related cognitive deficits on speech
perception.
- Subjects
- Group 1 Young, Normal Hearing (N26)
- Group 2 Old, Normal Hearing (N36)
- Group 3 Old, Mild-Moderate Hearing Loss (N71)
- Speech Material
- IEEE Sentences (contextual).
- Anomalous Sentences (syntactically correct but
semantically anomalous). - Methods
- For young listeners, test from CR60 to CR75. For
old listeners, test from CR40 to CR65. - Subjects repeat the entire sentence.
- Each test sentence contains 5 key words for
scoring. - 10 sentences for each compression rate.
5Pilot Study
Perceptual Test Results
- Functions are parallel
- Young vs Old listeners
- Small Differences
- Old listeners with normal vs hearing loss
- Contextual vs non-contextual sentences
For more details, please see poster 2aSC16 by
Furukawa et al.
6Motivation for this Work
Identify possible acoustic sources of reduced
intelligibility for older listeners.
- Possible acoustic sources of reduced
intelligibilityrelated to artifacts of
compression algorithm
- Unwanted temporal changes?
- Vowel durations
- vowels may be overly compressed during
steady-state regions or not compressed
enoughrelative to consonants - Consonant-vowel durations
- steady-state consonants may be overly
compressed, stops may be under-compressed - Spectral distortions?
7Time-Compression TechniqueSynchronous Overlap
Add (SOLA) Roucous, 85.
Example, CR 50
Input
(b)
Unprocessed speech
(a)
Maximum Correlation Point
Correlation (a) (b)
(a)
Output
Input
Unprocessed speech
(c)
Correlation (a) (c)
(a)
(c)
Output
8Tool to Identify Phoneme BoundariesForced-Alignm
ent
- Use existing automatic speech recognizerHosom,
02. - Constrain to recognize only the correct phoneme
sequence. - Return both phoneme identities and phoneme
boundaries.
Measuring accuracy as the percentage of
boundaries within X ms of manually-placed
boundaries.
9Speech Corpus IEEE sentences
Example The birch canoe slid on the smooth
planks.
- Used for speech intelligibility test (SI).
- Developed by IEEE Rothauser, 69.
- Total 320 sentences used for analysis.
- Each sentence contains 7 to 10 words
- 2518 words in total.
- 2859 vowels, 6673 consonants in total.
10Temporal Domain Analysis
- In English, duration often serves as a primary
perceptual cue Klatt, 1976. - Vowel duration Stressed vs Unstressed
- Example, /i/ vs /I/, /A/ vs //
- Fricative duration Voiced vs Unvoiced
- Example, /s/ vs /z/, /f/ vs /v/
- Consonant-Vowel duration ratio influence phoneme
identity . - Vowel duration indicates whether following
consonant is voiced or unvoiced. - Example, mat (shorter vowel) vs mad (longer
vowel) Ladefoged, 93.
11Average Vowel Duration
Question 1 Does compression affect duration
differences of stressed and unstressed vowels
equally or not?
- We compute average vowel duration and compare the
original and time compressed
Number of Occurrences
12Average Vowel Duration
- Stressed and Unstressed vowels are compressed
equally (except //). - No vowels are over-compressed more than the
desired compression rate. - Except for CR70 vowels /A/, //, large timing
difference between stressed and unstressed is
preserved.
13Consonant-Vowel Duration
Question 2 Does compression affect relative
duration differences of voiced unvoiced
consonants equally or not?
- We compute the ratio between a consonant and its
closest neighboring vowel, normalizing for effect
of vowel identity.
average duration of all vowels (V).
average duration of vowel (v).
e.g. For the consonant /9r/,
/9r/ 70ms
/i/ 100ms
14Average Consonant-Vowel RatioVoiced, Unvoiced
Stops
Number of Occurrences
- For CR70, duration cues are less clear.
- Unvoiced consonant ratios are smaller
- Voiced consonant ratios are larger.
- For CR30, CR50, duration cues are preserved.
15Average Consonant-Vowel RatioVoiced, Unvoiced
Fricatives
Number of Occurrences
s
v
z
f
T
D
Z
S
- /f/ vs /v/ and /S/ vs /Z/ duration cues are less
clear with increasing compression.
16Spectral Analysis
Question 3 Does time-compressed speech contain
perceptible spectral distortions?
- We compute the average long-term power spectrum
over 320 sentences. - We plot spectrum for original, CR40, CR60
sentences.
17Spectral Analysis
- We found
- about 3 dB increase at F1 region in CR60.
- about 3 dB increase at F2 region in original.
- JND of F1 about 1.5 dB and of F2 about 3 dB
Oshaughnessy, 00
18Conclusions
- Average vowel duration indicates equal
compression in stressed and unstressed vowels and
no unwanted compression due to processing. - Average consonant-vowel ratio indicates
- plosive duration cues preserved for CR30, CR50,
but not CR70, - fricative duration cues preserved except for /S/,
/Z/ and /f/, /v/ distinctions. (However, /S/, /Z/
confusion usually does not impact word
recognition). - Average long term power spectrum indicates
potentially noticeable difference around F1, but
not F2 region. - Noticeable difference of 3dB may or may not be
perceptually important. - No spectral artifacts (e.g. clicks) discovered in
spectrum. - No clear link between reduced intelligibility in
older listeners and duration or power-spectrum
artifacts.
19Future Work
- Analyze formant transitions, because they provide
important perceptual cues. - When the speech is compressed, the formant
transition may undershoot the target formant
frequencies. - Need quantitative analysis considering JND for
young vs old listeners with or without hearing
loss. - JND of temporal cues depend on age and hearing
level. - Investigate the relationship between the
word-level confusion from the listening test
results and vowel/consonant duration cues.
20References
- Fulford, (1993). Can Learning be more efficient?
Using compressed audio tapes to enhance
systematically designed text, Education
Technology, 33(2) 51-59. - Covell, M., (1998). MACH1 for nonuniform
time-scale modifications of speech theory,
technique, and comparisons, Proc. IEEE ICASSP,
Seattle, WA., vol. 1, 349-352. - Roucous, S., Wilgus, A., (1985). High quality
time-scale modification for speech, Proc. IEEE
ICASSP, Tampa FL., vol. 2, 493-496. - Hosom, J. P., (2002). "Automatic phoneme
alignment based on acoustic-phonetic modeling,"
Proc. ICSLP, Boulder, CO., vol. 1, 357-360. - Klatt, D.H., (1976). Linguistic uses of
segmental duration in English Acoustic and
perceptual evidence, Journal of the Acoustical
Society of America, 59(5) 1208-1221. - Ladefoged, P., (1993). A Course in Phonetics,
Chapter 4, English Vowels and Phonological
Rules, Rule 6. - OShaughnessy, D., (2000). Speech
Communications Human and Machine, IEEE Press,
New York. - Rothauser, E. H., Chapman, W. D., Guttman, N.,
Nordby, K. S., Silbiger, H. R., Urbanek, G. E.,
and Weinstock, M. (1969). IEEE recommended
practice for speech quality measurements, IEEE
Trans. Audio Electroacoust. 17, 227-246.