Title: Speech%20acoustics%20and%20phonetics
1Speech acoustics and phonetics
- Louis C.W. Pols
- Institute of Phonetic Sciences (IFA)
- Amsterdam Center for Language and Communication
(ACLC)
NATO-ASI Dynamics of Speech Production and
Perception Il Ciocco, Tuscany,
Italy, July 1, 2002
2Overview
- Dynamics in speech acoustics
- Contour modeling (mainly formants)
- Aspects of spectral undershoot
- Modeling V and C reduction
- Phonetic knowledge from speech corpora
- IFA, CGN, TIMIT, found speech
- Conclusions
3(No Transcript)
4Dynamics in speech acoustics
- Dynamics is the norm, not stationarity
- articulatory efficiency
- Dynamics is everywhere
- generally no word boundaries in speech
- deletion of words, syllables, phonemes insertion
- within/between word coarticulation/assimilation
- vowel and consonant reduction
- Acoustic manifestations
- segment duration, F0, loudness, spectral quality
5Dynamics is the norm
- The speaker speaks as sloppily as the listeners
allow him to do in communication - communicative efficiency
- Articulatory vs. perceptual efficiency
- do spectral transitions facilitate or hamper
perception? gt see other presentation - Speaker flexibility speaking style (clear vs.
sloppy) speaking rate
6Dynamics is everywhere
- Deletion
- bread and butter /brEmbY3/
- Amsterdam (Du) /Amst_at_rdAm/ gt/Ams_at_dAm/
- koninklijke (Du) /konI?kl_at_k_at_/ gt/kol_at_k_at_/
- Insertion
- homorganic glide insertion die een (Du)
/dij_at_n/ - Degemination
- is zichtbaar (Du) /Is zIxtbar/ gt/IsIxbar/
- Reduction, coarticulation, assimilation
7Acoustic manifestations
- pitch, loudness, formant, component contours
- contour stylization (e.g., pitch in praat)
- contour modeling
- n-th degree curve fitting (D.van Bergem)
- Legendre polynomials ) (R.van Son)
- 16 points per segment )
- (phoneme) segmentation
- by hand (time consuming non-consistent)
- automatically (via forced phoneme recognition
and a pronunciation lexicon with alternatives
systematic errors)
8Contour modeling
- allows modeling of specific phenomena
- pitch accentuation (vs. vowel onset)
- reduction, centralization, undershoot
- allows generation of stimuli for perc. expts.
- phoneme identification in extending context
- 2-alternatives forced choice identif. of continua
- discrimination, RT
- allows statistics on large speech corpora
- TIMIT, CGN, IFA-corpus, Switchboard
9Static vs. dynamic V recogn.
- see Weenink (2001)
- Vowel normalizations with the TIMIT acoustic
phonetic speech corpus, IFA Proc. 24, 117-123 - 438 males, both train test sent. of TIMIT
- 35,385 vowel segments, hand segmented
- 13 monophthongeal vowel categories
- 1-Bark bandfilter anal. (18), intensity. normal.
- 3 frames per segment central and 25 ms L/R
10Some results
- Vowel classif. () with discriminant functions
Condition Items Static 1 frame Dynamic 3 frames
Original 35,385 438x13x(125) 59.3 66.9
speaker normalized 35,385 62.2 69.2
V centers per speaker 5,374 438x13 78.9 90.1
speaker normalized 5,374 87.9 94.5
11Formant tracks / speaking rate
- Ph.D. thesis Rob van Son (1993)
- Spectro-temporal features of vowel segments
- see also Speech Comm. 13, 135-148 (Pols vSon)
- 850-words text, read at normal and fast rate
- hand segmentation of 7 most freq. V schwa
- formant tracks
- via 16 points per segm. or 5 Legendre polynomials
- influence of rate, V-dur., context, sent. acc.
- evidence for duration-controlled undershoot?
12Some results
- no differences for F1/F2 in vowel center for
normal- or fast-rate speech only some over- all
rise in F1 for fast rate (irrespective of V) - same formant track shape (normalized to 16
points) for normal- or fast-rate speech - same results when using the more elaborate
Legendre polynomials - Concl. changes in V-duration do not change the
amount of undershoot gt active control
of articulation speed
13Formant representations
e
e
zeroth order Legendre Legendre polynomial
coefficients (mean Fi in vowel segment)
second order polynomials (axes reversed)
14Modeling vowel reduction
- Ph.D. thesis Dick van Bergem (1995)
- Acoustic and lexical vowel reduction
- see also Speech Communication 16, 329-358
- lexical V reduction Fr /betõ/ vs. Du /b_at_tOn/
- acoustic V reduction /banan, bAnan, b_at_nan/
- f(sent. acc., w. str., w. class)
can-candy-canteen - coarticulatory effects on the schwa
- C1_at_C2V- and VC1_at_C2-type nonsense words
- perceptual effects (full V or schwa, f.i.
ananas)
15Some results
t-n
w-l
The schwa is not just a centralized vowel but
something that is completely assimilated with its
phonemic context
16Modeling consonant reduction
- Sp. Comm. (1999) 28, 125-140 (vSon Pols)
- 20 min. speech, both spontaneous and read
- 2 x 791 similar VCV hand segmented
- 5 aspects of V and C reduction
- related to coarticulation F2 slope differences
at CV- vs. VC-boundaries F2 locus equations (F2
onset vs. F2 target) - related to speaking effort duration spectral
COG (mean freq.) V-C sound energy differences
17Some results
- V markedly reduced in spontaneous speech
- lower F2-slope diff. in spontaneous speech gt
decrease in articulation speed - no systematic effect on F2 locus equation V
onsets and targets change in concert gt any
V reduction mirrored by comparable change in C - spont. sp. V and C shorter lower COG gt
decrease in vocal and articulatory effort
18Access to large corpora
- more, and more realistic, data
- phonetic knowledge via statistical analyses
- f.i. highly accessible IFA-corpus (free, SQL)
- see Structure and access of the open source
IFA-corpus, IFA Proc. 24, 15-26 (vSon Pols) - on-line http//www.fon.hum.uva.nl/IFAcorpus/
- 4 M/4F speakers, 5.5 hrs of speech
- from informal to read sent., words, syllables
- 50Kwords segm. and labeled at phoneme level
19Some results
- speech annot. meta data relational DB
- realization of final n, f.i. Du geven /xev_at_(n)/
Style wrds /_at_n/ /_at_/ All /_at_n/ /_at_n/ /_at_n/
Informal 5,250 1 304 305 0.3
Retelling 6,229 13 236 249 5.2 LF HF
Narr. story 14,453 180 372 552 33 42 30
Sentences 14,970 203 340 543 37 42 30
Pseudo-sent 2,554 62 19 81 77
All 43,456 459 1,271 1,730 36
Read
20Spoken Dutch Corpus (CGN)
- 10 M words, 1,000 hrs of speech
- variety of styles, incl. telephone speech
- adult Dutch and Flemish speakers
- for linguistic and technological research
- see various LREC and ICSLP papers (2002)
- see also http//lands.let.kun.nl/cgn/home.htm
- fully transcribed orthogr., POS, lemmas
- partly transcr. phonemic, prosodic, syntactic
21TIMIT
- popular DB in acoustic phonetics and ASR
- also telephone version (NTIMIT)
- hand segmented labeled at phoneme level
- 438 males, 192 females (8 dialect regions)
- 10 sent./sp. (2 fixed, 1 phon. compact, 7
diverse) - sa1 She had her dark suit in greasy wash water
all year - includes separate test data (112 M, 56 F)
- e.g. Ph.D thesis X. Wang (1997)
- Incorporating knowledge on segmental duration in
HMM-based continuous speech recognition
22Useful info durational variability
Adopted from Wang (1998)
23all 3,696 training sent. (sx si) of TIMIT
training set
0
normalized phone duration
speaking rate
24found speech
- DARPA-LVSR community rather ambitious
- Broadcast News (BN), Sp.Comm. 37 (2002)
lt 95 WSJ NAB read sp. 1995 Market place 1996 F0-F5, FX partitioned 1997 3 hrs test unpartit. 1998 non Engl. speech also lt 10x RT
audio training data 100 hrs 10 hrs 55 hrs 50 hrs 100 hrs
text (for LM) 430 K 122 M 540 M gt 900 M
best WER on test set 27.0 27.1 146 hrs 16.2 3 hrs 13.5 gt16.1 3 hrs (10xRT)
For Proc. DARPA Workshops, see http//www.nist.gov
/speech/proc/darpa99/index.htm
25Articul.-acoustic features in ASR
- A Dutch treatment of an elitist approach to
articulatory-acoustic feature classification,
Proc. Eurospeech-2001, 1729-1732 (M. Wester et
al.) - Integrating articulatory features into acoustic
models for speech recognition, Phonus 5, 73-86
(K. Kirchhoff, 2000) - An overlapping-feature-based phonological model
incorporating linguistic constraints
Applications to speech recognition, JASA 111
(2), 1086-1101 (J. Sun L. Deng, 2002)
26Conclusions
- examples of dynamics in speech acoustics
- going from formal to informal speech
- less dynamics, more reduction (artic. guided)
- undershoot vs. speaking style
- sloppiness or articulatory limits?
- functionality of dynamics? gt other paper
- systematicity of dynamics?
- easing ASR, rules for TTS, acquiring knowledge?