Speech%20acoustics%20and%20phonetics - PowerPoint PPT Presentation

About This Presentation
Title:

Speech%20acoustics%20and%20phonetics

Description:

Speech acoustics and phonetics. Louis C.W. Pols. Institute ... Dynamics in speech acoustics. Contour modeling (mainly formants) Aspects of spectral undershoot ... – PowerPoint PPT presentation

Number of Views:240
Avg rating:3.0/5.0
Slides: 27
Provided by: louis78
Category:

less

Transcript and Presenter's Notes

Title: Speech%20acoustics%20and%20phonetics


1
Speech acoustics and phonetics
  • Louis C.W. Pols
  • Institute of Phonetic Sciences (IFA)
  • Amsterdam Center for Language and Communication
    (ACLC)

NATO-ASI Dynamics of Speech Production and
Perception Il Ciocco, Tuscany,
Italy, July 1, 2002
2
Overview
  • Dynamics in speech acoustics
  • Contour modeling (mainly formants)
  • Aspects of spectral undershoot
  • Modeling V and C reduction
  • Phonetic knowledge from speech corpora
  • IFA, CGN, TIMIT, found speech
  • Conclusions

3
(No Transcript)
4
Dynamics in speech acoustics
  • Dynamics is the norm, not stationarity
  • articulatory efficiency
  • Dynamics is everywhere
  • generally no word boundaries in speech
  • deletion of words, syllables, phonemes insertion
  • within/between word coarticulation/assimilation
  • vowel and consonant reduction
  • Acoustic manifestations
  • segment duration, F0, loudness, spectral quality

5
Dynamics is the norm
  • The speaker speaks as sloppily as the listeners
    allow him to do in communication
  • communicative efficiency
  • Articulatory vs. perceptual efficiency
  • do spectral transitions facilitate or hamper
    perception? gt see other presentation
  • Speaker flexibility speaking style (clear vs.
    sloppy) speaking rate

6
Dynamics is everywhere
  • Deletion
  • bread and butter /brEmbY3/
  • Amsterdam (Du) /Amst_at_rdAm/ gt/Ams_at_dAm/
  • koninklijke (Du) /konI?kl_at_k_at_/ gt/kol_at_k_at_/
  • Insertion
  • homorganic glide insertion die een (Du)
    /dij_at_n/
  • Degemination
  • is zichtbaar (Du) /Is zIxtbar/ gt/IsIxbar/
  • Reduction, coarticulation, assimilation

7
Acoustic manifestations
  • pitch, loudness, formant, component contours
  • contour stylization (e.g., pitch in praat)
  • contour modeling
  • n-th degree curve fitting (D.van Bergem)
  • Legendre polynomials ) (R.van Son)
  • 16 points per segment )
  • (phoneme) segmentation
  • by hand (time consuming non-consistent)
  • automatically (via forced phoneme recognition
    and a pronunciation lexicon with alternatives
    systematic errors)

8
Contour modeling
  • allows modeling of specific phenomena
  • pitch accentuation (vs. vowel onset)
  • reduction, centralization, undershoot
  • allows generation of stimuli for perc. expts.
  • phoneme identification in extending context
  • 2-alternatives forced choice identif. of continua
  • discrimination, RT
  • allows statistics on large speech corpora
  • TIMIT, CGN, IFA-corpus, Switchboard

9
Static vs. dynamic V recogn.
  • see Weenink (2001)
  • Vowel normalizations with the TIMIT acoustic
    phonetic speech corpus, IFA Proc. 24, 117-123
  • 438 males, both train test sent. of TIMIT
  • 35,385 vowel segments, hand segmented
  • 13 monophthongeal vowel categories
  • 1-Bark bandfilter anal. (18), intensity. normal.
  • 3 frames per segment central and 25 ms L/R

10
Some results
  • Vowel classif. () with discriminant functions

Condition Items Static 1 frame Dynamic 3 frames
Original 35,385 438x13x(125) 59.3 66.9
speaker normalized 35,385 62.2 69.2
V centers per speaker 5,374 438x13 78.9 90.1
speaker normalized 5,374 87.9 94.5
11
Formant tracks / speaking rate
  • Ph.D. thesis Rob van Son (1993)
  • Spectro-temporal features of vowel segments
  • see also Speech Comm. 13, 135-148 (Pols vSon)
  • 850-words text, read at normal and fast rate
  • hand segmentation of 7 most freq. V schwa
  • formant tracks
  • via 16 points per segm. or 5 Legendre polynomials
  • influence of rate, V-dur., context, sent. acc.
  • evidence for duration-controlled undershoot?

12
Some results
  • no differences for F1/F2 in vowel center for
    normal- or fast-rate speech only some over- all
    rise in F1 for fast rate (irrespective of V)
  • same formant track shape (normalized to 16
    points) for normal- or fast-rate speech
  • same results when using the more elaborate
    Legendre polynomials
  • Concl. changes in V-duration do not change the
    amount of undershoot gt active control
    of articulation speed

13
Formant representations
e
e
zeroth order Legendre Legendre polynomial
coefficients (mean Fi in vowel segment)
second order polynomials (axes reversed)
14
Modeling vowel reduction
  • Ph.D. thesis Dick van Bergem (1995)
  • Acoustic and lexical vowel reduction
  • see also Speech Communication 16, 329-358
  • lexical V reduction Fr /betõ/ vs. Du /b_at_tOn/
  • acoustic V reduction /banan, bAnan, b_at_nan/
  • f(sent. acc., w. str., w. class)
    can-candy-canteen
  • coarticulatory effects on the schwa
  • C1_at_C2V- and VC1_at_C2-type nonsense words
  • perceptual effects (full V or schwa, f.i.
    ananas)

15
Some results
t-n
w-l
The schwa is not just a centralized vowel but
something that is completely assimilated with its
phonemic context
16
Modeling consonant reduction
  • Sp. Comm. (1999) 28, 125-140 (vSon Pols)
  • 20 min. speech, both spontaneous and read
  • 2 x 791 similar VCV hand segmented
  • 5 aspects of V and C reduction
  • related to coarticulation F2 slope differences
    at CV- vs. VC-boundaries F2 locus equations (F2
    onset vs. F2 target)
  • related to speaking effort duration spectral
    COG (mean freq.) V-C sound energy differences

17
Some results
  • V markedly reduced in spontaneous speech
  • lower F2-slope diff. in spontaneous speech gt
    decrease in articulation speed
  • no systematic effect on F2 locus equation V
    onsets and targets change in concert gt any
    V reduction mirrored by comparable change in C
  • spont. sp. V and C shorter lower COG gt
    decrease in vocal and articulatory effort

18
Access to large corpora
  • more, and more realistic, data
  • phonetic knowledge via statistical analyses
  • f.i. highly accessible IFA-corpus (free, SQL)
  • see Structure and access of the open source
    IFA-corpus, IFA Proc. 24, 15-26 (vSon Pols)
  • on-line http//www.fon.hum.uva.nl/IFAcorpus/
  • 4 M/4F speakers, 5.5 hrs of speech
  • from informal to read sent., words, syllables
  • 50Kwords segm. and labeled at phoneme level

19
Some results
  • speech annot. meta data relational DB
  • realization of final n, f.i. Du geven /xev_at_(n)/

Style wrds /_at_n/ /_at_/ All /_at_n/ /_at_n/ /_at_n/
Informal 5,250 1 304 305 0.3
Retelling 6,229 13 236 249 5.2 LF HF
Narr. story 14,453 180 372 552 33 42 30
Sentences 14,970 203 340 543 37 42 30
Pseudo-sent 2,554 62 19 81 77
All 43,456 459 1,271 1,730 36
Read
20
Spoken Dutch Corpus (CGN)
  • 10 M words, 1,000 hrs of speech
  • variety of styles, incl. telephone speech
  • adult Dutch and Flemish speakers
  • for linguistic and technological research
  • see various LREC and ICSLP papers (2002)
  • see also http//lands.let.kun.nl/cgn/home.htm
  • fully transcribed orthogr., POS, lemmas
  • partly transcr. phonemic, prosodic, syntactic

21
TIMIT
  • popular DB in acoustic phonetics and ASR
  • also telephone version (NTIMIT)
  • hand segmented labeled at phoneme level
  • 438 males, 192 females (8 dialect regions)
  • 10 sent./sp. (2 fixed, 1 phon. compact, 7
    diverse)
  • sa1 She had her dark suit in greasy wash water
    all year
  • includes separate test data (112 M, 56 F)
  • e.g. Ph.D thesis X. Wang (1997)
  • Incorporating knowledge on segmental duration in
    HMM-based continuous speech recognition

22
Useful info durational variability
Adopted from Wang (1998)
23
all 3,696 training sent. (sx si) of TIMIT
training set
0
normalized phone duration
speaking rate
24
found speech
  • DARPA-LVSR community rather ambitious
  • Broadcast News (BN), Sp.Comm. 37 (2002)

lt 95 WSJ NAB read sp. 1995 Market place 1996 F0-F5, FX partitioned 1997 3 hrs test unpartit. 1998 non Engl. speech also lt 10x RT
audio training data 100 hrs 10 hrs 55 hrs 50 hrs 100 hrs
text (for LM) 430 K 122 M 540 M gt 900 M
best WER on test set 27.0 27.1 146 hrs 16.2 3 hrs 13.5 gt16.1 3 hrs (10xRT)
For Proc. DARPA Workshops, see http//www.nist.gov
/speech/proc/darpa99/index.htm
25
Articul.-acoustic features in ASR
  • A Dutch treatment of an elitist approach to
    articulatory-acoustic feature classification,
    Proc. Eurospeech-2001, 1729-1732 (M. Wester et
    al.)
  • Integrating articulatory features into acoustic
    models for speech recognition, Phonus 5, 73-86
    (K. Kirchhoff, 2000)
  • An overlapping-feature-based phonological model
    incorporating linguistic constraints
    Applications to speech recognition, JASA 111
    (2), 1086-1101 (J. Sun L. Deng, 2002)

26
Conclusions
  • examples of dynamics in speech acoustics
  • going from formal to informal speech
  • less dynamics, more reduction (artic. guided)
  • undershoot vs. speaking style
  • sloppiness or articulatory limits?
  • functionality of dynamics? gt other paper
  • systematicity of dynamics?
  • easing ASR, rules for TTS, acquiring knowledge?
Write a Comment
User Comments (0)
About PowerShow.com