Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent - PowerPoint PPT Presentation

About This Presentation
Title:

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent

Description:

Unsupervised and Semi-Supervised Learning. of Tone and Pitch Accent. Gina-Anne Levow ... Automatically segmented, pinyin pronunciation lexicon ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 43
Provided by: ginal5
Category:

less

Transcript and Presenter's Notes

Title: Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent


1
Unsupervised and Semi-Supervised Learning of
Tone and Pitch Accent
  • Gina-Anne Levow
  • University of Chicago
  • June 6, 2006

2
Roadmap
  • Challenges for Tone and Pitch Accent
  • Variation and Learning
  • Data collections processing
  • Learning with less
  • Semi-supervised learning
  • Unsupervised clustering
  • Approaches, structure, and context
  • Conclusion

3
Challenges Tone and Variation
  • Tone and Pitch Accent Recognition
  • Key component of language understanding
  • Lexical tone carries word meaning
  • Pitch accent carries semantic, pragmatic,
    discourse meaning
  • Non-canonical form (Shen 90, Shih 00, Xu 01)
  • Tonal coarticulation modifies surface realization
  • In extreme cases, fall becomes rise
  • Tone is relative
  • To speaker range
  • High for male may be low for female
  • To phrase range, other tones
  • E.g. downstep

4
Challenges Training Demands
  • Tone and pitch accent recognition
  • Exploit data intensive machine learning
  • SVMs (Thubthong 01,Levow 05, SLX05)
  • Boosted and Bagged Decision trees (X. Sun, 02)
  • HMMs (Wang Seneff 00, Zhou et al 04,
    Hasegawa-Johnson et al, 04,)
  • Can achieve good results with large sample sets
  • 10K lab syllabic samples -gt gt 90 accuracy
  • Training data expensive to acquire
  • Time pitch accent 10s of time real-time
  • Money requires skilled labelers
  • Limits investigation across domains, styles, etc
  • Human language acquisition doesnt use labels

5
Strategy Overall
  • Tone and intonation across languages
  • Common machine learning classifiers
  • Acoustic-prosodic model
  • No word label, POS, lexical stress info
  • No explicit tone label sequence model
  • English, Mandarin Chinese

6
Strategy Training
  • Challenge
  • Can we use the underlying acoustic structure of
    the language through unlabeled examples to
    reduce the need for expensive labeled training
    data?
  • Exploit semi-supervised and unsupervised learning
  • Semi-supervised Laplacian SVM
  • K-means and asymmetric k-lines clustering
  • Substantially outperform baselines
  • Can approach supervised levels

7
Data Collections I English
  • English (Ostendorf et al, 95)
  • Boston University Radio News Corpus, f2b
  • Manually ToBI annotated, aligned, syllabified
  • Pitch accent aligned to syllables
  • 4-way Unaccented, High, Downstepped High, Low
  • (Sun 02, Ross Ostendorf 95)
  • Binary Unaccented vs Accented

8
Data Collections II Mandarin
  • Mandarin
  • Lexical tones
  • High, Mid-rising, Low, High falling, Neutral

9
Data Collections III Mandarin
  • Mandarin Chinese
  • Lab speech data (Xu, 1999)
  • 5 syllable utterances vary tone, focus position
  • In-focus, pre-focus, post-focus
  • TDT2 Voice of America Mandarin Broadcast News
  • Automatically force aligned to anchor scripts
  • Automatically segmented, pinyin pronunciation
    lexicon
  • Manually constructed pinyin-ARPABET mapping
  • CU Sonic language porting
  • 4-way High, Mid-rising, Low, High falling

10
Local Feature Extraction
  • Motivated by Pitch Target Approximation Model
  • Tone/pitch accent target exponentially approached
  • Linear target height, slope (Xu et al, 99)
  • Scalar features
  • Pitch, Intensity max, mean (Praat, speaker
    normalized)
  • Pitch at 5 points across voiced region
  • Duration
  • Initial, final in phrase
  • Slope
  • Linear fit to last half of pitch contour

11
Context Features
  • Local context
  • Extended features
  • Pitch max, mean, adjacent points of adjacent
    syllable
  • Difference features wrt adjacent syllable
  • Difference between
  • Pitch max, mean, mid, slope
  • Intensity max, mean
  • Phrasal context
  • Compute collection average phrase slope
  • Compute scalar pitch values, adjusted for slope

12
Experimental Configuration
  • English Pitch Accent
  • Proportionally sampled 1000 examples
  • 4-way and binary classification
  • Contextualization representation, preceding
    syllables
  • Mandarin Tone
  • Balanced tone sets 400 examples
  • Vary data set difficulty clean lab -gt broadcast
  • 4 tone classification
  • Simple local pitch only features
  • Prior lab speech experiments effective with local
    features

13
Semi-supervised Learning
  • Approach
  • Employ small amount of labeled data
  • Exploit information from additional presumably
    more available unlabeled data
  • Few prior examples EM, co- self-training
    Ostendorf 05
  • Classifier
  • Laplacian SVM (Sindhwani,BelkinNiyogi 05)
  • Semi-supervised variant of SVM
  • Exploits unlabeled examples
  • RBF kernel, typically 6 nearest neighbors

14
Experiments
  • Pitch accent recognition
  • Binary classification Unaccented/Accented
  • 1000 instances, proportionally sampled
  • Labeled training 200 unacc, 100 acc
  • gt80 accuracy (cf. 84 w/15x labeled SVM)
  • Mandarin tone recognition
  • 4-way classification n(n-1)/2 binary classifiers
  • 400 instances balanced 160 labeled
  • Clean lab speech- in-focus-94
  • cf. 99 w/SVM, 1000s train 85 w/SVM 160
    training samples
  • Broadcast news 70
  • Cf. lt50 w/supervised SVM 160 training samples
    74 4x training

15
Unsupervised Learning
  • Question
  • Can we identify the tone structure of a language
    from the acoustic space without training?
  • Analogous to language acquisition
  • Significant recent research in unsupervised
    clustering
  • Established approaches k-means
  • Spectral clustering Eigenvector decomposition of
    affinity matrix
  • (Shih Malik 2000, Fischer Poland 2004, BNS
    2004)
  • Little research for tone
  • Self-organizing maps (Gauthier et al,2005)
  • Tones identified in lab speech using f0 velocities

16
Unsupervised Pitch Accent
  • Pitch accent clustering
  • 4 way distinction 1000 samples, proportional
  • 2-16 clusters constructed
  • Assign most frequent class label to each cluster
  • Learner
  • Asymmetric k-lines clustering (Fischer Poland
    05)
  • Context-dependent kernel radii, non-spherical
    clusters
  • gt 78 accuracy
  • Context effects
  • Vector w/context vs vector with no context
    comparable

17
Contrasting Clustering
  • Approaches
  • 3 Spectral approaches
  • Asymmetric k-lines (Fischer Poland 2004)
  • Symmetric k-lines (Fischer Poland 2004)
  • Laplacian Eigenmaps (Belkin, Niyogi, Sindhwani
    2004)
  • Binary weights, k-lines clustering
  • K-means Standard Euclidean distance
  • of clusters 2-16
  • Best results gt 78
  • 2 clusters asymmetric k-lines gt 2 clusters
    kmeans
  • Larger of clusters more similar

18
Contrasting Learners
19
Tone Clustering
  • Mandarin four tones
  • 400 samples balanced
  • 2-phase clustering 2-3 clusters each
  • Asymmetric k-lines
  • Clean read speech
  • In-focus syllables 87 (cf. 99 supervised)
  • In-focus and pre-focus 77 (cf. 93 supervised)
  • Broadcast news 57 (cf. 74 supervised)
  • Contrast
  • K-means In-focus syllables 74.75
  • Requires more clusters to reach asymm. k-lines
    level

20
Tone Structure
First phase of clustering splits high/rising from
low/falling by slope Second phase by pitch
height, or slope
21
Conclusions
  • Exploiting unlabeled examples for tone and pitch
    accent
  • Semi- and Un-supervised approaches
  • Best cases approach supervised levels with less
    training
  • Leveraging both labeled unlabeled examples best
  • Both spectral approaches and k-means effective
  • Contextual information less well-exploited than
    in supervised case
  • Exploit acoustic structure of tone and accent
    space

22
Future Work
  • Additional languages, tone inventories
  • Cantonese - 6 tones,
  • Bantu family languages truly rare data
  • Language acquisition
  • Use of child directed speech as input
  • Determination of number of clusters

23
Thanks
  • V. Sindhwani, M. Belkin, P. Niyogi I. Fischer
    J. Poland T. Joachims C-C. Cheng C. Lin
  • Dinoj Surendran, Siwei Wang, Yi Xu
  • This work supported by NSF Grant 0414919
  • http//people.cs.uchicago.edu/levow/tai

24
Spectral Clustering in a Nutshell
  • Basic spectral clustering
  • Build affinity matrix
  • Determine dominant eigenvectors and eigenvalues
    of the affinity matrix
  • Compute clustering based on them
  • Approaches differ in
  • Affinity matrix construction
  • Binary weights, conductivity, heat weights
  • Clustering cut, k-means, k-lines

25
K-Lines Clustering Algorithm
  • Due to Fischer Poland 2005
  • 1. Initialize vectors m1...mK (e.g. randomly, or
    as the first K eigenvectors of the spectraldata
    yi)
  • 2. for j1 . . .K
  • Define Pj as the set of indices of all points yi
    that are closest to the line defined by mj , and
    create the matrix Mj yi, i in Pi whose
    columns are the corresponding vectors yi
  • 3. Compute the new value of every mj as the
    first eigenvector of MjMTj
  • 4. Repeat from 2 until mj 's do not change

26
Asymmetric Clustering
  • Replace Gaussian kernel of fixed width
  • (Fischer Poland TR-ISDIA-12-04, p. 12),
  • Where tau 2d 1 or 10, largely insensitive to
    tau

27
Laplacian SVM
  • Manifold regularization framework
  • Hypothesize intrinsic (true) data lies on a low
    dimensional manifold,
  • Ambient (observed) data lies in a possibly high
    dimensional space
  • Preserves locality
  • Points close in ambient space should be close in
    intrinsic
  • Use labeled and unlabeled data to warp function
    space
  • Run SVM on warped space

28
Laplacian SVM (Sindhwani)
29
  • Input l labeled and u unlabeled examples
  • Output
  • Algorithm
  • Contruct adjacency Graph. Compute Laplacian.
  • Choose Kernel K(x,y). Compute Gram matrix K.
  • Compute
  • And

30
Current and Future Work
  • Interactions of tone and intonation
  • Recognition of topic and turn boundaries
  • Effects of topic and turn cues on tone realn
  • Child-directed speech tone learning
  • Support for Computer-assisted tone learning
  • Structured sequence models for tone
  • Sub-syllable segmentation modeling
  • Feature assessment
  • Band energy and intensity in tone recognition

31
Related Work
  • Tonal coarticulation
  • Xu Sun,02 Xu 97Shih Kochanski 00
  • English pitch accent
  • X. Sun, 02 Hasegawa-Johnson et al, 04 Ross
    Ostendorf 95
  • Lexical tone recognition
  • SVM recognition of Thai tone Thubthong 01
  • Context-dependent tone models
  • Wang Seneff 00, Zhou et al 04

32
Pitch Target Approximation Model
  • Pitch target
  • Linear model
  • Exponentially approximated
  • In practice, assume target well-approximated by
    mid-point (Sun, 02)

33
Classification Experiments
  • Classifier Support Vector Machine
  • Linear kernel
  • Multiclass formulation
  • SVMlight (Joachims), LibSVM (Cheng Lin 01)
  • 41 training / test splits
  • Experiments Effects of
  • Context position preceding, following, none,
    both
  • Context encoding Extended/Difference
  • Context type local, phrasal

34
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
35
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74.0 80.7
Extend Pre 74.0 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69.0 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
36
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
37
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
38
Discussion Local Context
  • Any context information improves over none
  • Preceding context information consistently
    improves over none or following context
    information
  • English Generally more context features are
    better
  • Mandarin Following context can degrade
  • Little difference in encoding (Extend vs Diffs)
  • Consistent with phonological analysis (Xu) that
    carryover coarticulation is greater than
    anticipatory

39
Results Discussion Phrasal Context
Phrase Context Mandarin Tone English Pitch Accent
Phrase 75.5 81.3
No Phrase 72 79.9
  • Phrase contour compensation enhances recognition
  • Simple strategy
  • Use of non-linear slope compensate may improve

40
Context Summary
  • Employ common acoustic representation
  • Tone (Mandarin), pitch accent (English)
  • Cantonese 64 68 with RBF kernel
  • SVM classifiers - linear kernel 76, 81
  • Local context effects
  • Up to gt 20 relative reduction in error
  • Preceding context greatest contribution
  • Carryover vs anticipatory
  • Phrasal context effects
  • Compensation for phrasal contour improves
    recognition

41
Context Summary
  • Employ common acoustic representation
  • Tone (Mandarin), pitch accent (English)
  • SVM classifiers - linear kernel 76, 81
  • Local context effects
  • Up to gt 20 relative reduction in error
  • Preceding context greatest contribution
  • Carryover vs anticipatory
  • Phrasal context effects
  • Compensation for phrasal contour improves
    recognition

42
Aside More Tones
  • Cantonese
  • CUSENT corpus of read broadcast news text
  • Same feature extraction representation
  • 6 tones
  • High level, high rise, mid level, low fall, low
    rise, low level
  • SVM classification
  • Linear kernel 64, Gaussian kernel 68
  • 3,6 50 - mutually indistinguishable (50
    pairwise)
  • Human levels no context 50 context 68
  • Augment with syllable phone sequence
  • 86 accuracy 90 of syllable w/tone 3 or 6 one
    dominates

43
Aside Voice Quality Energy
  • By Dinoj Surendran
  • Assess local voice quality and energy features
    for tone
  • Not typically associated with Mandarin
  • Considered
  • VQ NAQ, AQ, etc Spectral balance Spectral
    Tilt Band energy
  • Useful Band energy significantly improves
  • Esp. neutral tone
  • Supports identification of unstressed syllables
  • Spectral balance predicts stress in Dutch

44
Roadmap
  • Challenges for Tone and Pitch Accent
  • Contextual effects
  • Training demands
  • Modeling Context for Tone and Pitch Accent
  • Data collections processing
  • Integrating context
  • Context in Recognition
  • Reducing Training demands
  • Data collections structure
  • Semi-supervised learning
  • Unsupervised clustering
  • Conclusion

45
Strategy Context
  • Exploit contextual information
  • Features from adjacent syllables
  • Height, shape direct, relative
  • Compensate for phrase contour
  • Analyze impact of
  • Context position, context encoding, context type
  • gt 20 relative improvement over no context
Write a Comment
User Comments (0)
About PowerShow.com