Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent - PowerPoint PPT Presentation

About This Presentation

Title:

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent

Description:

Unsupervised and Semi-Supervised Learning. of Tone and Pitch Accent. Gina-Anne Levow ... Automatically segmented, pinyin pronunciation lexicon ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 43

Provided by: ginal5

Learn more at: http://people.cs.uchicago.edu

Category:

more less

Transcript and Presenter's Notes

Title: Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent

1
Unsupervised and Semi-Supervised Learning of
Tone and Pitch Accent

Gina-Anne Levow
University of Chicago
June 6, 2006

2
Roadmap

Challenges for Tone and Pitch Accent
Variation and Learning
Data collections processing
Learning with less
Semi-supervised learning
Unsupervised clustering
Approaches, structure, and context
Conclusion

3
Challenges Tone and Variation

Tone and Pitch Accent Recognition
Key component of language understanding
Lexical tone carries word meaning
Pitch accent carries semantic, pragmatic,
discourse meaning
Non-canonical form (Shen 90, Shih 00, Xu 01)
Tonal coarticulation modifies surface realization
In extreme cases, fall becomes rise
Tone is relative
To speaker range
High for male may be low for female
To phrase range, other tones
E.g. downstep

4
Challenges Training Demands

Tone and pitch accent recognition
Exploit data intensive machine learning
SVMs (Thubthong 01,Levow 05, SLX05)
Boosted and Bagged Decision trees (X. Sun, 02)
HMMs (Wang Seneff 00, Zhou et al 04,
Hasegawa-Johnson et al, 04,)
Can achieve good results with large sample sets
10K lab syllabic samples -gt gt 90 accuracy
Training data expensive to acquire
Time pitch accent 10s of time real-time
Money requires skilled labelers
Limits investigation across domains, styles, etc
Human language acquisition doesnt use labels

5
Strategy Overall

Tone and intonation across languages
Common machine learning classifiers
Acoustic-prosodic model
No word label, POS, lexical stress info
No explicit tone label sequence model
English, Mandarin Chinese

6
Strategy Training

Challenge
Can we use the underlying acoustic structure of
the language through unlabeled examples to
reduce the need for expensive labeled training
data?
Exploit semi-supervised and unsupervised learning
Semi-supervised Laplacian SVM
K-means and asymmetric k-lines clustering
Substantially outperform baselines
Can approach supervised levels

7
Data Collections I English

English (Ostendorf et al, 95)
Boston University Radio News Corpus, f2b
Manually ToBI annotated, aligned, syllabified
Pitch accent aligned to syllables
4-way Unaccented, High, Downstepped High, Low
(Sun 02, Ross Ostendorf 95)
Binary Unaccented vs Accented

8
Data Collections II Mandarin

Mandarin
Lexical tones
High, Mid-rising, Low, High falling, Neutral

9
Data Collections III Mandarin

Mandarin Chinese
Lab speech data (Xu, 1999)
5 syllable utterances vary tone, focus position
In-focus, pre-focus, post-focus
TDT2 Voice of America Mandarin Broadcast News
Automatically force aligned to anchor scripts
Automatically segmented, pinyin pronunciation
lexicon
Manually constructed pinyin-ARPABET mapping
CU Sonic language porting
4-way High, Mid-rising, Low, High falling

10
Local Feature Extraction

Motivated by Pitch Target Approximation Model
Tone/pitch accent target exponentially approached
Linear target height, slope (Xu et al, 99)
Scalar features
Pitch, Intensity max, mean (Praat, speaker
normalized)
Pitch at 5 points across voiced region
Duration
Initial, final in phrase
Slope
Linear fit to last half of pitch contour

11
Context Features

Local context
Extended features
Pitch max, mean, adjacent points of adjacent
syllable
Difference features wrt adjacent syllable
Difference between
Pitch max, mean, mid, slope
Intensity max, mean
Phrasal context
Compute collection average phrase slope
Compute scalar pitch values, adjusted for slope

12
Experimental Configuration

English Pitch Accent
Proportionally sampled 1000 examples
4-way and binary classification
Contextualization representation, preceding
syllables
Mandarin Tone
Balanced tone sets 400 examples
Vary data set difficulty clean lab -gt broadcast
4 tone classification
Simple local pitch only features
Prior lab speech experiments effective with local
features

13
Semi-supervised Learning

Approach
Employ small amount of labeled data
Exploit information from additional presumably
more available unlabeled data
Few prior examples EM, co- self-training
Ostendorf 05
Classifier
Laplacian SVM (Sindhwani,BelkinNiyogi 05)
Semi-supervised variant of SVM
Exploits unlabeled examples
RBF kernel, typically 6 nearest neighbors

14
Experiments

Pitch accent recognition
Binary classification Unaccented/Accented
1000 instances, proportionally sampled
Labeled training 200 unacc, 100 acc
gt80 accuracy (cf. 84 w/15x labeled SVM)
Mandarin tone recognition
4-way classification n(n-1)/2 binary classifiers
400 instances balanced 160 labeled
Clean lab speech- in-focus-94
cf. 99 w/SVM, 1000s train 85 w/SVM 160
training samples
Broadcast news 70
Cf. lt50 w/supervised SVM 160 training samples
74 4x training

15
Unsupervised Learning

Question
Can we identify the tone structure of a language
from the acoustic space without training?
Analogous to language acquisition
Significant recent research in unsupervised
clustering
Established approaches k-means
Spectral clustering Eigenvector decomposition of
affinity matrix
(Shih Malik 2000, Fischer Poland 2004, BNS
2004)
Little research for tone
Self-organizing maps (Gauthier et al,2005)
Tones identified in lab speech using f0 velocities

16
Unsupervised Pitch Accent

Pitch accent clustering
4 way distinction 1000 samples, proportional
2-16 clusters constructed
Assign most frequent class label to each cluster
Learner
Asymmetric k-lines clustering (Fischer Poland
05)
Context-dependent kernel radii, non-spherical
clusters
gt 78 accuracy
Context effects
Vector w/context vs vector with no context
comparable

17
Contrasting Clustering

Approaches
3 Spectral approaches
Asymmetric k-lines (Fischer Poland 2004)
Symmetric k-lines (Fischer Poland 2004)
Laplacian Eigenmaps (Belkin, Niyogi, Sindhwani
2004)
Binary weights, k-lines clustering
K-means Standard Euclidean distance
of clusters 2-16
Best results gt 78
2 clusters asymmetric k-lines gt 2 clusters
kmeans
Larger of clusters more similar

18
Contrasting Learners
19
Tone Clustering

Mandarin four tones
400 samples balanced
2-phase clustering 2-3 clusters each
Asymmetric k-lines
Clean read speech
In-focus syllables 87 (cf. 99 supervised)
In-focus and pre-focus 77 (cf. 93 supervised)
Broadcast news 57 (cf. 74 supervised)
Contrast
K-means In-focus syllables 74.75
Requires more clusters to reach asymm. k-lines
level

20
Tone Structure
First phase of clustering splits high/rising from
low/falling by slope Second phase by pitch
height, or slope
21
Conclusions

Exploiting unlabeled examples for tone and pitch
accent
Semi- and Un-supervised approaches
Best cases approach supervised levels with less
training
Leveraging both labeled unlabeled examples best
Both spectral approaches and k-means effective
Contextual information less well-exploited than
in supervised case
Exploit acoustic structure of tone and accent
space

22
Future Work

Additional languages, tone inventories
Cantonese - 6 tones,
Bantu family languages truly rare data
Language acquisition
Use of child directed speech as input
Determination of number of clusters

23
Thanks

V. Sindhwani, M. Belkin, P. Niyogi I. Fischer
J. Poland T. Joachims C-C. Cheng C. Lin
Dinoj Surendran, Siwei Wang, Yi Xu
This work supported by NSF Grant 0414919
http//people.cs.uchicago.edu/levow/tai

24
Spectral Clustering in a Nutshell

Basic spectral clustering
Build affinity matrix
Determine dominant eigenvectors and eigenvalues
of the affinity matrix
Compute clustering based on them
Approaches differ in
Affinity matrix construction
Binary weights, conductivity, heat weights
Clustering cut, k-means, k-lines

25
K-Lines Clustering Algorithm

Due to Fischer Poland 2005
1. Initialize vectors m1...mK (e.g. randomly, or
as the first K eigenvectors of the spectraldata
yi)
2. for j1 . . .K
Define Pj as the set of indices of all points yi
that are closest to the line defined by mj , and
create the matrix Mj yi, i in Pi whose
columns are the corresponding vectors yi
3. Compute the new value of every mj as the
first eigenvector of MjMTj
4. Repeat from 2 until mj 's do not change

26
Asymmetric Clustering

Replace Gaussian kernel of fixed width
(Fischer Poland TR-ISDIA-12-04, p. 12),
Where tau 2d 1 or 10, largely insensitive to
tau

27
Laplacian SVM

Manifold regularization framework
Hypothesize intrinsic (true) data lies on a low
dimensional manifold,
Ambient (observed) data lies in a possibly high
dimensional space
Preserves locality
Points close in ambient space should be close in
intrinsic
Use labeled and unlabeled data to warp function
space
Run SVM on warped space

28
Laplacian SVM (Sindhwani)
29

Input l labeled and u unlabeled examples
Output
Algorithm
Contruct adjacency Graph. Compute Laplacian.
Choose Kernel K(x,y). Compute Gram matrix K.
Compute
And

30
Current and Future Work

Interactions of tone and intonation
Recognition of topic and turn boundaries
Effects of topic and turn cues on tone realn
Child-directed speech tone learning
Support for Computer-assisted tone learning
Structured sequence models for tone
Sub-syllable segmentation modeling
Feature assessment
Band energy and intensity in tone recognition

31
Related Work

Tonal coarticulation
Xu Sun,02 Xu 97Shih Kochanski 00
English pitch accent
X. Sun, 02 Hasegawa-Johnson et al, 04 Ross
Ostendorf 95
Lexical tone recognition
SVM recognition of Thai tone Thubthong 01
Context-dependent tone models
Wang Seneff 00, Zhou et al 04

32
Pitch Target Approximation Model

Pitch target
Linear model
Exponentially approximated
In practice, assume target well-approximated by
mid-point (Sun, 02)

33
Classification Experiments

Classifier Support Vector Machine
Linear kernel
Multiclass formulation
SVMlight (Joachims), LibSVM (Cheng Lin 01)
41 training / test splits
Experiments Effects of
Context position preceding, following, none,
both
Context encoding Extended/Difference
Context type local, phrasal

34
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
35
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74.0 80.7
Extend Pre 74.0 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69.0 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
36
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
37
Results Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5 81.3
Extend PrePost 74 80.7
Extend Pre 74 79.9
Extend Post 70.5 76.7
Diffs PrePost 75.5 80.7
Diffs Pre 76.5 79.5
Diffs Post 69 77.3
Both Pre 76.5 79.7
Both Post 71.5 77.6
No context 68.5 75.9
38
Discussion Local Context

Any context information improves over none
Preceding context information consistently
improves over none or following context
information
English Generally more context features are
better
Mandarin Following context can degrade
Little difference in encoding (Extend vs Diffs)
Consistent with phonological analysis (Xu) that
carryover coarticulation is greater than
anticipatory