Recent Progress on Automatic Phoneme Alignment

About This Presentation

Title:

Recent Progress on Automatic Phoneme Alignment

Description:

Spectrogram. Context. Phonetic Transcription. attitude. ae dx ax ... Spectrogram. Boundaries given by HMM Force Alignment. Hypothesis Boundaries as Input to SVM ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 18

Provided by: iisSin

Category:

more less

Transcript and Presenter's Notes

Title: Recent Progress on Automatic Phoneme Alignment

1
Recent Progress on Automatic Phoneme Alignment

Hung-Yi Lo and Hsin-Min Wang

2
Outline

A Minimum Boundary Error Framework for Automatic
Phonetic Segmentation (ICSLP06, ISCSLP06)
Phonetic Boundary Refinement Using Support Vector
Machine (submitted to ICASSP06)
Introduction
Reduced support vector machine
Phone-transition-dependent SVM classifier
Experimental results
Progress on Manually Corpus Labeling

3
Automatic Corpus Segmentation Problem
4
Boundary Refinement Using SVM
5
Introduction to Support Vector Machine
6
Support Vector Machines Formulation

Solve the quadratic program for some

min
(QP)
s. t.
,
7
Reduced Support Vector Machine

Overcoming computational and storage difficulties
in nonlinear SVM by using a rectangular kernel.
Choose a small random sample
of .
Replace by
in nonlinear SVM.
Only need to compute and store
numbers for the rectangular kernel.
Computational complexity reduced from
to .
The nonlinear classifier is defined by the
optimal solution by solving a
quadratic programming problem.

8
Feature Extraction in Refinement Stage

Each frame represented by a 45-dim vector
39 MFCC-based coefficients
Zero crossing rate
Bisector frequency
Burst degree
Spectral entropy
General weighted entropy
Subband energy
Two features were extracted for each hypothesis
boundary
Symmetrical Kullback-Leibler distance
Spectral feature transition rate
Each hypothesized boundary was represented by the
feature vectors of the left and right frames next
to it and the two boundary feature, which is a
92-dim augmented vector.

9
Phone-Transition-Dependent SVM Classifier
10
Phone Transition Clustering

For each specific phone transition case, we
gather all augmented vectors associated with the
human-labeled phone boundaries, and compute the
mean vector.
For each one of the three phone transition
classes, sonorant to non-sonorant, sonorant to
sonorant, non-sonorant to non-sonorant, apply
K-means algorithm to cluster the phone
transitions according to their mean vectors.
Note that only the phone transitions with enough
instances are considered in this step.
The number of clusters is determined according to
the cross-validation accuracy that the resulting
SVM classifiers achieve in the training data.
We assign the phone transitions, which are
ignored in Step 2), to the nearest clusters
according to the distances between their mean
vectors and the cluster centers.

11
Discriminative Feature Extraction

For each partition subset, two discriminative
features, namely discriminative weighted entropy
and discriminative subband energy are extracted
to replace the general weighted entropy and
subband energy.
Discriminative weighted entropy
where is a weight vector and is
the element of normalized power spectrum. The
weight vectors of each partition subset is
trained by linear SVM, to maximize the variation
of the weighted entropy feature close to the true
boundary.

12
Discriminative Feature Extraction

Discriminative subband energy
where is
pre-defined subband energy and the
weight score defined as
where and are the mean and
standard deviation of the j-th subband energy for
the training samples of the left or negative
class.

13
Boundary Recognition Using SVM

A RSVM classifier trained for each phone
transition subset.
Augmented vector associated with the true
boundaries are the positive training samples.
Randomly selected augmented vector at least 20ms
away from the true boundary as the negative
training example.
Gaussian kernel with weighted Euclidean distance
is applied.
In the testing phase, the augmented vector around
the hypothesized boundary are examined by the
classifier according to which the phone
transition belong.
The position with the maximum classifier output
is recognized as the refined boundary.

14
Experimental Setup

TIMIT corpus (dialect sentences are excluded)
Training set 3696 utterances
Testing set 1312 utterances
Context independent HMMs used for preliminary
alignment
50 phone models (left-to-right topology)
3 states for each HMM
5265 mixtures totally
Frame window size and shift are 20 ms and 5 ms
respectively.
12 MFCCs and log energy, and their first and
second differences
CMS and CVN are applied.

15
Experimental Setup

Phone transition cluster is determine by
cross-validation on the TIMIT training data.
20 in the sonorant to non-sonorant class
16 in the sonorant to sonorant class
10 in the non-sonorant to non-sonorant class
In the refinement phase, 5 hypothesized
boundaries extracted every 5 ms around the
initial boundary within 10 ms will be examined
by SVM.

16
Experimental Results
17
Thank you

Write a Comment

User Comments (0)