On Use of Temporal Dynamics of Speech for Language Identification presentation

About This Presentation

Transcript and Presenter's Notes

Title: On Use of Temporal Dynamics of Speech for Language Identification

1
On Use of Temporal Dynamics of Speech for
Language Identification

2
OGI-4 ASP System

Goal
Convert the speech signal into a sequence of
discrete sub-word units that can characterize the
language
Approach
Use temporal trajectories of speech parameters to
obtain the sequence of units
Model the sequence of discrete sub-word units
using a N-gram language model
Sub-word units
TRAP-derived American English phonemes
Symbols derived from prosodic cues dynamics
Phonemes from OGI-LID

3
American English Phoneme Recognition

Phoneme set
39 American English phonemes (CMU-like)
Phoneme Recognizer
trained on NTIMIT
TRAP (Temporal Patterns) based
Speech segments for training obtained from
energy-based speech/nonspeech segmentation
Modeling
3-gram language model

4
English Phoneme System
Merger
Band Classifier 1
Viterbi search
Band Classifier 2
frequency
Band Classifier N
time

5
Prosodic Cues Dynamics

Technique
Using prosodic cues (intensity and pitch
trajectories) to derive the sub-word units
Approach
Segment the speech signal at the inflection
points of trajectories (zero-crossings of the
derivative) and at the onsets and offsets of
voicing
Label the segment by the direction of change of
the parameter within the segment

6
Prosodic Cues Dynamics

Duration
The duration of the segment is characterized as
short (less than 8 frames) or long
10 symbols
Broad-phonetic-category (BFC)
Finer labeling achieved by estimating the
broad-phonetic category (voweldiphthongglide,
schwa, stop, fricative, flap, nasal, and silence)
coinciding with each prosodic segment
BFC TRAPs trained on NTIMIT is used for deriving
the broad phonetic categories
61 symbols
3-gram language model

7
OGI-4 ASP System
EER30s17.8
EER30s41.4
EER30s19.3
EER30s32.1
8
OGI-4 ASP System
EER30s17.8
9
Post-Evaluation Phoneme System

Speech-nonspeech segmentation using silence
classes from TRAP-based classification
TRAPs classifier
Temporal trajectory duration - 400ms
3 bands as the input trajectory for each band
classifier to explore the correlation between
adjacent bands
The trajectories of 3 bands are projected into a
DCT basis (20 coefficients)
Viterbi search tuned for language identification
Training data
CallFriend training and development sets

10
Post-Evaluation Phoneme System
34 relative improvement
EER30s12.7
11
Post-Evaluation Prosodic Cues System

12
Post-Evaluation Prosodic Cues System
30 relative improvement
EER30s22.2
13
Fusion - 30 sec condition

EER30s5.7
26 relative improvement
14
Conclusions

Sequences of discrete symbols derived from speech
dynamics provide useful information for
characterizing the language
Two techniques for deriving the sequences of
symbols investigated
segmentation and labeling based on prosodic cues
segmentation and labeling based on TRAP-derived
phonetic labels
The introduced techniques combine well with each
other as well as with the more conventional
language ID techniques

Write a Comment

User Comments (0)

About PowerShow.com

On Use of Temporal Dynamics of Speech for Language Identification PowerPoint PPT Presentation