On Use of Temporal Dynamics of Speech for Language Identification PowerPoint PPT Presentation

presentation player overlay
1 / 14
About This Presentation
Transcript and Presenter's Notes

Title: On Use of Temporal Dynamics of Speech for Language Identification


1
On Use of Temporal Dynamics of Speech for
Language Identification
  • Andre Adami Pavel
    Matejka Petr Schwarz Hynek
    Hermansky
  • Anthropic Signal Processing Grouphttp//www.asp.o
    gi.edu

2
OGI-4 ASP System
  • Goal
  • Convert the speech signal into a sequence of
    discrete sub-word units that can characterize the
    language
  • Approach
  • Use temporal trajectories of speech parameters to
    obtain the sequence of units
  • Model the sequence of discrete sub-word units
    using a N-gram language model
  • Sub-word units
  • TRAP-derived American English phonemes
  • Symbols derived from prosodic cues dynamics
  • Phonemes from OGI-LID

3
American English Phoneme Recognition
  • Phoneme set
  • 39 American English phonemes (CMU-like)
  • Phoneme Recognizer
  • trained on NTIMIT
  • TRAP (Temporal Patterns) based
  • Speech segments for training obtained from
    energy-based speech/nonspeech segmentation
  • Modeling
  • 3-gram language model

4
English Phoneme System
Merger
Band Classifier 1
Viterbi search
Band Classifier 2
frequency
Band Classifier N
time
  • Temporal trajectories
  • 23 mel-scale frequency band
  • 1 s segments of log energy trajectory
  • Band classifiers
  • MLP (101x300x39)
  • Hidden unit nonlinearities sigmoids
  • Output nonlinearities softmax
  • Merger
  • MLP (897x300x39)
  • Viterbi search
  • Penalty factor tuning deletions insertions
  • Training
  • NTIMIT

5
Prosodic Cues Dynamics
  • Technique
  • Using prosodic cues (intensity and pitch
    trajectories) to derive the sub-word units
  • Approach
  • Segment the speech signal at the inflection
    points of trajectories (zero-crossings of the
    derivative) and at the onsets and offsets of
    voicing
  • Label the segment by the direction of change of
    the parameter within the segment

6
Prosodic Cues Dynamics
  • Duration
  • The duration of the segment is characterized as
    short (less than 8 frames) or long
  • 10 symbols
  • Broad-phonetic-category (BFC)
  • Finer labeling achieved by estimating the
    broad-phonetic category (voweldiphthongglide,
    schwa, stop, fricative, flap, nasal, and silence)
    coinciding with each prosodic segment
  • BFC TRAPs trained on NTIMIT is used for deriving
    the broad phonetic categories
  • 61 symbols
  • 3-gram language model
  • BFC TRAPS Setup
  • Input temporal vectors
  • 15 bark-scale frequency band energy
  • 1s segments of log energy trajectory
  • Mean and variance normalized
  • Dimension reductionDCT
  • Band classifiers
  • MLP (15x100x7)
  • Hidden units sigmoid
  • Output units softmax
  • Merger
  • MLP (105x100x7)

7
OGI-4 ASP System
EER30s17.8
EER30s41.4
EER30s19.3
EER30s32.1
8
OGI-4 ASP System
EER30s17.8
9
Post-Evaluation Phoneme System
  • Speech-nonspeech segmentation using silence
    classes from TRAP-based classification
  • TRAPs classifier
  • Temporal trajectory duration - 400ms
  • 3 bands as the input trajectory for each band
    classifier to explore the correlation between
    adjacent bands
  • The trajectories of 3 bands are projected into a
    DCT basis (20 coefficients)
  • Viterbi search tuned for language identification
  • Training data
  • CallFriend training and development sets

10
Post-Evaluation Phoneme System
34 relative improvement
EER30s12.7
11
Post-Evaluation Prosodic Cues System
  • No energy-based segmentation
  • Unvoiced segments longer than 2 seconds are
    considered non-speech
  • No broad-phonetic category labeling applied
  • Rate of change plus the quantized duration (10
    tokens)
  • Training data
  • CallFriend training and development sets

12
Post-Evaluation Prosodic Cues System
30 relative improvement
EER30s22.2
13
Fusion - 30 sec condition
  • Fusing the scores from the prosodic cues system
  • with TRAP-derived phonemes EER30s 10.5(17
    relative improvement)
  • with OGI-LID derived phonemes EER30s 6.6 14
    relative improvement
  • TRAP-derived phoneme system fused with OGI-LID
  • EER30s 6.2 19 relative improvement

EER30s5.7
26 relative improvement
14
Conclusions
  • Sequences of discrete symbols derived from speech
    dynamics provide useful information for
    characterizing the language
  • Two techniques for deriving the sequences of
    symbols investigated
  • segmentation and labeling based on prosodic cues
  • segmentation and labeling based on TRAP-derived
    phonetic labels
  • The introduced techniques combine well with each
    other as well as with the more conventional
    language ID techniques
Write a Comment
User Comments (0)
About PowerShow.com