Overview - PowerPoint PPT Presentation

About This Presentation
Title:

Overview

Description:

Overview Recall What are sound features? Feature detection and extraction Features in Sphinx III Recall: Speech signal is slowly time varying singnal There are ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 12
Provided by: MarkPe70
Category:
Tags: overview

less

Transcript and Presenter's Notes

Title: Overview


1
Overview
  • Recall
  • What are sound features?
  • Feature detection and extraction
  • Features in Sphinx III

2
Recall
  • Speech signal is slowly time varying singnal
  • There are a number of linguistically distinct
    speech sounds (phonemes) in a language.
  • It is possible to represent the sound spectrogram
    in a 3D spectrogram of the speech intensity and
    the different frequency bands over time
  • Most SR systems rely heavily on vowel recognition
    to achieve high performance (they are long in
    duration and spectrally well defined and
    therefore easily recognized)

3
Speech sounds and features
  • Examples
  • Vowels (a, u, )
  • Diphthongs (f.i. ay as in guy, )
  • Semivowels (w, l, r, y)
  • Nasal Consonants (m, n)
  • Unvoiced Fricatives (f, s)
  • Voiced Fricatives (v, th, z)
  • Voiced and Unvoiced Stops (b, d, g)
  • They all have their own characteristics (features)

4
ASR Stages
  • 1) speech analysis system to provide an
    appropriate spectral representation of the
    characteristics of the time-varying speech
    signal
  • ? 2) feature detection stage to convert the
    spectral measurements to a set of features that
    describe the broad acoustic properties of the
    different phonetic units (f.i. nasality,
    frication, formant locations, voiced-unvoiced
    classification, ratios of high- and
    low-frequency energy, etc.)
  • 3) segmentation and labeling phase to find
    stable regions and then label the segmented
    region according to how well the features within
    that region match those of individual phonetic
    units
  • 4) final output of the recognizer is the word
    or word sequence that best matches

5
Feature detection (and extraction)
  • Speech segment contains certain characteristics,
    features.
  • Different segments of speech contain different
    features, specific for the kind of segment!
  • Goal is to try to classify a speech segment into
    one of several broad speech classes (f.i. via
    binary tree compact/diffuse, acute/grave,
    long/short, high/low frequency, etc)
  • Ideally, feature vectors for a given word should
    hopefully be the same regardless of the way in
    which the word has been uttered

6
Last week Mel-Frequency Ceptrum Coefficient
  • Fourier Transform extracts the frequency
    components of a signal in the time domain
  • Frequency domain is filtered/sliced in 12 smaller
    parts, where for each its own coefficient (MFCC)
    can be calculated
  • MFCC's use the log-spectrum of the speech signal.
  • The logarithmic nature of the technique is
    significant since the human auditory system
    perceives sound on a logarithmic scale above
    certain frequencies

7
Acoustic Modeling Feature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Energy Mel-Spaced Cepstrum
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
8
What to do with the MFCCs
  • A speech recognizer can be built using the energy
    values (time domain) and 12 MFCC's (frequency
    domain), plus the first and second order
    derivatives of those coefficients.
  • 13 (Absolute Energy (1) and MFCCs (12))
  • 13 (Delta First-order derivatives of the 13
    absolute coefficients)
  • 13 (Delta-Delta Second-order derivatives of the
    13 absolute coefficients)
  • ------------------------------------------------
  • 39 Total Basic MFCC Front End
  • The derivatives are useful because they provide
    information about the spectral change
  • These total of 39 coefficients will provide
    information about the different features in that
    segment!
  • The feature measurements of the segments are
    stored in so called feature vectors, that can
    be used in the next stage of the speech
    recognition (f.i. Hidden Markov Model)

9
In Sphinx IIIcomputation of feature vectors
  • feat_s2mfc2feat
  • feat_s2mfc2feat_block
  • MFC file is read
  • Initialization defining the kind of
    input-gtfeature conversion desired (there are some
    differences between Sphinx II and Sphinx III)
  • Feature vectors are computed for the entire
    segment specified (feat_s2mfc2feat and
    feat_s2mfc2feat_block)
  • In Sphinx in the feature vectors, the streams of
    features are stored as follows
  • CEP C1-C12
  • DCEP D1-D12
  • Energy values C0, D0, DD0
  • D2CEP DD1-DD12

10
  • So, at this point in the speech recognition
    process, you have stored feature vectors for the
    entire speech segment you are looking at,
    providing the necessary information about what
    kind features are in that segment.
  • Now, The feature stream can be analyzed using a
    Hidden-Markov Model (HMM)

one




two
Concat.
Train



oh
The feature stream is analyzed using a
Hidden-Markov Model (HMM)
Feature Extraction Modules
Input speech
Feature Vector
11
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com