Overview

About This Presentation

Transcript and Presenter's Notes

Title: Overview

1
Overview

Recall
What are sound features?
Feature detection and extraction
Features in Sphinx III

2
Recall

Speech signal is slowly time varying singnal
There are a number of linguistically distinct
speech sounds (phonemes) in a language.
It is possible to represent the sound spectrogram
in a 3D spectrogram of the speech intensity and
the different frequency bands over time
Most SR systems rely heavily on vowel recognition
to achieve high performance (they are long in
duration and spectrally well defined and
therefore easily recognized)

3
Speech sounds and features

Examples
Vowels (a, u, )
Diphthongs (f.i. ay as in guy, )
Semivowels (w, l, r, y)
Nasal Consonants (m, n)
Unvoiced Fricatives (f, s)
Voiced Fricatives (v, th, z)
Voiced and Unvoiced Stops (b, d, g)
They all have their own characteristics (features)

4
ASR Stages

1) speech analysis system to provide an
appropriate spectral representation of the
characteristics of the time-varying speech
signal
? 2) feature detection stage to convert the
spectral measurements to a set of features that
describe the broad acoustic properties of the
different phonetic units (f.i. nasality,
frication, formant locations, voiced-unvoiced
classification, ratios of high- and
low-frequency energy, etc.)
3) segmentation and labeling phase to find
stable regions and then label the segmented
region according to how well the features within
that region match those of individual phonetic
units
4) final output of the recognizer is the word
or word sequence that best matches

5
Feature detection (and extraction)

Speech segment contains certain characteristics,
features.
Different segments of speech contain different
features, specific for the kind of segment!
Goal is to try to classify a speech segment into
one of several broad speech classes (f.i. via
binary tree compact/diffuse, acute/grave,
long/short, high/low frequency, etc)
Ideally, feature vectors for a given word should
hopefully be the same regardless of the way in
which the word has been uttered

6
Last week Mel-Frequency Ceptrum Coefficient

Fourier Transform extracts the frequency
components of a signal in the time domain
Frequency domain is filtered/sliced in 12 smaller
parts, where for each its own coefficient (MFCC)
can be calculated
MFCC's use the log-spectrum of the speech signal.
The logarithmic nature of the technique is
significant since the human auditory system
perceives sound on a logarithmic scale above
certain frequencies

7
Acoustic Modeling Feature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Energy Mel-Spaced Cepstrum
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
8
What to do with the MFCCs

A speech recognizer can be built using the energy
values (time domain) and 12 MFCC's (frequency
domain), plus the first and second order
derivatives of those coefficients.
13 (Absolute Energy (1) and MFCCs (12))
13 (Delta First-order derivatives of the 13
absolute coefficients)
13 (Delta-Delta Second-order derivatives of the
13 absolute coefficients)
------------------------------------------------
39 Total Basic MFCC Front End
The derivatives are useful because they provide
information about the spectral change
These total of 39 coefficients will provide
information about the different features in that
segment!
The feature measurements of the segments are
stored in so called feature vectors, that can
be used in the next stage of the speech
recognition (f.i. Hidden Markov Model)

9
In Sphinx IIIcomputation of feature vectors

feat_s2mfc2feat
feat_s2mfc2feat_block
MFC file is read
Initialization defining the kind of
input-gtfeature conversion desired (there are some
differences between Sphinx II and Sphinx III)
Feature vectors are computed for the entire
segment specified (feat_s2mfc2feat and
feat_s2mfc2feat_block)
In Sphinx in the feature vectors, the streams of
features are stored as follows
CEP C1-C12
DCEP D1-D12
Energy values C0, D0, DD0
D2CEP DD1-DD12

So, at this point in the speech recognition
process, you have stored feature vectors for the
entire speech segment you are looking at,
providing the necessary information about what
kind features are in that segment.
Now, The feature stream can be analyzed using a
Hidden-Markov Model (HMM)

one

two
Concat.
Train

oh
The feature stream is analyzed using a
Hidden-Markov Model (HMM)
Feature Extraction Modules
Input speech
Feature Vector
11
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Overview PowerPoint PPT Presentation