Speech Recognition

About This Presentation

Title:

Speech Recognition

Description:

Acoustic Dictionary Creation. Generation of Acoustic Models (training) ... Use Dictionary in Reverse-mode to identify words (what if there are confusable words? ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 24

Provided by: MadhaviGan1

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Speech Recognition

1
Speech Recognition
experiments on parallel data sets

Madhavi Ganapathiraju
Carnegie Mellon University
March-23-2002

2
How ASR works

Digitized speech samples of speech at different
instances of time..
It is Frequency content that is different in two
sounds

different sounds have different frequency
features
3
Steps involved in ASR

Feature Extraction
Acoustic Dictionary Creation
Generation of Acoustic Models (training)
Decoding

4
Feature Extraction

Break-up the digitized speech into windows of 25
milli-seconds
Find the frequency content in each window
(Fourier analysis, etc)
Make a feature vector for each window
F f1 f2 f3 .fN

5
Acoustic Dictionary Creation

List all possible words
(that are likely to be encountered)
Identify basic sounds (phonemes) from which
words are built
Different for different languages
For each word list sequence of phonemes
Example SUCCESS s a k s e ss

6
Training

Segment the speech signal into parts that
belong to the same phoneme
There are machine learning algorithms to do this
automatically
Use Statistical Learning algorithms like
Neural Networks, Hidden Markov Models
Show features extracted from 25-millisecond
windows and their corresponding phonemes
iteratively, until the patterns are learnt

7
Decoding

Extract features for new speech signal
Compare features to Statistical models learnt
Identify best-match phonemes
Use Dictionary in Reverse-mode to identify words
(what if there are confusable words?)

8
Decoding with Language Models

It is questionable OR
It ease questionable
P(W3 W2, W1) for a language can be calculated
Use Precomputed Ps for different words to say
P(is it) gt P(ease it)

9
Issues in ASR

Single speaker ? (PC dictation machine)
Multiple speaker ? (Railway information centre)
Studio Recording? On road ? Telephone ? Cell
phone ?
Speech compression ? Full Bandwidth ?

10
Difficulties in ASR

Studio quality speech clearer than cell-phone
(CP) speech
But NOT always for ASR !!
Models trained with CP do NOT recognise studio
quality any better !
Training and Test data have to match

11
Experiments with MIS-MATCHED speech
12
Experimental setupSpeech type

Vocabulary
Large (6000 words)
Speech type
Read speech
Speakers
Multiple speakers

13
Experimental setupSpeech quality

Full bandwidth (TIMIT)
16 kHz 16 bits per sample
Telephone quality (N-TIMIT)
16 kHz, effective quality of 8 kHz sampling,
16bps
Far-field (desktop) microphone (FFM-TIMIT)
16 kHz 16 bits per sample
Cellular phone quality (C-TIMIT)
8 kHz, 16 bits per sample, with background noise

14
Speech Recognition System

Sphinx-3 trainer and decoder
3-state HMMs to model phones

Context independent training
A is same in C A T and M A T
Context dependent training
A in C A T is different from A in M A T

15
Feature Vectors

13 MFC coefficients
Bandwidth limited from 130-6500 or 200-3500 Hz
Depending on sampling freq of input
31-40 mel scale filters
39 length features computed internally by Sphinx

16
Language Modeling

Used CMU_Cambridge Statistical Language Modeling
toolkit
Modeled 3 grams
with back-off 2-grams and 1-grams

17
Results-TIMIT
18
Results-N-TIMIT
19
Results-FFM-TIMIT
20
Feature preprocessing
Linear Regress would do..
1st MFCC compared NTIMIT and CVSD-TIMIT
21
Feature preprocessing
Linear Regress would NOT do..
12th MFCC compared b NTIMIT and CVSD-TIMIT
22
Results-CVSD-TIMIT
23
Thank you

Write a Comment

User Comments (0)