Speech Recognition - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Speech Recognition

Description:

Acoustic Dictionary Creation. Generation of Acoustic Models (training) ... Use Dictionary in Reverse-mode to identify words (what if there are confusable words? ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 24
Provided by: MadhaviGan1
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Speech Recognition


1
Speech Recognition
experiments on parallel data sets
  • Madhavi Ganapathiraju
  • Carnegie Mellon University
  • March-23-2002

2
How ASR works
  • Digitized speech samples of speech at different
    instances of time..
  • It is Frequency content that is different in two
    sounds

different sounds have different frequency
features
3
Steps involved in ASR
  • Feature Extraction
  • Acoustic Dictionary Creation
  • Generation of Acoustic Models (training)
  • Decoding

4
Feature Extraction
  • Break-up the digitized speech into windows of 25
    milli-seconds
  • Find the frequency content in each window
    (Fourier analysis, etc)
  • Make a feature vector for each window
  • F f1 f2 f3 .fN

5
Acoustic Dictionary Creation
  • List all possible words
  • (that are likely to be encountered)
  • Identify basic sounds (phonemes) from which
    words are built
  • Different for different languages
  • For each word list sequence of phonemes
  • Example SUCCESS s a k s e ss

6
Training
  • Segment the speech signal into parts that
    belong to the same phoneme
  • There are machine learning algorithms to do this
    automatically
  • Use Statistical Learning algorithms like
  • Neural Networks, Hidden Markov Models
  • Show features extracted from 25-millisecond
    windows and their corresponding phonemes
    iteratively, until the patterns are learnt

7
Decoding
  • Extract features for new speech signal
  • Compare features to Statistical models learnt
  • Identify best-match phonemes
  • Use Dictionary in Reverse-mode to identify words
  • (what if there are confusable words?)

8
Decoding with Language Models
  • It is questionable OR
  • It ease questionable
  • P(W3 W2, W1) for a language can be calculated
  • Use Precomputed Ps for different words to say
    P(is it) gt P(ease it)

9
Issues in ASR
  • Single speaker ? (PC dictation machine)
  • Multiple speaker ? (Railway information centre)
  • Studio Recording? On road ? Telephone ? Cell
    phone ?
  • Speech compression ? Full Bandwidth ?

10
Difficulties in ASR
  • Studio quality speech clearer than cell-phone
    (CP) speech
  • But NOT always for ASR !!
  • Models trained with CP do NOT recognise studio
    quality any better !
  • Training and Test data have to match

11
Experiments with MIS-MATCHED speech
12
Experimental setupSpeech type
  • Vocabulary
  • Large (6000 words)
  • Speech type
  • Read speech
  • Speakers
  • Multiple speakers

13
Experimental setupSpeech quality
  • Full bandwidth (TIMIT)
  • 16 kHz 16 bits per sample
  • Telephone quality (N-TIMIT)
  • 16 kHz, effective quality of 8 kHz sampling,
    16bps
  • Far-field (desktop) microphone (FFM-TIMIT)
  • 16 kHz 16 bits per sample
  • Cellular phone quality (C-TIMIT)
  • 8 kHz, 16 bits per sample, with background noise

14
Speech Recognition System
  • Sphinx-3 trainer and decoder
  • 3-state HMMs to model phones
  • Context independent training
  • A is same in C A T and M A T
  • Context dependent training
  • A in C A T is different from A in M A T

15
Feature Vectors
  • 13 MFC coefficients
  • Bandwidth limited from 130-6500 or 200-3500 Hz
  • Depending on sampling freq of input
  • 31-40 mel scale filters
  • 39 length features computed internally by Sphinx

16
Language Modeling
  • Used CMU_Cambridge Statistical Language Modeling
    toolkit
  • Modeled 3 grams
  • with back-off 2-grams and 1-grams

17
Results-TIMIT
18
Results-N-TIMIT
19
Results-FFM-TIMIT
20
Feature preprocessing
Linear Regress would do..
1st MFCC compared NTIMIT and CVSD-TIMIT
21
Feature preprocessing
Linear Regress would NOT do..
12th MFCC compared b NTIMIT and CVSD-TIMIT
22
Results-CVSD-TIMIT
23
Thank you
Write a Comment
User Comments (0)
About PowerShow.com