74.419 Artificial Intelligence 2004 Speech

About This Presentation

Title:

74.419 Artificial Intelligence 2004 Speech

Description:

74.419 Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) – PowerPoint PPT presentation

Number of Views:133

Avg rating:3.0/5.0

Slides: 31

Provided by: jdurston

Category:

more less

Transcript and Presenter's Notes

Title: 74.419 Artificial Intelligence 2004 Speech

1
74.419 Artificial Intelligence 2004 Speech
Natural Language Processing

Natural Language Processing
written text as input
sentences (well-formed)
Speech Recognition
acoustic signal as input
conversion into written words
Spoken Language Understanding
analysis of spoken language (transcribed speech)

2
(No Transcript)
3
Speech Natural Language Processing

Areas in Natural Language Processing
Morphology
Grammar Parsing (syntactic analysis)
Semantics
Pragamatics
Discourse / Dialogue
Spoken Language Understanding
Areas in Speech Recognition
Signal Processing
Phonetics
Word Recognition

4
(No Transcript)
5
Speech Production Reception

Sound and Hearing
change in air pressure ? sound wave
reception through inner ear membrane / microphone
break-up into frequency components receptors in
cochlea / mathematical frequency analysis (e.g.
Fast-Fourier Transform FFT) ? Frequency Spectrum
perception/recognition of phonemes and
subsequently words (e.g. Neural Networks,
Hidden-Markov Models)

6
(No Transcript)
7
Speech Recognition Phases

Speech Recognition
acoustic signal as input
signal analysis - spectrogram
feature extraction
phoneme recognition
word recognition
conversion into written words

8
Speech Signal

Speech Signal
composed of different (sinus) waves with
different frequencies and amplitudes
Frequency - waves/second ? like pitch
Amplitude - height of wave ? like loudness
noise (not sinus wave)
Speech Signal
composite signal comprising different frequency
components

9
Waveform (fig. 7.20)
Amplitude/ Pressure
Time
"She just had a baby."
10
Waveform for Vowel ae (fig. 7.21)
Amplitude/ Pressure
Time
Time
11
Speech Signal Analysis

Analog-Digital Conversion of Acoustic Signal
Sampling in Time Frames (windows)
frequency 0-crossings per time frame
? e.g. 2 crossings/second is 1 Hz (1 wave)
? e.g. 10kHz needs sampling rate 20kHz
measure amplitudes of signal in time frame
? digitized wave form
separate different frequency components
? FFT (Fast Fourier Transform)
? spectrogram
other frequency based representations
? LPC (linear predictive coding),
? Cepstrum

12
Waveform and Spectrogram (figs. 7.20, 7.23)
13
Waveform and LPC Spectrum for Vowel ae (figs.
7.21, 7.22)
Amplitude/ Pressure
Time
Energy
Formants
Frequency
14
Speech Signal Characteristics

From Signal Representation derive, e.g.
formants - dark stripes in spectrum
strong frequency components characterize
particular vowels gender of speaker
pitch fundamental frequency
baseline for higher frequency harmonics like
formants gender characteristic
change in frequency distribution
characteristic for e.g. plosives (form of
articulation)

15
(No Transcript)
16
(No Transcript)
17
Video of glottis and speech signal in lingWAVES
(from http//www.lingcom.de)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Phoneme Recognition

Recognition Process based on
features extracted from spectral analysis
phonological rules
statistical properties of language/ pronunciation
Recognition Methods
Hidden Markov Models
Neural Networks
Pattern Classification in general

22
Pronunciation Networks / Word Models as
Probabilistic FAs (fig 5.12)
23
Pronunciation Network for 'about' (fig 5.13)
24
Word Recognition with Probabilistic FA / Markov
Chain (fig 5.14)
25
Viterbi-Algorithm

The Viterbi Algorithm finds an optimal sequence
of states in continuous Speech Recognition, given
an observation sequence of phones and a
probabilistic (weighted) FA. The algorithm
returns the path through the automaton which has
maximum probability and accepts the observed
sequence.

26
Viterbi-Algorithm (fig 5.19)
function VITERBI(observations of len
T,state-graph) returns best-path num-states
NUM-OF-STATES(state-graph) Create a path
probability matrix viterbinum-states2,T2 viter
bi0,0 1.0 for each time step t from 0 to T
do for each state s from 0 to num-states do for
each transition s0 from s specified by
state-graph new-score viterbis, t as,s0
bs0 (ot ) if ((viterbis0,t1 0) jj (new-score
gt viterbis0, t1)) then viterbis0, t1
new-score back-pointers0, t1 s Backtrace from
highest probability state in the final column of
viterbi and return path
27
Speech Recognition
Acoustic / sound wave
Filtering, Sampling Spectral Analysis
FFT Frequency Spectrum
Features (Phonemes Context)
Signal Processing / Analysis
Phoneme Recognition HMM, Neural
Networks Phonemes
Grammar or Statistics Phoneme Sequences /
Words
Grammar or Statistics for likely word
sequences Word Sequence / Sentence
28
Speech Recognizer Architecture (fig. 7.2)
29
Speech Processing - Important Types and
Characteristics
single word vs. continuous speech unlimited vs.
large vs. small vocabulary speaker-dependent vs.
speaker-independent training Speech Recognition
vs. Speaker Identification
30
Additional References

Hong, X. A. Acero H. Hon Spoken Language
Processing. A Guide to Theory, Algorithms, and
System Development. Prentice-Hall, NJ, 2001
Figures taken from
Jurafsky, D. J. H. Martin, Speech and Language
Processing, Prentice-Hall, 2000, Chapters 5 and
7.
lingWAVES (from http//www.lingcom.de

Write a Comment

User Comments (0)