Title: 74.419 Artificial Intelligence 2004 Speech
174.419 Artificial Intelligence 2004 Speech
Natural Language Processing
- Natural Language Processing
- written text as input
- sentences (well-formed)
- Speech Recognition
- acoustic signal as input
- conversion into written words
- Spoken Language Understanding
- analysis of spoken language (transcribed speech)
2(No Transcript)
3Speech Natural Language Processing
- Areas in Natural Language Processing
- Morphology
- Grammar Parsing (syntactic analysis)
- Semantics
- Pragamatics
- Discourse / Dialogue
- Spoken Language Understanding
- Areas in Speech Recognition
- Signal Processing
- Phonetics
- Word Recognition
4(No Transcript)
5Speech Production Reception
- Sound and Hearing
- change in air pressure ? sound wave
- reception through inner ear membrane / microphone
- break-up into frequency components receptors in
cochlea / mathematical frequency analysis (e.g.
Fast-Fourier Transform FFT) ? Frequency Spectrum - perception/recognition of phonemes and
subsequently words (e.g. Neural Networks,
Hidden-Markov Models)
6(No Transcript)
7Speech Recognition Phases
- Speech Recognition
- acoustic signal as input
- signal analysis - spectrogram
- feature extraction
- phoneme recognition
- word recognition
- conversion into written words
8Speech Signal
- Speech Signal
- composed of different (sinus) waves with
different frequencies and amplitudes - Frequency - waves/second ? like pitch
- Amplitude - height of wave ? like loudness
- noise (not sinus wave)
- Speech Signal
- composite signal comprising different frequency
components
9Waveform (fig. 7.20)
Amplitude/ Pressure
Time
"She just had a baby."
10Waveform for Vowel ae (fig. 7.21)
Amplitude/ Pressure
Time
Time
11Speech Signal Analysis
- Analog-Digital Conversion of Acoustic Signal
- Sampling in Time Frames (windows)
- frequency 0-crossings per time frame
- ? e.g. 2 crossings/second is 1 Hz (1 wave)
- ? e.g. 10kHz needs sampling rate 20kHz
- measure amplitudes of signal in time frame
- ? digitized wave form
- separate different frequency components
- ? FFT (Fast Fourier Transform)
- ? spectrogram
- other frequency based representations
- ? LPC (linear predictive coding),
- ? Cepstrum
12Waveform and Spectrogram (figs. 7.20, 7.23)
13Waveform and LPC Spectrum for Vowel ae (figs.
7.21, 7.22)
Amplitude/ Pressure
Time
Energy
Formants
Frequency
14Speech Signal Characteristics
- From Signal Representation derive, e.g.
- formants - dark stripes in spectrum
- strong frequency components characterize
particular vowels gender of speaker - pitch fundamental frequency
- baseline for higher frequency harmonics like
formants gender characteristic - change in frequency distribution
- characteristic for e.g. plosives (form of
articulation)
15(No Transcript)
16(No Transcript)
17Video of glottis and speech signal in lingWAVES
(from http//www.lingcom.de)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Phoneme Recognition
- Recognition Process based on
- features extracted from spectral analysis
- phonological rules
- statistical properties of language/ pronunciation
- Recognition Methods
- Hidden Markov Models
- Neural Networks
- Pattern Classification in general
22Pronunciation Networks / Word Models as
Probabilistic FAs (fig 5.12)
23Pronunciation Network for 'about' (fig 5.13)
24Word Recognition with Probabilistic FA / Markov
Chain (fig 5.14)
25Viterbi-Algorithm
- The Viterbi Algorithm finds an optimal sequence
of states in continuous Speech Recognition, given
an observation sequence of phones and a
probabilistic (weighted) FA. The algorithm
returns the path through the automaton which has
maximum probability and accepts the observed
sequence.
26Viterbi-Algorithm (fig 5.19)
function VITERBI(observations of len
T,state-graph) returns best-path num-states
NUM-OF-STATES(state-graph) Create a path
probability matrix viterbinum-states2,T2 viter
bi0,0 1.0 for each time step t from 0 to T
do for each state s from 0 to num-states do for
each transition s0 from s specified by
state-graph new-score viterbis, t as,s0
bs0 (ot ) if ((viterbis0,t1 0) jj (new-score
gt viterbis0, t1)) then viterbis0, t1
new-score back-pointers0, t1 s Backtrace from
highest probability state in the final column of
viterbi and return path
27Speech Recognition
Acoustic / sound wave
Filtering, Sampling Spectral Analysis
FFT Frequency Spectrum
Features (Phonemes Context)
Signal Processing / Analysis
Phoneme Recognition HMM, Neural
Networks Phonemes
Grammar or Statistics Phoneme Sequences /
Words
Grammar or Statistics for likely word
sequences Word Sequence / Sentence
28Speech Recognizer Architecture (fig. 7.2)
29Speech Processing - Important Types and
Characteristics
single word vs. continuous speech unlimited vs.
large vs. small vocabulary speaker-dependent vs.
speaker-independent training Speech Recognition
vs. Speaker Identification
30Additional References
- Hong, X. A. Acero H. Hon Spoken Language
Processing. A Guide to Theory, Algorithms, and
System Development. Prentice-Hall, NJ, 2001 - Figures taken from
- Jurafsky, D. J. H. Martin, Speech and Language
Processing, Prentice-Hall, 2000, Chapters 5 and
7. - lingWAVES (from http//www.lingcom.de