Title: Class 1 August 2908
1 Class 1 (August 29/08)
- Course Review
- Introduction
- Oscar N. Garcia
2Principles of Speech Production, Analysis and
Recognition
- Instructor Oscar N. Garcia, PhD
- Office 249 NH, Suite 249A Phone (817) 272-5755
UTA email ogarcia_at_uta.edu or ogarcia_at_unt.edu - Office hours Fridays, 1230-100 PM, 430-530
PM - Course Number and Section EE 5359 - Topics in
Signal Processing, Section 002 Principles of
Speech Production, Analysis and Recognition,
short title - Pr Spch Prod Anal Recog
- Day and time Fridays, 100 PM to 350 PM,
Classroom Neederman Hall NH106
3Brief Description, Learning Outcomes and
Pre-requisites
- Brief Description Introduction to the production
of human speech, vocal tract, the hearing system,
the units of speech, methods of analysis of
speech signals, speech recognition technology,
computerized speech synthesis. - Learning Outcomes The student will understand
how speech is produced and how it is perceived by
humans, how it can be decomposed and analyzed as
a signal and the basis of the methods of speech
recognition and speech synthesis. - Pre-requisites Undergraduate courses in integral
and differential calculus, differential
equations, matrix algebra. A very introductory
knowledge of signals, systems, frequency analysis
and filters is expected (See Chapter 2 of text.)
4Grading
- The grade for the course is determined by
- Two exams, a mid-term exam (for 30 of the grade)
and a comprehensive final examination (for 40 of
the grade). - Homework will be assigned with a value of 10 of
the grade. - A class project will be undertaken with a written
report and a presentation for 20 of the course
grade. - No curving of the grade is anticipated.
- Students are expected to attend all classes with
a maximum of two unexcused absences and the
responsibility is on their shoulders to cover the
material on their own if a class is missed. - University drop policy applies.
- No make-up tests for unexcused absences.
5Text and references
- Main Textbook D. OShaughnessy, Speech
Communications Human and Machine, Second
Edition, IEEE Press, 2000. There are substantial
WWW references in p. 469 and following. - Other reference texts
- Digital Speech Processing, Synthesis and
Recognition, Sadaoki Furui, Marcel Dekker Inc.,
2001, - Automatic Speech Recognition The Development of
the SPHINX , Kai-Fu Lee, 1989 Kluwer, - Voice Communication between Humans and
Machines, National Academy of Sciences, D. B.
Roe and J. G. Wilpon, Eds., 1994.
6Objectives of the Course
- An introductory overview of speech production
and perception and acoustic phonetics, with
coverage of models of the vocal tract sound
production and the mechanisms of sound
perception. The central topic will be automatic
speech recognition (ASR) and the models used for
it. Some aspects of speech coding, speech
synthesis and speaker identification topics.
7This class overviews the course
- Speech communication
- physics (acoustics, speech analysis),
- physiology (speech production and perception)
- and psycho sociological (semantics,
psychoacoustics, prosody and suprasegmental)
aspects. - Communication with machines an introduction to
speech recognition and to speech synthesis. - Features of speech synthesis and speaker
recognition (speaker characterization).
8Signals Speech is a signal
- Speech is manifested as air pressured waves that
are perceived by the human hearing system. The
pressure waves may be transformed to electrical
signals by microphone(s). The pressure waves
follow the physical laws of acoustics. - A (time varying) signal is represented by a
(discrete or analog) function of time. Examples
of signals that we will use are - a) An impulse
- b) A step function
- c) A (displaced or shifted) exponential
- d) Sine function
- e) Cosine function
- Notice continuous vs. discrete (sampled) signals
in the next picture
9IMPULSE
STEP
a-n
EXPONENTIAL (PLUS A CONSTANT)
10(No Transcript)
11Digitizing the speech signal
- There are a number of advantages in going through
an A/D conversion - It is easier to combat noise up to a point
- The digital signal may be handled by DSP methods
in hardware or by programs (filtering, etc.) - It is easier to analyze, compress, code, etc.
- It unfortunately suffers from quantization noise
of a magnitude that depends on the number of
quantization levels (some 28 or 256 levels for
example)
12The physiology of speech
The relation between signs and their users
Perception
Production
13Speech recognition by machine can be aided by
understanding certain aspects of the language
spoken
- Compare machine recognition with the previous
figure (Physiology of Speech) - Semiotics (the study or science of signs and
symbols), for example, covers - Pragmatics
- Semantics
- Syntactics
- We will only briefly cover some aspects of some
of these issues as needed in the course
14Psycho sociological (psychoacoustics, semantics,
prosody and suprasegmental) aspects
- When we model natural language for recognition it
is advantageous to consider not only the physical
aspects but other broad aspects like - Psychoacoustics (Chapter 5 of text) relates the
acoustic signal to what the human perceives (for
example, speech synthesis by rules is
intelligible but not natural) - Semantics meanings and the development of word
meanings (helpful in disambiguation) - Prosody speech emphasis for meaning, style
(read speech vs. conversational), stress, pauses,
duration, place and manner of articulation,
rhythm. - Suprasegmentals Prosody is not localized to
specific sound segment and its characteristics
extend often beyond a word to a whole portion of
a sentence. The stress is related to semantics,
type of discourse and pragmatics. Suprasegmental
effects go beyond the segments that are
considered in common recognizers segment
individual features.
15Communication with machines an introduction to
speech recognition (STT) and to speech synthesis
(TTS)
- Some salient issues of speech recognition
- Do we want to recognize sentences or isolated
words? - Is it speaker dependent or speaker independent?
- Speech varies not only with the speaker but even
with the same speaker and other (noise?) factors - Do we use dynamic time warping or Hidden Markov
Models? HMM is the predominant approach today - Do we limit the vocabulary?
- We need a balanced database to train the HMM
(like TIMIT? Other?) - Do we take advantage of a language model?
- MATLAB has HMM software
16DTW and HMM
- DTW uses Bellmans dynamic programming
optimization to nonlinearly change the time axis
to match reference patterns of parameters. Very
popular in the 80s it has only limited
applications today. - A (plain) Markov Model considers the
probabilities of a state transition given the
present state. In a Hidden Markov Model we
consider the same probabilities without knowing
what is the current state based only on the
output of the state machine (following all
feasible paths and self loops). The terms output,
observation and emission all refer to the same
result of the HMM.
17An HMM is defined by
- A set of states with distinguished initial and
final state(s) - A set of state transition probabilities
- A matrix of probabilities (one matrix per
transition or more frequently one per state) of
emitting the possible outputs when moving along
that transition - The following is assumed a) the next state is
conditionally dependent on the previous state
(Markov assumption) and b) the probability that a
particular symbol is emitted at a given time
depends only on the transition taken at that time
and is independent of the past. -
18Problems that may be tackled with HMMs
- Given an HMM model and a set of outputs, estimate
the probability that the model generated the
observations (evaluation) - Given an HMM model and a set of outputs, what is
the most likely path followed (decoding) - Given a set of outputs and a model, what are the
transition probabilities and output symbol
probability distribution matrix that make it
highly probable the model generated the outputs
(learning)
19MATLAB has some HMM software
20The training set and the test set should be
dichotomous
Never use examples from the training set to test
the trained machine
21A few issues of speech synthesis
- Why not use recorded speech segments?
- It is possible to use recorded speech for QA
systems. Particularly good to limit possible
answers. - Issue of co-articulation
- Issue of prosody
- Issue of naturalness
- It takes some 50 parameters to synthesize natural
sounding speech. Features are needed to identify
a speaker, given a population of speakers - Synthesis is not difficult at the expense of
naturalness (the toothpaste in the tube)
22Speaker recognition (speaker characterization).
- In contrast with speech recognition we want to
emphasize the differences between speakers - Three main sources of speaker variability
- Vocal tract and vocal folds shape
- Speech style (dynamics of phones, co-articulation
and speech rate) possible to mimic - Choice of words or phrases (hard to verify)
23Summary of Class 1 In this presentation we have
- Presented the plan for the course, including the
expected outcomes, the requirements, the intended
grading procedures, the text recommended and
references, the course objectives, and - Given a brief overview of the course with
emphasis on the three main topics - Speech production and perception (background)
- Coding and automatic speech recognition (central
topic) - Aspects of speech synthesis and speaker
recognition
24Epilogue to the first class
The subject matter of this course is very
extensive and the course may be given a variety
of emphases while preserving the basic
principles. Now that you have seen the outline of
the course please write on a piece of paper with
your name (undergrad/ grad?) and preferred email
address how do you think this course would
benefit you and why. It would help me emphasize
a little more the areas of interest to the group.
Thank you!