Class 1 August 2908

About This Presentation

Title:

Class 1 August 2908

Description:

Class 1 August 2908 – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 25

Provided by: ogar1

Category:

more less

Transcript and Presenter's Notes

Title: Class 1 August 2908

1
Class 1 (August 29/08)

Course Review
Introduction
Oscar N. Garcia

2
Principles of Speech Production, Analysis and
Recognition

Instructor Oscar N. Garcia, PhD
Office 249 NH, Suite 249A Phone (817) 272-5755
UTA email ogarcia_at_uta.edu or ogarcia_at_unt.edu
Office hours Fridays, 1230-100 PM, 430-530
PM
Course Number and Section EE 5359 - Topics in
Signal Processing, Section 002 Principles of
Speech Production, Analysis and Recognition,
short title - Pr Spch Prod Anal Recog
Day and time Fridays, 100 PM to 350 PM,
Classroom Neederman Hall NH106

3
Brief Description, Learning Outcomes and
Pre-requisites

Brief Description Introduction to the production
of human speech, vocal tract, the hearing system,
the units of speech, methods of analysis of
speech signals, speech recognition technology,
computerized speech synthesis.
Learning Outcomes The student will understand
how speech is produced and how it is perceived by
humans, how it can be decomposed and analyzed as
a signal and the basis of the methods of speech
recognition and speech synthesis.
Pre-requisites Undergraduate courses in integral
and differential calculus, differential
equations, matrix algebra. A very introductory
knowledge of signals, systems, frequency analysis
and filters is expected (See Chapter 2 of text.)

4
Grading

The grade for the course is determined by
Two exams, a mid-term exam (for 30 of the grade)
and a comprehensive final examination (for 40 of
the grade).
Homework will be assigned with a value of 10 of
the grade.
A class project will be undertaken with a written
report and a presentation for 20 of the course
grade.
No curving of the grade is anticipated.
Students are expected to attend all classes with
a maximum of two unexcused absences and the
responsibility is on their shoulders to cover the
material on their own if a class is missed.
University drop policy applies.
No make-up tests for unexcused absences.

5
Text and references

Main Textbook D. OShaughnessy, Speech
Communications Human and Machine, Second
Edition, IEEE Press, 2000. There are substantial
WWW references in p. 469 and following.
Other reference texts
Digital Speech Processing, Synthesis and
Recognition, Sadaoki Furui, Marcel Dekker Inc.,
2001,
Automatic Speech Recognition The Development of
the SPHINX , Kai-Fu Lee, 1989 Kluwer,
Voice Communication between Humans and
Machines, National Academy of Sciences, D. B.
Roe and J. G. Wilpon, Eds., 1994.

6
Objectives of the Course

An introductory overview of speech production
and perception and acoustic phonetics, with
coverage of models of the vocal tract sound
production and the mechanisms of sound
perception. The central topic will be automatic
speech recognition (ASR) and the models used for
it. Some aspects of speech coding, speech
synthesis and speaker identification topics.

7
This class overviews the course

Speech communication
physics (acoustics, speech analysis),
physiology (speech production and perception)
and psycho sociological (semantics,
psychoacoustics, prosody and suprasegmental)
aspects.
Communication with machines an introduction to
speech recognition and to speech synthesis.
Features of speech synthesis and speaker
recognition (speaker characterization).

8
Signals Speech is a signal

Speech is manifested as air pressured waves that
are perceived by the human hearing system. The
pressure waves may be transformed to electrical
signals by microphone(s). The pressure waves
follow the physical laws of acoustics.
A (time varying) signal is represented by a
(discrete or analog) function of time. Examples
of signals that we will use are
a) An impulse
b) A step function
c) A (displaced or shifted) exponential
d) Sine function
e) Cosine function
Notice continuous vs. discrete (sampled) signals
in the next picture

9
IMPULSE
STEP
a-n
EXPONENTIAL (PLUS A CONSTANT)
10
(No Transcript)
11
Digitizing the speech signal

There are a number of advantages in going through
an A/D conversion
It is easier to combat noise up to a point
The digital signal may be handled by DSP methods
in hardware or by programs (filtering, etc.)
It is easier to analyze, compress, code, etc.
It unfortunately suffers from quantization noise
of a magnitude that depends on the number of
quantization levels (some 28 or 256 levels for
example)

12
The physiology of speech
The relation between signs and their users
Perception
Production
13
Speech recognition by machine can be aided by
understanding certain aspects of the language
spoken

Compare machine recognition with the previous
figure (Physiology of Speech)
Semiotics (the study or science of signs and
symbols), for example, covers
Pragmatics
Semantics
Syntactics
We will only briefly cover some aspects of some
of these issues as needed in the course

14
Psycho sociological (psychoacoustics, semantics,
prosody and suprasegmental) aspects

When we model natural language for recognition it
is advantageous to consider not only the physical
aspects but other broad aspects like
Psychoacoustics (Chapter 5 of text) relates the
acoustic signal to what the human perceives (for
example, speech synthesis by rules is
intelligible but not natural)
Semantics meanings and the development of word
meanings (helpful in disambiguation)
Prosody speech emphasis for meaning, style
(read speech vs. conversational), stress, pauses,
duration, place and manner of articulation,
rhythm.
Suprasegmentals Prosody is not localized to
specific sound segment and its characteristics
extend often beyond a word to a whole portion of
a sentence. The stress is related to semantics,
type of discourse and pragmatics. Suprasegmental
effects go beyond the segments that are
considered in common recognizers segment
individual features.

15
Communication with machines an introduction to
speech recognition (STT) and to speech synthesis
(TTS)

Some salient issues of speech recognition
Do we want to recognize sentences or isolated
words?
Is it speaker dependent or speaker independent?
Speech varies not only with the speaker but even
with the same speaker and other (noise?) factors
Do we use dynamic time warping or Hidden Markov
Models? HMM is the predominant approach today
Do we limit the vocabulary?
We need a balanced database to train the HMM
(like TIMIT? Other?)
Do we take advantage of a language model?
MATLAB has HMM software

16
DTW and HMM

DTW uses Bellmans dynamic programming
optimization to nonlinearly change the time axis
to match reference patterns of parameters. Very
popular in the 80s it has only limited
applications today.
A (plain) Markov Model considers the
probabilities of a state transition given the
present state. In a Hidden Markov Model we
consider the same probabilities without knowing
what is the current state based only on the
output of the state machine (following all
feasible paths and self loops). The terms output,
observation and emission all refer to the same
result of the HMM.

17
An HMM is defined by

A set of states with distinguished initial and
final state(s)
A set of state transition probabilities
A matrix of probabilities (one matrix per
transition or more frequently one per state) of
emitting the possible outputs when moving along
that transition
The following is assumed a) the next state is
conditionally dependent on the previous state
(Markov assumption) and b) the probability that a
particular symbol is emitted at a given time
depends only on the transition taken at that time
and is independent of the past.

18
Problems that may be tackled with HMMs

Given an HMM model and a set of outputs, estimate
the probability that the model generated the
observations (evaluation)
Given an HMM model and a set of outputs, what is
the most likely path followed (decoding)
Given a set of outputs and a model, what are the
transition probabilities and output symbol
probability distribution matrix that make it
highly probable the model generated the outputs
(learning)

19
MATLAB has some HMM software
20
The training set and the test set should be
dichotomous
Never use examples from the training set to test
the trained machine
21
A few issues of speech synthesis

Why not use recorded speech segments?
It is possible to use recorded speech for QA
systems. Particularly good to limit possible
answers.
Issue of co-articulation
Issue of prosody
Issue of naturalness
It takes some 50 parameters to synthesize natural
sounding speech. Features are needed to identify
a speaker, given a population of speakers
Synthesis is not difficult at the expense of
naturalness (the toothpaste in the tube)

22
Speaker recognition (speaker characterization).

In contrast with speech recognition we want to
emphasize the differences between speakers
Three main sources of speaker variability
Vocal tract and vocal folds shape
Speech style (dynamics of phones, co-articulation
and speech rate) possible to mimic
Choice of words or phrases (hard to verify)

23
Summary of Class 1 In this presentation we have

Presented the plan for the course, including the
expected outcomes, the requirements, the intended
grading procedures, the text recommended and
references, the course objectives, and
Given a brief overview of the course with
emphasis on the three main topics
Speech production and perception (background)
Coding and automatic speech recognition (central
topic)
Aspects of speech synthesis and speaker
recognition

24
Epilogue to the first class
The subject matter of this course is very
extensive and the course may be given a variety
of emphases while preserving the basic
principles. Now that you have seen the outline of
the course please write on a piece of paper with
your name (undergrad/ grad?) and preferred email
address how do you think this course would
benefit you and why. It would help me emphasize
a little more the areas of interest to the group.
Thank you!

Write a Comment

User Comments (0)

About PowerShow.com

Class 1 August 2908 - PowerPoint PPT Presentation

Class 1 August 2908

Class 1 August 2908 – PowerPoint PPT presentation