Class 1 August 2908 - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Class 1 August 2908

Description:

Class 1 August 2908 – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 25
Provided by: ogar1
Category:
Tags: august | class | garcia | jo

less

Transcript and Presenter's Notes

Title: Class 1 August 2908


1
Class 1 (August 29/08)
  • Course Review
  • Introduction
  • Oscar N. Garcia

2
Principles of Speech Production, Analysis and
Recognition
  • Instructor Oscar N. Garcia, PhD
  • Office 249 NH, Suite 249A Phone (817) 272-5755
    UTA email ogarcia_at_uta.edu or ogarcia_at_unt.edu
  • Office hours Fridays, 1230-100 PM, 430-530
    PM
  • Course Number and Section EE 5359 - Topics in
    Signal Processing, Section 002 Principles of
    Speech Production, Analysis and Recognition,
    short title - Pr Spch Prod Anal Recog
  • Day and time Fridays, 100 PM to 350 PM,
    Classroom Neederman Hall NH106

3
Brief Description, Learning Outcomes and
Pre-requisites
  • Brief Description Introduction to the production
    of human speech, vocal tract, the hearing system,
    the units of speech, methods of analysis of
    speech signals, speech recognition technology,
    computerized speech synthesis.
  • Learning Outcomes The student will understand
    how speech is produced and how it is perceived by
    humans, how it can be decomposed and analyzed as
    a signal and the basis of the methods of speech
    recognition and speech synthesis.
  • Pre-requisites Undergraduate courses in integral
    and differential calculus, differential
    equations, matrix algebra. A very introductory
    knowledge of signals, systems, frequency analysis
    and filters is expected (See Chapter 2 of text.)

4
Grading
  • The grade for the course is determined by
  • Two exams, a mid-term exam (for 30 of the grade)
    and a comprehensive final examination (for 40 of
    the grade).
  • Homework will be assigned with a value of 10 of
    the grade.
  • A class project will be undertaken with a written
    report and a presentation for 20 of the course
    grade.
  • No curving of the grade is anticipated.
  • Students are expected to attend all classes with
    a maximum of two unexcused absences and the
    responsibility is on their shoulders to cover the
    material on their own if a class is missed.
  • University drop policy applies.
  • No make-up tests for unexcused absences.

5
Text and references
  • Main Textbook D. OShaughnessy, Speech
    Communications Human and Machine, Second
    Edition, IEEE Press, 2000. There are substantial
    WWW references in p. 469 and following.
  • Other reference texts
  • Digital Speech Processing, Synthesis and
    Recognition, Sadaoki Furui, Marcel Dekker Inc.,
    2001,
  • Automatic Speech Recognition The Development of
    the SPHINX , Kai-Fu Lee, 1989 Kluwer,
  • Voice Communication between Humans and
    Machines, National Academy of Sciences, D. B.
    Roe and J. G. Wilpon, Eds., 1994.

6
Objectives of the Course
  • An introductory overview of speech production
    and perception and acoustic phonetics, with
    coverage of models of the vocal tract sound
    production and the mechanisms of sound
    perception. The central topic will be automatic
    speech recognition (ASR) and the models used for
    it. Some aspects of speech coding, speech
    synthesis and speaker identification topics.

7
This class overviews the course
  • Speech communication
  • physics (acoustics, speech analysis),
  • physiology (speech production and perception)
  • and psycho sociological (semantics,
    psychoacoustics, prosody and suprasegmental)
    aspects.
  • Communication with machines an introduction to
    speech recognition and to speech synthesis.
  • Features of speech synthesis and speaker
    recognition (speaker characterization).

8
Signals Speech is a signal
  • Speech is manifested as air pressured waves that
    are perceived by the human hearing system. The
    pressure waves may be transformed to electrical
    signals by microphone(s). The pressure waves
    follow the physical laws of acoustics.
  • A (time varying) signal is represented by a
    (discrete or analog) function of time. Examples
    of signals that we will use are
  • a) An impulse
  • b) A step function
  • c) A (displaced or shifted) exponential
  • d) Sine function
  • e) Cosine function
  • Notice continuous vs. discrete (sampled) signals
    in the next picture

9
IMPULSE
STEP
a-n
EXPONENTIAL (PLUS A CONSTANT)
10
(No Transcript)
11
Digitizing the speech signal
  • There are a number of advantages in going through
    an A/D conversion
  • It is easier to combat noise up to a point
  • The digital signal may be handled by DSP methods
    in hardware or by programs (filtering, etc.)
  • It is easier to analyze, compress, code, etc.
  • It unfortunately suffers from quantization noise
    of a magnitude that depends on the number of
    quantization levels (some 28 or 256 levels for
    example)

12
The physiology of speech
The relation between signs and their users
Perception
Production
13
Speech recognition by machine can be aided by
understanding certain aspects of the language
spoken
  • Compare machine recognition with the previous
    figure (Physiology of Speech)
  • Semiotics (the study or science of signs and
    symbols), for example, covers
  • Pragmatics
  • Semantics
  • Syntactics
  • We will only briefly cover some aspects of some
    of these issues as needed in the course

14
Psycho sociological (psychoacoustics, semantics,
prosody and suprasegmental) aspects
  • When we model natural language for recognition it
    is advantageous to consider not only the physical
    aspects but other broad aspects like
  • Psychoacoustics (Chapter 5 of text) relates the
    acoustic signal to what the human perceives (for
    example, speech synthesis by rules is
    intelligible but not natural)
  • Semantics meanings and the development of word
    meanings (helpful in disambiguation)
  • Prosody speech emphasis for meaning, style
    (read speech vs. conversational), stress, pauses,
    duration, place and manner of articulation,
    rhythm.
  • Suprasegmentals Prosody is not localized to
    specific sound segment and its characteristics
    extend often beyond a word to a whole portion of
    a sentence. The stress is related to semantics,
    type of discourse and pragmatics. Suprasegmental
    effects go beyond the segments that are
    considered in common recognizers segment
    individual features.

15
Communication with machines an introduction to
speech recognition (STT) and to speech synthesis
(TTS)
  • Some salient issues of speech recognition
  • Do we want to recognize sentences or isolated
    words?
  • Is it speaker dependent or speaker independent?
  • Speech varies not only with the speaker but even
    with the same speaker and other (noise?) factors
  • Do we use dynamic time warping or Hidden Markov
    Models? HMM is the predominant approach today
  • Do we limit the vocabulary?
  • We need a balanced database to train the HMM
    (like TIMIT? Other?)
  • Do we take advantage of a language model?
  • MATLAB has HMM software

16
DTW and HMM
  • DTW uses Bellmans dynamic programming
    optimization to nonlinearly change the time axis
    to match reference patterns of parameters. Very
    popular in the 80s it has only limited
    applications today.
  • A (plain) Markov Model considers the
    probabilities of a state transition given the
    present state. In a Hidden Markov Model we
    consider the same probabilities without knowing
    what is the current state based only on the
    output of the state machine (following all
    feasible paths and self loops). The terms output,
    observation and emission all refer to the same
    result of the HMM.

17
An HMM is defined by
  • A set of states with distinguished initial and
    final state(s)
  • A set of state transition probabilities
  • A matrix of probabilities (one matrix per
    transition or more frequently one per state) of
    emitting the possible outputs when moving along
    that transition
  • The following is assumed a) the next state is
    conditionally dependent on the previous state
    (Markov assumption) and b) the probability that a
    particular symbol is emitted at a given time
    depends only on the transition taken at that time
    and is independent of the past.

18
Problems that may be tackled with HMMs
  • Given an HMM model and a set of outputs, estimate
    the probability that the model generated the
    observations (evaluation)
  • Given an HMM model and a set of outputs, what is
    the most likely path followed (decoding)
  • Given a set of outputs and a model, what are the
    transition probabilities and output symbol
    probability distribution matrix that make it
    highly probable the model generated the outputs
    (learning)

19
MATLAB has some HMM software
20
The training set and the test set should be
dichotomous
Never use examples from the training set to test
the trained machine
21
A few issues of speech synthesis
  • Why not use recorded speech segments?
  • It is possible to use recorded speech for QA
    systems. Particularly good to limit possible
    answers.
  • Issue of co-articulation
  • Issue of prosody
  • Issue of naturalness
  • It takes some 50 parameters to synthesize natural
    sounding speech. Features are needed to identify
    a speaker, given a population of speakers
  • Synthesis is not difficult at the expense of
    naturalness (the toothpaste in the tube)

22
Speaker recognition (speaker characterization).
  • In contrast with speech recognition we want to
    emphasize the differences between speakers
  • Three main sources of speaker variability
  • Vocal tract and vocal folds shape
  • Speech style (dynamics of phones, co-articulation
    and speech rate) possible to mimic
  • Choice of words or phrases (hard to verify)

23
Summary of Class 1 In this presentation we have
  • Presented the plan for the course, including the
    expected outcomes, the requirements, the intended
    grading procedures, the text recommended and
    references, the course objectives, and
  • Given a brief overview of the course with
    emphasis on the three main topics
  • Speech production and perception (background)
  • Coding and automatic speech recognition (central
    topic)
  • Aspects of speech synthesis and speaker
    recognition

24
Epilogue to the first class
The subject matter of this course is very
extensive and the course may be given a variety
of emphases while preserving the basic
principles. Now that you have seen the outline of
the course please write on a piece of paper with
your name (undergrad/ grad?) and preferred email
address how do you think this course would
benefit you and why. It would help me emphasize
a little more the areas of interest to the group.
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com