S Legrand - PowerPoint PPT Presentation

About This Presentation

Title:

S Legrand

Description:

Spectrogram: Fourier, LPC. Formant analysis. Power analysis ... A Spectrogram gives a picture of the Fourier components (coefficients) as they evolve over time. ... – PowerPoint PPT presentation

Number of Views:234

Avg rating:3.0/5.0

Slides: 37

Provided by: stephen331

Category:

Tags: legrand

more less

Transcript and Presenter's Notes

Title: S Legrand

1
Snack for Ruby

S Legrand

2
Talk Objectives

Tour of API
Learn the walk and talk
Have Fun

3
Snack

Snack library is a tool to aid in the learning
about sound, voice, ASR, and is hopefully a fun
way to experiment
Snack is a tcl-based API
Snack has been adapted to and included in
Standard Python Distribution

4
Snack

Snack is Swedish for talk or chat
Kåre Sjölander is the principal investigator for
tcl-based snack
Tcl Snack is available at http//www.speech.kth.se
/snack/

5
Snack for Ruby

rbSnack is a ruby wrapper around tcl snack
rbSnack has additional ruby based utilities
rbSnack has html-based help. (rdocrbTeX)
rbSnack can be found at http//rbsnack.sourceforge
.net/

6
Snack Toolkit Includes

Recording, Playback
Waveform display
Spectrogram Fourier, LPC
Formant analysis
Power analysis
Filters
(will demo)

7
The Speech Signal

Continuous speech is discretely sampled
Signal consist of rapidly changing data points.
The display of the sampled signal is called the
waveform
Snack can display the waveform real-time

8
Analysis uses frames

Signal is broken into frames
Frames may overlap
Characteristics of signal analyzed using Fourier
and LPC analysis on a per frame basis.

9
Going in Circles

Complex numbers is just a funny way of
multiplying add angles.
Eulers formula

10
Fourier Analysis

Fourier matrix is an unitary matrix
Multiplication by Fourier matrix returns the
frequency components of the signal, called the
Fourier coefficients
Easy to compute the inverse Called Fourier
Inverse

11
The Fourier Matrix Looks Like

Spinning disks

Multiplication by signal produces Fourier
coefficients (frequency components)
12
Examining Fourier components

A Spectrogram gives a picture of the Fourier
components (coefficients) as they evolve over
time. Snack can display real time.
Looks like an X Ray
Bands of high activity correspond to formants

13
Linear Filters

Useful to understand nature of speech signals
Generators generate square waves, sin waves, saw
tooth, etc.
Composers composes several filters.
FIR Finite impulse response
IIR Infinite impulse response

14
FIR Filter

Determined completely by response to a unit
impulse.
Response finite in duration.

y(t)b0 x(t) b1 x(t-1) b2x(t-2)bn x(t-n)
(We will demo FIR using rbSnack)
15
IIR Filter

Also called Recursive filter
Response infinite in duration.

y(t)b0 x(t) b1 x(t-1) b2x(t-2)bn x(t-n)
a1 y(t-1) a2y(t-2)an y(t-n)
(We will demo IIR using rbSnack)
16
Linear Predictive Analysis

Analogous to Fourier analysis
Assumption For each frame, the signal is
predicted by
The LPC coefficients are the best least squares
approximation.
Can also be used to predict formants

y(t)a1 y(t-1) a2y(t-2)ap y(t-p)
17
What is Sound? What is Speech?

Sound is the resulting signal created by the
longitude waves in some medium like air.
Sound waves are continuous
Can be decomposed into linear combination of sin
waves.
Speech is a special noise made by humans

18
Its Just Tubing

The simplest model of speech is to consider the
lungs and trachea as one long tube.
Resonance frequencies are called Formants.

F2
F1
19
Some Speech Recognition Features

Formants
Pitch
Voiced/Unvoiced
Nasality
Frication
Energy

Our current work only uses Formants and Energy
20
Basic Utterances

A basic unit of speech is called a Phone
Vowels are utterances with constant formants
Diphthong is the transitioning from one vowel to
another
Vowels and Diphthongs are essentially
characterized by the first and second formant.

21
Other Phones The Consonants

Plosives closure in oral cavity /p/
Nasal Closure of nasal cavity /m/
Fricative Turbulent airstream noise /s/
Retroflex liquid Vowel like-tongue high curled
back /r/
Lateral liquid Vowel like, tongue central, side
air stream /l/
Glide Vowel like /y/

22
Some Problems with Speech Signals

Segmentation when does a word begin and end?
(Noise?)
Wet ware (speakers internal configuration lip
smacks, breathing etc.)
SegmentationWorkshop demos one approach.

23
Code Books

A code book consists of code words.
Idea is to search through code book to find code
word corresponding to best match of feature
sequence.
RbSnack uses codebook approach in word
recognition.

24
Code Book Approach

Easy to implement
Good for isolated words
- Works best on small vocabularies
-- Is insensitive to context, prone to errors

25
Code Book Approach

WhichWay is a simple demo of this approach

26
More Problems with Speech Signals

Accent Southern vs. New England vs. California
Valley vs. Other.
Variation in rate of speech makes it hard to
compare words

27
Dynamic Time Warping

A pattern comparison technique
A way of stretching or compressing one sequence
to match another.
Evaluated using dynamic programming

28
Dynamic Programming

Form a grid, with start at lower left, end at
upper right.
Label each node with difference (error) between
pattern 1 at time i and pattern 2 at time j.
Find minimal distance from start to end using

29
Dynamic Programming
Basic Assumption If best path P(S,E) passes
through node N, then P(S,E) is the concatenation
of P(S,N) (best from S to N) and P(N,E) (best
from N to E)

A possible path

30
Dynamic Programming
1
3
2
1
2
3
Type I
Type III

RbSnack includes examples for various time
alignment approaches

31
Dynamic Programming
1
1
1
1
1
1
1
1
Itakura
Type IV
32
Hidden Markov Models

Sometime the second (or third) best match is the
right word. Use HMMs to ascertain the correct
word in the context of the sentence. (Ditto for
phones within a word)
HMMs are similar to non-deterministic finite
state machines, except for they have
non-deterministic output.

33
Hidden Markov Models