Title: S Legrand
1Snack for Ruby
2Talk Objectives
- Tour of API
- Learn the walk and talk
- Have Fun
3Snack
- Snack library is a tool to aid in the learning
about sound, voice, ASR, and is hopefully a fun
way to experiment - Snack is a tcl-based API
- Snack has been adapted to and included in
Standard Python Distribution
4Snack
- Snack is Swedish for talk or chat
- Kåre Sjölander is the principal investigator for
tcl-based snack - Tcl Snack is available at http//www.speech.kth.se
/snack/
5Snack for Ruby
- rbSnack is a ruby wrapper around tcl snack
- rbSnack has additional ruby based utilities
- rbSnack has html-based help. (rdocrbTeX)
- rbSnack can be found at http//rbsnack.sourceforge
.net/
6Snack Toolkit Includes
- Recording, Playback
- Waveform display
- Spectrogram Fourier, LPC
- Formant analysis
- Power analysis
- Filters
- (will demo)
7The Speech Signal
- Continuous speech is discretely sampled
- Signal consist of rapidly changing data points.
- The display of the sampled signal is called the
waveform - Snack can display the waveform real-time
8Analysis uses frames
- Signal is broken into frames
- Frames may overlap
- Characteristics of signal analyzed using Fourier
and LPC analysis on a per frame basis.
9Going in Circles
- Complex numbers is just a funny way of
multiplying add angles. -
- Eulers formula
10Fourier Analysis
- Fourier matrix is an unitary matrix
- Multiplication by Fourier matrix returns the
frequency components of the signal, called the
Fourier coefficients - Easy to compute the inverse Called Fourier
Inverse
11The Fourier Matrix Looks Like
Multiplication by signal produces Fourier
coefficients (frequency components)
12Examining Fourier components
- A Spectrogram gives a picture of the Fourier
components (coefficients) as they evolve over
time. Snack can display real time. - Looks like an X Ray
- Bands of high activity correspond to formants
13Linear Filters
- Useful to understand nature of speech signals
- Generators generate square waves, sin waves, saw
tooth, etc. - Composers composes several filters.
- FIR Finite impulse response
- IIR Infinite impulse response
14FIR Filter
- Determined completely by response to a unit
impulse. - Response finite in duration.
y(t)b0 x(t) b1 x(t-1) b2x(t-2)bn x(t-n)
(We will demo FIR using rbSnack)
15IIR Filter
- Also called Recursive filter
- Response infinite in duration.
y(t)b0 x(t) b1 x(t-1) b2x(t-2)bn x(t-n)
a1 y(t-1) a2y(t-2)an y(t-n)
(We will demo IIR using rbSnack)
16Linear Predictive Analysis
- Analogous to Fourier analysis
- Assumption For each frame, the signal is
predicted by - The LPC coefficients are the best least squares
approximation. - Can also be used to predict formants
y(t)a1 y(t-1) a2y(t-2)ap y(t-p)
17What is Sound? What is Speech?
- Sound is the resulting signal created by the
longitude waves in some medium like air. - Sound waves are continuous
- Can be decomposed into linear combination of sin
waves. - Speech is a special noise made by humans
18Its Just Tubing
- The simplest model of speech is to consider the
lungs and trachea as one long tube. - Resonance frequencies are called Formants.
F2
F1
19Some Speech Recognition Features
- Formants
- Pitch
- Voiced/Unvoiced
- Nasality
- Frication
- Energy
Our current work only uses Formants and Energy
20Basic Utterances
- A basic unit of speech is called a Phone
- Vowels are utterances with constant formants
- Diphthong is the transitioning from one vowel to
another - Vowels and Diphthongs are essentially
characterized by the first and second formant.
21Other Phones The Consonants
- Plosives closure in oral cavity /p/
- Nasal Closure of nasal cavity /m/
- Fricative Turbulent airstream noise /s/
- Retroflex liquid Vowel like-tongue high curled
back /r/ - Lateral liquid Vowel like, tongue central, side
air stream /l/ - Glide Vowel like /y/
22Some Problems with Speech Signals
- Segmentation when does a word begin and end?
(Noise?) - Wet ware (speakers internal configuration lip
smacks, breathing etc.) - SegmentationWorkshop demos one approach.
23Code Books
- A code book consists of code words.
- Idea is to search through code book to find code
word corresponding to best match of feature
sequence. - RbSnack uses codebook approach in word
recognition.
24Code Book Approach
- Easy to implement
- Good for isolated words
- - Works best on small vocabularies
- -- Is insensitive to context, prone to errors
25Code Book Approach
- WhichWay is a simple demo of this approach
26More Problems with Speech Signals
- Accent Southern vs. New England vs. California
Valley vs. Other. - Variation in rate of speech makes it hard to
compare words
27Dynamic Time Warping
- A pattern comparison technique
- A way of stretching or compressing one sequence
to match another. - Evaluated using dynamic programming
28Dynamic Programming
- Form a grid, with start at lower left, end at
upper right. - Label each node with difference (error) between
pattern 1 at time i and pattern 2 at time j. - Find minimal distance from start to end using
29Dynamic Programming
Basic Assumption If best path P(S,E) passes
through node N, then P(S,E) is the concatenation
of P(S,N) (best from S to N) and P(N,E) (best
from N to E)
30Dynamic Programming
1
3
2
1
2
3
Type I
Type III
- RbSnack includes examples for various time
alignment approaches
31Dynamic Programming
1
1
1
1
1
1
1
1
Itakura
Type IV
32Hidden Markov Models
- Sometime the second (or third) best match is the
right word. Use HMMs to ascertain the correct
word in the context of the sentence. (Ditto for
phones within a word) - HMMs are similar to non-deterministic finite
state machines, except for they have
non-deterministic output.
33Hidden Markov Models
- Dynamic Programming is used to compute weights.
- HMMs look like
.4
.2
3
2
1
P(/i/).5 P(/a/).2 P(/o/).3
.4
4
34PossibleFuture Directions
- Examine other features, (pitch?)
- Incorporate other libraries. (Do the
computationally hard work in C) - Add more signal processing routines
- Add more examples
- Use Hidden Markov Models
35Lessons Learned/to be learned
- Document everything.
- Nothings perfect
- Automate everything
- Project is never done
36Whats next?