S Legrand - PowerPoint PPT Presentation

About This Presentation
Title:

S Legrand

Description:

Spectrogram: Fourier, LPC. Formant analysis. Power analysis ... A Spectrogram gives a picture of the Fourier components (coefficients) as they evolve over time. ... – PowerPoint PPT presentation

Number of Views:234
Avg rating:3.0/5.0
Slides: 37
Provided by: stephen331
Category:
Tags: legrand

less

Transcript and Presenter's Notes

Title: S Legrand


1
Snack for Ruby
  • S Legrand

2
Talk Objectives
  • Tour of API
  • Learn the walk and talk
  • Have Fun

3
Snack
  • Snack library is a tool to aid in the learning
    about sound, voice, ASR, and is hopefully a fun
    way to experiment
  • Snack is a tcl-based API
  • Snack has been adapted to and included in
    Standard Python Distribution

4
Snack
  • Snack is Swedish for talk or chat
  • Kåre Sjölander is the principal investigator for
    tcl-based snack
  • Tcl Snack is available at http//www.speech.kth.se
    /snack/

5
Snack for Ruby
  • rbSnack is a ruby wrapper around tcl snack
  • rbSnack has additional ruby based utilities
  • rbSnack has html-based help. (rdocrbTeX)
  • rbSnack can be found at http//rbsnack.sourceforge
    .net/

6
Snack Toolkit Includes
  • Recording, Playback
  • Waveform display
  • Spectrogram Fourier, LPC
  • Formant analysis
  • Power analysis
  • Filters
  • (will demo)

7
The Speech Signal
  • Continuous speech is discretely sampled
  • Signal consist of rapidly changing data points.
  • The display of the sampled signal is called the
    waveform
  • Snack can display the waveform real-time

8
Analysis uses frames
  • Signal is broken into frames
  • Frames may overlap
  • Characteristics of signal analyzed using Fourier
    and LPC analysis on a per frame basis.

9
Going in Circles
  • Complex numbers is just a funny way of
    multiplying add angles.
  • Eulers formula

10
Fourier Analysis
  • Fourier matrix is an unitary matrix
  • Multiplication by Fourier matrix returns the
    frequency components of the signal, called the
    Fourier coefficients
  • Easy to compute the inverse Called Fourier
    Inverse

11
The Fourier Matrix Looks Like
  • Spinning disks

Multiplication by signal produces Fourier
coefficients (frequency components)
12
Examining Fourier components
  • A Spectrogram gives a picture of the Fourier
    components (coefficients) as they evolve over
    time. Snack can display real time.
  • Looks like an X Ray
  • Bands of high activity correspond to formants

13
Linear Filters
  • Useful to understand nature of speech signals
  • Generators generate square waves, sin waves, saw
    tooth, etc.
  • Composers composes several filters.
  • FIR Finite impulse response
  • IIR Infinite impulse response

14
FIR Filter
  • Determined completely by response to a unit
    impulse.
  • Response finite in duration.

y(t)b0 x(t) b1 x(t-1) b2x(t-2)bn x(t-n)
(We will demo FIR using rbSnack)
15
IIR Filter
  • Also called Recursive filter
  • Response infinite in duration.

y(t)b0 x(t) b1 x(t-1) b2x(t-2)bn x(t-n)
a1 y(t-1) a2y(t-2)an y(t-n)
(We will demo IIR using rbSnack)
16
Linear Predictive Analysis
  • Analogous to Fourier analysis
  • Assumption For each frame, the signal is
    predicted by
  • The LPC coefficients are the best least squares
    approximation.
  • Can also be used to predict formants

y(t)a1 y(t-1) a2y(t-2)ap y(t-p)
17
What is Sound? What is Speech?
  • Sound is the resulting signal created by the
    longitude waves in some medium like air.
  • Sound waves are continuous
  • Can be decomposed into linear combination of sin
    waves.
  • Speech is a special noise made by humans

18
Its Just Tubing
  • The simplest model of speech is to consider the
    lungs and trachea as one long tube.
  • Resonance frequencies are called Formants.

F2
F1
19
Some Speech Recognition Features
  • Formants
  • Pitch
  • Voiced/Unvoiced
  • Nasality
  • Frication
  • Energy

Our current work only uses Formants and Energy
20
Basic Utterances
  • A basic unit of speech is called a Phone
  • Vowels are utterances with constant formants
  • Diphthong is the transitioning from one vowel to
    another
  • Vowels and Diphthongs are essentially
    characterized by the first and second formant.

21
Other Phones The Consonants
  • Plosives closure in oral cavity /p/
  • Nasal Closure of nasal cavity /m/
  • Fricative Turbulent airstream noise /s/
  • Retroflex liquid Vowel like-tongue high curled
    back /r/
  • Lateral liquid Vowel like, tongue central, side
    air stream /l/
  • Glide Vowel like /y/

22
Some Problems with Speech Signals
  • Segmentation when does a word begin and end?
    (Noise?)
  • Wet ware (speakers internal configuration lip
    smacks, breathing etc.)
  • SegmentationWorkshop demos one approach.

23
Code Books
  • A code book consists of code words.
  • Idea is to search through code book to find code
    word corresponding to best match of feature
    sequence.
  • RbSnack uses codebook approach in word
    recognition.

24
Code Book Approach
  • Easy to implement
  • Good for isolated words
  • - Works best on small vocabularies
  • -- Is insensitive to context, prone to errors

25
Code Book Approach
  • WhichWay is a simple demo of this approach

26
More Problems with Speech Signals
  • Accent Southern vs. New England vs. California
    Valley vs. Other.
  • Variation in rate of speech makes it hard to
    compare words

27
Dynamic Time Warping
  • A pattern comparison technique
  • A way of stretching or compressing one sequence
    to match another.
  • Evaluated using dynamic programming

28
Dynamic Programming
  • Form a grid, with start at lower left, end at
    upper right.
  • Label each node with difference (error) between
    pattern 1 at time i and pattern 2 at time j.
  • Find minimal distance from start to end using

29
Dynamic Programming
Basic Assumption If best path P(S,E) passes
through node N, then P(S,E) is the concatenation
of P(S,N) (best from S to N) and P(N,E) (best
from N to E)
  • A possible path

30
Dynamic Programming
1
3
2
1
2
3
Type I
Type III
  • RbSnack includes examples for various time
    alignment approaches

31
Dynamic Programming
1
1
1
1
1
1
1
1
Itakura
Type IV
32
Hidden Markov Models
  • Sometime the second (or third) best match is the
    right word. Use HMMs to ascertain the correct
    word in the context of the sentence. (Ditto for
    phones within a word)
  • HMMs are similar to non-deterministic finite
    state machines, except for they have
    non-deterministic output.

33
Hidden Markov Models
  • Dynamic Programming is used to compute weights.
  • HMMs look like

.4
.2
3
2
1
P(/i/).5 P(/a/).2 P(/o/).3
.4
4
34
PossibleFuture Directions
  • Examine other features, (pitch?)
  • Incorporate other libraries. (Do the
    computationally hard work in C)
  • Add more signal processing routines
  • Add more examples
  • Use Hidden Markov Models

35
Lessons Learned/to be learned
  • Document everything.
  • Nothings perfect
  • Automate everything
  • Project is never done

36
Whats next?
  • Try it out.
Write a Comment
User Comments (0)
About PowerShow.com