Speaker, Speech, and Facial Recognition - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Speaker, Speech, and Facial Recognition

Description:

Will be able to identify a person's voice based on the physical structure of a ... http://labts.troja.mff.cuni.cz/~machl5bm/sift ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 33
Provided by: Xin86
Category:

less

Transcript and Presenter's Notes

Title: Speaker, Speech, and Facial Recognition


1
Speaker, Speech, and Facial Recognition By
Austin Ouyang Xin Chen TA Tony Mangognia
2
Overall Design
Digital Signal Processor C6713
Microphone
control signals
speaker identification class
Webcam
Matlab GUI
3
3 Stages of Verification
  • Speaker Identification
  • Will be able to identify a person's voice based
    on the physical structure of a person's throat
    generating his or her voice
  • Speech Pattern Recognition
  • Will be able to identify what a person is saying
    assuming that the speech said is already in the
    database
  • Facial Recognition
  • Will be able to identify a persons face with
    those present in the database

4
Basic Recognition System
Pre-processing
signal
Basic structure used for speaker, speech, and
facial recogntion
Feature Extraction
Classification
Post-processing
5
Speaker Recognition System
Framing and Windowing
signal
  • Key parameters
  • Size of frames
  • Hop sizes
  • Number of LPC coefficients
  • Thresholds for majority ruling
  • (imposter determination)

Feature Extraction using Linear Prediction Coding
(LPC)
Linear Classifier
Majority Ruling
6
Framing and Windowing
7
Feature Extraction using Linear Prediction Coding
(LPC)
1. Find autocorrelation coefficients based on how
many LPC coefficients needed 2. Compute LPC
coefficients through Levinson-Durbin
algorithm 3. LPC coefficients are actually the
filter coefficients of an IIR filter that can
model the frequency response of a signal thus LPC
coefficients are like a spectral envelope in the
frequency domain.
8
Linear Classifier
The linear classifier assumes that the system is
a linear system. With non-linear systems, this
classifier would not be effective. This system
can be seen as a linear system, since the LPC
coefficients derived are also linearly
calculated.
such that w is the template matrix and x is the
matrix containing all of the LPC coefficients. t
is the known class matrix.
Where m is the number of classes n is the
number of LPC coefficients l is the number of
frames
9
Linear Classifier (cont.)
During training the w matrix needs to be found.
To find this matrix, the t matrix must be
multiplied by the inversion of the x matrix. This
will be done by typical matrix inversion
techniques.
where C is the adjugate matrix of A
10
Majority Ruling
During testing, a t matrix will be generated
which will be of the order m x l where m is the
number classes and l is the number of frames. For
each frame the index with the largest value will
be the identified class for that frame.
It can be seen that given 8 columns 5 of these
had a max value at index 4 and the remaining 3
had a max value at index 5. With majority ruling,
it would then be determined that the class for
the testing was class 4, since more maxes are
present at index 4 than any other index.
11
Speech Pattern Recognition By Bernd Plannerer
signal
Framing
  • Key parameters
  • Size of frames
  • Hop sizes
  • Number of channels for Mel filters

Mel Filter Bank
Distance calculations with samples in database
12
Mel-Frequency Filter Banks
Humans do not hear frequencies on a linear scale
and more based on a logarithmic scale. Human
hearing as seen from the figure is more sensitive
to lower frequencies than at higher frequencies.
The triangular filters perform a masking
effect. Multiplying the mel-frequency filters
with the original power spectrum of the signal
will return a weighting of how strong the signal
is in each frequency bank, and thus providing the
feature coefficients
13
Distance
Where ... m is the number of channels for the
mel banks n is the number of frames in sample
x l is the number of frames in sample y
Uses predecessor distances to accumulate algorithm
finds the shortest distance possible to match
sample x with sample y. Total accumulated
distance is in bottom right corner
Euclidean distance for two arbitrary frames
14
Face RecognitionBy David Lowe
image
  • SIFT (Shift Invariant Feature Transform)
  • Keypoint detector
  • Edge / Low Contrast Removal
  • Orientation Assignment
  • Vector Creation

Feature Extraction using Shift Invariant Feature
Transform (SIFT)
Closest vector with 2nd closest far enough
Greatest matched keypoints
15
Face Recognition - SIFT
  • Each layer on left is original image multiplied
    by Gaussian of increasing s (by factor k)

16
Face Recognition - SIFT
  • How keypoints are found if x is max or min
    compared to rest of the neighbors

17
Face Recognition - SIFT
  • Remove the edges and points of areas of low
    contrast

18
Face Recognition - SIFT
  • Assign an orientation based on gradient on
    neighboring pixels of the keypoint scale them
    using Gaussian scale

19
Face Recognition - SIFT
  • Each descriptor is a 4x4x8 vector

20
Face Recognition - Classification
  • Compare each image in database separately.
  • For each image in database, sort database vectors
    based on lowest angle with test vector.
  • dotprod des(i,) database_des, j'
  • vals,indx sort(acos(dotprod))
  • if (vals(1)ltvals(2)ratio)
  • count(j) count(j)1
  • end

21
Face Recognition - SIFT
  • Advantages
  • Shift Invariant angle as well as scale
  • Fast to Match
  • Distinctive
  • Disadvantage
  • Undergoing patent cannot use repeatedly for
    commercial unless license is obtained

22
Testing Speaker Recognition
  • 1st stage test with 2 person database Mid to
    poor performance
  • Decided to make noise as a class and not count it
  • 2nd stage test with noise as class Improvement
  • Decided to eliminate noise in training other
    classes by repeatedly saying password
  • 3rd stage test with repeated password
  • Computation of w template too long
  • With 2 people, 15.3 seconds

23
Testing Speaker Recognition
  • 4th stage test reducing LPC size from 64 to 32,
    but increasing fs from 8kHz to 44.1kHz
  • Time for LPC calculation increased from 0.5
    seconds to 2.5 seconds
  • Time for template calculation for 2 people
    decreased from 15.3 seconds to 8 seconds.
  • 5th stage test increasing level of input level
  • Some softer voices were having poor performance,
    this increased signal to noise ratio

24
Testing Speaker Recognition
  • Results Percentage of features identified to
    right person (minus noise)

Test Person Matched Person
Trial Number
25
Testing Speech Pattern Recognition
  • 1st stage test speech pattern algorithm with
    various country names with one persons voice
    populating the database
  • Accuracy with the same person was extremely high
    however, accuracy with other people testing was
    not as good
  • 2nd stage test populating database with passwords
    said by different users
  • Accuracy of speech recognition with users in
    database saying their password was good with an
    accuracy of around 75
  • Accuracy of speech recognition with users in
    database saying other peoples password was very
    low making imposter recognition very difficult

26
Testing Speech Pattern Recognition
  • 3rd stage test training of database by having the
    users repeat password as to minimize noise space
    in the 3 second buffer
  • Accuracy of users saying their own passwords was
    very high with well over 95 accuracy
  • Accuracy of users saying other peoples passwords
    improved significantly however, errors were
    still present. This is very well due to the fact
    that in the training, a password is only trained
    with one persons voice.

27
Testing Speech Pattern Recognition
  • Results Minimal accumulated distance per person

Matched Person
Test Person
28
Testing Face Recognition
  • 1st stage test using camera, then performing
    facial recognition
  • Detected hundreds of features too much in
    background
  • Distance, positioning, lighting and angle
    inconsistent
  • 2nd stage test using white background with
    consistent lighting and webcam fixed
  • of features dropped to around 100
  • Matching performance at 100 with 8 person
    database
  • 3rd stage test using linear classifier to match
  • Performance dropped to almost random
    classification
  • Reason SIFT is a nonlinear system

29
Testing Face Recognition
  • Invariant to what it is comparing can be faces,
    or objects, etc.

30
Testing Face Recognition
  • Results Number of features matched to each
    person

Matched Person
Test Person
31
Additional Considerations
  • Speaker Recognition
  • Training of password does not encompass all
    phonemes in English language
  • Facial Recognition
  • If head is turned side to side, performance
    drops significantly
  • Algorithm does not care about what is being
    matched to what no locality

32
List of sources
  • http//labts.troja.mff.cuni.cz/machl5bm/sift/
  • Distinctive image features from scale-invariant
    keypoints. David G. Lowe, International Journal
    of Computer Vision, 60, 2 (2004), pp. 91-110
  • Sebastian Thrun and Jana Koecká
  • Ling Feng at Denmark Technical University
  • Ogg Vorbis SOFTWARE
Write a Comment
User Comments (0)
About PowerShow.com