AdvAIR - PowerPoint PPT Presentation

About This Presentation
Title:

AdvAIR

Description:

Simulate vocal track shape. Features Extraction (con't) ... Speech content recognition. Music pattern matching. Distributed system for segmentation. Q & A ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 42
Provided by: Ling87
Category:
Tags: advair

less

Transcript and Presenter's Notes

Title: AdvAIR


1
AdvAIR
An Advanced Audio Information Retrieval System
  • Supervised by Prof. Michael R. Lyu
  • Prepared by Alex Fok, Shirley Ng
  • 2002 Fall

2
Outline
  • Introduction
  • System Overview
  • Applications
  • Experiment
  • Future Work
  • QA

3
  • Introduction

4
Motivation
  • Rapid expansion of audio information due to
    blooming of internet
  • Little attention paid on audio mining
  • Lack of a framework for generic audio information
    processing

5
Targets
  • Open platform that can provide a basis for
    various voice oriented applications
  • Enhance audio information retrieval by
    performance with guaranteed accuracy
  • Generic speech analysis tools for data mining

6
Approaches
  • Robust low-level sound information preprocess
    module
  • Speed oriented but accuracy algorithms
  • Generalized model concept for various usage
  • A visual framework for presentation

7
  • System Design

8
System Flow Chart
Scene Cutting
Audio Signal
Implements
Video Scene Change And Speaker Tracking
Features Extraction
Database Storage
Segmentation and clustering Preprocessing
Speaker Identification
Training and Modeling
Linguistic Identification
Core Platform
Extended tools
9
Features Extraction
  • Energy Measurement
  • Zero Crossing Rate
  • Pitch
  • Human resolves frequencies non-linearly across
    the audio spectrum
  • MFCC approach
  • Simulate vocal track shape

10
Features Extraction (cont)
  • The idea of filter-bank, which approximates the
    non-linear frequency resolution
  • Bins hold a weighted sum representing the
    spectral magnitude of channels
  • Lower and upper frequency cut-offs

Frequency
magnitude

11
Segmentation
  • Segmentation is to cut audio stream at the
    acoustic change point
  • BIC (Bayesian Information Criterion) is used
  • It is threshold-free and robust
  • Input audio stream is modeled
  • as Gaussians

Gaussian
Mean
12
Segmentation
  • Notations for an audio stream
  • N Number of frames
  • X xi i 1,2,,N a set of feature vectors
  • µ is the mean
  • S is the full covariance matrix

13
Segmentation for single change pt.
  • Assume change point is at frame i
  • H0,H1 two different models
  • H0 models the data as one Gaussian
  • X1 XN N( µ , S )
  • H1 models the data as two Gaussians
  • X1 Xi N( µ1 , S1 )
  • Xi1 XN N( µ2 , S2 )

14
Segmentation for single change pt. (cont)
  • maximum likelihood ratio statistics is
  • R(i) N log S - N1 log S1 - N2
  • log S2

Audio Stream
Frame N
Change point
Frame 1
Frame i
15
Segmentation for single change pt. (cont)
  • BIC(i) R(i) -? P
  • BIC(i) is ve i is the change point
  • BIC(i) is ve i is not the change point
  • Which model fits the data better, single
    Gaussian(H0) or 2 Gaussians(H1)?

16
Segmentation for single change pt. (cont)
  • To detect a single change point, we need to
    calculate BIC(i) for all i 1,2,,N
  • The frame i with largest BIC value is the change
    point
  • O(N) to detect a single change point

17
Segmentation for multiple change pt.
  • Step 1 Initialize interval a,b, set a 1, b
    2
  • Step 2 Detect change point in interval a,b
    through BIC single change point detection
    algorithm
  • Step 3 If no change point in interval a,b,
  • then set b b1
  • else let t be the changing point
    detected,
  • set a t1, b t2
  • Step 4 Go to Step (2)

18
Enhanced Implementation Algorithm
  • Original multiple change point detection
    algorithm
  • Start to detect change point within 2 frames
  • Increase investigation interval by 1 every time
  • Enhanced Implementation algorithm
  • minimum processing interval used in our engine is
    100 frames
  • Increase investigation interval by 100 every time

19
Enhanced Implementation Algorithm (cont)
  • Why do we choose to increase the interval by 100
    frames?
  • It increases is too large, then scene change may
    be missed.
  • Must be smaller than 170 frames because there are
    around 170 frames in 1 second
  • It increases is too small, then speed of
    processing is too slow

20
Enhanced Implementation Algorithm (cont)
  • Advantage Speed up
  • Trade-off the change point we detected is not
    too accurate
  • To compensate
  • investigate on the frames around the change point
    again
  • investigation interval is incremented by 1 to
    locate a more accurate change point

21
Training and Modeling
  • Before doing various identification, training and
    modeling is needed
  • Probability-based Model ? Gaussian Mixture Model
    (GMM) is used
  • GMM is used for language identification, gender
    identification and speaker identification
  • GMM is modeled by many different Gaussian
    distributions
  • A Gaussian distribution is represented by its
    mean and variance

22
Gaussian Mixture Model (GMM)
Model for Speaker i
  • To train a model is to calculate the mean ,
    variance and weight (?) for each of the Gaussian
    distribution

23
Training of speaker GMMs
  • Collect sound clips that is long enough for each
    speaker (e.g. 20 minutes sound clips)
  • Steps for training one speaker model
  • Step 1. Start with an initial model ?
  • Step 2. Calculate new mean, variance, weighting
    (new ?) by training
  • Step 3. Use a new?if it represents the model
    better than the old?
  • Step 4. Repeat Step 2 to Step 3
  • Finally, we get ?that can represent the model

24
  • Applications

25
Applications
  • Video scene change and speaker tracking
  • Speaker Identification
  • Telephony message notification

26
Video scene change and Speaker tracking
Multimedia Presentation
Video Clip
AdvAIR core Segmentation
Timing And Speaker Information
Video Playing Mechanism
Speakers Index Information
27
Usage
  • Speaker tracking enhance data mining about a
    particular person (e.g. Political person in a
    conference)
  • Audio information indexing and sorting for audio
    library storage
  • It as an auxiliary tool for video cutting and
    editing applications

28
Screenshot
Input clip
Multimedia player
Time information and indexing
29
Speaker Identification
Preprocessed Speaker clip
Sound source
GMM Model Training
Speaker Comparison Mechanism
Speaker Models Database
Speaker Identity
Training Stage
Testing Stage
30
Usage
  • Security authentication
  • Speaker identification of telephone base system
  • Criminal investigation (For example, similar to
    fingerprint)

31
Screenshot
Input source
Flexible length comparison
Media Player for visual verification
Speaker Identity
32
Telephony Message Notification
Caller phone
Desired group Model database
GMM model comparison
User cant listen
Record the leaving message of caller
Desired group
Non-desired Group
AdvAIR segmentation
Messaging API
Short Message System
E-mail system
33
  • Experiment Results

34
Threshold-free BIC criterion
Test Wave length Actual Turing Point False Alarm Missed Point Time used
1 9 seconds 2 0 0 2 seconds
2 12 seconds 4 0 0 4 seconds
3 25 seconds 3 0 0 8 seconds
4 120 seconds 8 1 0 134 seconds
5 540 seconds 12 8 0 1200 seconds
Background Noise affect accuracy
35
Enhanced Implementation
Test Method Wave length Actual Turning Point False Alarm Missed Point Time used
1 Old 9 seconds 2 0 0 10 seconds
1 New 9 seconds 2 0 0 2 seconds
2 Old 12 seconds 4 0 0 40 seconds
2 New 12 seconds 4 0 0 4 seconds
3 Old 25 seconds 3 1 0 1300 seconds
3 New 25 seconds 3 2 0 8 seconds
4 Old 540 seconds 18 7 2 Over 1 days
4 New 540 seconds 18 8 2 1200 seconds
Speed enhance is determined by relative number of
changing point by length
36
GMM modal closed-set speaker identification
  • Training Stage
  • 10 speaker
  • 5 males, 5 females
  • 20 minutes for each speaker
  • Testing Stage
  • 50 sound clips with 5 seconds duration
  • 48 sound clips are correct, i.e. 96

37
GMM modal open-set speaker identification
  • Accept or Reject as result
  • Same setting as closed-set
  • i.e. 10 speaker, which each 20 minutes
  • Correct 45/50 90
  • False reject 3/50 6
  • False accept 2/50 4

38
  • Problems
  • and
  • Limitation

39
Problems and limitations
  • Accuracy is affected by background noise
  • Some speakers have very likely features of sound
  • Open set speaker identification determination
    function is not so accurate if duration is short
  • Segmentation is still a time consuming process

40
Future Work
  • Speaker gender identification
  • Robust open-set speaker identification
  • Speech content recognition
  • Music pattern matching
  • Distributed system for segmentation

41
Q A
Write a Comment
User Comments (0)
About PowerShow.com