Title: Audio Modality for Semantic Analysis of Video
1Audio Modality for Semantic Analysis of Video
2Audio Analysis for Segmentation
- Overview
- Features
- Segmentation classification
- Previous Works
- Our System
- Work Plan
3Overview
- The focus is mostly on visual information
- Audio contains significant amount of information
- Existing work is at a preliminary stage
- Directions in audio content analysis
- - Speech-music discrimination
- - Speech-music-background (others)
discrimination - - Key effect detection
- - Audio based genre detection
- -
4Features
- In audio analysis world the features are well
categorized thanks to the efforts in speech
processing
Features
Time Domain
Frequency Domain
- Short Term Energy
- Zero Crossing Rate
-
- Pitch
- Spectral Peaks (Harmonics)
- MFCC
-
5Features
- Short Term Energy Function (STE)
- A simple tool for silence detection
- Can keep track of rhythm pattern
- Higher for voiced speech components compared to
unvoiced ones - Not robust against variable volume and SNR
6Features
- Represents the frequency content of the signal
- A simple but an efficient tool.
- Higher for unvoiced speech components
7Features
- Fundamental frequency of an audio waveform
- Defined for voiced speech and harmonic music
- Different methods of different complexities
exist for pitch estimation - No totally reliable and robust method, some are
computationally expensive
8Features
- Takes the perceptual characteristics of human
auditory system into account. - The most important feature for speech processing
systems
9Segmentation Classification
Silence
Background
Speech
Music
Others
Applauses
Gunshots
SpeechMusic
Pure Music
10Segmentation Classification
- Thresholding Feature Values
- Gaussian Mixture Models
- Hidden Markov Models
- Neural Networks
11Samples from previous work
Classes Speech, Music Features STE,
ZCR Modeling Simple Thresholding
Wyse and Smoliar
Classes Speech, Music, Others Features
Spectral peak, Pitch Modeling Simple
Thresholding
12Samples from previous work
Classes Speech, Silence, Laughter, Nonspeech
Features Cepstral Coefficients Modeling HMM
Lu, Zhang and Jiang
Classes Speech, Music, Environment Sound,
Silence Features ZCR, STE, Spectral Flux,
Linear Spectral Pairs Distance, Band Periodicity,
Noise Frame Ratio Modeling KNN Simple
Thresholding
13Samples from previous work
Classes Silence, Pure Music, Song, Speech
with Music Background, Pure Speech Features
STE, ZCR, Pitch, Spectral Peaks Modeling Simple
Thresholding
Our system is based on this system
14Our System
Uses widely accepted fundamental features
Segmentation and classification are two different
steps Defined classes are appropriate for
concert video parsing purposes
On the other hand
- No information about segmentation parameters -
Modeling and decision making phases are primitive
15Our System
Audio data
Segmentation
Classification
STE
Post processing
Feature Ext
ZCR
Pitch
Audio Segments Indices
Sp. Peaks
16Our System
- A segment boundary is detected if an abrupt
change occurs in one of STE, ZCR or pitch values. - A sliding window proceed together with each
newly computed feature. - Segment boundary is claimed if the difference
between the average feature values in the first
half (Awe(w1)) and the second half (Awe(w2)) of
the window.
17Our System
Step 1 Silence Detection (STE ZCR) System
detects silence, if both STE and ZCR values are
low. Step 2 Separating with/without music
component (Spectral Peaks) If there are peaks in
power spectra keeping their values for a certain
period of time, this interval is assumed to have
a music component. To avoid influence from
speech only components higher than 500 Hz are
considered
Further classification is performed based on
rules defined and thresholds set!!!
18Our System
- Implementations for feature extraction
- Preliminary tests for segmentation part using
spatial features - Implementation of frequency based features is
ongoing
19Our System
20Our System
21Our System
- Simple window averaging does not constitute a
robust discriminative function - More discriminative properties can be observed
in the graphs (i.e. variance, distribution of
high and low values for music/speech data) - All peaks might be the indication of a kind of
change in data
22Our System
- Extracting a complete set of audio features
- Improving segmentation performance up to the
point that we can use it as a standalone system - Implementing a training based classification
scheme (so that we can use it for detection of
key events as well)