Audio Modality for Semantic Analysis of Video - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Audio Modality for Semantic Analysis of Video

Description:

Simple window averaging does not constitute a robust discriminative function. More discriminative properties can be observed in the graphs (i.e. variance, ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 23

Provided by: umut7

Category:

more less

Transcript and Presenter's Notes

Title: Audio Modality for Semantic Analysis of Video

1
Audio Modality for Semantic Analysis of Video

Umut Naci

2
Audio Analysis for Segmentation

Overview
Features
Segmentation classification
Previous Works
Our System
Work Plan

3
Overview

The focus is mostly on visual information
Audio contains significant amount of information
Existing work is at a preliminary stage
Directions in audio content analysis
- Speech-music discrimination
- Speech-music-background (others)
discrimination
- Key effect detection
- Audio based genre detection
-

4
Features

In audio analysis world the features are well
categorized thanks to the efforts in speech
processing

Features
Time Domain
Frequency Domain

Short Term Energy
Zero Crossing Rate

Pitch
Spectral Peaks (Harmonics)
MFCC

5
Features

Short Term Energy Function (STE)

A simple tool for silence detection
Can keep track of rhythm pattern
Higher for voiced speech components compared to
unvoiced ones
Not robust against variable volume and SNR

6
Features

Zero Crossing Rate (ZCR)

Represents the frequency content of the signal
A simple but an efficient tool.
Higher for unvoiced speech components

7
Features

Pitch

Fundamental frequency of an audio waveform
Defined for voiced speech and harmonic music
Different methods of different complexities
exist for pitch estimation
No totally reliable and robust method, some are
computationally expensive

8
Features

MFCC

Takes the perceptual characteristics of human
auditory system into account.
The most important feature for speech processing
systems

9
Segmentation Classification

Class definitions

Silence
Background
Speech
Music
Others
Applauses
Gunshots

SpeechMusic
Pure Music

10
Segmentation Classification

Methods

Thresholding Feature Values
Gaussian Mixture Models
Hidden Markov Models
Neural Networks

11
Samples from previous work

Sounders

Classes Speech, Music Features STE,
ZCR Modeling Simple Thresholding
Wyse and Smoliar
Classes Speech, Music, Others Features
Spectral peak, Pitch Modeling Simple
Thresholding
12
Samples from previous work

Kimber and Wilcox

Classes Speech, Silence, Laughter, Nonspeech
Features Cepstral Coefficients Modeling HMM
Lu, Zhang and Jiang
Classes Speech, Music, Environment Sound,
Silence Features ZCR, STE, Spectral Flux,
Linear Spectral Pairs Distance, Band Periodicity,
Noise Frame Ratio Modeling KNN Simple
Thresholding
13
Samples from previous work

Zhang and Kuo

Classes Silence, Pure Music, Song, Speech
with Music Background, Pure Speech Features
STE, ZCR, Pitch, Spectral Peaks Modeling Simple
Thresholding
Our system is based on this system
14
Our System

Why Zhang and Kuo?

Uses widely accepted fundamental features
Segmentation and classification are two different
steps Defined classes are appropriate for
concert video parsing purposes
On the other hand
- No information about segmentation parameters -
Modeling and decision making phases are primitive
15
Our System

System Overview

Audio data
Segmentation
Classification
STE
Post processing
Feature Ext
ZCR
Pitch
Audio Segments Indices
Sp. Peaks
16
Our System

Segmentation

A segment boundary is detected if an abrupt
change occurs in one of STE, ZCR or pitch values.
A sliding window proceed together with each
newly computed feature.
Segment boundary is claimed if the difference
between the average feature values in the first
half (Awe(w1)) and the second half (Awe(w2)) of
the window.

17
Our System

Classification

Step 1 Silence Detection (STE ZCR) System
detects silence, if both STE and ZCR values are
low. Step 2 Separating with/without music
component (Spectral Peaks) If there are peaks in
power spectra keeping their values for a certain
period of time, this interval is assumed to have
a music component. To avoid influence from
speech only components higher than 500 Hz are
considered
Further classification is performed based on
rules defined and thresholds set!!!
18
Our System

Implementation Status

Implementations for feature extraction
Preliminary tests for segmentation part using
spatial features
Implementation of frequency based features is
ongoing

19
Our System

Short Term Energy Graph

20
Our System

Zero Crossing Rate Graph

21
Our System

Observations

Simple window averaging does not constitute a
robust discriminative function
More discriminative properties can be observed
in the graphs (i.e. variance, distribution of
high and low values for music/speech data)
All peaks might be the indication of a kind of
change in data

22
Our System

Future Work

Extracting a complete set of audio features
Improving segmentation performance up to the
point that we can use it as a standalone system
Implementing a training based classification
scheme (so that we can use it for detection of
key events as well)

Write a Comment

User Comments (0)