Audio Modality for Semantic Analysis of Video - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Audio Modality for Semantic Analysis of Video

Description:

Simple window averaging does not constitute a robust discriminative function. More discriminative properties can be observed in the graphs (i.e. variance, ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 23
Provided by: umut7
Category:

less

Transcript and Presenter's Notes

Title: Audio Modality for Semantic Analysis of Video


1
Audio Modality for Semantic Analysis of Video
  • Umut Naci

2
Audio Analysis for Segmentation
  • Overview
  • Features
  • Segmentation classification
  • Previous Works
  • Our System
  • Work Plan

3
Overview
  • The focus is mostly on visual information
  • Audio contains significant amount of information
  • Existing work is at a preliminary stage
  • Directions in audio content analysis
  • - Speech-music discrimination
  • - Speech-music-background (others)
    discrimination
  • - Key effect detection
  • - Audio based genre detection
  • -

4
Features
  • In audio analysis world the features are well
    categorized thanks to the efforts in speech
    processing

Features
Time Domain
Frequency Domain
  • Short Term Energy
  • Zero Crossing Rate
  • Pitch
  • Spectral Peaks (Harmonics)
  • MFCC

5
Features
  • Short Term Energy Function (STE)
  • A simple tool for silence detection
  • Can keep track of rhythm pattern
  • Higher for voiced speech components compared to
    unvoiced ones
  • Not robust against variable volume and SNR

6
Features
  • Zero Crossing Rate (ZCR)
  • Represents the frequency content of the signal
  • A simple but an efficient tool.
  • Higher for unvoiced speech components

7
Features
  • Pitch
  • Fundamental frequency of an audio waveform
  • Defined for voiced speech and harmonic music
  • Different methods of different complexities
    exist for pitch estimation
  • No totally reliable and robust method, some are
    computationally expensive

8
Features
  • MFCC
  • Takes the perceptual characteristics of human
    auditory system into account.
  • The most important feature for speech processing
    systems

9
Segmentation Classification
  • Class definitions

Silence
Background
Speech
Music
Others
Applauses
Gunshots

SpeechMusic
Pure Music

10
Segmentation Classification
  • Methods
  • Thresholding Feature Values
  • Gaussian Mixture Models
  • Hidden Markov Models
  • Neural Networks

11
Samples from previous work
  • Sounders

Classes Speech, Music Features STE,
ZCR Modeling Simple Thresholding
Wyse and Smoliar
Classes Speech, Music, Others Features
Spectral peak, Pitch Modeling Simple
Thresholding
12
Samples from previous work
  • Kimber and Wilcox

Classes Speech, Silence, Laughter, Nonspeech
Features Cepstral Coefficients Modeling HMM
Lu, Zhang and Jiang
Classes Speech, Music, Environment Sound,
Silence Features ZCR, STE, Spectral Flux,
Linear Spectral Pairs Distance, Band Periodicity,
Noise Frame Ratio Modeling KNN Simple
Thresholding
13
Samples from previous work
  • Zhang and Kuo

Classes Silence, Pure Music, Song, Speech
with Music Background, Pure Speech Features
STE, ZCR, Pitch, Spectral Peaks Modeling Simple
Thresholding
Our system is based on this system
14
Our System
  • Why Zhang and Kuo?

Uses widely accepted fundamental features
Segmentation and classification are two different
steps Defined classes are appropriate for
concert video parsing purposes
On the other hand
- No information about segmentation parameters -
Modeling and decision making phases are primitive
15
Our System
  • System Overview

Audio data
Segmentation
Classification
STE
Post processing
Feature Ext
ZCR
Pitch
Audio Segments Indices
Sp. Peaks
16
Our System
  • Segmentation
  • A segment boundary is detected if an abrupt
    change occurs in one of STE, ZCR or pitch values.
  • A sliding window proceed together with each
    newly computed feature.
  • Segment boundary is claimed if the difference
    between the average feature values in the first
    half (Awe(w1)) and the second half (Awe(w2)) of
    the window.

17
Our System
  • Classification

Step 1 Silence Detection (STE ZCR) System
detects silence, if both STE and ZCR values are
low. Step 2 Separating with/without music
component (Spectral Peaks) If there are peaks in
power spectra keeping their values for a certain
period of time, this interval is assumed to have
a music component. To avoid influence from
speech only components higher than 500 Hz are
considered
Further classification is performed based on
rules defined and thresholds set!!!
18
Our System
  • Implementation Status
  • Implementations for feature extraction
  • Preliminary tests for segmentation part using
    spatial features
  • Implementation of frequency based features is
    ongoing

19
Our System
  • Short Term Energy Graph

20
Our System
  • Zero Crossing Rate Graph

21
Our System
  • Observations
  • Simple window averaging does not constitute a
    robust discriminative function
  • More discriminative properties can be observed
    in the graphs (i.e. variance, distribution of
    high and low values for music/speech data)
  • All peaks might be the indication of a kind of
    change in data

22
Our System
  • Future Work
  • Extracting a complete set of audio features
  • Improving segmentation performance up to the
    point that we can use it as a standalone system
  • Implementing a training based classification
    scheme (so that we can use it for detection of
    key events as well)
Write a Comment
User Comments (0)
About PowerShow.com