Content Based Audio Classification A Neural Network Approach - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Content Based Audio Classification A Neural Network Approach

Description:

Each sensation triggered through multiple neurons ... [3] Malcolm Slaney, Auditory Toolbox for Matlab, Interval Research Corporation, Version 2. ... – PowerPoint PPT presentation

Number of Views:313

Avg rating:3.0/5.0

Slides: 27

Provided by: admi996

Category:

more less

Transcript and Presenter's Notes

Title: Content Based Audio Classification A Neural Network Approach

1
Content Based Audio Classification A Neural
Network Approach

Vikramjit Mitra, Chia-Jiu Wang
Spring 2005

2
Abstract

Aim Audio classification based on audio content.
Parallel Artificial Neural Networks (ANN) based
architecture introduced.
Audio signals processed to extract features.
Features fed to parallel ANN architecture.
Genre classification accuracy 87.3

3
Acronyms

ANN Artificial Neural Network
MLP Multi-Layer Perceptron
GRBF Gaussian Radial Basis Function
MFCC Mel Frequency Cepstrum Coefficient
LPC Linear Predictive Coding
SVM Support Vector Machine
DWT Discrete Wavelet Transform
CWT Continuous Wavelet Transform
IE Inference Engine
DCT Discrete Cosine Transform
PA Polynomial Approximation
HMM Hidden Markov Model

4
1. Introduction

Internet contains huge number of audio-visual
files.
Text searching possible, but
Multimedia search?
Current multimedia search based on file header
information-
If header contains wrong information?
Proposed solution
Content based searching.
Multimedia file contents analyzed
Analysis results determine the category to which
that file belongs

5
1.1 System Architecture
Audio file Identification and Reading
Parallel ANN Classifier
Inference Engine
Feature Extractor

. . . . . .
Input Audio
. . .
Final Decision
6
2.0 Audio Data

2 types of audio data considered
MP3 (MPEG audio layer 3)
Wav
Sampling rate 44.1 KHz
Total Audio files 60
6 Classes (Genres) Classical (CL), Hard Rock
(HR), Jazz (JZ), Pop (PP), Rap (RP) and Soft Rock
(SR).
Each file segmented into 5 windows (7.8 sec)
Number of windows 300 (50 per class).
Training 180 windows
Testing 120 windows

7
3.0 Features

8 Feature vectors V1, V2, V3, V4, V5, V6, V7,
V8
V1 ? LPC 1 estimates s(n) as a linear
combination of previous samples s(n)
9 Coefficients per window selected

DWT transform of windows using Haar Wavelet
V2 ? Mean (µ) and Variance (s2) of segments of
DWT coefficients (8 values).
V3 ? Polynomial Approximation of DWT coeffs (9
values)
DWT of a discrete signal sk given as
Dilation factor a 2j , translation factor
b 2jn, j,n are Integers
Where Fa,b is the wavelet function derived from
mother wavelet F

9
3 Level DWT decomposition tree

DWT down samples sk by 2, Gives approx coeff CA
and detailed coeff CD vector

Haar
Daubechies
Symlet
10

V2
V3
11

Mel Frequency Cepstrum Coefficient (MFCC) 3
- Used as features for speech recognition
Steps-
Break the signal into frames (using hamming
window).
Obtain the amplitude spectrum of each frame.
Take the logarithm (log10) of the amplitude
spectrum.
Convert to Mel (perceptual based) spectrum.
Take the Discrete Cosine Transform (DCT) of the
spectrum.
MFCC generate 13 coefficient sets
each coefficient set represent a subband spectrum
CA vector ? Haar DWT of each subband is
generated.
Each CA vector coded into 5 coefficients by PA
13 subbands give 65 coefficients ? V4

V4
13

V5 and V6 are similar to V2 and V3 but Wavelet
used - Daubechies

V7 and V8 are similar to V2 and V3 but wavelet
used Symlet

Feature Vector

16
4.0 Artificial Neural Networks

Adaptive, generally nonlinear learning agents
Built of different Processing Elements (PEs)
PEs Receive input from
External sources
Other PEs
Interconnection of PEs define the topology of ANN
Signal flowing through connection scaled by
weight wij.
PEs sum the different contribution
Produces output as a nonlinear function of the
sum
Output from a PE is processed by other PEs until
the system output is generated

17
Desired Output
Output
Cost
Adaptive System
Input
Training Algorithm
Error

A Typical ANN Architecture

18
2 types of ANN studied

Multi-Layer Perceptron (MLP)
Layered Feed-forward network 2
Idial for pattern recognition or classification
Slow training
Lots of training data
Gaussian Radial Basis Function (GRBF)
Nonlinear hybrid networks 2
Uses Gaussian transfer function
Faster than MLP
Uses both supervised and unsupervised learning

5. Proposed Classifier Architecture

ANN-1

Feature Vector V1
. . .
Inference Engine
ANN-2
Decision
Feature Vector V2

. . .
. . .
ANN-8
Feature Vector V8
. . .

20

Desired Output for each class

Rule Base for IE
21

6.0 Results

Each ANNs were trained separately
Tested separately
Prediction accuracy low for certain classes
Parallel ANN architecture ? Average the
prediction
Imitate human nervous system
Each sensation triggered through multiple neurons
Each ANN receives input from each unique feature
vector.
Results from each ANN vector summed.
Processed by IE ? Final Decision

Prediction result example
Parallel ANN and IE prediction for each cases

Prediction result example

Prediction accuracy for each class
25
7.0 Conclusion

Parallel ANN (MLP based) accuracy ? 87.3
MLP performed better than GRBF
MLP easy to construct, train and test
PA coefficients ? poor features
Future research-
Other possible ANN architectures
Parametric Classifiers (Bayesian, SVM, HMM etc)
Better feature vectors probably more
exploration on wavelets
Capability to read other audio file formats.

26
8.0 References

1 B. Atal, and M. Schroeder, Predictive coding
of speech signals and subjective error criteria,
IEEE Transactions on Acoustics, Speech, and
Signal Processing, Vol. 27, Issue 3, pp. 247254,
June, 1979.
2 J.C. Principe, N.R. Euliano and W.C.
Lefebvre, Neural and Adaptive Systems
Fundamentals through Simulations, John Wiley
Sons, Inc., February 29, 2000.
3 Malcolm Slaney, Auditory Toolbox for Matlab,
Interval Research Corporation, Version 2.