Title: BiologicallyInspired Audio Coding
1Biologically-Inspired Audio Coding
- Ramin Pishehvar
- Advanced Audio Systems, VPBT
2Plan
- What is audio coding?
- What is sparse coding?
- Why sparse coding?
- Motivations For a New Paradigm
- Mathematical Background
- Details of the Proposed Coding Paradigm
- Pattern extraction
- Coding Results For Different Audio Signals
3Audio Coding
- Techniques that allow us to transmit audio signal
with a bitrate smaller than the raw material
without loss of perceptible quality - Used in Ipods, cell phones, DVDs, blu-rays, TVs,
computers, etc. - State of the art techniques based on Fourier
(frequency domain)-like transforms MDCT, FFT,
etc. - Some known standards in the industry MP3, AAC,
High-Efficiency AAC, AMR-WB, etc.
4Linkages - Industry Canada
Conceptual Differences between Classical Coding
and Sparse Coding
Histogram of Energy
Histogram of Energy
5Next Steps of Action
6Why Is Change Necessary?
- Frame-based (FFT, DCT, MDCT, etc.) coding
- Transients and acoustic events smeared across
frames - Change of Analysis results depending on the
alignment of the frames (even wavelets
Simoncelli et al. 1992) - Frame-based analysis not shift invariant
- Energy saving with sparse coding
7Shift Variance In Frame-Based Coding
From Smith and Lewicki 2005
8Matching Pursuit (MP)
Save the optimal kernel
MAX
-
Residual
9Proposed Biologically-Inspired Audio Coder
10Gammatone/Gammachirp Filterbanks
- Frequency-modulated version of gammatones
- Minimization of the scale-time uncertainty (Irino
and Patterson 2001) by the gammachirp - Better fit to physiological data
- More free tuning parameters
- Rotation of the tilings in the time-frequency
plane
Decay Slope
Attack Slope
Center Frequency
Deviation From Center Frequency
Chirp Factor
Time Delay
11Why Gammatone/Gamachirp?
- Optimal Auditory Coding strategy Maximize the
information conveyed to the brain while
minimizing the required energy and neural
resources (Smith and Lewicki 2006) - For natural sounds optimal auditory coding
achieved when gammatone used (Smith and Lewicki
2006) - Gammatone is optimal for audio as Gabor is
optimal for image (Smith and Lewicki 2006)
12Adaptive vs. Non-Adaptive
- Non-Adaptive
- Gammatone filterbank
- Center frequencies, time delays, and spike
amplitudes computed - Adaptive
- Gammachirp filterbank
- Center frequencies, time delays, spike
amplitudes, chirp (modulation) factors, attack,
and decay parameters computed - Combinational explosion suboptimal search
Our Claim Switching to Adaptive increases coding
efficiency
13Comparison of Adaptive vs. Non-Adaptive For Speech
Only the modulation (chirp) factor is adapted
14Masking
- MP is based on MSE
- Perceptual-based MP uses only instantaneous
masking - Remove spikes below the absolute threshold of
hearing - Remove inaudible spikes due to forward or
backward temporal masking (on-frequency masking) - Remove inaudible spikes in adjacent critical
bands (i.e., off-frequency masking)
15Coding Results for Percussion
Spikes Before Masking
Previous Works 0.66N-3.2N for 4kHz speech
30000
10000
29370
Spikes After Masking
9430
0.37N
0.12N
Spike Gain
2.90
Bit rate
1.93
Adaptive
Non-Adaptive
High Quality With Informal Listening Tests
16Pattern Extraction
- Extraction of auditory objects
- Spikes not statistically independent
- Episode discovery in spikegrams
- Codebook generation based on audio objects
- Signal coded as codebook elements plus residual
- Bitrate reduced by 40-50
17(No Transcript)
18WO Pattern 21982 1911
8
23704
19Future and Ongoing Work
- Generalizing pattern extraction to other features
(time, amplitude, etc.) - Closed-form, precomputed , tree-like search
matching pursuit to speed-up (MPTK 0.25 real
time for large signals) - Parametric coding of spike parameters(mean firing
rate, delay, etc.) - Modular approach to replace MP
- Compressed sensing
20Conclusion
- Efficient coding paradigm when coding delay can
be afforded - Paradigm mimics the auditory pathway
- Adaptive approach (with gammachirp) more
efficient than non-adaptive (with gammatones) - Masking removes inaudible spikes
- Object-based coding
- Expected to give 1 bit/sample for high quality
44.1 kHz audio for archiving and broadcasting