Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation - PowerPoint PPT Presentation

About This Presentation
Title:

Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation

Description:

... Threshold S S S S S S Use speech enhancement as a pre-processing step VAD ... Ordinary Energy-based VAD Spectral-Subtraction VAD VAD in ETSI AMR ... – PowerPoint PPT presentation

Number of Views:446
Avg rating:3.0/5.0
Slides: 31
Provided by: Welc191
Category:

less

Transcript and Presenter's Notes

Title: Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation


1
Robust Voice Activity Detection for Interview
Speech in NIST Speaker Recognition Evaluation
  • Man-Wai MAK and Hon-Bill YU
  • The Hong Kong Polytechnic University
  • enmwmak_at_polyu.edu.hk
  • http//www.eie.polyu.edu.hk/mwmak/

2
Outline
  • Speaker Verification
  • Speaker Verification Process
  • Voice Activity Detection (VAD) in Speaker
    Verification
  • Effect of VAD on Acoustic Features
  • Characteristics of Interview-Speech in NIST
    Speaker Recognition Evaluation
  • VAD for NIST Speaker Recognition Evaluation
  • Experiments on NIST SRE 2008
  • Preliminary Results on NIST SRE 2010

3
Speaker Verification Process
  • To verify the identify of a claimant based on
    his/her own voices

Is this Marys voice?
I am Mary
4
Speaker Verification Process
  • A 2-class Hypothesis problem
  • H0 MFCC sequence X(c) comes from to the true
    speaker
  • H1 MFCC sequence X(c) comes from an impostor
  • Verification score is a likelihood ratio

Speaker Model
Score

Feature extraction
Decision
-
Background Model
5
Voice Activity Detection in Speaker Verification
VAD
Feature Extraction
Speech
Acoustic Features (MFCC)
Speech segments
MFCC
DCT
LogX(?)
6
Effect of VAD on Acoustic Features
Non-speech region
Feature Extraction
Feature Extraction
VAD
Speech
7
Outline
  • Speaker Verification
  • Speaker Verification Process
  • Voice Activity Detection (VAD) in Speaker
    Verification
  • Effect of VAD on Acoustic Features
  • Characteristics of Interview-Speech in NIST
    Speaker Recognition Evaluation
  • VAD for NIST Speaker Recognition Evaluation
  • Experiments on NIST SRE 2008
  • Preliminary Results on NIST SRE 2010

8
Interview-Speech in NIST SRE
Interviewee
Desk
Interviewer
Interview Room
Source NIST SRE 2008 Workshop
9
Interview-Speech in NIST SRE
  • Far-field and desktop microphones were used for
    collecting interview speech
  • Some interview-speech files are very noisy,
    causing difficulty in differentiating speech
    segments from non-speech segments

non-speech
speech
Frequency
Amplitude
Time
A typical interview-speech file in NIST SRE 2008
10
Interview-Speech in NIST SRE
  • Some files have very low SNR

Whole file
Amplitude
Amplitude
S speech h non-speech
S speech
Segmentation
Frequency
Time
10
11
Interview-Speech in NIST SRE
  • Some files contain spiky signals, causing wrong
    VAD decision threshold

Spiky signal
Amplitude
Time
12
Interview-Speech in NIST SRE
  • Some files contain low-energy speech signal
    superimposed on periodic background noise.

Amplitude
Non-speech detected as speech
Segmentation
Frequency
Time
13
Outline
  • Speaker Verification
  • Speaker Verification Process
  • Voice Activity Detection (VAD) in Speaker
    Verification
  • Effect of VAD on Acoustic Features
  • Characteristics of Interview-Speech in NIST
    Speaker Recognition Evaluation
  • VAD for NIST Speaker Recognition Evaluation
  • Experiments on NIST SRE 2008
  • Preliminary Results on NIST SRE 2010

14
VAD for NIST Speaker Recognition Evaluation
  • Use speech enhancement as a pre-processing step

Speech Segment Info
Denoising (Spectral Subtraction)
Energy-based VAD
Noisy Speech
Denoised Speech
Spectral-Subtraction VAD (SVAD)
Feature Extraction
Scoring
Decision Making
Accept/Reject
MFCC
S
S
S
S
S
S
Speaker Model
Decision Threshold
Impostor Model
15
VAD for NIST Speaker Recognition Evaluation
  • Use speech enhancement as a pre-processing step

Signal Frequency Spectrum
Clean speech x(n,m) X(?,m)
Noisy speech y(n,m) Y(?,m)
Background speech b(n,m) B(?,m)
This values were set such that we remove as much
noise as possible.
16
VAD for NIST Speaker Recognition Evaluation
  • Without denoising
  • With denoising

Amplitude
Time
Amplitude
Time
17
VAD for NIST Speaker Recognition Evaluation
  • Without denoising

S speech h non-speech
18
VAD for NIST Speaker Recognition Evaluation
With denoising
SS-VAD
S speech h non-speech
VAD in ETSI-AMR speech coder
19
VAD for NIST Speaker Recognition Evaluation
  • Speech-segment-length to speech-file-length ratio
    of 3 VADs

total duration 10 secs .
total speech segment 3 secs.
speech-segment-length to speech-file-length ratio
3/10
20
VAD for NIST Speaker Recognition Evaluation
  • Speech-segment-length to speech-file-length ratio
    of 3 VADs

VAD in ETSI AMR Coder
Ordinary Energy-based VAD
Spectral-Subtraction VAD
High frequency of occurrence, suggesting many
non-speech segments being mistakenly detected as
speech segments
21
Outline
  • Speaker Verification
  • Speaker Verification Process
  • Voice Activity Detection (VAD) in Speaker
    Verification
  • Effect of VAD on Acoustic Features
  • Characteristics of Interview-Speech in NIST
    Speaker Recognition Evaluation
  • VAD for NIST Speaker Recognition Evaluation
  • Experiments on NIST SRE 2008
  • Preliminary Results on NIST SRE 2010

22
Experiments on NIST SRE 2008
  • Speaker Modeling GMM-SVM
  • Score Normalization T-norm

23
Results on NIST 2008 SRE
24
Results on NIST 2008 SRE
Common Condition 1
VAD
ETSI AMR
SS-VAD
25
Preliminary Results on NIST 2010
Common Condition 2 All trials involving
interview speech from different microphones
EER () Normalized minDCF
Energy-based VAD 11.72 0.99
SS-VAD 4.45 0.58
SMB 5.83 0.75
SS-SMB 4.62 0.60
NIST ASR Transcripts 8.58 0.85
ETSI-AMR 8.05 0.85
SMB Statistical-Model Based VAD Sohn, et al. A
statistical model-based voice activity
detection, IEEE Signal Processing Letters, 1999.
26
Conclusions
  • Noise reduction is of primary importance for VAD
    under extremely low SNR
  • It is important to remove the sinusoidal
    background found in NIST SRE sound files as this
    kind of background signal could lead to many
    false detection in energy-based VAD.
  • Using noise reduction as a pre-preprocessing step
    leads to a VAD outperforms the VAD in ETSI-AMR
    (Option 2).

27
VAD for NIST Speaker Recognition Evaluation
  • Threshold Determination and VAD Decision Logic

Sample-based
Windowing
Frame-based
Amplitude Ranking
28
Results
29
Experiments on NIST SRE 2008
  • Training phase

(NIST05 06)
MAP Adaptation
Feature Extraction
MAP Adaptation
300 background speakers (NIST06)
(NIST08)
GMM-supervectors of 300 impostors
GMM-supervectors of target speakers
NAP
NAP
30
Experiments on NIST SRE 2008
  • Verification phase
Write a Comment
User Comments (0)
About PowerShow.com