Title: IRISA 2003 SPEAKER RECOGNITION SYSTEM
1IRISA 2003 SPEAKER RECOGNITION SYSTEM
- 1sp DETECTION
- Limited Data
- M. BEN, G. GRAVIER, A. OZEROV F. BIMBOT
- for the ELISA consortium
NIST Speaker Recognition Workshop, June 24-25,
2003
2Outline
- IRISA 2003 system
- Introduction
- Description
- NIST03 SRE results
- Experiments
- Front-end
- Modeling
- Score normalization
- Conclusions
3IRISA 2003 system
? IRISA is a member of the ELISA consortium ?
IRISA 2003 system is based on a newly developed
audio segmentation software audioseg Web
links - IRISA/METISS http//www.irisa.fr/meti
ss/accueil.html - ELISA consortium
http//elisa.ddl.ish-lyon.cnrs.fr
4IRISA 2003 system
- 20 ms frames every 10 ms
- 24 filter bank over 340 - 3400 Hz ? 16 LFCC
- RASTA filtering (secondary system)
- deltas delta log-energy are added
- frame selection bi-gaussian modeling of the
energy with ML classification of the frames
(speech/silence) - global feature normalization (zero mean, unit
var.)
5IRISA 2003 system
- Description
- background modeling
- speaker models
- gender-dependent background models
- 256 GMMs with diagonal covariance matrices
- prim. system cellular data (NIST01)
- second. system cellularlandline data
(NIST01) - adapted from the background models with MAP
estimation of the parameters (mean only
adaptation)
6IRISA 2003 system
- frame score
- log-likelihood ratio using the 10-best
- matching gaussians in the background model
-
- utterance score
- NT number of frames in the utterance
-
7IRISA 2003 system
- Description
- score normalization DT-norm
- D-norm
- D(spk) symmetric Kullback-Leibler
distance - between the speaker (spk) and
- the background models
-
- DT-norm
- mean and standard deviation of the
- D-norm scores of the test utterance
- using cohort impostor models (50 mal.
- 50 fem. from NIST01 SRE)
8IRISA 2003 system
- NIST03 SRE results 1sp-limited
- DET curves
- 2 systems submited
- IRI_1 primary
- baseline system
- IRI_2 secondary
- RASTA front-end
- mixed cell.land. data for world models
DCF min actual IRI_1
0.3176 0.3205 IRI_2 0.3333
0.3396
9Experiments
- Front-end frame selection
- speech/silence classification based on a
- bi-gaussian modeling of the frame energy
- ML classification
- or
- threshold-based selection ?
- ( t ?2 - c.?2 )
- constant coef. to optimise
G1(?1 ,?1)
G2 (?2,?2)
energy
10Experiments
- Front-end frame selection
- speech/silence classification based on a
- bi-gaussian modeling of the frame log-energy
- ML classification
- or
- threshold-based selection ?
- ( t ?2 - c.?2 )
- constant coef. to optimise
G1(?1 ,?1)
G2 (?2,?2)
log-energy
11Experiments
- Front-end frame selection
- SYS_fs1 ML selection (E)
- SYS_fs2 optimal threshold-based selection (E)
c 0.8 - SYS_fs3 ML selection (LogE)
- SYS_fs4 optimal threshold-based selection
(LogE) c 2.5 - energy (E) bi-gauss. modeling with ML selection
of the frames performs the best - drastic selection about 50 of the frames are
discarded !
NIST 03 SRE data
12Experiments
- Front-end feature normalization
- st-norm short-term norm. (0 mean, unit
var.) on a sliding window (3 sec.) - lt-norm
long term norm. (0 mean, unit var.) on all
features
- st-norm is applied before frame
- selection
- lt-norm can be applied before or
- after frame selection
- SYS_fn1 lt-norm frame selection
- SYS_fn2 st-norm frame selection
- SYS_fn3 frame selection lt-norm
NIST 02 SRE data (subset)
13Experiments
- Front-end feature normalization
- - SYS_fn5 frame selection lt-norm
- baseline system (prim.)
- SYS_fn6 st-norm frame selection
lt-norm - short-term normalization does not seem to work
well (buggy?) - long-term normalization at the end of front-end
seems to be crucial - best results obtained with frame selection
followed by long-term normalization of remaining
features
NIST 03 SRE data
14Experiments
- - SYS_nbg1 256 component GMMs
- (baseline)
- SYS_nbg2 2048 component GMMs
- no gain of performance with 2048 gaussians in
the mixture - may be due to the frame selection process which
remove a large amount of frames (?)
NIST 02 SRE data (subset)
15Experiments
- SYS_sn1 no score norm.
- SYS_sn2 T-norm
- SYS_sn3 DT-norm
- SYS_sn4 DZT-norm
- all score normalizations improve performance
- DT-norm seems to perform better than T-norm and
DZT-norm at minimum DCF point
NIST 02 SRE data (subset)
16Conclusions
-
-
- validation of the new toolkit audioseg
- new baseline system performs well
- frame selection is crucial for good performance
- work on feature transformations (PCA, ICA ...)
- model adaptation on test data
- hierarchical structural model adaptation
- IRISA participation to NIST03 SRE
- Perspectives