Title: Brno%20University%20Of%20Technology%20%20Speech@FIT%20%20Luk
1Brno University Of Technology Speech_at_FIT
Lukáš Burget, Michal Fapšo, Valiantsina Hubeika,
Ondrej Glembek, Martin Karafiát, Marcel Kockmann,
Pavel Matejka, Petr Schwarz and Honza
CernockýNIST Speaker Recognition Workshop
2008
MOBIO
2Outline
- Submitted systems
- Factor Analysis systems
- SVM-MLLR system
- Side information based calibration and fusion
- Contribution of subsystems in fusion
- Analysis of FA system
- Gender dependent vs. independent
- Flavors of FA system
- Sensitivity to number of eigenchannels
- Importance of ZT-norm
- Optimization for microphone data
- Techniques that did not make it to the submission
- Conclusion
3Submitted systems
- BUT01 - primary (3 systems)
- Channel and language side information in fusion
- FA-MFCC13?39
- FA-MFCC20?60
- SVM-MLLR
- BUT02 - (3 systems)
- The same as BUT01, but no side information in
fusion - BUT03 - (2 systems)
- Channel and language side information in fusion
- FA-MFCC13?39
- FA-MFCC20?60
4FA-MFCC13?39 system
- MAP adapted UBM with 2048 Gaussian components
- Single UBM trained on Switchboard and NIST 2004,5
data - 12 MFCC C0 (20ms window, 10ms shift )
- Short time Gaussianization
- Rank of the current frame coefficient in 3sec
window transformed by inverse Gaussian cumulative
distribution function. - Delta double delta triple delta coefficients
- Together 52 coefficients, 12 frames context
- HLDA (dimensionality reduction from 52 to 39)
- Factor Analysis Model gender independent
- 300 eigenvoices (Switchboards, NIST 2004,5)
- 100 eigenchannels for telephone speech (NIST
2004,5 tel data) - 100 eigenchannels for microphone speech (NIST
2005 mic data) - ZT-norm gender dependent
5FA-MFCC20?60 system
- The same as FA-MFCC13?39 with the following
differences - 60 dimensional features are 19 MFCC Energy
deltas double deltas (no HLDA) - Two gender dependent Factor Analysis models
6SVM MLLR system
- Linear kernels
- Rank normalization
- LibSVM C library Chang2001
- Pre-computed Gram matrices
- Features are MLLR transformations adapting LVCSR
system (developed for AMI project) to speaker of
given speech segment - Estimation of MLLR transformations makes use of
the ASR transcripts provided by NIST
7SVM MLLR system
- Cascade of CMLLR and MLLR
- 2 CMLLR transformation (silence and speech)
- 3 MLLR transformation (silence and 2 phoneme
clusters) - Silence transformations are discarded for SRE
- Supervector 1 CMLLR 2 MLLR
- 33923394680
- Impostors NIST 2004 mic data from NIST 2005
- ZT-norm speakers from NIST 2004
8Side info based calibration and fusion
- Side information for each trial is given by its
hard assignment to classes - Trial channel condition provided by NIST
tel-tel, tel-mic, mic-tel, mic-mic - English/non-English decision given by our LID
system - Side information is used as follows
- For each system
- Split trials by channel condition and calibrate
scores using linear logistic regression (LLR) in
each split separately - Split trials according to English/non-English
decision and calibrate scores using LLR in each
split separately - Fuse the calibrated scores of all subsystems
using LLR without making use of any side
information - For convenience , FoCal Bilinear toolkit by Niko
Brummer was used, although we did not make use of
its extensions over standard LLR.
9Side info based calibration and fusion tel-tel
trials
SRE 2006 (all trials, det1)
SRE 2008 (all trials, det6)
No side information (BUT02) Channel cond. Lang.
by NIST Channel cond. Lang. by LID (BUT01
primary system)
- Use of side information is helpful
- Some improvement can be obtained by relying on
language information provided by NIST instead of
more realistic LID system
10Side info based calibration and fusion mic-mic
trials
SRE 2006 (trial list defined by MIT-LL)
SRE 2008 (det1)
No side information Channel cond. Lang. by
NIST Channel cond. Lang. by LID
- Use of side information allowed to use the same
unchanged subsystems for all the channels
11Subsystems and fusion tel-tel trials
SRE 2006 (all trials, det1)
SRE 2008 (all trials, det6)
SVM-MLLR FA-MFCC13?39 FA-MFCC20?60 Fusion of 2 x
FA Fusion of all 3
- For tel-tel trials, the single FA-MFCC20?60
performs almost as well as the fusion
12Subsystems and fusionmic-mic trials
SRE 2006 (trial list defined by MIT-LL)
SRE 2008 (det1)
SVM-MLLR FA-MFCC13?39 FA-MFCC20?60 Fusion of 2 x
FA Fusion of all 3
- For microphone conditions FA-MFCC20?60 fails to
perform well and fusion is beneficial. - FA-MFCC13?39 system outperforms FA-MFCC20?60
system having 3x more parameters, which is
possibly too over-trained to telephone data
primarily used for FA model training.
13Gender dependent vs. gender independent FA system
SRE 2006 tel-tel
SRE 2006 mic-mic
FA-MFCC13?39 GI FA-MFCC20?60 GD FA-MFCC20?60 GI
- Halving the number of parameters of FA-MFCC20?60
system by making it gender independent degrades
the performance on telephone and improves on
microphone condition
14Flavors of FA
Speaker specific factors
Session specific factors
All hyperparameters can be
trained from data using EM
Eigenchannels
Diagonal matrix
Eigenvoices
UBM mean supervector
- Relevance MAP adaptation
- M m dz with d2 S/ t
- where S is matrix with UBM variance supervector
in diagonal - Eigenchannel adaptation (SDV, BUT)
- Relevance MAP for enrolling speaker model
- Adapt speaker model to test utterance using
eigenchannels estimated by PCA - FA without eigenvoices, with d2 S/ t (QUT, LIA)
- FA without eigenvoices, with d trained from data
(CRIM) - can be seen as training different t for each
supervector coefficient - Effective relevance factor tef trace(S)/
trace(d2) - FA with eigenvoices (CRIM)
15Flavors of FA
SRE 2006 (all trials, det1)
MFCC13?39 features
MFCC20?60 features
tef 236.1
tef 81.2
Eigenchannel adapt. FA d2 S/ t FA d trained
on data FA with eigenvoices
- Without eigenvoices, simple eigenchannel
adaptation seems to be more robust than FA. - FA with trained d fails for MFCC13?39 features.
Too high tef? Caused by HLDA? - FA with eigenvoices significantly outperform the
other FA configurations. -
16Sensitivity of FA to number of eigenchannels
MFCC20?60 features
- FA systems without eigenvoices seem not to be
able to robustly estimate increased number of
eigenchannels - However, we benefit from more eigenchannels
significantly after explaining the speaker
variability by eigenvoices
17Importance of zt-norm
MFCC13?39 features
- zt-norm is essential for good performance FA
with eigenvoices.
18Training eigenchannels for different channel
conditions
FA-MFCC13?39
SRE2006 trials tel-tel (det1) tel-mic mic-tel mic
-mic
SRE2008 trials tel-tel (det6) tel-mic
(det5) mic-tel (det4) mic-mic (det1)
- Negligible degradation on tel-tel condition and
huge improvement particularly on mic-mic
condition is obtained after adding eigenchannels
trained on microphone data to those trained on
telephone data.
19Training additional eigenchannels on SRE08 dev
data
FA-MFCC13?39
- Significant improvement is obtained on microphone
condition for eval data after adding
eigenchannels trained on SER08 dev data (all
spontaneous speech from the 6 interviewees).
SRE2008 trials tel-tel (det6) tel-mic
(det5) mic-tel (det4) mic-mic (det1)
20Training additional eigenchannels on SRE08 dev
data primary fusion
det1 int-int det2 int-int same mic det3
int-int diff. mic det4 int-phn
det5 phn-mic det6 tel-tel det7 tel-tel
Eng. det8 tel-tel native
primary submission 20 EC
trained on SRE08 dev data
- 20 eigenchannels trained on SRE08 dev data are
added to the two FA subsystems - The improvement generalizes also to non-interview
microphone data (det5)
21Other techniques that did not make it to the
submission
- The following techniques were also tried , which
did not improve the fusion performance - GMM with eigenchannel adaptation
- SVM-GMM 2048 NAP
- SVM-GMM 2048 ISV (Inter Session Variability
modeling) - SVM-GMM 2048 ISV derivative (Fisher) kernel
- SVM-GMM 2048 IVS based on FA-MFCC13?39
- FA modeling prosodic and cepstral contours
- SVM on phonotactics counts from Binary decision
trees - SVM on soft bigram statistics collected on
cumulated posteriograms (matrix of posterior
probability of phonemes for each frame) - See our poster for more details
22Conclusions
- FA systems build according to recipe from
Kenny2008 performs excellently, though there is
still some mystery to be solved. - It was hard to find another complementary system
that would contribute to fusion of our two FA
system. - Although our system was primarily trained on and
tuned for telephone data, FA subsystems can be
simply augmented with eigenchannels trained on
microphone data (as also proposed in
Kenny2008), which makes the system performing
well also on microphone conditions. - Another significant improvement was obtained by
training additional eigenchannels on data with
matching channels condition, even thought there
was very limited amount of such data provided by
NIST.
23Thanks
- To Patrick Kenny for Kenny2008 recipe to
building FA system that really works and for
providing list of files for training FA system - MIT-LL for creating and sharing the trial lists
based on SRE06 data, which we used for the system
development - Niko Bummer for FoCal Bilinear, which allowed us
to start playing with the fusion just the last
day before the submission deadline
24References
- Kenny2008 P. Kenny et al. A Study of
Inter-Speaker Variability in Speaker Verification
IEEE TASLP, July 2008. - Brummer2008 N. Brummer FoCal Bilinear Tools
for detector fusion and calibration, with use of
side-information http//niko.brummer.googlepages.c
om/focalbilinear - Chang2001 C. Chang et al. LIBSVM a library
for Support Vector Machines, http//www.csie.ntu.e
du.tw/cjlin/libsvm - Stolcke2005/6 A. Stolcke MLLR Transforms as
Features in SpkID, Eurospeech 2005, Odyssey 2006 - Hain2005 T. Hain et al. The 2005 AMI system
for RTS, Meeting Recognition Evaluation Workshop,
Edinburgh, July 2005.
25(No Transcript)
26Inter-session variability
Target speaker model
UBM
High inter-speaker variability
High inter-session variability
NIST SRE 2008
27Inter-session compensation
Target speaker model
Test data
For recognition, move both models along the high
inter-session variability direction(s) to fit
well the test data
UBM
High inter-speaker variability
High inter-session variability
NIST SRE 2008
28LVCSR for SVM-MLLR system
- LVCSR system is adapted to speaker (VTLN factor
and (C)MLLR transformations are estimated) using
ASR transcriptions provided by NIST - AMI 2005(6) LVCSR state of the art system for
the recognition of spontaneous speech Hain2005
- 50k word dictionary (pronunciations of OOVs were
generated by grapheme to phoneme conversion based
on rules trained from data) - PLP, HLDA
- CD-HMM with 7500 tied-states each modeled by 18
Gaussians - Discriminatively trained using MPE
- Adapted to speaker VTLN, SAT based on CMLLR, MLLR
NIST SRE 2008