Brno%20University%20Of%20Technology%20%20Speech@FIT%20%20Luk - PowerPoint PPT Presentation

About This Presentation

Title:

Brno%20University%20Of%20Technology%20%20Speech@FIT%20%20Luk

Description:

Luk Burget, Michal Fap o, Valiantsina Hubeika, Ondrej Glembek, ... Discriminatively trained using MPE. Adapted to speaker: VTLN, SAT based on CMLLR, MLLR ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 29

Provided by: Luk121

Category:

more less

Transcript and Presenter's Notes

Title: Brno%20University%20Of%20Technology%20%20Speech@FIT%20%20Luk

1
Brno University Of Technology Speech_at_FIT
Lukáš Burget, Michal Fapšo, Valiantsina Hubeika,
Ondrej Glembek, Martin Karafiát, Marcel Kockmann,
Pavel Matejka, Petr Schwarz and Honza
CernockýNIST Speaker Recognition Workshop
2008
MOBIO
2
Outline

Submitted systems
Factor Analysis systems
SVM-MLLR system
Side information based calibration and fusion
Contribution of subsystems in fusion
Analysis of FA system
Gender dependent vs. independent
Flavors of FA system
Sensitivity to number of eigenchannels
Importance of ZT-norm
Optimization for microphone data
Techniques that did not make it to the submission
Conclusion

3
Submitted systems

BUT01 - primary (3 systems)
Channel and language side information in fusion
FA-MFCC13?39
FA-MFCC20?60
SVM-MLLR
BUT02 - (3 systems)
The same as BUT01, but no side information in
fusion
BUT03 - (2 systems)
Channel and language side information in fusion
FA-MFCC13?39
FA-MFCC20?60

4
FA-MFCC13?39 system

MAP adapted UBM with 2048 Gaussian components
Single UBM trained on Switchboard and NIST 2004,5
data
12 MFCC C0 (20ms window, 10ms shift )
Short time Gaussianization
Rank of the current frame coefficient in 3sec
window transformed by inverse Gaussian cumulative
distribution function.
Delta double delta triple delta coefficients
Together 52 coefficients, 12 frames context
HLDA (dimensionality reduction from 52 to 39)
Factor Analysis Model gender independent
300 eigenvoices (Switchboards, NIST 2004,5)
100 eigenchannels for telephone speech (NIST
2004,5 tel data)
100 eigenchannels for microphone speech (NIST
2005 mic data)
ZT-norm gender dependent

5
FA-MFCC20?60 system

The same as FA-MFCC13?39 with the following
differences
60 dimensional features are 19 MFCC Energy
deltas double deltas (no HLDA)
Two gender dependent Factor Analysis models

6
SVM MLLR system

Linear kernels
Rank normalization
LibSVM C library Chang2001
Pre-computed Gram matrices
Features are MLLR transformations adapting LVCSR
system (developed for AMI project) to speaker of
given speech segment
Estimation of MLLR transformations makes use of
the ASR transcripts provided by NIST

7
SVM MLLR system

Cascade of CMLLR and MLLR
2 CMLLR transformation (silence and speech)
3 MLLR transformation (silence and 2 phoneme
clusters)
Silence transformations are discarded for SRE
Supervector 1 CMLLR 2 MLLR
33923394680
Impostors NIST 2004 mic data from NIST 2005
ZT-norm speakers from NIST 2004

8
Side info based calibration and fusion

Side information for each trial is given by its
hard assignment to classes
Trial channel condition provided by NIST
tel-tel, tel-mic, mic-tel, mic-mic
English/non-English decision given by our LID
system
Side information is used as follows
For each system
Split trials by channel condition and calibrate
scores using linear logistic regression (LLR) in
each split separately
Split trials according to English/non-English
decision and calibrate scores using LLR in each
split separately
Fuse the calibrated scores of all subsystems
using LLR without making use of any side
information
For convenience , FoCal Bilinear toolkit by Niko
Brummer was used, although we did not make use of
its extensions over standard LLR.

9
Side info based calibration and fusion tel-tel
trials
SRE 2006 (all trials, det1)
SRE 2008 (all trials, det6)
No side information (BUT02) Channel cond. Lang.
by NIST Channel cond. Lang. by LID (BUT01
primary system)

Use of side information is helpful
Some improvement can be obtained by relying on
language information provided by NIST instead of
more realistic LID system

10
Side info based calibration and fusion mic-mic
trials
SRE 2006 (trial list defined by MIT-LL)
SRE 2008 (det1)
No side information Channel cond. Lang. by
NIST Channel cond. Lang. by LID

Use of side information allowed to use the same
unchanged subsystems for all the channels

11
Subsystems and fusion tel-tel trials
SRE 2006 (all trials, det1)
SRE 2008 (all trials, det6)
SVM-MLLR FA-MFCC13?39 FA-MFCC20?60 Fusion of 2 x
FA Fusion of all 3

For tel-tel trials, the single FA-MFCC20?60
performs almost as well as the fusion

12
Subsystems and fusionmic-mic trials
SRE 2006 (trial list defined by MIT-LL)
SRE 2008 (det1)
SVM-MLLR FA-MFCC13?39 FA-MFCC20?60 Fusion of 2 x
FA Fusion of all 3

For microphone conditions FA-MFCC20?60 fails to
perform well and fusion is beneficial.
FA-MFCC13?39 system outperforms FA-MFCC20?60
system having 3x more parameters, which is
possibly too over-trained to telephone data
primarily used for FA model training.

13
Gender dependent vs. gender independent FA system
SRE 2006 tel-tel
SRE 2006 mic-mic
FA-MFCC13?39 GI FA-MFCC20?60 GD FA-MFCC20?60 GI

Halving the number of parameters of FA-MFCC20?60
system by making it gender independent degrades
the performance on telephone and improves on
microphone condition

14
Flavors of FA
Speaker specific factors
Session specific factors

M m vy dz ux

All hyperparameters can be
trained from data using EM
Eigenchannels
Diagonal matrix
Eigenvoices
UBM mean supervector

Relevance MAP adaptation
M m dz with d2 S/ t
where S is matrix with UBM variance supervector
in diagonal
Eigenchannel adaptation (SDV, BUT)
Relevance MAP for enrolling speaker model
Adapt speaker model to test utterance using
eigenchannels estimated by PCA
FA without eigenvoices, with d2 S/ t (QUT, LIA)
FA without eigenvoices, with d trained from data
(CRIM)
can be seen as training different t for each
supervector coefficient
Effective relevance factor tef trace(S)/
trace(d2)
FA with eigenvoices (CRIM)

15
Flavors of FA
SRE 2006 (all trials, det1)
MFCC13?39 features
MFCC20?60 features
tef 236.1
tef 81.2
Eigenchannel adapt. FA d2 S/ t FA d trained
on data FA with eigenvoices

Without eigenvoices, simple eigenchannel
adaptation seems to be more robust than FA.
FA with trained d fails for MFCC13?39 features.
Too high tef? Caused by HLDA?
FA with eigenvoices significantly outperform the
other FA configurations.

16
Sensitivity of FA to number of eigenchannels
MFCC20?60 features

FA systems without eigenvoices seem not to be
able to robustly estimate increased number of
eigenchannels
However, we benefit from more eigenchannels
significantly after explaining the speaker
variability by eigenvoices

17
Importance of zt-norm
MFCC13?39 features

zt-norm is essential for good performance FA
with eigenvoices.

18
Training eigenchannels for different channel
conditions
FA-MFCC13?39
SRE2006 trials tel-tel (det1) tel-mic mic-tel mic
-mic
SRE2008 trials tel-tel (det6) tel-mic
(det5) mic-tel (det4) mic-mic (det1)

Negligible degradation on tel-tel condition and
huge improvement particularly on mic-mic
condition is obtained after adding eigenchannels
trained on microphone data to those trained on
telephone data.

19
Training additional eigenchannels on SRE08 dev
data
FA-MFCC13?39

Significant improvement is obtained on microphone
condition for eval data after adding
eigenchannels trained on SER08 dev data (all
spontaneous speech from the 6 interviewees).

SRE2008 trials tel-tel (det6) tel-mic
(det5) mic-tel (det4) mic-mic (det1)
20
Training additional eigenchannels on SRE08 dev
data primary fusion
det1 int-int det2 int-int same mic det3
int-int diff. mic det4 int-phn
det5 phn-mic det6 tel-tel det7 tel-tel
Eng. det8 tel-tel native
primary submission 20 EC
trained on SRE08 dev data

20 eigenchannels trained on SRE08 dev data are
added to the two FA subsystems
The improvement generalizes also to non-interview
microphone data (det5)

21
Other techniques that did not make it to the
submission

The following techniques were also tried , which
did not improve the fusion performance
GMM with eigenchannel adaptation
SVM-GMM 2048 NAP
SVM-GMM 2048 ISV (Inter Session Variability
modeling)
SVM-GMM 2048 ISV derivative (Fisher) kernel
SVM-GMM 2048 IVS based on FA-MFCC13?39
FA modeling prosodic and cepstral contours
SVM on phonotactics counts from Binary decision
trees
SVM on soft bigram statistics collected on
cumulated posteriograms (matrix of posterior
probability of phonemes for each frame)
See our poster for more details

22
Conclusions

FA systems build according to recipe from
Kenny2008 performs excellently, though there is
still some mystery to be solved.
It was hard to find another complementary system
that would contribute to fusion of our two FA
system.
Although our system was primarily trained on and
tuned for telephone data, FA subsystems can be
simply augmented with eigenchannels trained on
microphone data (as also proposed in
Kenny2008), which makes the system performing
well also on microphone conditions.
Another significant improvement was obtained by
training additional eigenchannels on data with
matching channels condition, even thought there
was very limited amount of such data provided by
NIST.

23
Thanks

To Patrick Kenny for Kenny2008 recipe to
building FA system that really works and for
providing list of files for training FA system
MIT-LL for creating and sharing the trial lists
based on SRE06 data, which we used for the system
development
Niko Bummer for FoCal Bilinear, which allowed us
to start playing with the fusion just the last
day before the submission deadline

24
References

Kenny2008 P. Kenny et al. A Study of
Inter-Speaker Variability in Speaker Verification
IEEE TASLP, July 2008.
Brummer2008 N. Brummer FoCal Bilinear Tools
for detector fusion and calibration, with use of
side-information http//niko.brummer.googlepages.c
om/focalbilinear
Chang2001 C. Chang et al. LIBSVM a library
for Support Vector Machines, http//www.csie.ntu.e
du.tw/cjlin/libsvm
Stolcke2005/6 A. Stolcke MLLR Transforms as
Features in SpkID, Eurospeech 2005, Odyssey 2006
Hain2005 T. Hain et al. The 2005 AMI system
for RTS, Meeting Recognition Evaluation Workshop,
Edinburgh, July 2005.

25
(No Transcript)
26
Inter-session variability
Target speaker model
UBM
High inter-speaker variability
High inter-session variability
NIST SRE 2008
27
Inter-session compensation
Target speaker model
Test data
For recognition, move both models along the high
inter-session variability direction(s) to fit
well the test data
UBM
High inter-speaker variability
High inter-session variability
NIST SRE 2008
28
LVCSR for SVM-MLLR system

LVCSR system is adapted to speaker (VTLN factor
and (C)MLLR transformations are estimated) using
ASR transcriptions provided by NIST
AMI 2005(6) LVCSR state of the art system for
the recognition of spontaneous speech Hain2005
50k word dictionary (pronunciations of OOVs were
generated by grapheme to phoneme conversion based
on rules trained from data)
PLP, HLDA
CD-HMM with 7500 tied-states each modeled by 18
Gaussians
Discriminatively trained using MPE
Adapted to speaker VTLN, SAT based on CMLLR, MLLR

NIST SRE 2008

Write a Comment

User Comments (0)