Brno%20University%20Of%20Technology%20%20Speech@FIT%20%20Luk - PowerPoint PPT Presentation

About This Presentation
Title:

Brno%20University%20Of%20Technology%20%20Speech@FIT%20%20Luk

Description:

Luk Burget, Michal Fap o, Valiantsina Hubeika, Ondrej Glembek, ... Discriminatively trained using MPE. Adapted to speaker: VTLN, SAT based on CMLLR, MLLR ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 29
Provided by: Luk121
Category:

less

Transcript and Presenter's Notes

Title: Brno%20University%20Of%20Technology%20%20Speech@FIT%20%20Luk


1
Brno University Of Technology Speech_at_FIT
Lukáš Burget, Michal Fapšo, Valiantsina Hubeika,
Ondrej Glembek, Martin Karafiát, Marcel Kockmann,
Pavel Matejka, Petr Schwarz and Honza
CernockýNIST Speaker Recognition Workshop
2008
MOBIO
2
Outline
  • Submitted systems
  • Factor Analysis systems
  • SVM-MLLR system
  • Side information based calibration and fusion
  • Contribution of subsystems in fusion
  • Analysis of FA system
  • Gender dependent vs. independent
  • Flavors of FA system
  • Sensitivity to number of eigenchannels
  • Importance of ZT-norm
  • Optimization for microphone data
  • Techniques that did not make it to the submission
  • Conclusion

3
Submitted systems
  • BUT01 - primary (3 systems)
  • Channel and language side information in fusion
  • FA-MFCC13?39
  • FA-MFCC20?60
  • SVM-MLLR
  • BUT02 - (3 systems)
  • The same as BUT01, but no side information in
    fusion
  • BUT03 - (2 systems)
  • Channel and language side information in fusion
  • FA-MFCC13?39
  • FA-MFCC20?60

4
FA-MFCC13?39 system
  • MAP adapted UBM with 2048 Gaussian components
  • Single UBM trained on Switchboard and NIST 2004,5
    data
  • 12 MFCC C0 (20ms window, 10ms shift )
  • Short time Gaussianization
  • Rank of the current frame coefficient in 3sec
    window transformed by inverse Gaussian cumulative
    distribution function.
  • Delta double delta triple delta coefficients
  • Together 52 coefficients, 12 frames context
  • HLDA (dimensionality reduction from 52 to 39)
  • Factor Analysis Model gender independent
  • 300 eigenvoices (Switchboards, NIST 2004,5)
  • 100 eigenchannels for telephone speech (NIST
    2004,5 tel data)
  • 100 eigenchannels for microphone speech (NIST
    2005 mic data)
  • ZT-norm gender dependent

5
FA-MFCC20?60 system
  • The same as FA-MFCC13?39 with the following
    differences
  • 60 dimensional features are 19 MFCC Energy
    deltas double deltas (no HLDA)
  • Two gender dependent Factor Analysis models

6
SVM MLLR system
  • Linear kernels
  • Rank normalization
  • LibSVM C library Chang2001
  • Pre-computed Gram matrices
  • Features are MLLR transformations adapting LVCSR
    system (developed for AMI project) to speaker of
    given speech segment
  • Estimation of MLLR transformations makes use of
    the ASR transcripts provided by NIST

7
SVM MLLR system
  • Cascade of CMLLR and MLLR
  • 2 CMLLR transformation (silence and speech)
  • 3 MLLR transformation (silence and 2 phoneme
    clusters)
  • Silence transformations are discarded for SRE
  • Supervector 1 CMLLR 2 MLLR
  • 33923394680
  • Impostors NIST 2004 mic data from NIST 2005
  • ZT-norm speakers from NIST 2004

8
Side info based calibration and fusion
  • Side information for each trial is given by its
    hard assignment to classes
  • Trial channel condition provided by NIST
    tel-tel, tel-mic, mic-tel, mic-mic
  • English/non-English decision given by our LID
    system
  • Side information is used as follows
  • For each system
  • Split trials by channel condition and calibrate
    scores using linear logistic regression (LLR) in
    each split separately
  • Split trials according to English/non-English
    decision and calibrate scores using LLR in each
    split separately
  • Fuse the calibrated scores of all subsystems
    using LLR without making use of any side
    information
  • For convenience , FoCal Bilinear toolkit by Niko
    Brummer was used, although we did not make use of
    its extensions over standard LLR.

9
Side info based calibration and fusion tel-tel
trials
SRE 2006 (all trials, det1)
SRE 2008 (all trials, det6)
No side information (BUT02) Channel cond. Lang.
by NIST Channel cond. Lang. by LID (BUT01
primary system)
  • Use of side information is helpful
  • Some improvement can be obtained by relying on
    language information provided by NIST instead of
    more realistic LID system

10
Side info based calibration and fusion mic-mic
trials
SRE 2006 (trial list defined by MIT-LL)
SRE 2008 (det1)
No side information Channel cond. Lang. by
NIST Channel cond. Lang. by LID
  • Use of side information allowed to use the same
    unchanged subsystems for all the channels

11
Subsystems and fusion tel-tel trials
SRE 2006 (all trials, det1)
SRE 2008 (all trials, det6)
SVM-MLLR FA-MFCC13?39 FA-MFCC20?60 Fusion of 2 x
FA Fusion of all 3
  • For tel-tel trials, the single FA-MFCC20?60
    performs almost as well as the fusion

12
Subsystems and fusionmic-mic trials
SRE 2006 (trial list defined by MIT-LL)
SRE 2008 (det1)
SVM-MLLR FA-MFCC13?39 FA-MFCC20?60 Fusion of 2 x
FA Fusion of all 3
  • For microphone conditions FA-MFCC20?60 fails to
    perform well and fusion is beneficial.
  • FA-MFCC13?39 system outperforms FA-MFCC20?60
    system having 3x more parameters, which is
    possibly too over-trained to telephone data
    primarily used for FA model training.

13
Gender dependent vs. gender independent FA system
SRE 2006 tel-tel
SRE 2006 mic-mic
FA-MFCC13?39 GI FA-MFCC20?60 GD FA-MFCC20?60 GI
  • Halving the number of parameters of FA-MFCC20?60
    system by making it gender independent degrades
    the performance on telephone and improves on
    microphone condition

14
Flavors of FA
Speaker specific factors
Session specific factors
  • M m vy dz ux

All hyperparameters can be
trained from data using EM
Eigenchannels
Diagonal matrix
Eigenvoices
UBM mean supervector
  • Relevance MAP adaptation
  • M m dz with d2 S/ t
  • where S is matrix with UBM variance supervector
    in diagonal
  • Eigenchannel adaptation (SDV, BUT)
  • Relevance MAP for enrolling speaker model
  • Adapt speaker model to test utterance using
    eigenchannels estimated by PCA
  • FA without eigenvoices, with d2 S/ t (QUT, LIA)
  • FA without eigenvoices, with d trained from data
    (CRIM)
  • can be seen as training different t for each
    supervector coefficient
  • Effective relevance factor tef trace(S)/
    trace(d2)
  • FA with eigenvoices (CRIM)

15
Flavors of FA
SRE 2006 (all trials, det1)
MFCC13?39 features
MFCC20?60 features
tef 236.1
tef 81.2
Eigenchannel adapt. FA d2 S/ t FA d trained
on data FA with eigenvoices
  • Without eigenvoices, simple eigenchannel
    adaptation seems to be more robust than FA.
  • FA with trained d fails for MFCC13?39 features.
    Too high tef? Caused by HLDA?
  • FA with eigenvoices significantly outperform the
    other FA configurations.

16
Sensitivity of FA to number of eigenchannels
MFCC20?60 features
  • FA systems without eigenvoices seem not to be
    able to robustly estimate increased number of
    eigenchannels
  • However, we benefit from more eigenchannels
    significantly after explaining the speaker
    variability by eigenvoices

17
Importance of zt-norm
MFCC13?39 features
  • zt-norm is essential for good performance FA
    with eigenvoices.

18
Training eigenchannels for different channel
conditions
FA-MFCC13?39
SRE2006 trials tel-tel (det1) tel-mic mic-tel mic
-mic
SRE2008 trials tel-tel (det6) tel-mic
(det5) mic-tel (det4) mic-mic (det1)
  • Negligible degradation on tel-tel condition and
    huge improvement particularly on mic-mic
    condition is obtained after adding eigenchannels
    trained on microphone data to those trained on
    telephone data.

19
Training additional eigenchannels on SRE08 dev
data
FA-MFCC13?39
  • Significant improvement is obtained on microphone
    condition for eval data after adding
    eigenchannels trained on SER08 dev data (all
    spontaneous speech from the 6 interviewees).

SRE2008 trials tel-tel (det6) tel-mic
(det5) mic-tel (det4) mic-mic (det1)
20
Training additional eigenchannels on SRE08 dev
data primary fusion
det1 int-int det2 int-int same mic det3
int-int diff. mic det4 int-phn
det5 phn-mic det6 tel-tel det7 tel-tel
Eng. det8 tel-tel native
primary submission 20 EC
trained on SRE08 dev data
  • 20 eigenchannels trained on SRE08 dev data are
    added to the two FA subsystems
  • The improvement generalizes also to non-interview
    microphone data (det5)

21
Other techniques that did not make it to the
submission
  • The following techniques were also tried , which
    did not improve the fusion performance
  • GMM with eigenchannel adaptation
  • SVM-GMM 2048 NAP
  • SVM-GMM 2048 ISV (Inter Session Variability
    modeling)
  • SVM-GMM 2048 ISV derivative (Fisher) kernel
  • SVM-GMM 2048 IVS based on FA-MFCC13?39
  • FA modeling prosodic and cepstral contours
  • SVM on phonotactics counts from Binary decision
    trees
  • SVM on soft bigram statistics collected on
    cumulated posteriograms (matrix of posterior
    probability of phonemes for each frame)
  • See our poster for more details

22
Conclusions
  • FA systems build according to recipe from
    Kenny2008 performs excellently, though there is
    still some mystery to be solved.
  • It was hard to find another complementary system
    that would contribute to fusion of our two FA
    system.
  • Although our system was primarily trained on and
    tuned for telephone data, FA subsystems can be
    simply augmented with eigenchannels trained on
    microphone data (as also proposed in
    Kenny2008), which makes the system performing
    well also on microphone conditions.
  • Another significant improvement was obtained by
    training additional eigenchannels on data with
    matching channels condition, even thought there
    was very limited amount of such data provided by
    NIST.

23
Thanks
  • To Patrick Kenny for Kenny2008 recipe to
    building FA system that really works and for
    providing list of files for training FA system
  • MIT-LL for creating and sharing the trial lists
    based on SRE06 data, which we used for the system
    development
  • Niko Bummer for FoCal Bilinear, which allowed us
    to start playing with the fusion just the last
    day before the submission deadline

24
References
  • Kenny2008 P. Kenny et al. A Study of
    Inter-Speaker Variability in Speaker Verification
    IEEE TASLP, July 2008.
  • Brummer2008 N. Brummer FoCal Bilinear Tools
    for detector fusion and calibration, with use of
    side-information http//niko.brummer.googlepages.c
    om/focalbilinear
  • Chang2001 C. Chang et al. LIBSVM a library
    for Support Vector Machines, http//www.csie.ntu.e
    du.tw/cjlin/libsvm
  • Stolcke2005/6 A. Stolcke MLLR Transforms as
    Features in SpkID, Eurospeech 2005, Odyssey 2006
  • Hain2005 T. Hain et al. The 2005 AMI system
    for RTS, Meeting Recognition Evaluation Workshop,
    Edinburgh, July 2005.

25
(No Transcript)
26
Inter-session variability
Target speaker model
UBM
High inter-speaker variability
High inter-session variability
NIST SRE 2008
27
Inter-session compensation
Target speaker model
Test data
For recognition, move both models along the high
inter-session variability direction(s) to fit
well the test data
UBM
High inter-speaker variability
High inter-session variability
NIST SRE 2008
28
LVCSR for SVM-MLLR system
  • LVCSR system is adapted to speaker (VTLN factor
    and (C)MLLR transformations are estimated) using
    ASR transcriptions provided by NIST
  • AMI 2005(6) LVCSR state of the art system for
    the recognition of spontaneous speech Hain2005
  • 50k word dictionary (pronunciations of OOVs were
    generated by grapheme to phoneme conversion based
    on rules trained from data)
  • PLP, HLDA
  • CD-HMM with 7500 tied-states each modeled by 18
    Gaussians
  • Discriminatively trained using MPE
  • Adapted to speaker VTLN, SAT based on CMLLR, MLLR

NIST SRE 2008
Write a Comment
User Comments (0)
About PowerShow.com