Title: Automatic Cluster complexity and Quantity Selection: Towards Robust Speaker Diarization
1 Automatic Cluster complexity and Quantity
Selection Towards Robust Speaker Diarization
- Xavier Anguera, Chuck Wooters and Javier Hernando
MLMI Workshop Washington DC, May 2006
2Outline
- Speaker Diarization for meetings
- Problems and questions to answer
- Proposed solutions
- Number of initial clusters
- Complexity selection
- No speaker turn average duration constraints
- Experiments and results
- Conclusions
3Speaker Diarization for Meetings
- Answers the question Who spoke when? in
recordings made in a meetings environment. - No prior knowledge of the number of speakers or
their identities is allowed.
Diarization output
Channels collapse
Speech/non- speech filter
Agglomerative clustering /w BIC
4Problems
- In any agglomerative clustering implementation
- A non-informed guess is made on the initial
number of clusters and the initial model
complexity. - After successive merges, model complexity doesnt
match the changes in clusters data. - Speaker average turn length is artificially
restricted in the cluster models.
5Problems
- Robustness problem system parameters are
determined without accounting for individual
meetings characteristics. - Need for automatic ways to set system parameters
according to the data.
6Specific questions addressed
- In this paper we propose answer to
- How many initial clusters to create to start the
agglomerative clustering? - How do we model a speaker, how many Gaussian
mixtures we use? - How to restrict how long (average) a speaker turn
is?
7Cluster Complexity Ratio (CCR)
- New system parameter to substitute some of the
previous. - It is defined as
-
- More tied to the data amount of data necessary
to optimally train every single mixture. - Increases robustness to meeting duration
variation.
8Answer 1
- How many initial clusters do we create to start
the agglomerative clustering?
9Number of initial clusters selection
- In agglomerative clustering we need to define a
correct number of initial clusters - Too many initial clusters makes the system go
slow. - Too few creates extra errors.
- In the past Kinit 10 or 16 for meetings and
Kinit 40 for Broadcast News, defined for all
recordings.
10Number of initial clusters selection
- We propose a per-meeting algorithm to select the
number of initial clusters - depend on the meeting length (Ntotal frames),
the mixtures per cluster (GMclus) and the
introduced CCR parameter. - This method is quick and meeting-dependent.
- For GMclus5, CCR10, in Meetings Kinit 12-15
clusters and in Broadcast News Kinit 40-50
11Answer 2
- How do we model a speaker, how many Gaussian
mixtures we use?
12Cluster complexity selection
- A good cluster modeling using GMM is crucial to
obtain optimum results in speaker diarization. - Within agglomerative clustering, a speaker model
is used - In Vitetbi segmentation.
- In Cluster comparisons with ?BIC.
- Fixed complexity models cause problems
- Too many Gaussians ? data overfitting ? system
under-clustering. - Too few Gaussians ? data underfitting ? system
over-clustering.
13Cluster complexity selection
- We present an occupancy driven approach based on
the introduced CCR. - Where the model complexity (Mi) depends on the
model occupancy (Ni frames) and the CCR
parameter. - Each model is adapted after every Viterbi segm.
- The new models are modified depending on the
complexity - If Min Min-1 gt nothing is done
- If Min gt Min-1 gt Gaussian splitting
- If Min lt Min-1 gt Train from scratch
14Answer 3
- How to restrict how long (average) a speaker turn
is?
15Speaker turn duration modeling
- The speakers acoustic modeling is done using an
ergodic HMM where each state corresponds to one
cluster.
Cluster 1
1/N
Cluster 2
1/N
Speaker turn start
Speaker turn end
1/N
Cluster N
16Speaker turn duration modeling
- Within each cluster model a minimum duration
(MinD) is set using multiple states sharing the
same GMM - This allows more stable diarization outputs
avoiding unwanted speaker changes. - The average speaker turn duration (AveD) is how
long a speaker remains in one cluster model.
17Speaker turn duration modeling
- The AveD is determined by a, ß, the acoustics and
MinD. - Previously we used a0.9 and ß0.1, causing AveD
MinD for all meetings ? robustness problem in
meetings with long speaker turns (i.e. lectures). - We remove any a priori constraints and a/ß
parameters making a 1 and ß 1. - The AveD is therefore determined only given the
acoustics of each individual meeting.
18Experiments
- Experiments use the RT04s (eval and devel, 16
meetings) and RT05s (eval, 10 meetings, only
conference room type) - From each meeting only the SDM channel is used.
- The references are created using forced alignment
(using ICSI-SRI STT system) between the reference
transcriptions for the IHM channels and the IHM
audio.
19Experiments
- CCR is set to 8 seconds/Gaussian mixture
according to the development data. - The baseline system is similar to the RT05s
evaluation system.
Scores in non-overlapped DER
20Conclusions
- initial clusters and complexity selection
- By using the CCR we obtain meeting-specific,
data-driven parameters - Substitute two artificial parameters (Mi and
Kinit) with CCR derived parameters. - Both new methods are fast and provide improved
accuracy. - Unconstrained average speaker turn length
- Increased robustness to different speaking
styles. - Eliminates a tuning parameter a/ß.
- The speaker turns are totally acoustically driven.
21Questions