Automatic Cluster complexity and Quantity Selection: Towards Robust Speaker Diarization

About This Presentation

Title:

Automatic Cluster complexity and Quantity Selection: Towards Robust Speaker Diarization

Description:

Automatic Cluster complexity and Quantity Selection: Towards Robust Speaker Diarization ... Answers the question 'Who spoke when?' in recordings made in a ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 22

Provided by: velblodVid

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Cluster complexity and Quantity Selection: Towards Robust Speaker Diarization

1
Automatic Cluster complexity and Quantity
Selection Towards Robust Speaker Diarization

Xavier Anguera, Chuck Wooters and Javier Hernando

MLMI Workshop Washington DC, May 2006
2
Outline

Speaker Diarization for meetings
Problems and questions to answer
Proposed solutions
Number of initial clusters
Complexity selection
No speaker turn average duration constraints
Experiments and results
Conclusions

3
Speaker Diarization for Meetings

Answers the question Who spoke when? in
recordings made in a meetings environment.
No prior knowledge of the number of speakers or
their identities is allowed.

Diarization output
Channels collapse
Speech/non- speech filter
Agglomerative clustering /w BIC
4
Problems

In any agglomerative clustering implementation
A non-informed guess is made on the initial
number of clusters and the initial model
complexity.
After successive merges, model complexity doesnt
match the changes in clusters data.
Speaker average turn length is artificially
restricted in the cluster models.

5
Problems

Robustness problem system parameters are
determined without accounting for individual
meetings characteristics.
Need for automatic ways to set system parameters
according to the data.

6
Specific questions addressed

In this paper we propose answer to
How many initial clusters to create to start the
agglomerative clustering?
How do we model a speaker, how many Gaussian
mixtures we use?
How to restrict how long (average) a speaker turn
is?

7
Cluster Complexity Ratio (CCR)

New system parameter to substitute some of the
previous.
It is defined as
More tied to the data amount of data necessary
to optimally train every single mixture.
Increases robustness to meeting duration
variation.

8
Answer 1

How many initial clusters do we create to start
the agglomerative clustering?

9
Number of initial clusters selection

In agglomerative clustering we need to define a
correct number of initial clusters
Too many initial clusters makes the system go
slow.
Too few creates extra errors.
In the past Kinit 10 or 16 for meetings and
Kinit 40 for Broadcast News, defined for all
recordings.

10
Number of initial clusters selection

We propose a per-meeting algorithm to select the
number of initial clusters
depend on the meeting length (Ntotal frames),
the mixtures per cluster (GMclus) and the
introduced CCR parameter.
This method is quick and meeting-dependent.
For GMclus5, CCR10, in Meetings Kinit 12-15
clusters and in Broadcast News Kinit 40-50

11
Answer 2

How do we model a speaker, how many Gaussian
mixtures we use?

12
Cluster complexity selection

A good cluster modeling using GMM is crucial to
obtain optimum results in speaker diarization.
Within agglomerative clustering, a speaker model
is used
In Vitetbi segmentation.
In Cluster comparisons with ?BIC.
Fixed complexity models cause problems
Too many Gaussians ? data overfitting ? system
under-clustering.
Too few Gaussians ? data underfitting ? system
over-clustering.

13
Cluster complexity selection

We present an occupancy driven approach based on
the introduced CCR.
Where the model complexity (Mi) depends on the
model occupancy (Ni frames) and the CCR
parameter.
Each model is adapted after every Viterbi segm.
The new models are modified depending on the
complexity
If Min Min-1 gt nothing is done
If Min gt Min-1 gt Gaussian splitting
If Min lt Min-1 gt Train from scratch

14
Answer 3

How to restrict how long (average) a speaker turn
is?

15
Speaker turn duration modeling

The speakers acoustic modeling is done using an
ergodic HMM where each state corresponds to one
cluster.

Cluster 1
1/N
Cluster 2
1/N
Speaker turn start
Speaker turn end
1/N
Cluster N
16
Speaker turn duration modeling

Within each cluster model a minimum duration
(MinD) is set using multiple states sharing the
same GMM
This allows more stable diarization outputs
avoiding unwanted speaker changes.
The average speaker turn duration (AveD) is how
long a speaker remains in one cluster model.

17
Speaker turn duration modeling

The AveD is determined by a, ß, the acoustics and
MinD.
Previously we used a0.9 and ß0.1, causing AveD
MinD for all meetings ? robustness problem in
meetings with long speaker turns (i.e. lectures).
We remove any a priori constraints and a/ß
parameters making a 1 and ß 1.
The AveD is therefore determined only given the
acoustics of each individual meeting.

18
Experiments

Experiments use the RT04s (eval and devel, 16
meetings) and RT05s (eval, 10 meetings, only
conference room type)
From each meeting only the SDM channel is used.
The references are created using forced alignment
(using ICSI-SRI STT system) between the reference
transcriptions for the IHM channels and the IHM
audio.

19
Experiments

CCR is set to 8 seconds/Gaussian mixture
according to the development data.
The baseline system is similar to the RT05s
evaluation system.

Scores in non-overlapped DER
20
Conclusions

initial clusters and complexity selection
By using the CCR we obtain meeting-specific,
data-driven parameters
Substitute two artificial parameters (Mi and
Kinit) with CCR derived parameters.
Both new methods are fast and provide improved
accuracy.
Unconstrained average speaker turn length
Increased robustness to different speaking
styles.
Eliminates a tuning parameter a/ß.
The speaker turns are totally acoustically driven.

21
Questions

Write a Comment

User Comments (0)