Automatic Cluster complexity and Quantity Selection: Towards Robust Speaker Diarization - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Automatic Cluster complexity and Quantity Selection: Towards Robust Speaker Diarization

Description:

Automatic Cluster complexity and Quantity Selection: Towards Robust Speaker Diarization ... Answers the question 'Who spoke when?' in recordings made in a ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 22
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: Automatic Cluster complexity and Quantity Selection: Towards Robust Speaker Diarization


1
Automatic Cluster complexity and Quantity
Selection Towards Robust Speaker Diarization
  • Xavier Anguera, Chuck Wooters and Javier Hernando

MLMI Workshop Washington DC, May 2006
2
Outline
  • Speaker Diarization for meetings
  • Problems and questions to answer
  • Proposed solutions
  • Number of initial clusters
  • Complexity selection
  • No speaker turn average duration constraints
  • Experiments and results
  • Conclusions

3
Speaker Diarization for Meetings
  • Answers the question Who spoke when? in
    recordings made in a meetings environment.
  • No prior knowledge of the number of speakers or
    their identities is allowed.

Diarization output
Channels collapse
Speech/non- speech filter
Agglomerative clustering /w BIC
4
Problems
  • In any agglomerative clustering implementation
  • A non-informed guess is made on the initial
    number of clusters and the initial model
    complexity.
  • After successive merges, model complexity doesnt
    match the changes in clusters data.
  • Speaker average turn length is artificially
    restricted in the cluster models.

5
Problems
  • Robustness problem system parameters are
    determined without accounting for individual
    meetings characteristics.
  • Need for automatic ways to set system parameters
    according to the data.

6
Specific questions addressed
  • In this paper we propose answer to
  • How many initial clusters to create to start the
    agglomerative clustering?
  • How do we model a speaker, how many Gaussian
    mixtures we use?
  • How to restrict how long (average) a speaker turn
    is?

7
Cluster Complexity Ratio (CCR)
  • New system parameter to substitute some of the
    previous.
  • It is defined as
  • More tied to the data amount of data necessary
    to optimally train every single mixture.
  • Increases robustness to meeting duration
    variation.

8
Answer 1
  • How many initial clusters do we create to start
    the agglomerative clustering?

9
Number of initial clusters selection
  • In agglomerative clustering we need to define a
    correct number of initial clusters
  • Too many initial clusters makes the system go
    slow.
  • Too few creates extra errors.
  • In the past Kinit 10 or 16 for meetings and
    Kinit 40 for Broadcast News, defined for all
    recordings.

10
Number of initial clusters selection
  • We propose a per-meeting algorithm to select the
    number of initial clusters
  • depend on the meeting length (Ntotal frames),
    the mixtures per cluster (GMclus) and the
    introduced CCR parameter.
  • This method is quick and meeting-dependent.
  • For GMclus5, CCR10, in Meetings Kinit 12-15
    clusters and in Broadcast News Kinit 40-50

11
Answer 2
  • How do we model a speaker, how many Gaussian
    mixtures we use?

12
Cluster complexity selection
  • A good cluster modeling using GMM is crucial to
    obtain optimum results in speaker diarization.
  • Within agglomerative clustering, a speaker model
    is used
  • In Vitetbi segmentation.
  • In Cluster comparisons with ?BIC.
  • Fixed complexity models cause problems
  • Too many Gaussians ? data overfitting ? system
    under-clustering.
  • Too few Gaussians ? data underfitting ? system
    over-clustering.

13
Cluster complexity selection
  • We present an occupancy driven approach based on
    the introduced CCR.
  • Where the model complexity (Mi) depends on the
    model occupancy (Ni frames) and the CCR
    parameter.
  • Each model is adapted after every Viterbi segm.
  • The new models are modified depending on the
    complexity
  • If Min Min-1 gt nothing is done
  • If Min gt Min-1 gt Gaussian splitting
  • If Min lt Min-1 gt Train from scratch

14
Answer 3
  • How to restrict how long (average) a speaker turn
    is?

15
Speaker turn duration modeling
  • The speakers acoustic modeling is done using an
    ergodic HMM where each state corresponds to one
    cluster.

Cluster 1
1/N
Cluster 2
1/N
Speaker turn start
Speaker turn end
1/N
Cluster N
16
Speaker turn duration modeling
  • Within each cluster model a minimum duration
    (MinD) is set using multiple states sharing the
    same GMM
  • This allows more stable diarization outputs
    avoiding unwanted speaker changes.
  • The average speaker turn duration (AveD) is how
    long a speaker remains in one cluster model.

17
Speaker turn duration modeling
  • The AveD is determined by a, ß, the acoustics and
    MinD.
  • Previously we used a0.9 and ß0.1, causing AveD
    MinD for all meetings ? robustness problem in
    meetings with long speaker turns (i.e. lectures).
  • We remove any a priori constraints and a/ß
    parameters making a 1 and ß 1.
  • The AveD is therefore determined only given the
    acoustics of each individual meeting.

18
Experiments
  • Experiments use the RT04s (eval and devel, 16
    meetings) and RT05s (eval, 10 meetings, only
    conference room type)
  • From each meeting only the SDM channel is used.
  • The references are created using forced alignment
    (using ICSI-SRI STT system) between the reference
    transcriptions for the IHM channels and the IHM
    audio.

19
Experiments
  • CCR is set to 8 seconds/Gaussian mixture
    according to the development data.
  • The baseline system is similar to the RT05s
    evaluation system.

Scores in non-overlapped DER
20
Conclusions
  • initial clusters and complexity selection
  • By using the CCR we obtain meeting-specific,
    data-driven parameters
  • Substitute two artificial parameters (Mi and
    Kinit) with CCR derived parameters.
  • Both new methods are fast and provide improved
    accuracy.
  • Unconstrained average speaker turn length
  • Increased robustness to different speaking
    styles.
  • Eliminates a tuning parameter a/ß.
  • The speaker turns are totally acoustically driven.

21
Questions
  • ?
Write a Comment
User Comments (0)
About PowerShow.com