Robust Speaker Segmentation for Meetings: The ICSISRI Spring 2005 Diarization System - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Robust Speaker Segmentation for Meetings: The ICSISRI Spring 2005 Diarization System

Description:

Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 ... to the quality of the current talker, changing with changing channel conditions. ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 23

Provided by: pepebo

Category:

more less

Transcript and Presenter's Notes

Title: Robust Speaker Segmentation for Meetings: The ICSISRI Spring 2005 Diarization System

1
Robust Speaker Segmentation for Meetings
The ICSI-SRI Spring 2005 Diarization System

Xavier Anguera, Chuck Wooters, Barbara Peskin and
Mateu Aguiló

RT-05S Meeting Recognition Workshop July 13th,
Edinburgh
2
Outline

Tasks we participated in for RT-05s
Whats new since RT-04f
The basic system
System description by modules
Delaysum beamforming
Segmentation and clustering
Purification algorithm
Systems submitted to the eval
Results on the evaluation and after
Post-evaluation improvements
Individual channels weighting for delaysum
Energy-based speech/non-speech detector
Selective clustering for Lecture room data
Future work

3
Tasks we participated in

Conference room
MDM condition 1 system
SDM condition 1 system
Lecture room
MDM condition 3 systems
SDM condition 2 systems
MSLA condition 3 systems

4
Whats new? (main differences)

First time participating in Diarization for
Meetings
Last Diarization submission was in RT-04f for
Broadcast News
Changes since RT-04f submission
Channel enhancement via delay-and-sum processing
on multi-channel tasks
Purification algorithm for clustering system
10ms (vs. earlier 20) frame step using 30ms
window
Fewer initial clusters for Meetings data

5
Full diarization process
delaysum processing
Chan. 1
Wiener filter
single-channel Diarization system
Diarization output
Chan. N
Wiener filter
speech/ non-speech
Acoustic fusion
SNR enhancement
Segmentationclustering
6
Delay-and-sum beamforming

Takes advantage of availability of multiple
microphones to create a single enhanced channel
Time delay of arrival (TDOA) computed for each
channel in sliding windows of 500msec (with 50
overlap)
Doesnt pose a burden on the execution time (0.3
to 0.5 times real-time) vs. processing each
channel separately

Ref.
GCC- PHAT
Ch. 0 reference
TDOA(0)n0
Ch. i
TDOA(i)n
N best

Enhanced signal
Ch. 1
TDOA Fitering
TDOA(N)n
Ch. N
TDOA estimation
7
Core diarization system

Diarization process
Speech/non-speech detection on all the data.
Signal processing (10ms, 19 MFCC). Discard
non-speech segments.
Model initialization using K equal-length
segments (K 10 in conference room, 5 in lecture
room).
Viterbi-segment and retrain the models.
Search for and merge the most similar pair of
models for which merge_score gt0 (stop if none
is found).
Run the purification algorithm for at most first
K iterations.
Goto 4.

Based on the system presented for RT-04f
Broadcast News diarization evaluation.
Agglomerative clustering system with BIC-like
distance metric and stopping criterion.
params for merged model sum of parameters for
individual ones, eliminating BIC penalty term.
No need for external training data nor extensive
parameter tuning.
The system is robust to changes in test data
and domain.

8
Purification algorithm

Applied after each merging stage for up to K (
initial clusters) iterations (avoiding infinite
loops).
Aims to clear clusters of sticky sound segments
that dont fit the model but otherwise would
remain and cause further errors.
On the RT05s Conference room data (MDM), using
purification (16.33 DER) outperforms the same
system without purification (17.82).

Purification process
For each cluster ( set of segments)
Find the segment that best fits the clusters
model (according to likelihood normalized by
frames).
Compute merge_score between this segment and all
others in cluster.
Identify segment with worst score.
If all scores are gt-50 (segments are quite
similar) consider this cluster pure and dont
process again.
The worst segment across all clusters (lowest
score) is extracted and assigned to a new cluster.

9
Systems submitted Conference room

Systems
MDM system Uses all available channels with
delaysum beamforming. Diarization using 10
initial clusters and 3 sec minimum segment
duration.
SDM system Does the same diarization using only
the specified channel.

10
Systems submitted Lecture room (I)

Task considerations
In the development data, guessing one speaker
all the time obtained hard-to-beat results
we made this our primary system for MSM, SDM
and MSLA.
We observed that silence breaks were only labeled
between speaker changes in questionanswer
sections therefore
no speech/non-speech detector is used (1
exception).
the models minimum length is set higher than in
conference room data in order to accommodate
extra silence regions labeled as speech.

11
Systems submitted Lecture room (II)

Systems
SDM, MDM and MSLA primary one speaker all the
time, derived from the UEM file.
Contrastive MDM (1) One speaker all the time on
speech regions computed using speech/non-speech
detection on the Tabletop channel only (greatest
SNR).
Contrastive MDM (2) Diarization performed over
the Tabletop channel, using 5sec minimum duration
and 5 initial clusters.

12
Systems submitted Lecture room (III)

Systems (contd)
SDM contrastive Diarization performed with 12sec
minimum duration and 5 initial clusters.
MSLA contrastive (1) Standard delaysum
performed on all microphones. Diarization with
12sec min duration and 5 initial clusters.
MSLA contrastive (2) Same as previous, but using
delaysum with individual time-varying channel
weighting. more details below

13
Baseline Results
Due to a small bug in delaysum algorithm.
Due to change in UEM file for show
CHIL_20050202-0000-E2.
On eval data, contrastive (real) systems beat
do nothing Lecture room systems!
14
Post-evaluation improvements

Channel weighting techniques
Energy-based train-free speech/non-speech
detector
Selective clustering on lecture room data

15
Individual channel weighting

Warning sign although standard delaysum
performed fairly well on the dev set, the SDM
condition outperformed the MDM condition on the
eval set.
Possible problem delaysum may not perform well
on a non-conventional microphone array
Channels have different qualities, are of
different types.
Placements are irregular, relative distances to
speakers are very different.
Possible solution Give each channel a different
weight.
We explored use of
cross-correlation between channels
the individual channels SNR

16
Individual channel weighting (II)

Time-varying cross-correlation weighting

Weights are computed using the cross-correlation
between the time-aligned segments for each
channel and the reference channel.
Weights are computed in an adaptive manner
Adaptation is affected by silence/noise
adaptation rate needs to be slow. We use 0.05.

17
Individual channel weighting (III)
2. Channel weighting by SNR

The SNR is computed across full evaluated portion
of audio on each channel.
Delaysum is performed with fixed weight, derived
from the SNR values, throughout the meeting.
It identifies overall bad channels and reduces
their impact on the output.

3. Hybrid weighting system

Use a combination of both methods
SNR to define the best channel and use it as the
reference channel.
Cross-correlation weighting to compute the
weights throughout the process.

18
Individual channel weighting (V) Results

Cross-correlation Works well in Conf. room with
frequent speaker changes. It adapts each channel
according to the quality of the current talker,
changing with changing channel conditions.
SNR fixed weighting is better for lecture room.
As one speaker is the main source, the optimum
channel weights remain nearly constant throughout
the meeting.

19
Energy-based speech/non-speech detection

The current speech/non-speech (SNS) detector
borrowed from SRIs ASR system is the only
trained module in the system. We strive for a
fully robust and train-free system.
Now developing an energy-based detector, tested
on conference room data.

20
Selective clustering for Lecture room

From the file name we were allowed to know
whether lecture excerpt or QA section didnt
use during eval
We have now built a system where
If lecture (E1, E3) One speaker all the time.
If QA (E2) Diarization with 12sec minimum
duration and 5 initial clusters.

21
Future work

Improve the speech/non-speech module without
training data or extensive parameter tuning.
Further enhance the DS processing
Alternative ways to deal with bad channels
Improvements on channels weighting
Use of auxiliary information for clustering
(TDOA, weights,)
Develop new methods for cluster purification.
Explore speaker ID techniques for diarization.

22
Questions?

Write a Comment

User Comments (0)