Title: Robust Speaker Segmentation for Meetings: The ICSISRI Spring 2005 Diarization System
1 Robust Speaker Segmentation for Meetings
The ICSI-SRI Spring 2005 Diarization System
- Xavier Anguera, Chuck Wooters, Barbara Peskin and
Mateu Aguiló
RT-05S Meeting Recognition Workshop July 13th,
Edinburgh
2Outline
- Tasks we participated in for RT-05s
- Whats new since RT-04f
- The basic system
- System description by modules
- Delaysum beamforming
- Segmentation and clustering
- Purification algorithm
- Systems submitted to the eval
- Results on the evaluation and after
- Post-evaluation improvements
- Individual channels weighting for delaysum
- Energy-based speech/non-speech detector
- Selective clustering for Lecture room data
- Future work
3Tasks we participated in
- Conference room
- MDM condition 1 system
- SDM condition 1 system
- Lecture room
- MDM condition 3 systems
- SDM condition 2 systems
- MSLA condition 3 systems
4Whats new? (main differences)
- First time participating in Diarization for
Meetings - Last Diarization submission was in RT-04f for
Broadcast News - Changes since RT-04f submission
- Channel enhancement via delay-and-sum processing
on multi-channel tasks - Purification algorithm for clustering system
- 10ms (vs. earlier 20) frame step using 30ms
window - Fewer initial clusters for Meetings data
5Full diarization process
delaysum processing
Chan. 1
Wiener filter
single-channel Diarization system
Diarization output
Chan. N
Wiener filter
speech/ non-speech
Acoustic fusion
SNR enhancement
Segmentationclustering
6Delay-and-sum beamforming
- Takes advantage of availability of multiple
microphones to create a single enhanced channel - Time delay of arrival (TDOA) computed for each
channel in sliding windows of 500msec (with 50
overlap) - Doesnt pose a burden on the execution time (0.3
to 0.5 times real-time) vs. processing each
channel separately
Ref.
GCC- PHAT
Ch. 0 reference
TDOA(0)n0
Ch. i
TDOA(i)n
N best
Enhanced signal
Ch. 1
TDOA Fitering
TDOA(N)n
Ch. N
TDOA estimation
7Core diarization system
- Diarization process
- Speech/non-speech detection on all the data.
- Signal processing (10ms, 19 MFCC). Discard
non-speech segments. - Model initialization using K equal-length
segments (K 10 in conference room, 5 in lecture
room). - Viterbi-segment and retrain the models.
- Search for and merge the most similar pair of
models for which merge_score gt0 (stop if none
is found). - Run the purification algorithm for at most first
K iterations. - Goto 4.
- Based on the system presented for RT-04f
Broadcast News diarization evaluation. - Agglomerative clustering system with BIC-like
distance metric and stopping criterion. - params for merged model sum of parameters for
individual ones, eliminating BIC penalty term. - No need for external training data nor extensive
parameter tuning. - The system is robust to changes in test data
and domain.
8Purification algorithm
- Applied after each merging stage for up to K (
initial clusters) iterations (avoiding infinite
loops). - Aims to clear clusters of sticky sound segments
that dont fit the model but otherwise would
remain and cause further errors. - On the RT05s Conference room data (MDM), using
purification (16.33 DER) outperforms the same
system without purification (17.82).
- Purification process
- For each cluster ( set of segments)
- Find the segment that best fits the clusters
model (according to likelihood normalized by
frames). - Compute merge_score between this segment and all
others in cluster. - Identify segment with worst score.
- If all scores are gt-50 (segments are quite
similar) consider this cluster pure and dont
process again. - The worst segment across all clusters (lowest
score) is extracted and assigned to a new cluster.
9Systems submitted Conference room
- Systems
- MDM system Uses all available channels with
delaysum beamforming. Diarization using 10
initial clusters and 3 sec minimum segment
duration. - SDM system Does the same diarization using only
the specified channel.
10Systems submitted Lecture room (I)
- Task considerations
- In the development data, guessing one speaker
all the time obtained hard-to-beat results - we made this our primary system for MSM, SDM
and MSLA. - We observed that silence breaks were only labeled
between speaker changes in questionanswer
sections therefore - no speech/non-speech detector is used (1
exception). - the models minimum length is set higher than in
conference room data in order to accommodate
extra silence regions labeled as speech.
11Systems submitted Lecture room (II)
- Systems
- SDM, MDM and MSLA primary one speaker all the
time, derived from the UEM file. - Contrastive MDM (1) One speaker all the time on
speech regions computed using speech/non-speech
detection on the Tabletop channel only (greatest
SNR). - Contrastive MDM (2) Diarization performed over
the Tabletop channel, using 5sec minimum duration
and 5 initial clusters.
12Systems submitted Lecture room (III)
- Systems (contd)
- SDM contrastive Diarization performed with 12sec
minimum duration and 5 initial clusters. - MSLA contrastive (1) Standard delaysum
performed on all microphones. Diarization with
12sec min duration and 5 initial clusters. - MSLA contrastive (2) Same as previous, but using
delaysum with individual time-varying channel
weighting. more details below
13Baseline Results
Due to a small bug in delaysum algorithm.
Due to change in UEM file for show
CHIL_20050202-0000-E2.
On eval data, contrastive (real) systems beat
do nothing Lecture room systems!
14Post-evaluation improvements
- Channel weighting techniques
- Energy-based train-free speech/non-speech
detector - Selective clustering on lecture room data
15Individual channel weighting
- Warning sign although standard delaysum
performed fairly well on the dev set, the SDM
condition outperformed the MDM condition on the
eval set. - Possible problem delaysum may not perform well
on a non-conventional microphone array - Channels have different qualities, are of
different types. - Placements are irregular, relative distances to
speakers are very different. - Possible solution Give each channel a different
weight. - We explored use of
- cross-correlation between channels
- the individual channels SNR
16Individual channel weighting (II)
- Time-varying cross-correlation weighting
- Weights are computed using the cross-correlation
between the time-aligned segments for each
channel and the reference channel. - Weights are computed in an adaptive manner
-
- Adaptation is affected by silence/noise
adaptation rate needs to be slow. We use 0.05.
17Individual channel weighting (III)
2. Channel weighting by SNR
- The SNR is computed across full evaluated portion
of audio on each channel. - Delaysum is performed with fixed weight, derived
from the SNR values, throughout the meeting. - It identifies overall bad channels and reduces
their impact on the output.
3. Hybrid weighting system
- Use a combination of both methods
- SNR to define the best channel and use it as the
reference channel. - Cross-correlation weighting to compute the
weights throughout the process.
18Individual channel weighting (V) Results
- Cross-correlation Works well in Conf. room with
frequent speaker changes. It adapts each channel
according to the quality of the current talker,
changing with changing channel conditions. - SNR fixed weighting is better for lecture room.
As one speaker is the main source, the optimum
channel weights remain nearly constant throughout
the meeting.
19Energy-based speech/non-speech detection
- The current speech/non-speech (SNS) detector
borrowed from SRIs ASR system is the only
trained module in the system. We strive for a
fully robust and train-free system. - Now developing an energy-based detector, tested
on conference room data.
20Selective clustering for Lecture room
- From the file name we were allowed to know
whether lecture excerpt or QA section didnt
use during eval - We have now built a system where
- If lecture (E1, E3) One speaker all the time.
- If QA (E2) Diarization with 12sec minimum
duration and 5 initial clusters.
21Future work
- Improve the speech/non-speech module without
training data or extensive parameter tuning. - Further enhance the DS processing
- Alternative ways to deal with bad channels
- Improvements on channels weighting
- Use of auxiliary information for clustering
(TDOA, weights,) - Develop new methods for cluster purification.
- Explore speaker ID techniques for diarization.
22Questions?