Robust Speaker Segmentation for Meetings: The ICSISRI Spring 2005 Diarization System - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Robust Speaker Segmentation for Meetings: The ICSISRI Spring 2005 Diarization System

Description:

Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 ... to the quality of the current talker, changing with changing channel conditions. ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 23
Provided by: pepebo
Category:

less

Transcript and Presenter's Notes

Title: Robust Speaker Segmentation for Meetings: The ICSISRI Spring 2005 Diarization System


1
Robust Speaker Segmentation for Meetings
The ICSI-SRI Spring 2005 Diarization System
  • Xavier Anguera, Chuck Wooters, Barbara Peskin and
    Mateu Aguiló

RT-05S Meeting Recognition Workshop July 13th,
Edinburgh
2
Outline
  • Tasks we participated in for RT-05s
  • Whats new since RT-04f
  • The basic system
  • System description by modules
  • Delaysum beamforming
  • Segmentation and clustering
  • Purification algorithm
  • Systems submitted to the eval
  • Results on the evaluation and after
  • Post-evaluation improvements
  • Individual channels weighting for delaysum
  • Energy-based speech/non-speech detector
  • Selective clustering for Lecture room data
  • Future work

3
Tasks we participated in
  • Conference room
  • MDM condition 1 system
  • SDM condition 1 system
  • Lecture room
  • MDM condition 3 systems
  • SDM condition 2 systems
  • MSLA condition 3 systems

4
Whats new? (main differences)
  • First time participating in Diarization for
    Meetings
  • Last Diarization submission was in RT-04f for
    Broadcast News
  • Changes since RT-04f submission
  • Channel enhancement via delay-and-sum processing
    on multi-channel tasks
  • Purification algorithm for clustering system
  • 10ms (vs. earlier 20) frame step using 30ms
    window
  • Fewer initial clusters for Meetings data

5
Full diarization process
delaysum processing
Chan. 1
Wiener filter
single-channel Diarization system
Diarization output
Chan. N
Wiener filter
speech/ non-speech
Acoustic fusion
SNR enhancement
Segmentationclustering
6
Delay-and-sum beamforming
  • Takes advantage of availability of multiple
    microphones to create a single enhanced channel
  • Time delay of arrival (TDOA) computed for each
    channel in sliding windows of 500msec (with 50
    overlap)
  • Doesnt pose a burden on the execution time (0.3
    to 0.5 times real-time) vs. processing each
    channel separately

Ref.
GCC- PHAT
Ch. 0 reference
TDOA(0)n0
Ch. i
TDOA(i)n
N best

Enhanced signal
Ch. 1
TDOA Fitering
TDOA(N)n
Ch. N
TDOA estimation
7
Core diarization system
  • Diarization process
  • Speech/non-speech detection on all the data.
  • Signal processing (10ms, 19 MFCC). Discard
    non-speech segments.
  • Model initialization using K equal-length
    segments (K 10 in conference room, 5 in lecture
    room).
  • Viterbi-segment and retrain the models.
  • Search for and merge the most similar pair of
    models for which merge_score gt0 (stop if none
    is found).
  • Run the purification algorithm for at most first
    K iterations.
  • Goto 4.
  • Based on the system presented for RT-04f
    Broadcast News diarization evaluation.
  • Agglomerative clustering system with BIC-like
    distance metric and stopping criterion.
  • params for merged model sum of parameters for
    individual ones, eliminating BIC penalty term.
  • No need for external training data nor extensive
    parameter tuning.
  • The system is robust to changes in test data
    and domain.

8
Purification algorithm
  • Applied after each merging stage for up to K (
    initial clusters) iterations (avoiding infinite
    loops).
  • Aims to clear clusters of sticky sound segments
    that dont fit the model but otherwise would
    remain and cause further errors.
  • On the RT05s Conference room data (MDM), using
    purification (16.33 DER) outperforms the same
    system without purification (17.82).
  • Purification process
  • For each cluster ( set of segments)
  • Find the segment that best fits the clusters
    model (according to likelihood normalized by
    frames).
  • Compute merge_score between this segment and all
    others in cluster.
  • Identify segment with worst score.
  • If all scores are gt-50 (segments are quite
    similar) consider this cluster pure and dont
    process again.
  • The worst segment across all clusters (lowest
    score) is extracted and assigned to a new cluster.

9
Systems submitted Conference room
  • Systems
  • MDM system Uses all available channels with
    delaysum beamforming. Diarization using 10
    initial clusters and 3 sec minimum segment
    duration.
  • SDM system Does the same diarization using only
    the specified channel.

10
Systems submitted Lecture room (I)
  • Task considerations
  • In the development data, guessing one speaker
    all the time obtained hard-to-beat results
  • we made this our primary system for MSM, SDM
    and MSLA.
  • We observed that silence breaks were only labeled
    between speaker changes in questionanswer
    sections therefore
  • no speech/non-speech detector is used (1
    exception).
  • the models minimum length is set higher than in
    conference room data in order to accommodate
    extra silence regions labeled as speech.

11
Systems submitted Lecture room (II)
  • Systems
  • SDM, MDM and MSLA primary one speaker all the
    time, derived from the UEM file.
  • Contrastive MDM (1) One speaker all the time on
    speech regions computed using speech/non-speech
    detection on the Tabletop channel only (greatest
    SNR).
  • Contrastive MDM (2) Diarization performed over
    the Tabletop channel, using 5sec minimum duration
    and 5 initial clusters.

12
Systems submitted Lecture room (III)
  • Systems (contd)
  • SDM contrastive Diarization performed with 12sec
    minimum duration and 5 initial clusters.
  • MSLA contrastive (1) Standard delaysum
    performed on all microphones. Diarization with
    12sec min duration and 5 initial clusters.
  • MSLA contrastive (2) Same as previous, but using
    delaysum with individual time-varying channel
    weighting. more details below

13
Baseline Results
Due to a small bug in delaysum algorithm.
Due to change in UEM file for show
CHIL_20050202-0000-E2.
On eval data, contrastive (real) systems beat
do nothing Lecture room systems!
14
Post-evaluation improvements
  • Channel weighting techniques
  • Energy-based train-free speech/non-speech
    detector
  • Selective clustering on lecture room data

15
Individual channel weighting
  • Warning sign although standard delaysum
    performed fairly well on the dev set, the SDM
    condition outperformed the MDM condition on the
    eval set.
  • Possible problem delaysum may not perform well
    on a non-conventional microphone array
  • Channels have different qualities, are of
    different types.
  • Placements are irregular, relative distances to
    speakers are very different.
  • Possible solution Give each channel a different
    weight.
  • We explored use of
  • cross-correlation between channels
  • the individual channels SNR

16
Individual channel weighting (II)
  • Time-varying cross-correlation weighting
  • Weights are computed using the cross-correlation
    between the time-aligned segments for each
    channel and the reference channel.
  • Weights are computed in an adaptive manner
  • Adaptation is affected by silence/noise
    adaptation rate needs to be slow. We use 0.05.

17
Individual channel weighting (III)
2. Channel weighting by SNR
  • The SNR is computed across full evaluated portion
    of audio on each channel.
  • Delaysum is performed with fixed weight, derived
    from the SNR values, throughout the meeting.
  • It identifies overall bad channels and reduces
    their impact on the output.

3. Hybrid weighting system
  • Use a combination of both methods
  • SNR to define the best channel and use it as the
    reference channel.
  • Cross-correlation weighting to compute the
    weights throughout the process.

18
Individual channel weighting (V) Results
  • Cross-correlation Works well in Conf. room with
    frequent speaker changes. It adapts each channel
    according to the quality of the current talker,
    changing with changing channel conditions.
  • SNR fixed weighting is better for lecture room.
    As one speaker is the main source, the optimum
    channel weights remain nearly constant throughout
    the meeting.

19
Energy-based speech/non-speech detection
  • The current speech/non-speech (SNS) detector
    borrowed from SRIs ASR system is the only
    trained module in the system. We strive for a
    fully robust and train-free system.
  • Now developing an energy-based detector, tested
    on conference room data.

20
Selective clustering for Lecture room
  • From the file name we were allowed to know
    whether lecture excerpt or QA section didnt
    use during eval
  • We have now built a system where
  • If lecture (E1, E3) One speaker all the time.
  • If QA (E2) Diarization with 12sec minimum
    duration and 5 initial clusters.

21
Future work
  • Improve the speech/non-speech module without
    training data or extensive parameter tuning.
  • Further enhance the DS processing
  • Alternative ways to deal with bad channels
  • Improvements on channels weighting
  • Use of auxiliary information for clustering
    (TDOA, weights,)
  • Develop new methods for cluster purification.
  • Explore speaker ID techniques for diarization.

22
Questions?
Write a Comment
User Comments (0)
About PowerShow.com