Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche, Meredith Ringel, and Litza Stark) - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche, Meredith Ringel, and Litza Stark)

Description:

Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche, Meredith Ringel ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 23
Provided by: juli3252
Category:

less

Transcript and Presenter's Notes

Title: Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche, Meredith Ringel, and Litza Stark)


1
SCANMAIL Audio Browsing and Retrieval in a
Voicemail Domain
  • Julia Hirschberg, Michiel Bacchiani, Phil
    Isenhour, Aaron Rosenberg, Larry Stead, Steve
    Whittaker, Jon Wright, and Gary Zamchick (with
    Martin Jansche, Meredith Ringel, and Litza Stark)

2
The Problem Navigating Audio Data
  • Increasing amounts of audio data available in
    corporate, public and private collections
    (recorded meetings, broadcast news and
    entertainment, voicemail) but useless without
    tools for searching
  • SCANMail prototype tool for searching speech
    data in voicemail domain

3
SCANMail
  • Inspired by interviews, surveys and usage logs
    identifying problems of heavy voicemail users
  • Its hard to quickly scan through new messages to
    find the ones you need to deal with (e.g. during
    a meeting break)
  • Its hard to find the message you want in your
    archive
  • Its hard to locate the information you want in
    any message (e.g. the telephone number)
  • SCANMail provides technology to help solve these
    problems, supporting content-based audio
    navigation

4
Related Research
  • Cambridge video mail retrieval by voice (1994)
  • NIST TREC Spoken Document Retrieval track
  • IBM voicemail transcription (1998) and
    information extraction (2001)
  • ATT voicemail user studies (1998)
  • ATT automatic speaker identification and
    browsing/search for voicemail (2000, 2001)

5
SCANMail Architecture
6
Training Corpus
  • Messages collected from 138 ATT Labs voicemail
    boxes
  • 100 hr corpus includes 10K messages from 2500
    speakers
  • Hand-labeled for caller id, gender, age,
    recording condition, entities (names, dates,
    telephone numbers)
  • Gender balanced, 12 non-native speakers
  • 10 of calls not from ordinary handsets
  • Mean message duration 36.4 secs, median 30.0 secs

7
ASR Server baseline system
  • Trained on 60 hour training set
  • Gender independent, 8k tied states, emission
    probabilities modeled by 12 component Gaussian
    mixtures.
  • Uses 14k vocabulary and Katz-style backoff
    trigram trained on 700k words
  • Lexicon automatically generated by the ATT Labs
    NextGen text to speech system
  • Decoder uses finite state transducers to
    construct recognition network
  • Initial search pass produces lattices used as
    grammars in all subsequent search passes

8
  • Accuracy
  • 24.4 wer ? 21 with adaptation
  • Speed
  • 2x real time for first pass
  • Will approach 5-6x real time for final
    transcription
  • Details
  • Bacchiani (HLT2000, ICASSP2000) Hirschberg et al
    (Eurospeech2001)

9
ASR Server rescoring passes
  • Compensation techniques for speaker/channel
    variation and invalid modeling assumptions
  • Gender dependency (GD)
  • Vocal Tract Length Normalization (VTLN) (Kamm et.
    al. 1995, Wegmann et. al. 1996)
  • Semi-Tied Covariances (STC) (Gales 1999)
  • Constrained Model-space Adaptation (CMA) (Gales
    1998)
  • Maximum Likelihood Linear Regression (Legetter
    and Woodland 1995)
  • MLLR likelihood-based clustering algorithm to
    ensure sufficient data for compensation
    algorithms (Bacchiani 2000)

10
ASR Transcription Accuracy
System Normalization WER ()
Baseline -- 34.9
GD -- 33.3
GD VTLN 32.3
VTLN VTLN 32.0
VTLNSTC VTLNSTC 30.8
VTLNSTCCMA VTLNSTCCMA 29.3
VTLNSTCCMA VTLNSTCCMAMLR 28.7
11
(No Transcript)
12
Information Retrieval
  • Uses SMART IR engine (Salton 1971, Buckley 1985)
  • Generates weighted term vectors for ASR
    transcripts and queries and computes similarity
    based on vector inner products
  • Both ASR transcripts and queries are preprocessed
    into tokens by removing common words
    (stop-listing) and stemming

13
(No Transcript)
14
Information Extraction
  • Extracts entities from the ASR transcripts
  • Old implementation used finite state transducers
    with hand designed costs
  • New statistical (trainable) system extracts phone
    numbers and caller names

15
(No Transcript)
16
Caller Identification
  • Proposes caller names by matching new incoming
    messages against existing Text Independent
    Gaussian Mixture Models (TIGMMs)
  • If no PBX-supplied caller identification, caller
    ID hypothesis presented to user
  • Caller models trained/adapted based on user
    feedback
  • Initial model trained after 1 minute of speech
    collected from single caller
  • Model updates with each 20sec increment up to
    180sec (mature model)

17
  • Setting thresholds to keep outgroup acceptance
    low (2.7), system had 11.5 ingroup rejection
    and 1.2 ingroup confusion for 20-caller
    ingroup.
  • For more detailed experimental results see
    Rosenberg (ICSLP 2000, Eurospeech 2001)

18
(No Transcript)
19
(No Transcript)
20
Email Server
  • Composes multi-part email message and sends to
    address specified in user profile
  • ASR transcript
  • Speech file
  • Entity transcriptions and speech segments
  • Uses time aligned ASR transcript and IE
    information to include audio excerpts
    corresponding to entities

21
Evaluation User Studies
  • Compared SCANMail with standard over-the-phone
    interface (Audix)
  • 8 subject performed fact-finding, relevance
    ranking and summarization tasks
  • SCANMail
  • Better for fact-finding and ranking tasks in
    quality/time measures (p lt0.05)
  • Faster solutions for fact-finding task (plt0.01)
  • Rated higher on all subjective measures
  • Normalized performance scores higher when subject
    employed successful IR searches (plt0.05)

22
Trials
  • 18 subjects in 2 month field trial
  • Usage
  • 52 of messages werent played completely through
  • Only 1 of messages deleted
  • After using SCANMail people thought
  • Scanning messages is difficult (2.8?4.7)
  • I frequently replay messages (1.9?3.5)
  • I frequently take notes (2.6?4.3)
  • Its hard to locate old messages (2.7?5.0)
  • Its hard to extract info from messages
    (2.5?5.0)

23
Current Status
  • 37 users
  • Recent improvements
  • More accurate ASR
  • Lighter-weight IR (Lucene)
  • Presentation of information as it becomes
    available (e.g. audio only, rough transcript of
    message)
  • Options for SCANMail email
  • First versions of phone and Ipaq interfaces built
    (many interface issues)

24
Research Foci
  • Additional information extracted from messages
    (Jansche Abney)
  • Dates, times
  • Message gisting
  • Message threading
  • Urgent and personal messages automatically
    identified (Ringel Hirschberg)
  • Faster/more accurate ASR
  • Migrate client features to email
Write a Comment
User Comments (0)
About PowerShow.com