Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche, Meredith Ringel, and Litza Stark) - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche, Meredith Ringel, and Litza Stark)

Description:

Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche, Meredith Ringel ... – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 23

Provided by: juli3252

Learn more at: http://www.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Julia Hirschberg, Michiel Bacchiani, Phil Isenhour, Aaron Rosenberg, Larry Stead, Steve Whittaker, Jon Wright, and Gary Zamchick (with Martin Jansche, Meredith Ringel, and Litza Stark)

1
SCANMAIL Audio Browsing and Retrieval in a
Voicemail Domain

Julia Hirschberg, Michiel Bacchiani, Phil
Isenhour, Aaron Rosenberg, Larry Stead, Steve
Whittaker, Jon Wright, and Gary Zamchick (with
Martin Jansche, Meredith Ringel, and Litza Stark)

2
The Problem Navigating Audio Data

Increasing amounts of audio data available in
corporate, public and private collections
(recorded meetings, broadcast news and
entertainment, voicemail) but useless without
tools for searching
SCANMail prototype tool for searching speech
data in voicemail domain

3
SCANMail

Inspired by interviews, surveys and usage logs
identifying problems of heavy voicemail users
Its hard to quickly scan through new messages to
find the ones you need to deal with (e.g. during
a meeting break)
Its hard to find the message you want in your
archive
Its hard to locate the information you want in
any message (e.g. the telephone number)
SCANMail provides technology to help solve these
problems, supporting content-based audio
navigation

4
Related Research

Cambridge video mail retrieval by voice (1994)
NIST TREC Spoken Document Retrieval track
IBM voicemail transcription (1998) and
information extraction (2001)
ATT voicemail user studies (1998)
ATT automatic speaker identification and
browsing/search for voicemail (2000, 2001)

5
SCANMail Architecture
6
Training Corpus

Messages collected from 138 ATT Labs voicemail
boxes
100 hr corpus includes 10K messages from 2500
speakers
Hand-labeled for caller id, gender, age,
recording condition, entities (names, dates,
telephone numbers)
Gender balanced, 12 non-native speakers
10 of calls not from ordinary handsets
Mean message duration 36.4 secs, median 30.0 secs

7
ASR Server baseline system

Trained on 60 hour training set
Gender independent, 8k tied states, emission
probabilities modeled by 12 component Gaussian
mixtures.
Uses 14k vocabulary and Katz-style backoff
trigram trained on 700k words
Lexicon automatically generated by the ATT Labs
NextGen text to speech system
Decoder uses finite state transducers to
construct recognition network
Initial search pass produces lattices used as
grammars in all subsequent search passes

Accuracy
24.4 wer ? 21 with adaptation
Speed
2x real time for first pass
Will approach 5-6x real time for final
transcription
Details
Bacchiani (HLT2000, ICASSP2000) Hirschberg et al
(Eurospeech2001)

9
ASR Server rescoring passes

Compensation techniques for speaker/channel
variation and invalid modeling assumptions
Gender dependency (GD)
Vocal Tract Length Normalization (VTLN) (Kamm et.
al. 1995, Wegmann et. al. 1996)
Semi-Tied Covariances (STC) (Gales 1999)
Constrained Model-space Adaptation (CMA) (Gales
1998)
Maximum Likelihood Linear Regression (Legetter
and Woodland 1995)
MLLR likelihood-based clustering algorithm to
ensure sufficient data for compensation
algorithms (Bacchiani 2000)

10
ASR Transcription Accuracy
System Normalization WER ()
Baseline -- 34.9
GD -- 33.3
GD VTLN 32.3
VTLN VTLN 32.0
VTLNSTC VTLNSTC 30.8
VTLNSTCCMA VTLNSTCCMA 29.3
VTLNSTCCMA VTLNSTCCMAMLR 28.7
11
(No Transcript)
12
Information Retrieval

Uses SMART IR engine (Salton 1971, Buckley 1985)
Generates weighted term vectors for ASR
transcripts and queries and computes similarity
based on vector inner products
Both ASR transcripts and queries are preprocessed
into tokens by removing common words
(stop-listing) and stemming

13
(No Transcript)
14
Information Extraction

Extracts entities from the ASR transcripts
Old implementation used finite state transducers
with hand designed costs
New statistical (trainable) system extracts phone
numbers and caller names

15
(No Transcript)
16
Caller Identification

Proposes caller names by matching new incoming
messages against existing Text Independent
Gaussian Mixture Models (TIGMMs)
If no PBX-supplied caller identification, caller
ID hypothesis presented to user
Caller models trained/adapted based on user
feedback
Initial model trained after 1 minute of speech
collected from single caller
Model updates with each 20sec increment up to
180sec (mature model)

Setting thresholds to keep outgroup acceptance
low (2.7), system had 11.5 ingroup rejection
and 1.2 ingroup confusion for 20-caller
ingroup.
For more detailed experimental results see
Rosenberg (ICSLP 2000, Eurospeech 2001)

18
(No Transcript)
19
(No Transcript)
20
Email Server

Composes multi-part email message and sends to
address specified in user profile
ASR transcript
Speech file
Entity transcriptions and speech segments
Uses time aligned ASR transcript and IE
information to include audio excerpts
corresponding to entities

21
Evaluation User Studies

Compared SCANMail with standard over-the-phone
interface (Audix)
8 subject performed fact-finding, relevance
ranking and summarization tasks
SCANMail
Better for fact-finding and ranking tasks in
quality/time measures (p lt0.05)
Faster solutions for fact-finding task (plt0.01)
Rated higher on all subjective measures
Normalized performance scores higher when subject
employed successful IR searches (plt0.05)

22
Trials

18 subjects in 2 month field trial
Usage
52 of messages werent played completely through
Only 1 of messages deleted
After using SCANMail people thought
Scanning messages is difficult (2.8?4.7)
I frequently replay messages (1.9?3.5)
I frequently take notes (2.6?4.3)
Its hard to locate old messages (2.7?5.0)
Its hard to extract info from messages
(2.5?5.0)

23
Current Status

37 users
Recent improvements
More accurate ASR
Lighter-weight IR (Lucene)
Presentation of information as it becomes
available (e.g. audio only, rough transcript of
message)
Options for SCANMail email
First versions of phone and Ipaq interfaces built
(many interface issues)

24
Research Foci