Title: Data Mining, Information Extraction and Search in Spoken Documents
1Data Mining, Information Extraction and Search in
Spoken Documents
2Today
- Data mining from text
- Searching audio data instead of text
- Information extraction from spoken documents
- Speech data mining
3Data Mining
- Discovery of trends and patterns across very
large datasets, usually for decision-making
purposes - Fraud detection in banking, telephony
- Stock market
- Indications of demographic disasters
- New causes of diseases
- finding things you dont know youre looking for
- Information retrieval vs. mining for nuggets
4Dating Mining in Computational Linguistics
- Finding lexical co-occurrence information
- Finding parallel text corpora on the web for MT
- Finding new topics in news stories
- TDT task
- Exploring citation links
- Networks of influence
- Information extraction, e.g. find mutual
acquaintances
5- Snowball (Agichtein et al 01)
- Seed set of patterns (e.g. Norman Mailer, 59 ?
, the 59-year-old
Mailer ? the -year-old ) - Find more patterns by looking for e.g. Mailer
close to 59 - Mailer turned 59 last week.
- Though Mailer is 59
6But Searching Audio Data is Harder
- Large amounts of audio data available on the
web, in company archives, in our homes - We have tools supporting random access to text
but for audio were limited to serial search - How can we develop methods to search audio as
easily as text?
7Applications
- Searching online TV and radio news and archives
- Library of Congress
- Searching a/v archives, movies
- Searching trial recordings and legislative
sessions - Searching meetings, customer care exchanges,
focus groups - Telephone calls and voicemail
8Current Approach
- Train/adapt a speech recognizer for the corpus
- Produce an ASR transcript
- Segment spoken documents into sentences, turns,
topics - Index (errorful) transcripts for Information
Retrieval and link to audio via timestamps - Enables audio search by content
9Some Examples
- SpeechBot searching internet broadcasts
- Google Voice Search search audio by voice (not
yet) - SCANMail searching voicemail
10Information Extraction and QA from Speech
- DARPA GALE project improve information
gathering from text, speech, translations - Current Domain newswire and news broadcasts in
English, Arabic, and Mandarin - 3 competing teams
- ASR/MT bakeoffs
- Distillation evaluations
- QA
- User studies
- Requires identification and annotation of
information and formatting in speech
11Sample Distillation Questions
- List facts about
- Find people who are mutual acquaintances of
and - Identify persons arrested from and
give their name and role in that organization - Produce a biography of
- Provide information on
- Find statements made by or attributed to
about - How did react to
12Nightingale Architecture
Automatic Annotation
Distillation
Speaker modeling
Information assimilation
MT
ASR
Audio diarization
Prosodic metadata
Target Language
Punctuation Capitalization
Source Language
Info repository
Linguistic structure
Prosodic analysis
Names Relations
Intelligence delivery
Topic modeling
13Information Annotation
- Spoken documents
- Lack many cues found in text documents
- Format (sentences, turns, paragraphs)
- Include spontaneous speech phenomena which are
difficult for ASR and NLP technologies to handle - Disfluencies, fragments
- Contain errors
- Annotation can turn a weakness into a strength
14From an ASR Transcript
- aides tonight in boston in depth the truth squad
for special series until election day tonight the
truth about the budget surplus of the candidates
are promising the two international flash points
getting worse while the middle east and a new
power play by milosevic and a lifelong a family
tries to say one child life by having another
amazing breakthrough the u s was was told local
own boss good evening uh from the university of
massachusetts in boston the site of the widely
anticipated first of eight between vice president
al gore and governor george w bush with the
election now just five weeks away this is the
beginning of a sprint to the finish and a strong
start here tonight is important this is the stage
for the two candidates will appear before a
national television audience taking questions
from jim lehrer of p b s n b cs david gregory is
here with governor bush claire shipman is
covering the vice president claire you begin
tonight please
15To Speaker Segmentation (Diarization)
- Speaker 0 - aides tonight in boston in depth the
truth squad for special series until election day
tonight the truth about the budget surplus of the
candidates are promising the two international
flash points getting worse while the middle east
and a new power play by milosevic and a lifelong
a family tries to say one child life by having
another amazing breakthrough the u s was was told
local own boss good evening uh from the
university of massachusetts in boston - Speaker 1 - the site of the widely anticipated
first of eight between vice president al gore and
governor george w bush with the election now
just five weeks away this is the beginning of a
sprint to the finish and a strong start here
tonight is important this is the stage for the
two candidates will appear before a national
television audience taking questions from jim
lehrer of p b s n b cs david gregory is here
with governor bush claire shipman is covering the
vice president claire you begin tonight please
16Add Speaker Role Labels
- Anchor - aides tonight in boston in depth the
truth squad for special series until election day
tonight the truth about the budget surplus of the
candidates are promising the two international
flash points getting worse while the middle east
and a new power play by milosevic and a lifelong
a family tries to say one child life by having
another amazing breakthrough the u s was was told
local own boss good evening uh from the
university of massachusetts in boston - Reporter - the site of the widely anticipated
first of eight between vice president al gore and
governor george w bush with the election now
just five weeks away this is the beginning of a
sprint to the finish and a strong start here
tonight is important this is the stage for the
two candidates will appear before a national
television audience taking questions from jim
lehrer of p b s n b cs david gregory is here
with governor bush claire shipman is covering the
vice president claire you begin tonight please
17Perform Sentence Detection and Punctuation
- Anchor - Aides tonight in boston. In depth the
truth squad for special series until election
day. Tonight the truth about the budget surplus
of the candidates are promising. The two
international flash points getting worse. While
the middle east. And a new power play by
milosevic and a lifelong a family tries to say
one child life by having another amazing
breakthrough the u. s. was was told local own
boss. Good evening uh from the university of
massachusetts in boston. - Reporter - The site of the widely anticipated
first of eight between vice president al gore and
governor george w. bush. With the election now
just five weeks away. This is the beginning of a
sprint to the finish. And a strong start here
tonight is important. This is the stage for the
two candidates will appear before a national
television audience taking questions from jim
lehrer of p. b. s. n. b. c.'s david gregory is
here with governor bush. Claire shipman is
covering the vice president claire you begin
tonight please.
18Detect Story Boundaries
- Anchor - Aides tonight in boston. In depth the
truth squad for special series until election
day. Tonight the truth about the budget surplus
of the candidates are promising. The two
international flash points getting worse. While
the middle east. And a new power play by
milosevic and a lifelong a family tries to say
one child life by having another amazing
breakthrough the u. s. was was told local own
boss. Good evening uh from the university of
massachusetts in boston. - Reporter - The site of the widely anticipated
first of eight between vice president al gore and
governor george w. bush. With the election now
just five weeks away. This is the beginning of a
sprint to the finish. And a strong start here
tonight is important. This is the stage for the
two candidates will appear before a national
television audience taking questions from jim
lehrer of p. b. s. n. b. c.'s david gregory is
here with governor bush. Claire shipman is
covering the vice president claire you begin
tonight please.
19Detect Disfluencies (and Keep/Remove)
- Anchor - Aides tonight in boston. In depth the
truth squad for special series until election
day. Tonight the truth about the budget surplus
of the candidates are promising. The two
international flash points getting worse. While
the middle east. And a new power play by
milosevic and a lifelong a family tries to say
one child life by having another amazing
breakthrough the u. s. was was told local own
boss. Good evening uh from the university of
massachusetts in boston. - Reporter - The site of the widely anticipated
first of eight between vice president al gore and
governor george w. bush. With the election now
just five weeks away. This is the beginning of a
sprint to the finish. And a strong start here
tonight is important. This is the stage for the
two candidates will appear before a national
television audience taking questions from jim
lehrer of p. b. s. n. b. c.'s david gregory is
here with governor bush. Claire shipman is
covering the vice president claire you begin
tonight please.
20Detect Named Entities
- Anchor - Aides tonight in Boston. In depth the
truth squad for special series until election
day. Tonight the truth about the budget surplus
of the candidates are promising. The two
international flash points getting worse. While
the middle east. And a new power play by
Milosevic and a lifelong a family tries to say
one child life by having another amazing
breakthrough the u. s. was was told local own
boss. Good evening from the University of
Massachusetts in Boston. - Reporter - The site of the widely anticipated
first of eight between vice president Al Gore and
Governor George W. Bush. With the election now
just five weeks away. This is the beginning of a
sprint to the finish. And a strong start here
tonight is important. This is the stage for the
two candidates will appear before a national
television audience taking questions from Jim
Lehrer of P.B.S. N.B.C.'s David Gregory is here
with Governor Bush. Claire Shipman is covering
the vice president Claire you begin tonight
please.
21Resolve References
- Anchor - Aides tonight in Boston. In depth the
truth squad for special series until election
day. Tonight the truth about the budget surplus
of the candidates are promising. The two
international flash points getting worse. While
the middle east. And a new power play by
Milosevic and a lifelong a family tries to say
one child life by having another amazing
breakthrough the u. s. was was told local own
boss. Good evening from the University of
Massachusetts in Boston. - Reporter - The site of the widely anticipated
first of eight between vice president Al Gore and
Governor George W. Bush. With the election now
just five weeks away. This is the beginning of a
sprint to the finish. And a strong start here
tonight is important. This is the stage for the
two candidates will appear before a national
television audience taking questions from Jim
Lehrer of P.B.S. N.B.C.'s David Gregory is here
with Governor Bush Governor George W. Bush.
Claire Shipman is covering the vice president
Claire Claire Shipman you begin tonight please.
22Story Segmentation in English, Mandarin and
Arabic Broadcast News
- Given a TDT-4 broadcast, identify transitions
between stories (topics) - Novel contributions
- Entirely from automatic segmentation/identificatio
n of words, speakers, sentences from speech - All features automatically extracted
- Application to Arabic BN
23Approach
- Rule Induction ML algorithm
- Lexical Features (L)
- TextTiling Hearst 97, LCSeg Galley, et al. 03
- Story beginning and ending cue unigrams trained
on corpus - Acoustic Features (A)
- F0, Intensity, Duration, Speaking Rate
- Speaker Features (S)
- Based on ICSI automatic segmentation
- Acoustic normalizations
- Speaker Participation features
24Results - English
25Results - Mandarin
26Results - Arabic
27Speech Data Mining
- How does it differ from text data mining?
- Must handle errorful transcription
- Lacks (reliable) formatting
- Contains spontaneous speech phenomena
- We need to bring additional sources to bear on
the problem
28Maskey et al 2004 Improving Proper Name
Transcription in Voicemail
- How can we improve transcription of proper names
without increasing the size of the ASR lexicon? - Use meta-data available at runtime to hypothesize
callers and callees names - Caller ID string cname
- Name of mailbox owner mname
29Corpus
- Scanmail corpus
- 100 hours of voicemail messages from 140
employees of ATT. - Manually transcribed with cname and mname
tags - Gender balanced
- 12 non-native speakers
- 238 random messages for testing, rest ( 10,000
messages) for training
30Approach
- Create a class-based language model
- Create a name network to give instances for the
classes of the model - Replace the class-based language model at
runtime with the appropriate name networks,
identified from the cname and mname of the call
31Name Network
- To get values for mname and cname, an
internal ATT employee directory ( 40,000
people) listing used - cname created from variations of static titles
(Miss, Mr), full first names and nicknames
(Alexander, Alex), and last names (Jones)
32Name Network
- Probability within class training corpus
- Probability within first names ATT directory
listing
33Experimental Results
- Word Error Rates (WER) improvement small
- Absolute reduction of 0.6
- Named Error Rate (NER) improvement significant
- Absolute reduction of 20
- Large reduction in NER important
- Getting a name right is important to business
users - Scanmail users expressed a strong desire for the
system to recognize their own names correctly
34Next Class
- HTK Toolkit and HW5 (Fadi Biadsy)