Title: Discovering and Utilizing Structure in Large Unstructured Text Datasets
1Discovering and Utilizing Structure in Large
Unstructured Text Datasets
- Eugene AgichteinMath and Computer Science
Department
2Information Extraction Example
- Information extraction systems represent text in
structured form
May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System
3How can information extraction help?
Structured Relation
- allow precise and efficient querying
- allow returning answers instead of documents
- support powerful query constructs
- allow data integration with (structured) RDBMS
- provide input to data mining statistics
analysis
4Goal Detect, Monitor, Predict Outbreaks
Detection, Monitoring, Prediction
Data Integration, Data Mining, Trend Analysis
IESys 4
IESys 3
IESys 2
IESys 1
Historical news, breaking news stories,wire,
alerts,
Current Patient Records Diagnosis, physicians
notes, lab results/analysis,
911 CallsTraffic accidents,
5Challenges in Information Extraction
- Portability
- Reduce effort to tune for new domains and tasks
- MUC systems experts would take 8-12 weeks to
tune - Scalability, Efficiency, Access
- Enable information extraction over large
collections - 1 sec / document 5 billion docs 158 CPU years
- Approach learn from data ( Bootstrapping )
- Snowball Partially Supervised Information
Extraction - Querying Large Text Databases for Efficient
Information Extraction
6Outline
- Information extraction overview
- Partially supervised information extraction
- Adaptivity
- Confidence estimation
- Text retrieval for scalable extraction
- Query-based information extraction
- Implicit connections/graphs in text databases
- Current and future work
- Inferring and analyzing social networks
- Utility-based extraction tuning
- Multi-modal information extraction and data
mining - Authority/trust/confidence estimation
7What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
8What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
9What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
10What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
12What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
13IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
14Information Extraction Tasks
- Extracting entities and relations
- Entities
- Named (e.g., Person)
- Generic (e.g., disease name)
- Relations
- Entities related in a predefined way (e.g.,
Location of a Disease outbreak) - Discovered automatically
- Common information extraction steps
- Preprocessing sentence chunking, parsing,
morphological analysis - Rules/extraction patterns manual, machine
learning, and hybrid - Applying extraction patterns to extract new
information - Postprocessing and complex extraction not
covered - Co-reference resolution
- Combining Relations into Events, Rules,
15Two kinds of IE approaches
- Knowledge Engineering
- rule based
- developed by experienced language engineers
- make use of human intuition
- requires only small amount of training data
- development could be very time consuming
- some changes may be hard to accommodate
- Machine Learning
- use statistics or other machine learning
- developers do not need LE expertise
- requires large amounts of annotated training data
- some changes may require re-annotation of the
entire training corpus - annotators are cheap (but you get what you pay
for!)
16Extracting Entities from Text
Classify Pre-segmentedCandidates
Lexicons
Sliding Window
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
member?
Classifier
Classifier
Alabama Alaska Wisconsin Wyoming
which class?
which class?
Try alternatewindow sizes
Context Free Grammars
Finite State Machines
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
V
P
NP
V
NNP
Most likely parse?
Classifier
PP
which class?
VP
NP
VP
BEGIN
END
BEGIN
END
S
and beyond
Any of these models can be used to capture words,
formatting or both.
17Hidden Markov Models
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
18IE with Hidden Markov Models
Given a sequence of observations
Yesterday Lawrence Saul spoke this example
sentence.
and a trained HMM
Find the most likely state sequence (Viterbi)
Yesterday Lawrence Saul spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Lawrence Saul
19HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or
(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 450k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90
Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
20Relation Extraction
- Extract structured relations from text
May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System
21Relation Extraction
- Typically require Entity Tagging as preprocessing
- Knowledge Engineering
- Rules defined over lexical items
- ltcompanygt located in ltlocationgt
- Rules defined over parsed text
- ((Obj ltcompanygt) (Verb located) () (Subj
ltlocationgt)) - Proteus, GATE,
- Machine Learning-based
- Learn rules/patterns from examples
- Dan Roth 2005, Cardie 2006, Mooney 2005,
- Partially-supervised bootstrap from seed
examples - Agichtein Gravano 2000, Etzioni et al., 2004,
- Recently, hybrid models Feldman2004, 2006
22Comparison of Approaches
significanteffort
- Use language-engineering environments to help
experts create extraction patterns - GATE 2002, Proteus 1998
- Train system over manually labeled data
- Soderland et al. 1997, Muslea et al. 2000,
Riloff et al. 1996 - Exploit large amounts of unlabeled data
- DIPRE Brin 1998, Snowball Agichtein Gravano
2000 - Etzioni et al. (04) KnowItAll extracting unary
relations - Yangarber et al. (00, 02) Pattern refinement,
generalized names detection
substantialeffort
minimaleffort
23The Snowball System Overview
Organization Location Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7
157th Street Manhattan 0.52
15th Party Congress China 0.3
15th Century Europe Dark Ages 0.1
Snowball
... ... ..
24Snowball Getting User Input
ACM DL 2000
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara
- User input
- a handful of example instances
- integrity constraints on the relation e.g.,
Organization is a key, Age gt 0, etc
25Snowball Finding Example Occurrences
Can use any full-text search engine
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara
Search Engine
Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA-based Microsoft Corp The
Armonk-based IBM introduced a new lineChange
of guard at IBM Corporations headquarters near
Armonk, NY ...
26Snowball Tagging Entities
Named entity taggers can recognize Dates, People,
Locations, Organizations, MITREs Alembic,
IBMs Talent, LingPipe,
Computer servers at Microsoft s headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA -based Microsoft Corp The Armonk
-based IBM introduced a new lineChange of
guard at IBM Corporations headquarters near
Armonk, NY ...
27Snowball Extraction Patterns
- General extraction pattern model
- acceptor0, Entity,
acceptor1, Entity, acceptor2 - Acceptor instantiations
- String Match (accepts string s headquarters
in) - Vector-Space ( vector (-s,0.5), (headquarters,
0.5), (in, 0.5) ) - Sequence Classifier (Prob(Tvalid s,
headquarters, in) ) - HMMs, Sparse sequences, Conditional Random
Fields,
28Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
LOCATION
ORGANIZATION
lt's 0.57gt, ltheadquarters 0.57gt, lt in 0.57gt
29Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
Create patterns as filtered cluster centroids
3
LOCATION
ORGANIZATION
lt's 0.71gt, ltheadquarters 0.71gt
30Vector Space Clustering
31Snowball Extracting New Tuples
Match tagged text fragments against patterns
Google 's new headquarters in
Mountain View are
V
LOCATION
ORGANIZATION
lt's 0.5gt, ltnew 0.5gt ltheadquarters 0.5gt, lt in
0.5gt
ltare 1gt
LOCATION
ltlocated 0.71gt, lt in 0.71gt
P2
Match0.4
ORGANIZATION
ORGANIZATION
lt's 0.71gt, ltheadquarters 0.71gt
P1
Match0.8
LOCATION
LOCATION
lt- 0.71gt, ltbased 0.71gt
ORGANIZATION
P3
Match0
32Snowball Evaluating Patterns
Automatically estimate pattern confidenceConf(P4
) Positive / Total 2/3 0.66
Current seed tuples
LOCATION
ORGANIZATION
lt , 1gt
P4
Organization Headquarters
IBM Armonk
Intel Santa Clara
Microsoft Redmond
?
IBM, Armonk, reported Positive Intel,
Santa Clara, introduced... Positive Bet on
Microsoft, New York-based analyst Jane Smith
said... Negative
?
x
33Snowball Evaluating Tuples
Automatically evaluate tuple confidence Conf(T)
A tuple has high confidence if generated by
high-confidence patterns.
Conf(T) 0.83
ORGANIZATION
P4 0.66
LOCATION
lt , 1gt
3COM Santa Clara
0.4
lt- 0.75gt, ltbased 0.75gt
LOCATION
ORGANIZATION
0.8
P3 0.95
34Snowball Evaluating Tuples
Organization Headquarters Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7
157th Street Manhattan 0.52
15th Party Congress China 0.3
15th Century Europe Dark Ages 0.1
... .... ..
... .... ..
Keep only high-confidence tuples for next
iteration
35Snowball Evaluating Tuples
Organization Headquarters Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7
Start new iteration with expanded example set
Iterate until no new tuples are extracted
36Pattern-Tuple Duality
- A good tuple
- Extracted by good patterns
- Tuple weight ? goodness
- A good pattern
- Generated by good tuples
- Extracts good new tuples
- Pattern weight ? goodness
- Edge weight
- Match/Similarity of tuple context to pattern
37How to Set Node Weights
- Constraint violation (from before)
- Conf(P) Log(Pos) Pos/(PosNeg)
- Conf(T)
- HITS Hassan et al., EMNLP 2006
- Conf(P) ?Conf(T)
- Conf(T) ?Conf(P)
- URNS Downey et al., IJCAI 2005
- EM-Spy Agichtein, SDM 2006
- Unknown tuples Neg
- Compute Conf(P), Conf(T)
- Iterate
38Evaluating Patterns and Tuples Expectation
Maximization
- EM-Spy Algorithm
- Hide labels for some seed tuples
- Iterate EM algorithm to convergence on
tuple/pattern confidence values - Set threshold t such that (t gt 90 of spy
tuples) - Re-initialize Snowball using new seed tuples
Organization Headquarters Initial Final
Microsoft Redmond 1 1
IBM Armonk 1 0.8
Intel Santa Clara 1 0.9
AG Edwards St Louis 0 0.9
Air Canada Montreal 0 0.8
7th Level Richardson 0 0.8
3Com Corp Santa Clara 0 0.8
3DO Redwood City 0 0.7
3M Minneapolis 0 0.7
MacWorld San Francisco 0 0.7
0
0
157th Street Manhattan 0 0.52
15th Party Congress China 0 0.3
15th Century Europe Dark Ages 0 0.1
..
39Adapting Snowball for New Relations
- Large parameter space
- Initial seed tuples (randomly chosen, multiple
runs) - Acceptor features words, stems, n-grams,
phrases, punctuation, POS - Feature selection techniques OR, NB, Freq,
support, combinations - Feature weights TFIDF, TF, TFNB, NB
- Pattern evaluation strategies NN, Constraint
violation, EM, EM-Spy - Automatically estimate parameter values
- Estimate operating parameters based on
occurrences of seed tuples - Run cross-validation on hold-out sets of seed
tuples for optimal perf. - Seed occurrences that do not have close
neighbors are discarded
40Example Task DiseaseOutbreaks
SDM 2006
Proteus 0.409 Snowball 0.415
41Snowball Used in Various Domains
- News NYT, WSJ, AP DL00, SDM06
- CompanyHeadquarters, MergersAcquisitions,
DiseaseOutbreaks - Medical literature PDR, Micromedex Thesis
- AdverseEffects, DrugInteractions,
RecommendedTreatments - Biological literature GeneWays corpus ISMB03
- Gene and Protein Synonyms
42Outline
- Information extraction overview
- Partially supervised information extraction
- Adaptivity
- Confidence estimation
- Text retrieval for scalable extraction
- Query-based information extraction
- Implicit connections/graphs in text databases
- Current and future work
- Inferring and analyzing social networks
- Utility-based extraction tuning
- Multi-modal information extraction and data
mining - Authority/trust/confidence estimation
43Extracting A Relation From a Large Text Database
InformationExtraction System
StructuredRelation
- Brute force approach feed all docs to
information extraction system - Only a tiny fraction of documents are often
useful - Many databases are not crawlable
- Often a search interface is available, with
existing keyword index - How to identify useful documents?
44An Abstract View of Text-Centric Tasks
Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006
Output tuples
Text Database
Extraction System
- Retrieve documents from database
- Process documents
- Extract output tuples
Task tuple
Information Extraction Relation Tuple
Database Selection Word (Frequency)
Focused Crawling Web Page about a Topic
45Executing a Text-Centric Task
Output tuples
Text Database
Extraction System
- Retrieve documents from database
- Extract output tuples
- Process documents
- Two major execution paradigms
- Scan-based Retrieve and process documents
sequentially - Index-based Query database (e.g., case
fatality rate), retrieve and process
documents in results
- Similar to relational world
?underlying data distribution dictates what is
best
- Indexes are only approximate index is on
keywords, not on tuples of interest - Choice of execution plan affects output
completeness (not only speed)
Unlike the relational world
46Scan
Output tuples
Extraction System
Text Database
- Extract output tuples
- Process documents
- Retrieve docs from database
- Scan retrieves and processes documents
sequentially (until reaching target recall) - Execution time Retrieved Docs (R P)
Question How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Time for retrieving a document
Filtered Scan uses a classifier to identify and
process only promising documents (details in
paper)
47Iterative Query Expansion
Output tuples
Text Database
Extraction System
Query Generation
- Extract tuplesfrom docs
- Process retrieved documents
- Augment seed tuples with new tuples
- Query database with seed tuples
(e.g., ltMalaria, Ethiopiagt)
(e.g., Ebola AND Zaire)
- Execution time Retrieved Docs (R P)
Queries Q
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Time for answering a query
Time for retrieving a document
Time for processing a document
48QXtract Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Query Generation
Queries
Promising Documents
Information Extraction System
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Mad Cow Disease The U.K. July 1995
Pneumonia The U.S. Feb. 1995
Problem Learn keyword queries to retrieve
promising documents
Extracted Relation
49Learning Queries to Retrieve Promising Documents
User-Provided Seed Tuples
- Get document sample with likely negative and
likely positive examples. - Label sample documents using information
extraction system as oracle. - Train classifiers to recognize useful
documents. - Generate queries from classifier model/rules.
Seed Sampling
Information Extraction System
Classifier Training
Query Generation
Queries
50Training Classifiers to Recognize Useful
Documents
D1
disease reported epidemic expected area
virus reported expected infected patients
products made used exported far
past old homerun sponsored event
Document features words
D2
-
D3
-
D4
Ripper
SVM
Okapi (IR)
products
virus 3
infected 2
sponsored -1
disease AND reported gt USEFUL
disease
exported
reported
used
epidemic
far
infected
virus
51Generating Queries from Classifiers
SVM
Ripper
Okapi (IR)
disease AND reported gt USEFUL
products
virus 3
infected 2
sponsored -1
disease
exported
reported
epidemic
used
infected
far
virus
epidemicvirus
virusinfected
disease AND reported
QCombined
disease and reportedepidemicvirus
52SIGMOD 2003 Demonstration
53An Even Simpler Querying Strategy Tuples
Ebola and Zaire
DiseaseName Location Date
Ebola Zaire May 1995
Search Engine
Malaria Ethiopia Jan. 1995
hemorrhagic fever Africa May 1995
InformationExtraction System
- Convert given tuples into queries
- Retrieve matching documents
- Extract new tuples from documents and iterate
54Comparison of Document Access Methods
QXtract 60 of relation extracted from 10 of
documents of 135,000 newspaper article
database Tuples strategy Recall at most 46
55Predicting Recall of Tuples Strategy
WebDB 2003
Seed Tuple
Seed Tuple
SUCCESS!
FAILURE ?
Can we predict if Tuples will succeed?
56Using Querying Graph for Analysis
tuples
Documents
- We need to compute the
- Number of documents retrieved after sending Q
tuples as queries (estimates time) - Number of tuples that appear in the retrieved
documents (estimates recall) - To estimate these we need to compute the
- Degree distribution of the tuples discovered by
retrieving documents - Degree distribution of the documents retrieved by
the tuples - (Not the same as the degree distribution of a
randomly chosen tuple or document it is easier
to discover documents and tuples with high
degrees)
t1
d1
ltSARS, Chinagt
d2
t2
ltEbola, Zairegt
t3
d3
ltMalaria, Ethiopiagt
t4
d4
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
57Information Reachability Graph
Tuples
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t2, t3, and t4 reachable from t1
t5
d5
58Connected Components
Reachable Tuples, do not retrieve tuples in Core
Tuples that retrieve other tuples and themselves
Tuples that retrieve other tuples but are not
reachable
59Sizes of Connected Components
How many tuples are in largest Core Out?
Out
In
Out
In
Core
Core
t0
Out
In
(strongly
Core
connected)
- Conjecture
- Degree distribution in reachability graphs
follows power-law. - Then, reachability graph has at most one giant
component. - Define Reachability as Fraction of tuples in
largest Core Out
60NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
61NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
62Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
63Estimating Reachability
- In a power-law random graph G a giant component
CG emerges if d (the average outdegree) gt 1,
and
- Estimate Reachability CG / T
- Depends only on d (average outdegree)
Chung and Lu, Annals of Combinatorics, 2002
For b lt 3.457
64Estimating Reachability Algorithm
Tuples
- Pick some random tuples
- Use tuples to query database
- Extract tuples from matching documents to compute
reachability graph edges - Estimate average outdegree
- Estimate reachability using results of Chung and
Lu, Annals of Combinatorics, 2002
Documents
t1
d1
t1
d2
t2
t3
t3
d3
t4
d4
t2
t2
d 1.5
t4
65Estimating Reachability of NYT
.46
Approximate reachability is estimated after 50
queries. Can be used to predict success (or
failure) of a Tuples querying strategy.
66Outline
- Information extraction overview
- Partially supervised information extraction
- Adaptivity
- Confidence estimation
- Text retrieval for scalable extraction
- Query-based information extraction
- Implicit connections/graphs in text databases
- Current and future work
- Adaptive information extraction and tuning
- Authority/trust/confidence estimation
- Inferring and analyzing social networks
- Multi-modal information extraction and data
mining
67Goal Detect, Monitor, Predict Outbreaks
Detection, Monitoring, Prediction
Data Integration, Data Mining, Trend Analysis
IESys 4
IESys 3
IESys 2
IESys 1
Historical news, breaking news stories,wire,
alerts,
Current Patient Records Diagnosis, physicians
notes, lab results/analysis,
911 CallsTraffic accidents,
68Adaptive, Utility-Driven Extraction
- Extract relevant symptoms and modifiers from text
- Physician notes, patient narrative, call
transcripts - Call transcripts a difficult extraction problem
- Not grammatical, dialogue, speech?text
unreliable, - Use partially supervised techniques to learn
extraction patterns - One approach
- Link together (when possible) call transcript and
patient record (e.g., by time, address, and
patient name) - Correlate patterns in transcript with
diagnosis/symptoms - Fine-grained learning can automatically train
for each symptom or group of patients, etc.
69Authority, Trust, Confidence
- How reliable are signals emitted by information
extraction? - Dimensions of trust/confidence
- Source reliability diagnosis vs. notes vs. 911
calls - Tuple extraction confidence
- Source extraction difficulty
70Source Confidence Estimation
CIKM 2005
- Task easy when context term distributions
diverge from background - Quantify as relative entropy (Kullback-Liebler
divergence) - After calibration, metric predicts task is easy
or hard
71Inferring Social Networks
- Explicit networks
- Patient records family, geographical entities in
structured and unstructured portions - Implicit connections
- Extract events (e.g., went to restaurant X
yesterday) - Extract relationships (e.g., I work in Kroegers
in Toco Hills
72Modeling Social Networks for
Email exchange mapped onto cubicle locations.
73Improve Prediction Accuracy
- Suppose we managed to
- Automatically identify people currently sick or
about to get sick - Automatically infer (part of) their social
network - Can we improve prediction for dynamics of an
outbreak?
74Multimodal Information Extraction and Data Mining
- Develop joint models over structured data
- E.g., lab results and symptoms extracted from
text - One approach mutual reinforcement
- Co-training train classifier on redundant views
of data (e.g., structured unstructured) - Bootstrap on examples proposed by both views
- More generally graphical models
75Summary
- Information extraction overview
- Partially supervised information extraction
- Adaptivity
- Confidence estimation
- Text retrieval for scalable extraction
- Query-based information extraction
- Implicit connections/graphs in text databases
- Current and future work
- Adaptive information extraction and tuning
- Authority/trust/confidence estimation
- Inferring and analyzing social networks
- Multi-modal information extraction and data
mining
76Thank You
- Details papers, other talk slides
- http//www.mathcs.emory.edu/eugene/