Discovering and Utilizing Structure in Large Unstructured Text Datasets

About This Presentation

Title:

Discovering and Utilizing Structure in Large Unstructured Text Datasets

Description:

... years, Microsoft Corporation CEO Bill Gates railed against the economic ... Bill Gates CEO Microsoft. Bill Veghte VP Microsoft. Richard Stallman founder ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 77

Provided by: EugeneAg8

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Discovering and Utilizing Structure in Large Unstructured Text Datasets

1
Discovering and Utilizing Structure in Large
Unstructured Text Datasets

Eugene AgichteinMath and Computer Science
Department

2
Information Extraction Example

Information extraction systems represent text in
structured form

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System
3
How can information extraction help?
Structured Relation

allow precise and efficient querying
allow returning answers instead of documents
support powerful query constructs
allow data integration with (structured) RDBMS
provide input to data mining statistics
analysis

4
Goal Detect, Monitor, Predict Outbreaks
Detection, Monitoring, Prediction
Data Integration, Data Mining, Trend Analysis
IESys 4
IESys 3
IESys 2
IESys 1
Historical news, breaking news stories,wire,
alerts,
Current Patient Records Diagnosis, physicians
notes, lab results/analysis,
911 CallsTraffic accidents,
5
Challenges in Information Extraction

Portability
Reduce effort to tune for new domains and tasks
MUC systems experts would take 8-12 weeks to
tune
Scalability, Efficiency, Access
Enable information extraction over large
collections
1 sec / document 5 billion docs 158 CPU years
Approach learn from data ( Bootstrapping )
Snowball Partially Supervised Information
Extraction
Querying Large Text Databases for Efficient
Information Extraction

6
Outline

Information extraction overview
Partially supervised information extraction
Adaptivity
Confidence estimation
Text retrieval for scalable extraction
Query-based information extraction
Implicit connections/graphs in text databases
Current and future work
Inferring and analyzing social networks
Utility-based extraction tuning
Multi-modal information extraction and data
mining
Authority/trust/confidence estimation

7
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
8
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
9
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
10
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
12
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation

13
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
14
Information Extraction Tasks

Extracting entities and relations
Entities
Named (e.g., Person)
Generic (e.g., disease name)
Relations
Entities related in a predefined way (e.g.,
Location of a Disease outbreak)
Discovered automatically
Common information extraction steps
Preprocessing sentence chunking, parsing,
morphological analysis
Rules/extraction patterns manual, machine
learning, and hybrid
Applying extraction patterns to extract new
information
Postprocessing and complex extraction not
covered
Co-reference resolution
Combining Relations into Events, Rules,

15
Two kinds of IE approaches

Knowledge Engineering
rule based
developed by experienced language engineers
make use of human intuition
requires only small amount of training data
development could be very time consuming
some changes may be hard to accommodate

Machine Learning
use statistics or other machine learning
developers do not need LE expertise
requires large amounts of annotated training data
some changes may require re-annotation of the
entire training corpus
annotators are cheap (but you get what you pay
for!)

16
Extracting Entities from Text
Classify Pre-segmentedCandidates
Lexicons
Sliding Window
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
member?
Classifier
Classifier
Alabama Alaska Wisconsin Wyoming
which class?
which class?
Try alternatewindow sizes
Context Free Grammars
Finite State Machines
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
V
P
NP
V
NNP
Most likely parse?
Classifier
PP
which class?
VP
NP
VP
BEGIN
END
BEGIN
END
S
and beyond
Any of these models can be used to capture words,
formatting or both.
17
Hidden Markov Models
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
18
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Lawrence Saul spoke this example
sentence.
and a trained HMM
Find the most likely state sequence (Viterbi)
Yesterday Lawrence Saul spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Lawrence Saul
19
HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or

(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 450k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
20
Relation Extraction

Extract structured relations from text

Typically require Entity Tagging as preprocessing
Knowledge Engineering
Rules defined over lexical items
ltcompanygt located in ltlocationgt
Rules defined over parsed text
((Obj ltcompanygt) (Verb located) () (Subj
ltlocationgt))
Proteus, GATE,
Machine Learning-based
Learn rules/patterns from examples
Dan Roth 2005, Cardie 2006, Mooney 2005,
Partially-supervised bootstrap from seed
examples
Agichtein Gravano 2000, Etzioni et al., 2004,
Recently, hybrid models Feldman2004, 2006

22
Comparison of Approaches
significanteffort

Use language-engineering environments to help
experts create extraction patterns
GATE 2002, Proteus 1998
Train system over manually labeled data
Soderland et al. 1997, Muslea et al. 2000,
Riloff et al. 1996
Exploit large amounts of unlabeled data
DIPRE Brin 1998, Snowball Agichtein Gravano
2000
Etzioni et al. (04) KnowItAll extracting unary
relations
Yangarber et al. (00, 02) Pattern refinement,
generalized names detection

substantialeffort
minimaleffort
23
The Snowball System Overview
Organization Location Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7

157th Street Manhattan 0.52
15th Party Congress China 0.3
15th Century Europe Dark Ages 0.1
Snowball
... ... ..
24
Snowball Getting User Input
ACM DL 2000
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara

User input
a handful of example instances
integrity constraints on the relation e.g.,
Organization is a key, Age gt 0, etc

25
Snowball Finding Example Occurrences
Can use any full-text search engine
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara
Search Engine
Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA-based Microsoft Corp The
Armonk-based IBM introduced a new lineChange
of guard at IBM Corporations headquarters near
Armonk, NY ...
26
Snowball Tagging Entities
Named entity taggers can recognize Dates, People,
Locations, Organizations, MITREs Alembic,
IBMs Talent, LingPipe,
Computer servers at Microsoft s headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA -based Microsoft Corp The Armonk
-based IBM introduced a new lineChange of
guard at IBM Corporations headquarters near
Armonk, NY ...
27
Snowball Extraction Patterns

General extraction pattern model
acceptor0, Entity,
acceptor1, Entity, acceptor2
Acceptor instantiations
String Match (accepts string s headquarters
in)
Vector-Space ( vector (-s,0.5), (headquarters,
0.5), (in, 0.5) )
Sequence Classifier (Prob(Tvalid s,
headquarters, in) )
HMMs, Sparse sequences, Conditional Random
Fields,

28
Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
LOCATION
ORGANIZATION
lt's 0.57gt, ltheadquarters 0.57gt, lt in 0.57gt

29
Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
Create patterns as filtered cluster centroids
3
LOCATION
ORGANIZATION
lt's 0.71gt, ltheadquarters 0.71gt

30
Vector Space Clustering
31
Snowball Extracting New Tuples
Match tagged text fragments against patterns
Google 's new headquarters in
Mountain View are
V
LOCATION
ORGANIZATION
lt's 0.5gt, ltnew 0.5gt ltheadquarters 0.5gt, lt in
0.5gt
ltare 1gt

LOCATION
ltlocated 0.71gt, lt in 0.71gt

P2
Match0.4
ORGANIZATION
ORGANIZATION
lt's 0.71gt, ltheadquarters 0.71gt

P1
Match0.8
LOCATION
LOCATION
lt- 0.71gt, ltbased 0.71gt

ORGANIZATION
P3
Match0
32
Snowball Evaluating Patterns
Automatically estimate pattern confidenceConf(P4
) Positive / Total 2/3 0.66
Current seed tuples
LOCATION
ORGANIZATION
lt , 1gt

P4
Organization Headquarters
IBM Armonk
Intel Santa Clara
Microsoft Redmond
?
IBM, Armonk, reported Positive Intel,
Santa Clara, introduced... Positive Bet on
Microsoft, New York-based analyst Jane Smith
said... Negative
?
x
33
Snowball Evaluating Tuples
Automatically evaluate tuple confidence Conf(T)
A tuple has high confidence if generated by
high-confidence patterns.
Conf(T) 0.83
ORGANIZATION

P4 0.66
LOCATION
lt , 1gt
3COM Santa Clara
0.4
lt- 0.75gt, ltbased 0.75gt

LOCATION
ORGANIZATION
0.8
P3 0.95
34
Snowball Evaluating Tuples
Organization Headquarters Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7

157th Street Manhattan 0.52
15th Party Congress China 0.3
15th Century Europe Dark Ages 0.1

... .... ..

... .... ..
Keep only high-confidence tuples for next
iteration
35
Snowball Evaluating Tuples
Organization Headquarters Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7

Start new iteration with expanded example set
Iterate until no new tuples are extracted
36
Pattern-Tuple Duality

A good tuple
Extracted by good patterns
Tuple weight ? goodness
A good pattern
Generated by good tuples
Extracts good new tuples
Pattern weight ? goodness
Edge weight
Match/Similarity of tuple context to pattern

37
How to Set Node Weights

Constraint violation (from before)
Conf(P) Log(Pos) Pos/(PosNeg)
Conf(T)
HITS Hassan et al., EMNLP 2006
Conf(P) ?Conf(T)
Conf(T) ?Conf(P)
URNS Downey et al., IJCAI 2005
EM-Spy Agichtein, SDM 2006
Unknown tuples Neg
Compute Conf(P), Conf(T)
Iterate

38
Evaluating Patterns and Tuples Expectation
Maximization

EM-Spy Algorithm
Hide labels for some seed tuples
Iterate EM algorithm to convergence on
tuple/pattern confidence values
Set threshold t such that (t gt 90 of spy
tuples)
Re-initialize Snowball using new seed tuples

Organization Headquarters Initial Final
Microsoft Redmond 1 1
IBM Armonk 1 0.8
Intel Santa Clara 1 0.9
AG Edwards St Louis 0 0.9
Air Canada Montreal 0 0.8
7th Level Richardson 0 0.8
3Com Corp Santa Clara 0 0.8
3DO Redwood City 0 0.7
3M Minneapolis 0 0.7
MacWorld San Francisco 0 0.7
0
0
157th Street Manhattan 0 0.52
15th Party Congress China 0 0.3
15th Century Europe Dark Ages 0 0.1
..
39
Adapting Snowball for New Relations

Large parameter space
Initial seed tuples (randomly chosen, multiple
runs)
Acceptor features words, stems, n-grams,
phrases, punctuation, POS
Feature selection techniques OR, NB, Freq,
support, combinations
Feature weights TFIDF, TF, TFNB, NB
Pattern evaluation strategies NN, Constraint
violation, EM, EM-Spy
Automatically estimate parameter values
Estimate operating parameters based on
occurrences of seed tuples
Run cross-validation on hold-out sets of seed
tuples for optimal perf.
Seed occurrences that do not have close
neighbors are discarded

40
Example Task DiseaseOutbreaks
SDM 2006
Proteus 0.409 Snowball 0.415
41
Snowball Used in Various Domains

News NYT, WSJ, AP DL00, SDM06
CompanyHeadquarters, MergersAcquisitions,
DiseaseOutbreaks
Medical literature PDR, Micromedex Thesis
AdverseEffects, DrugInteractions,
RecommendedTreatments
Biological literature GeneWays corpus ISMB03
Gene and Protein Synonyms

42
Outline

Information extraction overview
Partially supervised information extraction
Adaptivity
Confidence estimation
Text retrieval for scalable extraction
Query-based information extraction
Implicit connections/graphs in text databases
Current and future work
Inferring and analyzing social networks
Utility-based extraction tuning
Multi-modal information extraction and data
mining
Authority/trust/confidence estimation

43
Extracting A Relation From a Large Text Database
InformationExtraction System
StructuredRelation

Brute force approach feed all docs to
information extraction system
Only a tiny fraction of documents are often
useful
Many databases are not crawlable
Often a search interface is available, with
existing keyword index
How to identify useful documents?

44
An Abstract View of Text-Centric Tasks
Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006
Output tuples

Text Database
Extraction System

Retrieve documents from database

Process documents

Extract output tuples

Task tuple
Information Extraction Relation Tuple
Database Selection Word (Frequency)
Focused Crawling Web Page about a Topic
45
Executing a Text-Centric Task
Output tuples

Text Database
Extraction System

Retrieve documents from database

Extract output tuples

Process documents

Two major execution paradigms
Scan-based Retrieve and process documents
sequentially
Index-based Query database (e.g., case
fatality rate), retrieve and process
documents in results

Similar to relational world

?underlying data distribution dictates what is
best

Indexes are only approximate index is on
keywords, not on tuples of interest
Choice of execution plan affects output
completeness (not only speed)

Unlike the relational world
46
Scan
Output tuples

Extraction System
Text Database

Extract output tuples

Process documents

Retrieve docs from database

Scan retrieves and processes documents
sequentially (until reaching target recall)
Execution time Retrieved Docs (R P)

Question How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Time for retrieving a document
Filtered Scan uses a classifier to identify and
process only promising documents (details in
paper)
47
Iterative Query Expansion
Output tuples

Text Database
Extraction System
Query Generation

Extract tuplesfrom docs

Process retrieved documents

Augment seed tuples with new tuples

Query database with seed tuples

(e.g., ltMalaria, Ethiopiagt)
(e.g., Ebola AND Zaire)

Execution time Retrieved Docs (R P)
Queries Q

Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Time for answering a query
Time for retrieving a document
Time for processing a document
48
QXtract Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Query Generation
Queries
Promising Documents
Information Extraction System
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Mad Cow Disease The U.K. July 1995
Pneumonia The U.S. Feb. 1995
Problem Learn keyword queries to retrieve
promising documents
Extracted Relation
49
Learning Queries to Retrieve Promising Documents
User-Provided Seed Tuples

Get document sample with likely negative and
likely positive examples.
Label sample documents using information
extraction system as oracle.
Train classifiers to recognize useful
documents.
Generate queries from classifier model/rules.

Seed Sampling
Information Extraction System
Classifier Training
Query Generation
Queries
50
Training Classifiers to Recognize Useful
Documents

D1
disease reported epidemic expected area
virus reported expected infected patients
products made used exported far
past old homerun sponsored event
Document features words

D2
-
D3
-
D4
Ripper
SVM
Okapi (IR)
products
virus 3
infected 2
sponsored -1
disease AND reported gt USEFUL
disease
exported
reported
used
epidemic
far
infected
virus
51
Generating Queries from Classifiers
SVM
Ripper
Okapi (IR)
disease AND reported gt USEFUL
products
virus 3
infected 2
sponsored -1
disease
exported
reported
epidemic
used
infected
far
virus
epidemicvirus
virusinfected
disease AND reported
QCombined
disease and reportedepidemicvirus
52
SIGMOD 2003 Demonstration
53
An Even Simpler Querying Strategy Tuples
Ebola and Zaire
DiseaseName Location Date
Ebola Zaire May 1995
Search Engine
Malaria Ethiopia Jan. 1995
hemorrhagic fever Africa May 1995
InformationExtraction System

Convert given tuples into queries
Retrieve matching documents
Extract new tuples from documents and iterate

54
Comparison of Document Access Methods
QXtract 60 of relation extracted from 10 of
documents of 135,000 newspaper article
database Tuples strategy Recall at most 46
55
Predicting Recall of Tuples Strategy
WebDB 2003
Seed Tuple
Seed Tuple
SUCCESS!
FAILURE ?
Can we predict if Tuples will succeed?
56
Using Querying Graph for Analysis
tuples
Documents

We need to compute the
Number of documents retrieved after sending Q
tuples as queries (estimates time)
Number of tuples that appear in the retrieved
documents (estimates recall)
To estimate these we need to compute the
Degree distribution of the tuples discovered by
retrieving documents
Degree distribution of the documents retrieved by
the tuples
(Not the same as the degree distribution of a
randomly chosen tuple or document it is easier
to discover documents and tuples with high
degrees)

t1
d1
ltSARS, Chinagt
d2
t2
ltEbola, Zairegt
t3
d3
ltMalaria, Ethiopiagt
t4
d4
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
57
Information Reachability Graph
Tuples
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t2, t3, and t4 reachable from t1
t5
d5
58
Connected Components
Reachable Tuples, do not retrieve tuples in Core
Tuples that retrieve other tuples and themselves
Tuples that retrieve other tuples but are not
reachable
59
Sizes of Connected Components
How many tuples are in largest Core Out?
Out
In
Out
In
Core
Core
t0
Out
In
(strongly
Core
connected)

Conjecture
Degree distribution in reachability graphs
follows power-law.
Then, reachability graph has at most one giant
component.
Define Reachability as Fraction of tuples in
largest Core Out

60
NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
61
NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
62
Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
63
Estimating Reachability

In a power-law random graph G a giant component
CG emerges if d (the average outdegree) gt 1,
and

Estimate Reachability CG / T
Depends only on d (average outdegree)

Chung and Lu, Annals of Combinatorics, 2002
For b lt 3.457
64
Estimating Reachability Algorithm
Tuples

Pick some random tuples
Use tuples to query database
Extract tuples from matching documents to compute
reachability graph edges
Estimate average outdegree
Estimate reachability using results of Chung and
Lu, Annals of Combinatorics, 2002

Documents
t1
d1
t1
d2
t2
t3
t3
d3
t4
d4
t2
t2
d 1.5
t4
65
Estimating Reachability of NYT
.46
Approximate reachability is estimated after 50
queries. Can be used to predict success (or
failure) of a Tuples querying strategy.
66
Outline

Information extraction overview
Partially supervised information extraction
Adaptivity
Confidence estimation
Text retrieval for scalable extraction
Query-based information extraction
Implicit connections/graphs in text databases
Current and future work
Adaptive information extraction and tuning
Authority/trust/confidence estimation
Inferring and analyzing social networks
Multi-modal information extraction and data
mining

67
Goal Detect, Monitor, Predict Outbreaks
Detection, Monitoring, Prediction
Data Integration, Data Mining, Trend Analysis
IESys 4
IESys 3
IESys 2
IESys 1
Historical news, breaking news stories,wire,
alerts,
Current Patient Records Diagnosis, physicians
notes, lab results/analysis,
911 CallsTraffic accidents,
68
Adaptive, Utility-Driven Extraction

Extract relevant symptoms and modifiers from text
Physician notes, patient narrative, call
transcripts
Call transcripts a difficult extraction problem
Not grammatical, dialogue, speech?text
unreliable,
Use partially supervised techniques to learn
extraction patterns
One approach
Link together (when possible) call transcript and
patient record (e.g., by time, address, and
patient name)
Correlate patterns in transcript with
diagnosis/symptoms
Fine-grained learning can automatically train
for each symptom or group of patients, etc.

69
Authority, Trust, Confidence

How reliable are signals emitted by information
extraction?
Dimensions of trust/confidence
Source reliability diagnosis vs. notes vs. 911
calls
Tuple extraction confidence
Source extraction difficulty

70
Source Confidence Estimation
CIKM 2005

Task easy when context term distributions
diverge from background
Quantify as relative entropy (Kullback-Liebler
divergence)
After calibration, metric predicts task is easy
or hard

71
Inferring Social Networks

Explicit networks
Patient records family, geographical entities in
structured and unstructured portions
Implicit connections
Extract events (e.g., went to restaurant X
yesterday)
Extract relationships (e.g., I work in Kroegers
in Toco Hills

72
Modeling Social Networks for

Epidemiology, security,

Email exchange mapped onto cubicle locations.
73
Improve Prediction Accuracy

Suppose we managed to
Automatically identify people currently sick or
about to get sick
Automatically infer (part of) their social
network
Can we improve prediction for dynamics of an
outbreak?

74
Multimodal Information Extraction and Data Mining

Develop joint models over structured data
E.g., lab results and symptoms extracted from
text
One approach mutual reinforcement
Co-training train classifier on redundant views
of data (e.g., structured unstructured)
Bootstrap on examples proposed by both views
More generally graphical models

75
Summary