Scalable Information Extraction

About This Presentation

Title:

Scalable Information Extraction

Description:

(e.g., drug info, WHO drug adverse effects DB, etc) Medical ... Air Canada. 0.8. Richardson. 7th Level. 1. Santa Clara. Intel. 0.8. Santa Clara. 3Com Corp ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 91

Provided by: EugeneAg8

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scalable Information Extraction

1
Scalable Information Extraction

Eugene Agichtein

2
Example Angina treatments
Structured databases (e.g., drug info, WHO drug
adverse effects DB, etc)
Medical reference and literature
Web search results
3
Research Goal

Accurate, intuitive, and efficient access to
knowledge in unstructured sources
Approaches
Information Retrieval
Retrieve the relevant documents or passages
Question answering
Human Reading
Construct domain-specific verticals (MedLine)
Machine Reading
Extract entities and relationships
Network of relationships Semantic Web

4
Semantic Relationships Buried in Unstructured
Text
RecommendedTreatment
A number of well-designed and -executed
large-scale clinical trials have now shown that
treatment with statins reduces recurrent
myocardial infarction, reduces strokes, and
lessens the need for revascularization or
hospitalization for unstable angina pectoris

Web, newsgroups, web logs
Text databases (PubMed, CiteSeer, etc.)
Newspaper Archives
Corporate mergers, succession, location
Terrorist attacks

5
What Structured Representation Can Do for You
Structured Relation

allow precise and efficient querying
allow returning answers instead of documents
support powerful query constructs
allow data integration with (structured) RDBMS
provide useful content for Semantic Web

6
Challenges in Information Extraction

Portability
Reduce effort to tune for new domains and tasks
MUC systems experts would take 8-12 weeks to
tune
Scalability, Efficiency, Access
Enable information extraction over large
collections
1 sec / document 5 billion docs 158 CPU years
Approach learn from data ( Bootstrapping )
Snowball Partially Supervised Information
Extraction
Querying Large Text Databases for Efficient
Information Extraction

7
Outline

Snowball partially supervised information
extraction (overview and key results)
Effective retrieval algorithms for information
extraction (in detail)
Current mining user behavior for web search
Future work

8
The Snowball System Overview
Snowball
... ... ..
9
Snowball Getting User Input
ACM DL 2000

User input
a handful of example instances
integrity constraints on the relation e.g.,
Organization is a key, Age 0, etc

10
Snowball Finding Example Occurrences
Can use any full-text search engine
Search Engine
Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA-based Microsoft Corp The
Armonk-based IBM introduced a new lineChange
of guard at IBM Corporations headquarters near
Armonk, NY ...
11
Snowball Tagging Entities
Named entity taggers can recognize Dates, People,
Locations, Organizations, MITREs Alembic,
IBMs Talent, LingPipe,
Computer servers at Microsoft s headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA -based Microsoft Corp The Armonk
-based IBM introduced a new lineChange of
guard at IBM Corporations headquarters near
Armonk, NY ...
12
Snowball Extraction Patterns

General extraction pattern model
acceptor0, Entity,
acceptor1, Entity, acceptor2
Acceptor instantiations
String Match (accepts string s headquarters
in)
Vector-Space ( vector (-s,0.5), (headquarters,
0.5), (in, 0.5) )
Classifier (estimate P(Tvalid s,
headquarters, in) )

13
Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
LOCATION
ORGANIZATION
, ,

14
Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
Create patterns as filtered cluster centroids
3
LOCATION
ORGANIZATION
,

15
Snowball Extracting New Tuples
Match tagged text fragments against patterns
Google 's new headquarters in
Mountain View are
V
LOCATION
ORGANIZATION
, , 0.5

LOCATION
,

P2
Match0.4
ORGANIZATION
ORGANIZATION
,

P1
Match0.8
LOCATION
LOCATION
,

ORGANIZATION
P3
Match0
16
Snowball Evaluating Patterns
Automatically estimate pattern confidenceConf(P4
) Positive / Total 2/3 0.66
Current seed tuples
LOCATION
ORGANIZATION

P4
?
IBM, Armonk, reported Positive Intel,
Santa Clara, introduced... Positive Bet on
Microsoft, New York-based analyst Jane Smith
said... Negative
?
x
17
Snowball Evaluating Tuples
Automatically evaluate tuple confidence Conf(T)
A tuple has high confidence if generated by
high-confidence patterns.
Conf(T) 0.83
ORGANIZATION

P4 0.66
LOCATION

3COM Santa Clara
0.4
,

LOCATION
ORGANIZATION
0.8
P3 0.95
18
Snowball Evaluating Tuples

... .... ..

... .... ..
Keep only high-confidence tuples for next
iteration
19
Snowball Evaluating Tuples

Start new iteration with expanded example set
Iterate until no new tuples are extracted
20
Pattern-Tuple Duality

A good tuple
Extracted by good patterns
Tuple weight ? goodness
A good pattern
Generated by good tuples
Extracts good new tuples
Pattern weight ? goodness
Edge weight
Match/Similarity of tuple context to pattern

21
How to Set Node Weights

Constraint violation (from before)
Conf(P) Log(Pos) Pos/(PosNeg)
Conf(T)
HITS Hassan et al., EMNLP 2006
Conf(P) ?Conf(T)
Conf(T) ?Conf(P)
URNS Downey et al., IJCAI 2005
EM-Spy Agichtein, SDM 2006
Unknown tuples Neg
Compute Conf(P), Conf(T)
Iterate

22
Snowball EM-based Pattern Evaluation
23
Evaluating Patterns and Tuples Expectation
Maximization

EM-Spy Algorithm
Hide labels for some seed tuples
Iterate EM algorithm to convergence on
tuple/pattern confidence values
Set threshold t such that (t 90 of spy
tuples)
Re-initialize Snowball using new seed tuples

..
24
Adapting Snowball for New Relations

Large parameter space
Initial seed tuples (randomly chosen, multiple
runs)
Acceptor features words, stems, n-grams,
phrases, punctuation, POS
Feature selection techniques OR, NB, Freq,
support, combinations
Feature weights TFIDF, TF, TFNB, NB
Pattern evaluation strategies NN, Constraint
violation, EM, EM-Spy
Automatically estimate parameter values
Estimate operating parameters based on
occurrences of seed tuples
Run cross-validation on hold-out sets of seed
tuples for optimal perf.
Seed occurrences that do not have close
neighbors are discarded

25
Example Task 1 DiseaseOutbreaks
SDM 2006
Proteus 0.409 Snowball 0.415
26
Example Task 2 Bioinformaticsa.k.a. mining the
bibliome
ISMB 2003
APO-1, also known as DR6MEK4, also called
SEK1

100,000 gene and protein synonyms extracted from
50,000 journal articles
Approximately 40 of confirmed synonyms not
previously listed in curated authoritative
reference (SWISSPROT)

27
Snowball Used in Various Domains

News NYT, WSJ, AP DL00, SDM06
CompanyHeadquarters, MergersAcquisitions,
DiseaseOutbreaks
Medical literature PDRHealth, Micromedex
Thesis
AdverseEffects, DrugInteractions,
RecommendedTreatments
Biological literature GeneWays corpus ISMB03
Gene and Protein Synonyms

28
Limits of Bootstrapping for Extraction
CIKM 2005

Task easy when context term distributions
diverge from background
Quantify as relative entropy (Kullback-Liebler
divergence)
After calibration, metric predicts if
bootstrapping likely to work

29
Few Relations Cover Common Questions
SIGIR 2005

25 relations cover 50 of question types, 5
relations cover 55 question instances

30
Outline

Snowball, a domain-independent, partially
supervised information extraction system
Retrieval algorithms for scalable information
extraction
Current mining user behavior for web search
Future work

31
Extracting A Relation From a Large Text Database
InformationExtraction System
StructuredRelation

Brute force approach feed all docs to
information extraction system
Only a tiny fraction of documents are often
useful
Many databases are not crawlable
Often a search interface is available, with
existing keyword index
How to identify useful documents?

32
Accessing Text DBs via Search Engines
InformationExtraction System
Search Engine

Search engines impose limitations
Limit on documents retrieved per query
Support simple keywords and phrases
Ignore stopwords (e.g., a, is)

StructuredRelation
33
QXtract Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
Query Generation
Queries
Promising Documents
Information Extraction System
Problem Learn keyword queries to retrieve
promising documents
Extracted Relation
34
Learning Queries to Retrieve Promising Documents
User-Provided Seed Tuples

Get document sample with likely negative and
likely positive examples.
Label sample documents using information
extraction system as oracle.
Train classifiers to recognize useful
documents.
Generate queries from classifier model/rules.

Seed Sampling
Information Extraction System
Classifier Training
Query Generation
Queries
35
Training Classifiers to Recognize Useful
Documents

D1
Document features words

D2
-
D3
-
D4
Ripper
SVM
Okapi (IR)
products
disease AND reported USEFUL
disease
exported
reported
used
epidemic
far
infected
virus
36
Generating Queries from Classifiers
SVM
Ripper
Okapi (IR)
disease AND reported USEFUL
products
disease
exported
reported
epidemic
used
infected
far
virus
epidemicvirus
virusinfected
disease AND reported
QCombined
disease and reportedepidemicvirus
37
SIGMOD 2003 Demonstration
38
Tuples A Simple Querying Strategy
Ebola and Zaire
Search Engine
InformationExtraction System

Convert given tuples into queries
Retrieve matching documents
Extract new tuples from documents and iterate

39
Comparison of Document Access Methods
QXtract 60 of relation extracted from 10 of
documents of 135,000 newspaper article
database Tuples strategy Recall at most 46
40
How to choose the best strategy?

Tuples Simple, no training, but limited recall
QXtract Robust, but has training and query
overhead
Scan No overhead, but must process all documents

41
Predicting Recall of Tuples Strategy
WebDB 2003
Seed Tuple
Seed Tuple
SUCCESS!
FAILURE ?
Can we predict if Tuples will succeed?
42
Abstract the problem Querying Graph
Tuples
Documents
Ebola and Zaire
t1
Search Engine
d1
t2
d2
t3
d3
t4
d4
t5
d5
Note Only top K docs returned for each query.
? retrieves many documents that
do not contain tuples ? searching for an
extracted tuple may not retrieve source document
43
Information Reachability Graph
Tuples
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t2, t3, and t4 reachable from t1
t5
d5
44
Connected Components
Reachable Tuples, do not retrieve tuples in Core
Tuples that retrieve other tuples and themselves
Tuples that retrieve other tuples but are not
reachable
45
Sizes of Connected Components
How many tuples are in largest Core Out?
Out
In
Out
In
Core
Core
t0
Out
In
(strongly
Core
connected)

Conjecture
Degree distribution in reachability graphs
follows power-law.
Then, reachability graph has at most one giant
component.
Define Reachability as Fraction of tuples in
largest Core Out

46
NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
47
NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
48
Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
49
Estimating Reachability

In a power-law random graph G a giant component
CG emerges if d (the average outdegree) 1,
and

Estimate Reachability CG / T
Depends only on d (average outdegree)

Chung and Lu, Annals of Combinatorics, 2002
For b 50
Estimating Reachability Algorithm
Tuples
Documents
t1
d1
t1

Pick some random tuples
Use tuples to query database
Extract tuples from matching documents to compute
reachability graph edges
Estimate average outdegree
Estimate reachability using results of Chung and
Lu, Annals of Combinatorics, 2002

d2
t2
t3
t3
d3
t4
d4
t2
t2
d 1.5
t4
51
Estimating Reachability of NYT
.46
Approximate reachability is estimated after 50
queries. Can be used to predict success (or
failure) of a Tuples querying strategy.
52
To Search or to Crawl? Towards a Query Optimizer
for Text-Centric Tasks, Ipeirotis, Agichtein,
Jain, Gavano, SIGMOD 2006

Information extraction applications extract
structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Information Extraction System (e.g., NYUs
Proteus)
53
An Abstract View of Text-Centric Tasks
Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006
Text Database
Extraction System

Retrieve documents from database

Process documents

Extract output tuples

54
Executing a Text-Centric Task
Text Database
Extraction System

Retrieve documents from database

Extract output tuples

Process documents

Two major execution paradigms
Scan-based Retrieve and process documents
sequentially
Index-based Query database (e.g., case
fatality rate), retrieve and process
documents in results

Similar to relational world

?underlying data distribution dictates what is
best

Indexes are only approximate index is on
keywords, not on tuples of interest
Choice of execution plan affects output
completeness (not only speed)

Unlike the relational world
55
Execution Plan Characteristics
Question How do we choose the fastest execution
plan for reaching a target recall ?
Text Database
Extraction System

Retrieve documents from database

Process documents

Extract output tuples

Execution Plans have two main characteristics
Execution Time
Recall (fraction of tuples retrieved)

What is the fastest plan for discovering 10 of
the disease outbreaks mentioned in The New York
Times archive?
56
Outline

Description and analysis of crawl- and
query-based plans
Scan
Filtered Scan
Iterative Set Expansion
Automatic Query Generation
Optimization strategy
Experimental results and conclusions

Crawl-based
Query-based
(Index-based)
57
Scan
Extraction System
Text Database

Extract output tuples

Process documents

Retrieve docs from database

Scan retrieves and processes documents
sequentially (until reaching target recall)
Execution time Retrieved Docs (R P)

Question How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Time for retrieving a document
Filtered Scan uses a classifier to identify and
process only promising documents (details in
paper)
58
Estimating Recall of Scan

Modeling Scan for tuple t
What is the probability of seeing t (with
frequency g(t)) after retrieving S documents?
A sampling without replacement process
After retrieving S documents, frequency of tuple
t follows hypergeometric distribution
Recall for tuple t is the probability that
frequency of t in S docs 0

Probability of seeing tuple t after retrieving S
documents
g(t) frequency of tuple t

59
Estimating Recall of Scan

Modeling Scan
Multiple sampling without replacement
processes, one for each tuple
Overall recall is average recall across tuples
? We can compute number of documents required to
reach target recall

Execution time Retrieved Docs (R P)
60
Iterative Set Expansion
Text Database
Extraction System
Query Generation

Extract tuplesfrom docs

Process retrieved documents

Augment seed tuples with new tuples

Query database with seed tuples

(e.g., )
(e.g., Ebola AND Zaire)

Execution time Retrieved Docs (R P)
Queries Q

Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Time for answering a query
Time for retrieving a document
Time for processing a document
61
Using Querying Graph for Analysis
tuples
Documents

We need to compute the
Number of documents retrieved after sending Q
tuples as queries (estimates time)
Number of tuples that appear in the retrieved
documents (estimates recall)
To estimate these we need to compute the
Degree distribution of the tuples discovered by
retrieving documents
Degree distribution of the documents retrieved by
the tuples
(Not the same as the degree distribution of a
randomly chosen tuple or document it is easier
to discover documents and tuples with high
degrees)

t1
d1

d2
t2

t3
d3

t4
d4

t5
d5

62
Summary of Cost Analysis

Our analysis so far
Takes as input a target recall
Gives as output the time for each plan to reach
target recall(time infinity, if plan cannot
reach target recall)
Time and recall depend on task-specific
properties of database
tuple degree distribution
Document degree distribution
Next, we show how to estimate degree
distributions on-the-fly

63
Estimating Cost Model Parameters

tuple and document degree distributions belong to
known distribution families

Can characterize distributions with only a few
parameters!
64
Parameter Estimation

Naïve solution for parameter estimation
Start with separate, parameter-estimation phase
Perform random sampling on database
Stop when cross-validation indicates high
confidence
We can do better than this!
No need for separate sampling phase
Sampling is equivalent to executing the task
?Piggyback parameter estimation into execution

65
On-the-fly Parameter Estimation
Correct (but unknown) distribution

Pick most promising execution plan for target
recall assuming default parameter values
Start executing task
Update parameter estimates during execution
Switch plan if updated statistics indicate so

Important
Only Scan acts as random sampling
All other execution plan need parameter
adjustment (see paper)

66
Outline

Description and analysis of crawl- and
query-based plans
Optimization strategy
Experimental results and conclusions

67
Correctness of Theoretical Analysis
Task Disease Outbreaks Snowball IE
system 182,531 documents from NYT 16,921 tuples

Solid lines Actual time
Dotted lines Predicted time with correct
parameters

68
Experimental Results (Information Extraction)

Solid lines Actual time
Green line Time with optimizer
(results similar in other experiments see
paper)

69
Conclusions

Common execution plans for multiple text-centric
tasks
Analytic models for predicting execution time and
recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the
fastest plan for target recall

70
Can we do better?

Yes. For some information extraction systems

71
Bindings Engine (BE) Slides Cafarella 2005

Bindings Engine (BE) is search engine where
No downloads during query processing
Disk seeks constant in corpus size
queries phrases
BEs approach
Variabilized search query language
Pre-processes all documents before query-time
Integrates variable/type data with inverted
index, minimizing query seeks

72
BE Query Support

cities such as
President Bush
is the capital of
reach me at
Any sequence of concrete terms and typed
variables
NEAR is insufficient
Functions (e.g., head())

73
BE Operation

Like a generic search engine, BE
Downloads a corpus of pages
Creates an index
Uses index to process queries efficiently
BE further requires
Set of indexed types (e.g., NounPhrase), with a
recognizer for each
String processing functions (e.g., head())
A BE system can only process types and functions
that its index supports

74
(No Transcript)
75
Query such as
docs
docid0
docid1
docid2
dociddocs-1

Test for equality
Advance smaller pointer
Abort when a list is exhausted

docs
docid0
docid1
docid2
dociddocs-1

322
Returned docs
76
such as
In phrase queries, match positions as well
77
Neighbor Index

At each position in the index, store neighbor
text that might be useful
Lets index and

I love cities such as Atlanta.
AdjT love
78
Neighbor Index

At each position in the index, store neighbor
text that might be useful
Lets index and

I love cities such as Atlanta.
AdjT cities NP cities
AdjT I NP I
79
Neighbor Index
Query cities such as
I love cities such as Atlanta.
AdjT Atlanta NP Atlanta
AdjT such
80
cities such as
docs
pos0
pos1
dociddocs-1
posdocs-1
docid0
docid1

19

posns
pos0
pos1
pospos-1
posns
pos0
neighbor0
pos1
neighbor1
pospos-1
12
neighbor1
str1
neighbors
blk_offset
neighbor0
str0
NPright
Atlanta
3

AdjTleft
such
In doc 19, starting at posn 8
I love cities such as Atlanta.

Find phrase query positions, as with phrase
queries
If term is adjacent to variable, extract typed
value

81
Current Research Directions

Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web,
blogspace, wikipedia
Modeling implicit link structures in text,
collections, web
Exploiting implicit explicit social networks
(e.g., for epidemiology)
Knowledge Discovery from Biological and Medical
Data
Automatic sequence annotation ? bioinformatics,
genetics
Actionable knowledge extraction from medical
articles
Robust information extraction, retrieval, and
query processing
Integrating information in structured and
unstructured sources
Robust search/question answering for medical
applications
Confidence estimation for extraction from text
and other sources
Detecting reliable signals from (noisy) text data
(e.g., medical surveillance)
Accuracy (!authority) of online sources
Information diffusion/propagation in online
sources
Information propagation on the web
In collaborative sources (wikipedia, MedLine)

82
Page Quality In Search of an Unbiased Web
RankingCho, Roy, Adams, SIGMOD 2005

popular pages tend to get even more popular,
while unpopular pages get ignored by an average
user

83
Sic Transit Gloria Telae Towards an
Understanding of theWebs Decay Bar-Yossef,
Broder, Kumar, Tomkins, WWW 2004
84
Modeling Social Networks for

Epidemiology, security,

Email exchange mapped onto cubicle locations.
85
Some Research Directions

Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web,
blogspace, wikipedia
Modeling implicit link structures in text,
collections, web
Exploiting implicit explicit social networks
(e.g., for epidemiology)
Knowledge Discovery from Biological and Medical
Data
Automatic sequence annotation ? bioinformatics,
genetics
Actionable knowledge extraction from medical
articles
Robust information extraction, retrieval, and
query processing
Integrating information in structured and
unstructured sources
Query processing over unstructured text
Robust search/question answering for medical
applications
Confidence estimation for extraction from text
and other sources
Detecting reliable signals from (noisy) text data
(e.g., medical surveillance)
Information diffusion/propagation in online
sources
Information propagation on the web
In collaborative sources (wikipedia, MedLine)

86
Mining Text and Sequence Data
Agichtein Eskin, PSB 2004
ROC50 scores for each class and method
87
Some Research Directions

Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web,
blogspace, wikipedia
Modeling implicit link structures in text,
collections, web
Exploiting implicit explicit social networks
(e.g., for epidemiology)
Knowledge Discovery from Biological and Medical
Data
Automatic sequence annotation ? bioinformatics,
genetics
Actionable knowledge extraction from medical
articles
Robust information extraction, retrieval, and
query processing
Integrating information in structured and
unstructured sources
Robust search/question answering for medical
applications
Confidence estimation for extraction from text
and other sources
Detecting reliable signals from (noisy) text data
(e.g., medical surveillance)
Accuracy (!authority) of online sources
Information diffusion/propagation in online
sources
Information propagation on the web
In collaborative sources (wikipedia, MedLine)

88
Structure and evolution of blogspace Kumar,
Novak, Raghavan, Tomkins, CACM 2004, KDD 2006
Fraction of nodes in components of various sizes
within Flickr and Yahoo! 360 timegraph, by week.
89
Current Research Directions