Title: Scalable Information Extraction
1Scalable Information Extraction
2Example Angina treatments
Structured databases (e.g., drug info, WHO drug
adverse effects DB, etc)
Medical reference and literature
Web search results
3Research Goal
- Accurate, intuitive, and efficient access to
knowledge in unstructured sources - Approaches
- Information Retrieval
- Retrieve the relevant documents or passages
- Question answering
- Human Reading
- Construct domain-specific verticals (MedLine)
- Machine Reading
- Extract entities and relationships
- Network of relationships Semantic Web
4Semantic Relationships Buried in Unstructured
Text
RecommendedTreatment
A number of well-designed and -executed
large-scale clinical trials have now shown that
treatment with statins reduces recurrent
myocardial infarction, reduces strokes, and
lessens the need for revascularization or
hospitalization for unstable angina pectoris
- Web, newsgroups, web logs
- Text databases (PubMed, CiteSeer, etc.)
- Newspaper Archives
- Corporate mergers, succession, location
- Terrorist attacks
5What Structured Representation Can Do for You
Structured Relation
- allow precise and efficient querying
- allow returning answers instead of documents
- support powerful query constructs
- allow data integration with (structured) RDBMS
- provide useful content for Semantic Web
6Challenges in Information Extraction
- Portability
- Reduce effort to tune for new domains and tasks
- MUC systems experts would take 8-12 weeks to
tune - Scalability, Efficiency, Access
- Enable information extraction over large
collections - 1 sec / document 5 billion docs 158 CPU years
- Approach learn from data ( Bootstrapping )
- Snowball Partially Supervised Information
Extraction - Querying Large Text Databases for Efficient
Information Extraction
7Outline
- Snowball partially supervised information
extraction (overview and key results) - Effective retrieval algorithms for information
extraction (in detail) - Current mining user behavior for web search
- Future work
8The Snowball System Overview
Snowball
... ... ..
9Snowball Getting User Input
ACM DL 2000
- User input
- a handful of example instances
- integrity constraints on the relation e.g.,
Organization is a key, Age 0, etc
10Snowball Finding Example Occurrences
Can use any full-text search engine
Search Engine
Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA-based Microsoft Corp The
Armonk-based IBM introduced a new lineChange
of guard at IBM Corporations headquarters near
Armonk, NY ...
11Snowball Tagging Entities
Named entity taggers can recognize Dates, People,
Locations, Organizations, MITREs Alembic,
IBMs Talent, LingPipe,
Computer servers at Microsoft s headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA -based Microsoft Corp The Armonk
-based IBM introduced a new lineChange of
guard at IBM Corporations headquarters near
Armonk, NY ...
12Snowball Extraction Patterns
- General extraction pattern model
- acceptor0, Entity,
acceptor1, Entity, acceptor2 - Acceptor instantiations
- String Match (accepts string s headquarters
in) - Vector-Space ( vector (-s,0.5), (headquarters,
0.5), (in, 0.5) ) - Classifier (estimate P(Tvalid s,
headquarters, in) )
13Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
LOCATION
ORGANIZATION
, ,
14Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
Create patterns as filtered cluster centroids
3
LOCATION
ORGANIZATION
,
15Snowball Extracting New Tuples
Match tagged text fragments against patterns
Google 's new headquarters in
Mountain View are
V
LOCATION
ORGANIZATION
, , 0.5
LOCATION
,
P2
Match0.4
ORGANIZATION
ORGANIZATION
,
P1
Match0.8
LOCATION
LOCATION
,
ORGANIZATION
P3
Match0
16Snowball Evaluating Patterns
Automatically estimate pattern confidenceConf(P4
) Positive / Total 2/3 0.66
Current seed tuples
LOCATION
ORGANIZATION
P4
?
IBM, Armonk, reported Positive Intel,
Santa Clara, introduced... Positive Bet on
Microsoft, New York-based analyst Jane Smith
said... Negative
?
x
17Snowball Evaluating Tuples
Automatically evaluate tuple confidence Conf(T)
A tuple has high confidence if generated by
high-confidence patterns.
Conf(T) 0.83
ORGANIZATION
P4 0.66
LOCATION
3COM Santa Clara
0.4
,
LOCATION
ORGANIZATION
0.8
P3 0.95
18Snowball Evaluating Tuples
... .... ..
... .... ..
Keep only high-confidence tuples for next
iteration
19Snowball Evaluating Tuples
Start new iteration with expanded example set
Iterate until no new tuples are extracted
20Pattern-Tuple Duality
- A good tuple
- Extracted by good patterns
- Tuple weight ? goodness
- A good pattern
- Generated by good tuples
- Extracts good new tuples
- Pattern weight ? goodness
- Edge weight
- Match/Similarity of tuple context to pattern
21How to Set Node Weights
- Constraint violation (from before)
- Conf(P) Log(Pos) Pos/(PosNeg)
- Conf(T)
- HITS Hassan et al., EMNLP 2006
- Conf(P) ?Conf(T)
- Conf(T) ?Conf(P)
- URNS Downey et al., IJCAI 2005
- EM-Spy Agichtein, SDM 2006
- Unknown tuples Neg
- Compute Conf(P), Conf(T)
- Iterate
22Snowball EM-based Pattern Evaluation
23Evaluating Patterns and Tuples Expectation
Maximization
- EM-Spy Algorithm
- Hide labels for some seed tuples
- Iterate EM algorithm to convergence on
tuple/pattern confidence values - Set threshold t such that (t 90 of spy
tuples) - Re-initialize Snowball using new seed tuples
..
24Adapting Snowball for New Relations
- Large parameter space
- Initial seed tuples (randomly chosen, multiple
runs) - Acceptor features words, stems, n-grams,
phrases, punctuation, POS - Feature selection techniques OR, NB, Freq,
support, combinations - Feature weights TFIDF, TF, TFNB, NB
- Pattern evaluation strategies NN, Constraint
violation, EM, EM-Spy - Automatically estimate parameter values
- Estimate operating parameters based on
occurrences of seed tuples - Run cross-validation on hold-out sets of seed
tuples for optimal perf. - Seed occurrences that do not have close
neighbors are discarded
25Example Task 1 DiseaseOutbreaks
SDM 2006
Proteus 0.409 Snowball 0.415
26Example Task 2 Bioinformaticsa.k.a. mining the
bibliome
ISMB 2003
APO-1, also known as DR6MEK4, also called
SEK1
- 100,000 gene and protein synonyms extracted from
50,000 journal articles - Approximately 40 of confirmed synonyms not
previously listed in curated authoritative
reference (SWISSPROT)
27Snowball Used in Various Domains
- News NYT, WSJ, AP DL00, SDM06
- CompanyHeadquarters, MergersAcquisitions,
DiseaseOutbreaks - Medical literature PDRHealth, Micromedex
Thesis - AdverseEffects, DrugInteractions,
RecommendedTreatments - Biological literature GeneWays corpus ISMB03
- Gene and Protein Synonyms
28Limits of Bootstrapping for Extraction
CIKM 2005
- Task easy when context term distributions
diverge from background - Quantify as relative entropy (Kullback-Liebler
divergence) - After calibration, metric predicts if
bootstrapping likely to work
29Few Relations Cover Common Questions
SIGIR 2005
- 25 relations cover 50 of question types, 5
relations cover 55 question instances
30Outline
- Snowball, a domain-independent, partially
supervised information extraction system - Retrieval algorithms for scalable information
extraction - Current mining user behavior for web search
- Future work
31Extracting A Relation From a Large Text Database
InformationExtraction System
StructuredRelation
- Brute force approach feed all docs to
information extraction system - Only a tiny fraction of documents are often
useful - Many databases are not crawlable
- Often a search interface is available, with
existing keyword index - How to identify useful documents?
32Accessing Text DBs via Search Engines
InformationExtraction System
Search Engine
- Search engines impose limitations
- Limit on documents retrieved per query
- Support simple keywords and phrases
- Ignore stopwords (e.g., a, is)
StructuredRelation
33QXtract Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
Query Generation
Queries
Promising Documents
Information Extraction System
Problem Learn keyword queries to retrieve
promising documents
Extracted Relation
34Learning Queries to Retrieve Promising Documents
User-Provided Seed Tuples
- Get document sample with likely negative and
likely positive examples. - Label sample documents using information
extraction system as oracle. - Train classifiers to recognize useful
documents. - Generate queries from classifier model/rules.
Seed Sampling
Information Extraction System
Classifier Training
Query Generation
Queries
35Training Classifiers to Recognize Useful
Documents
D1
Document features words
D2
-
D3
-
D4
Ripper
SVM
Okapi (IR)
products
disease AND reported USEFUL
disease
exported
reported
used
epidemic
far
infected
virus
36Generating Queries from Classifiers
SVM
Ripper
Okapi (IR)
disease AND reported USEFUL
products
disease
exported
reported
epidemic
used
infected
far
virus
epidemicvirus
virusinfected
disease AND reported
QCombined
disease and reportedepidemicvirus
37SIGMOD 2003 Demonstration
38Tuples A Simple Querying Strategy
Ebola and Zaire
Search Engine
InformationExtraction System
- Convert given tuples into queries
- Retrieve matching documents
- Extract new tuples from documents and iterate
39Comparison of Document Access Methods
QXtract 60 of relation extracted from 10 of
documents of 135,000 newspaper article
database Tuples strategy Recall at most 46
40How to choose the best strategy?
- Tuples Simple, no training, but limited recall
- QXtract Robust, but has training and query
overhead - Scan No overhead, but must process all documents
41Predicting Recall of Tuples Strategy
WebDB 2003
Seed Tuple
Seed Tuple
SUCCESS!
FAILURE ?
Can we predict if Tuples will succeed?
42Abstract the problem Querying Graph
Tuples
Documents
Ebola and Zaire
t1
Search Engine
d1
t2
d2
t3
d3
t4
d4
t5
d5
Note Only top K docs returned for each query.
? retrieves many documents that
do not contain tuples ? searching for an
extracted tuple may not retrieve source document
43Information Reachability Graph
Tuples
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t2, t3, and t4 reachable from t1
t5
d5
44Connected Components
Reachable Tuples, do not retrieve tuples in Core
Tuples that retrieve other tuples and themselves
Tuples that retrieve other tuples but are not
reachable
45Sizes of Connected Components
How many tuples are in largest Core Out?
Out
In
Out
In
Core
Core
t0
Out
In
(strongly
Core
connected)
- Conjecture
- Degree distribution in reachability graphs
follows power-law. - Then, reachability graph has at most one giant
component. - Define Reachability as Fraction of tuples in
largest Core Out
46NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
47NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
48Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
49Estimating Reachability
- In a power-law random graph G a giant component
CG emerges if d (the average outdegree) 1,
and
- Estimate Reachability CG / T
- Depends only on d (average outdegree)
Chung and Lu, Annals of Combinatorics, 2002
For b
50Estimating Reachability Algorithm
Tuples
Documents
t1
d1
t1
- Pick some random tuples
- Use tuples to query database
- Extract tuples from matching documents to compute
reachability graph edges - Estimate average outdegree
- Estimate reachability using results of Chung and
Lu, Annals of Combinatorics, 2002
d2
t2
t3
t3
d3
t4
d4
t2
t2
d 1.5
t4
51Estimating Reachability of NYT
.46
Approximate reachability is estimated after 50
queries. Can be used to predict success (or
failure) of a Tuples querying strategy.
52To Search or to Crawl? Towards a Query Optimizer
for Text-Centric Tasks, Ipeirotis, Agichtein,
Jain, Gavano, SIGMOD 2006
- Information extraction applications extract
structured relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Information Extraction System (e.g., NYUs
Proteus)
53An Abstract View of Text-Centric Tasks
Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006
Text Database
Extraction System
- Retrieve documents from database
54Executing a Text-Centric Task
Text Database
Extraction System
- Retrieve documents from database
- Two major execution paradigms
- Scan-based Retrieve and process documents
sequentially - Index-based Query database (e.g., case
fatality rate), retrieve and process
documents in results
- Similar to relational world
?underlying data distribution dictates what is
best
- Indexes are only approximate index is on
keywords, not on tuples of interest - Choice of execution plan affects output
completeness (not only speed)
Unlike the relational world
55Execution Plan Characteristics
Question How do we choose the fastest execution
plan for reaching a target recall ?
Text Database
Extraction System
- Retrieve documents from database
- Execution Plans have two main characteristics
- Execution Time
- Recall (fraction of tuples retrieved)
What is the fastest plan for discovering 10 of
the disease outbreaks mentioned in The New York
Times archive?
56Outline
- Description and analysis of crawl- and
query-based plans - Scan
- Filtered Scan
- Iterative Set Expansion
- Automatic Query Generation
- Optimization strategy
- Experimental results and conclusions
Crawl-based
Query-based
(Index-based)
57Scan
Extraction System
Text Database
- Retrieve docs from database
- Scan retrieves and processes documents
sequentially (until reaching target recall) - Execution time Retrieved Docs (R P)
Question How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Time for retrieving a document
Filtered Scan uses a classifier to identify and
process only promising documents (details in
paper)
58Estimating Recall of Scan
- Modeling Scan for tuple t
- What is the probability of seeing t (with
frequency g(t)) after retrieving S documents? - A sampling without replacement process
- After retrieving S documents, frequency of tuple
t follows hypergeometric distribution - Recall for tuple t is the probability that
frequency of t in S docs 0
- Probability of seeing tuple t after retrieving S
documents - g(t) frequency of tuple t
59Estimating Recall of Scan
- Modeling Scan
- Multiple sampling without replacement
processes, one for each tuple - Overall recall is average recall across tuples
- ? We can compute number of documents required to
reach target recall
Execution time Retrieved Docs (R P)
60Iterative Set Expansion
Text Database
Extraction System
Query Generation
- Process retrieved documents
- Augment seed tuples with new tuples
- Query database with seed tuples
(e.g., )
(e.g., Ebola AND Zaire)
- Execution time Retrieved Docs (R P)
Queries Q
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Time for answering a query
Time for retrieving a document
Time for processing a document
61Using Querying Graph for Analysis
tuples
Documents
- We need to compute the
- Number of documents retrieved after sending Q
tuples as queries (estimates time) - Number of tuples that appear in the retrieved
documents (estimates recall) - To estimate these we need to compute the
- Degree distribution of the tuples discovered by
retrieving documents - Degree distribution of the documents retrieved by
the tuples - (Not the same as the degree distribution of a
randomly chosen tuple or document it is easier
to discover documents and tuples with high
degrees)
t1
d1
d2
t2
t3
d3
t4
d4
t5
d5
62Summary of Cost Analysis
- Our analysis so far
- Takes as input a target recall
- Gives as output the time for each plan to reach
target recall(time infinity, if plan cannot
reach target recall) - Time and recall depend on task-specific
properties of database - tuple degree distribution
- Document degree distribution
- Next, we show how to estimate degree
distributions on-the-fly
63Estimating Cost Model Parameters
- tuple and document degree distributions belong to
known distribution families
Can characterize distributions with only a few
parameters!
64Parameter Estimation
- Naïve solution for parameter estimation
- Start with separate, parameter-estimation phase
- Perform random sampling on database
- Stop when cross-validation indicates high
confidence - We can do better than this!
- No need for separate sampling phase
- Sampling is equivalent to executing the task
- ?Piggyback parameter estimation into execution
65On-the-fly Parameter Estimation
Correct (but unknown) distribution
- Pick most promising execution plan for target
recall assuming default parameter values - Start executing task
- Update parameter estimates during execution
- Switch plan if updated statistics indicate so
- Important
- Only Scan acts as random sampling
- All other execution plan need parameter
adjustment (see paper)
66Outline
- Description and analysis of crawl- and
query-based plans - Optimization strategy
- Experimental results and conclusions
67Correctness of Theoretical Analysis
Task Disease Outbreaks Snowball IE
system 182,531 documents from NYT 16,921 tuples
- Solid lines Actual time
- Dotted lines Predicted time with correct
parameters
68Experimental Results (Information Extraction)
- Solid lines Actual time
- Green line Time with optimizer
- (results similar in other experiments see
paper)
69Conclusions
- Common execution plans for multiple text-centric
tasks - Analytic models for predicting execution time and
recall of various crawl- and query-based plans - Techniques for on-the-fly parameter estimation
- Optimization framework picks on-the-fly the
fastest plan for target recall
70Can we do better?
- Yes. For some information extraction systems
71Bindings Engine (BE) Slides Cafarella 2005
- Bindings Engine (BE) is search engine where
- No downloads during query processing
- Disk seeks constant in corpus size
- queries phrases
- BEs approach
- Variabilized search query language
- Pre-processes all documents before query-time
- Integrates variable/type data with inverted
index, minimizing query seeks
72BE Query Support
- cities such as
- President Bush
- is the capital of
- reach me at
- Any sequence of concrete terms and typed
variables - NEAR is insufficient
- Functions (e.g., head())
73BE Operation
- Like a generic search engine, BE
- Downloads a corpus of pages
- Creates an index
- Uses index to process queries efficiently
- BE further requires
- Set of indexed types (e.g., NounPhrase), with a
recognizer for each - String processing functions (e.g., head())
- A BE system can only process types and functions
that its index supports
74(No Transcript)
75Query such as
docs
docid0
docid1
docid2
dociddocs-1
- Test for equality
- Advance smaller pointer
- Abort when a list is exhausted
docs
docid0
docid1
docid2
dociddocs-1
322
Returned docs
76such as
In phrase queries, match positions as well
77Neighbor Index
- At each position in the index, store neighbor
text that might be useful - Lets index and
I love cities such as Atlanta.
AdjT love
78Neighbor Index
- At each position in the index, store neighbor
text that might be useful - Lets index and
I love cities such as Atlanta.
AdjT cities NP cities
AdjT I NP I
79Neighbor Index
Query cities such as
I love cities such as Atlanta.
AdjT Atlanta NP Atlanta
AdjT such
80cities such as
docs
pos0
pos1
dociddocs-1
posdocs-1
docid0
docid1
19
posns
pos0
pos1
pospos-1
posns
pos0
neighbor0
pos1
neighbor1
pospos-1
12
neighbor1
str1
neighbors
blk_offset
neighbor0
str0
NPright
Atlanta
3
AdjTleft
such
In doc 19, starting at posn 8
I love cities such as Atlanta.
- Find phrase query positions, as with phrase
queries - If term is adjacent to variable, extract typed
value
81Current Research Directions
- Modeling explicit and Implicit network structures
- Modeling evolution of explicit structure on web,
blogspace, wikipedia - Modeling implicit link structures in text,
collections, web - Exploiting implicit explicit social networks
(e.g., for epidemiology) - Knowledge Discovery from Biological and Medical
Data - Automatic sequence annotation ? bioinformatics,
genetics - Actionable knowledge extraction from medical
articles - Robust information extraction, retrieval, and
query processing - Integrating information in structured and
unstructured sources - Robust search/question answering for medical
applications - Confidence estimation for extraction from text
and other sources - Detecting reliable signals from (noisy) text data
(e.g., medical surveillance) - Accuracy (!authority) of online sources
- Information diffusion/propagation in online
sources - Information propagation on the web
- In collaborative sources (wikipedia, MedLine)
82Page Quality In Search of an Unbiased Web
RankingCho, Roy, Adams, SIGMOD 2005
- popular pages tend to get even more popular,
while unpopular pages get ignored by an average
user
83Sic Transit Gloria Telae Towards an
Understanding of theWebs Decay Bar-Yossef,
Broder, Kumar, Tomkins, WWW 2004
84Modeling Social Networks for
Email exchange mapped onto cubicle locations.
85Some Research Directions
- Modeling explicit and Implicit network structures
- Modeling evolution of explicit structure on web,
blogspace, wikipedia - Modeling implicit link structures in text,
collections, web - Exploiting implicit explicit social networks
(e.g., for epidemiology) - Knowledge Discovery from Biological and Medical
Data - Automatic sequence annotation ? bioinformatics,
genetics - Actionable knowledge extraction from medical
articles - Robust information extraction, retrieval, and
query processing - Integrating information in structured and
unstructured sources - Query processing over unstructured text
- Robust search/question answering for medical
applications - Confidence estimation for extraction from text
and other sources - Detecting reliable signals from (noisy) text data
(e.g., medical surveillance) - Information diffusion/propagation in online
sources - Information propagation on the web
- In collaborative sources (wikipedia, MedLine)
86Mining Text and Sequence Data
Agichtein Eskin, PSB 2004
ROC50 scores for each class and method
87Some Research Directions
- Modeling explicit and Implicit network structures
- Modeling evolution of explicit structure on web,
blogspace, wikipedia - Modeling implicit link structures in text,
collections, web - Exploiting implicit explicit social networks
(e.g., for epidemiology) - Knowledge Discovery from Biological and Medical
Data - Automatic sequence annotation ? bioinformatics,
genetics - Actionable knowledge extraction from medical
articles - Robust information extraction, retrieval, and
query processing - Integrating information in structured and
unstructured sources - Robust search/question answering for medical
applications - Confidence estimation for extraction from text
and other sources - Detecting reliable signals from (noisy) text data
(e.g., medical surveillance) - Accuracy (!authority) of online sources
- Information diffusion/propagation in online
sources - Information propagation on the web
- In collaborative sources (wikipedia, MedLine)
88Structure and evolution of blogspace Kumar,
Novak, Raghavan, Tomkins, CACM 2004, KDD 2006
Fraction of nodes in components of various sizes
within Flickr and Yahoo! 360 timegraph, by week.
89Current Research Directions
- Modeling explicit and Implicit network structures
- Modeling evolution of explicit structure on web,
blogspace, wikipedia - Modeling implicit link structures in text,
collections, web - Exploiting implicit explicit social networks
(e.g., for epidemiology) - Knowledge Discovery from Biological and Medical
Data - Automatic sequence annotation ? bioinformatics,
genetics - Actionable knowledge extraction from medical
articles - Robust information extraction, retrieval, and
query processing - Integrating information in structured and
unstructured sources - Robust search/question answering for medical
applications - Confidence estimation for extraction from text
and other sources - Detecting reliable signals from (noisy) text data
(e.g., medical surveillance) - Accuracy (!authority) of online sources
- Information diffusion/propagation in online
sources - Information propagation on the web, news
- In collaborative sources (wikipedia, MedLine)
90Thank You
- Details
- http//www.mathcs.emory.edu/eugene/