Title: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks
1To Search or to Crawl?Towards a Query Optimizer
for Text-Centric Tasks
- Panos Ipeirotis New York University
- Eugene Agichtein Microsoft Research
- Pranay Jain Columbia University
- Luis Gravano Columbia University
2Text-Centric Task I Information Extraction
- Information extraction applications extract
structured relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System (e.g., NYUs
Proteus)
Information Extraction tutorial yesterday by
AnHai Doan, Raghu Ramakrishnan, Shivakumar
Vaithyanathan
3Other Text-Centric Tasks
- Task II Database Selection
- Task III Focused Crawling
Details in the paper
4An Abstract View of Text-Centric Tasks
Output Tokens
Text Database
Extraction System
- Retrieve documents from database
- Process documents
- Extract output tokens
Task Token
Information Extraction Relation Tuple
Database Selection Word (Frequency)
Focused Crawling Web Page about a Topic
5Executing a Text-Centric Task
Output Tokens
Text Database
Extraction System
- Retrieve documents from database
- Extract output tokens
- Process documents
- Two major execution paradigms
- Scan-based Retrieve and process documents
sequentially - Index-based Query database (e.g., case
fatality rate), retrieve and process
documents in results
- Similar to relational world
?underlying data distribution dictates what is
best
- Indexes are only approximate index is on
keywords, not on tokens of interest - Choice of execution plan affects output
completeness (not only speed)
Unlike the relational world
6Execution Plan Characteristics
Question How do we choose the fastest execution
plan for reaching a target recall ?
Output Tokens
Text Database
Extraction System
- Retrieve documents from database
- Process documents
- Extract output tokens
- Execution Plans have two main characteristics
- Execution Time
- Recall (fraction of tokens retrieved)
What is the fastest plan for discovering 10 of
the disease outbreaks mentioned in The New York
Times archive?
7Outline
- Description and analysis of crawl- and
query-based plans - Scan
- Filtered Scan
- Iterative Set Expansion
- Automatic Query Generation
- Optimization strategy
- Experimental results and conclusions
Crawl-based
Query-based
(Index-based)
8Scan
Output Tokens
Extraction System
Text Database
- Extract output tokens
- Process documents
- Retrieve docs from database
- Scan retrieves and processes documents
sequentially (until reaching target recall) - Execution time Retrieved Docs (R P)
Question How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Time for retrieving a document
Filtered Scan uses a classifier to identify and
process only promising documents (details in
paper)
9Estimating Recall of Scan
ltSARS, Chinagt
- Modeling Scan for Token t
- What is the probability of seeing t (with
frequency g(t)) after retrieving S documents? - A sampling without replacement process
- After retrieving S documents, frequency of token
t follows hypergeometric distribution - Recall for token t is the probability that
frequency of t in S docs gt 0
- Probability of seeing token t after retrieving S
documents - g(t) frequency of token t
10Estimating Recall of Scan
ltSARS, Chinagt
ltEbola, Zairegt
- Modeling Scan
- Multiple sampling without replacement
processes, one for each token - Overall recall is average recall across tokens
- ? We can compute number of documents required to
reach target recall
Execution time Retrieved Docs (R P)
11Outline
- Description and analysis of crawl- and
query-based plans - Scan
- Filtered Scan
- Iterative Set Expansion
- Automatic Query Generation
- Optimization strategy
- Experimental results and conclusions
Crawl-based
Query-based
12Iterative Set Expansion
Output Tokens
Text Database
Extraction System
Query Generation
- Extract tokensfrom docs
- Process retrieved documents
- Augment seed tokens with new tokens
- Query database with seed tokens
(e.g., ltMalaria, Ethiopiagt)
(e.g., Ebola AND Zaire)
- Execution time Retrieved Docs (R P)
Queries Q
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Time for answering a query
Time for retrieving a document
Time for processing a document
13Querying Graph
Tokens
Documents
t1
d1
- The querying graph is a bipartite graph,
containing tokens and documents - Each token (transformed to a keyword query)
retrieves documents - Documents contain tokens
ltSARS, Chinagt
d2
t2
ltEbola, Zairegt
t3
d3
ltMalaria, Ethiopiagt
t4
d4
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
14Using Querying Graph for Analysis
Tokens
Documents
- We need to compute the
- Number of documents retrieved after sending Q
tokens as queries (estimates time) - Number of tokens that appear in the retrieved
documents (estimates recall) - To estimate these we need to compute the
- Degree distribution of the tokens discovered by
retrieving documents - Degree distribution of the documents retrieved by
the tokens - (Not the same as the degree distribution of a
randomly chosen token or document it is easier
to discover documents and tokens with high
degrees)
t1
d1
ltSARS, Chinagt
d2
t2
ltEbola, Zairegt
t3
d3
ltMalaria, Ethiopiagt
t4
d4
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
Elegant analysis framework based on generating
functions details in the paper
15Recall Limit Reachability Graph
Reachability Graph
Tokens
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t5
d5
Upper recall limit determined by the size of
the biggest connected component
16Automatic Query Generation
- Iterative Set Expansion has recall limitation due
to iterative nature of query generation - Automatic Query Generation avoids this problem by
creating queries offline (using machine
learning), which are designed to return documents
with tokens
Details in the paper
17Outline
- Description and analysis of crawl- and
query-based plans - Optimization strategy
- Experimental results and conclusions
18Summary of Cost Analysis
- Our analysis so far
- Takes as input a target recall
- Gives as output the time for each plan to reach
target recall(time infinity, if plan cannot
reach target recall) - Time and recall depend on task-specific
properties of database - Token degree distribution
- Document degree distribution
- Next, we show how to estimate degree
distributions on-the-fly
19Estimating Cost Model Parameters
- Token and document degree distributions belong to
known distribution families
Task Document Distribution Token Distribution
Information Extraction Power-law Power-law
Content Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
Can characterize distributions with only a few
parameters!
20Parameter Estimation
- Naïve solution for parameter estimation
- Start with separate, parameter-estimation phase
- Perform random sampling on database
- Stop when cross-validation indicates high
confidence - We can do better than this!
- No need for separate sampling phase
- Sampling is equivalent to executing the task
- ?Piggyback parameter estimation into execution
21On-the-fly Parameter Estimation
Correct (but unknown) distribution
- Pick most promising execution plan for target
recall assuming default parameter values - Start executing task
- Update parameter estimates during execution
- Switch plan if updated statistics indicate so
- Important
- Only Scan acts as random sampling
- All other execution plan need parameter
adjustment (see paper)
22Outline
- Description and analysis of crawl- and
query-based plans - Optimization strategy
- Experimental results and conclusions
23Correctness of Theoretical Analysis
Task Disease Outbreaks Snowball IE
system 182,531 documents from NYT 16,921 tokens
- Solid lines Actual time
- Dotted lines Predicted time with correct
parameters
24Experimental Results (Information Extraction)
- Solid lines Actual time
- Green line Time with optimizer
- (results similar in other experiments see
paper)
25Conclusions
- Common execution plans for multiple text-centric
tasks - Analytic models for predicting execution time and
recall of various crawl- and query-based plans - Techniques for on-the-fly parameter estimation
- Optimization framework picks on-the-fly the
fastest plan for target recall
26Future Work
- Incorporate precision and recall of extraction
system in framework - Create non-parametric optimization (i.e., no
assumption about distribution families) - Examine other text-centric tasks and analyze new
execution plans - Create adaptive, next-K optimizer
27Thank you!
Task Filtered Scan Iterative Set Expansion Automatic Query Generation
Information Extraction Grishman et al., J.of Biomed. Inf. 2002 Agichtein and Gravano, ICDE 2003 Agichtein and Gravano, ICDE 2003
Content Summary Construction - Callan et al., SIGMOD 1999 Ipeirotis and Gravano, VLDB 2002
Focused Resource Discovery Chakrabarti et al., WWW 1999 - Cohen and Singer, AAAI WIBIS 1996