To Search or to Crawl Towards a Query Optimizer for TextCentric Tasks - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

To Search or to Crawl Towards a Query Optimizer for TextCentric Tasks

Description:

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks ... may be next on the target list of CSPI, a consumer-health group ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 40

Provided by: panagi2

Category:

more less

Transcript and Presenter's Notes

Title: To Search or to Crawl Towards a Query Optimizer for TextCentric Tasks

1
To Search or to Crawl?Towards a Query Optimizer
for Text-Centric Tasks

Panos Ipeirotis New York University
Eugene Agichtein Microsoft Research
Pranay Jain Columbia University
Luis Gravano Columbia University

2
Text-Centric Task I Information Extraction

Information extraction applications extract
structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Information Extraction System (e.g., NYUs
Proteus)
Information Extraction tutorial yesterday by
AnHai Doan, Raghu Ramakrishnan, Shivakumar
Vaithyanathan
3
Text-Centric Task II Metasearching

Metasearchers create content summaries of
databases (words frequencies) to direct queries
appropriately

Friday June 16, NEW YORK (Forbes) - Starbucks
Corp. may be next on the target list of CSPI, a
consumer-health group that this week sued the
operator of the KFC restaurant chain
Content Summary of Forbes.com
Content Summary Extractor
4
Text-Centric Task III Focused Resource Discovery

Identify web pages about a given topic (multiple
techniques proposed simple classifiers, focused
crawlers, focused querying,)

Web Pages about Botany
Web Page Classifier
5
An Abstract View of Text-Centric Tasks
Text Database
Extraction System

Retrieve documents from database

Process documents

Extract output tokens

6
Executing a Text-Centric Task
Text Database
Extraction System

Retrieve documents from database

Extract output tokens

Process documents

Two major execution paradigms
Scan-based Retrieve and process documents
sequentially
Index-based Query database (e.g., case
fatality rate), retrieve and process
documents in results

Similar to relational world

?underlying data distribution dictates what is
best

Indexes are only approximate index is on
keywords, not on tokens of interest
Choice of execution plan affects output
completeness (not only speed)

Unlike the relational world
7
Execution Plan Characteristics
Question How do we choose the fastest execution
plan for reaching a target recall ?
Text Database
Extraction System

Retrieve documents from database

Process documents

Extract output tokens

Execution Plans have two main characteristics
Execution Time
Recall (fraction of tokens retrieved)

What is the fastest plan for discovering 10 of
the disease outbreaks mentioned in The New York
Times archive?
8
Outline

Description and analysis of crawl- and
query-based plans
Scan
Filtered Scan
Iterative Set Expansion
Automatic Query Generation
Optimization strategy
Experimental results and conclusions

Crawl-based
Query-based
(Index-based)
9
Scan
Extraction System
Text Database

Extract output tokens

Process documents

Retrieve docs from database

Scan retrieves and processes documents
sequentially (until reaching target recall)
Execution time Retrieved Docs (R P)

Question How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Time for retrieving a document
Filtered Scan uses a classifier to identify and
process only promising documents (details in
paper)
10
Estimating Recall of Scan
ltSARS, Chinagt

Modeling Scan for Token t
What is the probability of seeing t (with
frequency g(t)) after retrieving S documents?
A sampling without replacement process
After retrieving S documents, frequency of token
t follows hypergeometric distribution
Recall for token t is the probability that
frequency of t in S docs gt 0

Probability of seeing token t after retrieving S
documents
g(t) frequency of token t

11
Estimating Recall of Scan
ltSARS, Chinagt
ltEbola, Zairegt

Modeling Scan
Multiple sampling without replacement
processes, one for each token
Overall recall is average recall across tokens
? We can compute number of documents required to
reach target recall

Execution time Retrieved Docs (R P)
12
Scan and Filtered Scan
Extraction System
Text Database
filtered

Extract output tokens

Process documents

Retrieve docs from database

Scan retrieves and processes all documents (until
reaching target recall)
Filtered Scan uses a classifier to identify and
process only promising documents(e.g., the
Sports section of NYT is unlikely to describe
disease outbreaks)
Execution time Retrieved Docs ( R F
P)

Time for processing a document
Question How many documents does (Filtered) Scan
retrieve to reach target recall?
Time for retrieving a document
Time for filteringa document
13
Estimating Recall of Filtered Scan

Modeling Filtered Scan
Analysis similar to Scan
Main difference the classifier rejects
documents and
Decreases effective database size from D to
s?D (s classifier selectivity)
Decreases effective token frequencyfrom g(t) to
r?g(t)(r classifier recall)

14
Outline

Description and analysis of crawl- and
query-based plans
Scan
Filtered Scan
Iterative Set Expansion
Automatic Query Generation
Optimization strategy
Experimental results and conclusions

Crawl-based
Query-based
15
Iterative Set Expansion
Text Database
Extraction System
Query Generation

Extract tokensfrom docs

Process retrieved documents

Augment seed tokens with new tokens

Query database with seed tokens

(e.g., ltMalaria, Ethiopiagt)
(e.g., Ebola AND Zaire)

Execution time Retrieved Docs (R P)
Queries Q

Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Time for answering a query
Time for retrieving a document
Time for processing a document
16
Querying Graph
Tokens
Documents
t1
d1

The querying graph is a bipartite graph,
containing tokens and documents
Each token (transformed to a keyword query)
retrieves documents
Documents contain tokens

ltSARS, Chinagt
d2
t2
ltEbola, Zairegt
t3
d3
ltMalaria, Ethiopiagt
t4
d4
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
17
Using Querying Graph for Analysis
Tokens
Documents

We need to compute the
Number of documents retrieved after sending Q
tokens as queries (estimates time)
Number of tokens that appear in the retrieved
documents (estimates recall)
To estimate these we need to compute the
Degree distribution of the tokens discovered by
retrieving documents
Degree distribution of the documents retrieved by
the tokens
(Not the same as the degree distribution of a
randomly chosen token or document it is easier
to discover documents and tokens with high
degrees)

t1
d1
ltSARS, Chinagt
d2
t2
ltEbola, Zairegt
t3
d3
ltMalaria, Ethiopiagt
t4
d4
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
Elegant analysis framework based on generating
functions details in the paper
18
Recall Limit Reachability Graph
Reachability Graph
Tokens
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t5
d5
Upper recall limit determined by the size of
the biggest connected component
19
Automatic Query Generation

Iterative Set Expansion has recall limitation due
to iterative nature of query generation
Automatic Query Generation avoids this problem by
creating queries offline (using machine
learning), which are designed to return documents
with tokens

20
Automatic Query Generation
Text Database
OfflineQueryGeneration
Extraction System

Extract tokensfrom docs

Process retrieved documents

Query database

Generate queries that tend to retrieve documents
with tokens

Execution time Retrieved Docs (R P)
Queries Q

Time for answering a query
Time for retrieving a document
Time for processing a document
21
Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs
Query has precision p(q)
p(q)?g(q) useful docs
1-p(q)?g(q) useless docs
We compute total number of useful (and useless)
documents retrieved
Analysis similar to Filtered Scan
Effective database size is Duseful
Sample size S is number of useful documents
retrieved

Text Database
p(q)?g(q)
q
(1-p(q))?g(q)
Useful Doc
Useless Doc
22
Outline

Description and analysis of crawl- and
query-based plans
Optimization strategy
Experimental results and conclusions

23
Summary of Cost Analysis

Our analysis so far
Takes as input a target recall
Gives as output the time for each plan to reach
target recall(time infinity, if plan cannot
reach target recall)
Time and recall depend on task-specific
properties of database
Token degree distribution
Document degree distribution
Next, we show how to estimate degree
distributions on-the-fly

24
Estimating Cost Model Parameters

Token and document degree distributions belong to
known distribution families

Can characterize distributions with only a few
parameters!
25
Parameter Estimation

Naïve solution for parameter estimation
Start with separate, parameter-estimation phase
Perform random sampling on database
Stop when cross-validation indicates high
confidence
We can do better than this!
No need for separate sampling phase
Sampling is equivalent to executing the task
?Piggyback parameter estimation into execution

26
On-the-fly Parameter Estimation
Correct (but unknown) distribution

Pick most promising execution plan for target
recall assuming default parameter values
Start executing task
Update parameter estimates during execution
Switch plan if updated statistics indicate so

Important
Only Scan acts as random sampling
All other execution plan need parameter
adjustment (see paper)

27
Outline

Description and analysis of crawl- and
query-based plans
Optimization strategy
Experimental results and conclusions

28
Correctness of Theoretical Analysis
Task Disease Outbreaks Snowball IE
system 182,531 documents from NYT 16,921 tokens

Solid lines Actual time
Dotted lines Predicted time with correct
parameters

29
Experimental Results (Information Extraction)

Solid lines Actual time
Green line Time with optimizer
(results similar in other experiments see
paper)

30
Conclusions

Common execution plans for multiple text-centric
tasks
Analytic models for predicting execution time and
recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the
fastest plan for target recall

31
Future Work

Incorporate precision and recall of extraction
system in framework
Create non-parametric optimization (i.e., no
assumption about distribution families)
Examine other text-centric tasks and analyze new
execution plans
Create adaptive, next-K optimizer

32
Thank you!
33
Overflow Slides
34
Experimental Results (IE, Headquarters)
Task Company Headquarters Snowball IE
system 182,531 documents from NYT 16,921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction 19,997 documents from
20newsgroups 120,024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction 19,997 documents from
20newsgroups 120,024 tokens
37
Experimental Results (Content Summaries)
Underestimated recall for AQG, switched to ISE
Content Summary Extraction 19,997 documents from
20newsgroups 120,024 tokens
38
Experimental Results (Information Extraction)
OPTIMIZED is faster than best plan
overestimated F.S. recall, but after F.S. run to
completion, OPTIMIZED just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery 800,000 web
pages 12,000 tokens

Write a Comment

User Comments (0)