Alexandros NtoulasPetros ZerfosJunghoo Cho - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Alexandros NtoulasPetros ZerfosJunghoo Cho

Description:

crP(qi) cdP(qi) Problem formalization. Find the set of queries ... Cost(qi) = cq crP(qi) cdPnew (qi) Crawled PubMed with. cq = 100. cr = 100. cd = 10,000 ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 43

Provided by: ntou6

Category:

more less

Transcript and Presenter's Notes

Title: Alexandros NtoulasPetros ZerfosJunghoo Cho

1
Downloading Textual Hidden-WebContent Through
Keyword Queries
Downloading Textual Hidden-WebContent Through
Keyword Queries

Alexandros Ntoulas Petros Zerfos Junghoo Cho
University of California Los Angeles
Computer Science Department
ntoulas, pzerfos, cho_at_cs.ucla.edu
JCDL, June 8th 2005

2
Motivation

I would like to buy a used 98 Ford Taurus
Technical specs ?
Reviews ?
Classifieds ?
Vehicle history ?

Google?
3
Why cant we use a search engine ?

Search engines today employ crawlers that find
pages by following links around
Many useful pages are available only after
issuing queries (e.g. Classifieds, USPTO,
PubMed, LoC, )
Search engines cannot reach such pages there are
no links to them (Hidden-Web)
In this talk how can we download Hidden-Web
content?

4
Outline

Interacting with Hidden-Web sites
Algorithms for selecting queries for the
Hidden-Web sites
Experimental evaluation of our algorithms

5
Interacting with Hidden-Web pages (1)

The user issues a query through a query interface

liver
6
Interacting with Hidden-Web pages (2)
Result List Page

The user issues a query through a query interface
A result list is presented to the user

7
Interacting with Hidden-Web pages (3)

The user issues a query through a query interface
A result list is presented to the user
The user selects and views the interesting
results

8
Querying a Hidden-Web site

Procedurewhile ( there are available resources
) do
(1) select a query to send to the site
(2) send query and acquire result list
(3) download the pages
done

9
How should we select the queries ? (1)

S set of pages in Web site (pages as points)
qi set of pages returned if we issue query qi
(queries as circles)

10
How should we select the queries ? (2)

Find the queries (circles) that cover the maximum
number of pages (points)
Equivalent to the set-covering problem in
graph-theory

11
Challenges during query selection

In practice we dont know which pages will be
returned by which queries (qi are unknown)
Even if we did know qi, the set-covering problem
is NP-Hard
We will present approximation algorithms to the
query selection problem
We will assume single-keyword queries

12
Outline

Interacting with Hidden-Web sites
Algorithms for selecting queries for the
Hidden-Web sites
Experimental evaluation of our algorithms

13
Some background (1)

Assumption When we issue query qi to a Web site,
all pages containing qi are returned
P(qi) fraction of pages from site we get back
after issuing qi
Example q liver
No. of docs in DB 10,000
No. of docs containing liver 3,000
P(liver) 0.3

14
Some background (2)

P(q1/\q2) fraction of pages containing both q1
and q2 (intersection of q1 and q2)
P(q1\/q2) fraction of pages containing either q1
or q2 (union of q1 and q2)
Cost and benefit
How much benefit do we get out of a query ?
How costly is it to issue a query?

15
Cost function

The cost to issue a query and download the
Hidden-Web pages
cq query cost
cr cost for retrieving a result item
cd cost for downloading a document

cq
Cost(qi)
crP(qi)
cdP(qi)
(1) Cost for issuing a query
(2) Cost for retrieving a result item times no.
of results
(3) Cost for retrieving a doc times no. of docs
16
Problem formalization

Find the set of queries q1,,qn
which maximizes
P(q1\/\/qn)
Under the constraint

17
Query selection algorithms

Random Select a query randomly from a
precompiled list (e.g. a dictionary)
Frequency-based Select a query from a
precompiled list based on frequency (e.g. a
corpus previously downloaded from the Web)
Adaptive Analyze previously downloaded pages to
determine promising future queries

18
Adaptive query selection

Assume we have issued q1,,qi-1.
To find a promising query qi we need to estimate
P(q1\/\/qi-1\/qi)
P( (q1\/\/qi-1) \/ qi)
P(q1\/\/qi-1)
P(qi) -
P(q1\/\/qi-1) P(qiq1\/\/qi-1)

Known (by counting) since we have issued q1,,qi-1
Can measure by counting P(qi) within P(q1,,qi-1)
What about P(qi) ?
19
Estimating P(qi)

Independence estimator
Zipf estimator IG02
Rank queries based on frequency of occurrence and
fit a power law distribution
Use fitted distribution to estimate P(qi)

P(qi) P(qiq1\/\/qi-1)
20
Query selection algorithm

foreach qi in potential queries do
Pnew(qi) P(q1\/\/qi-1\/qi) P(q1\/\/qi-1)
Estimate
done
return qi with maximum Efficiency(qi)

21
Other practical issues

Efficient calculation of P(qiq1\/\/qi-1)
Selection of the initial query
Crawling sites that limit the number of
results(e.g. DMOZ returns up to 10,000 results)
Please refer to our paper for the details

22
Outline

Interacting with Hidden-Web sites
Algorithms for selecting queries for the
Hidden-Web sites
Experimental evaluation of our algorithms

23
Experimental evaluation

Applied our algorithms to 4 different sites

24
Policies

Random-16K
Pick query randomly from 16,000 most popular
terms
Random-1M
Pick query randomly from 1,000,000 most popular
terms
Frequency-based
Pick query based on frequency of occurrence
Adaptive

25
Coverage of policies

What fraction of the Web sites can we download by
issuing queries ?
Study P(q1\/\/qi) as i increases

26
Coverage of policies for PubMed

Adaptive gets 80 with 83 queries
Frequency needs 103 for the same coverage

27
Coverage of policies for DMOZ (whole)

Adaptive outperforms others

28
Coverage of policies for DMOZ (arts)

Adaptive performs best in topic-specific texts

29
Other experiments

Impact of the initial query
Impact of the various parameters of the cost
function
Crawling sites that limit the number of
results(e.g. DMOZ returns up to 10,000 results)
Please refer to our paper for the details

30
Related work

Issuing queries to databases
Acquire language model CCD99
Estimate fraction of the Web indexed LG98
Estimate relative size and overlap of indexes
BB98
Build multi-keyword queries that can return a
large number of documents BF04
Harvesting approaches/cooperative databases (OAI
LS01, DP9 LMZN02)

31
Conclusion

An adaptive algorithm for issuing queries to
Hidden-Web sites
Our algorithm is highly efficient (downloaded
gt90 of a site with 100 queries)
Allows users to tap into unexplored information
on the Web
Allows the research community to download, mine,
study, understand the Hidden-Web

32
References

IG02 P. Ipeirotis, L. Gravano. Distributed
search over the hidden web Hierarchical database
sampling and selection. VLDB 2002.
CCD99 J. Callan, M.E. Connel, A. Du. Automatic
discovery of language models for text databases.
SIGMOD 1999.
LG98 S. Lawrence, C.L. Giles. Searching the
World Wide Web. Science 280(5360)98-100, 1998.
BB98 K. Bharat, A. Broder. A technique for
measuring the relative size and overlap of public
web search engines. WWW 1998.
BF04 L. Barbosa, J. Freire. Siphoning
hidden-web data through keyword-based interfaces.
LS01 C. Lagoze, H.V. Sompel. The Open Archives
Initiative Building a low-barrier
interoperability framework. JCDL 2001.
LMZN02 X. Liu, K. Maly, M. Zubair, M.L. Nelson.
DP9-An OAI Gatway Service for Web Crawlers. JCDL
2002.