Title: Alexandros NtoulasPetros ZerfosJunghoo Cho
1 Downloading Textual Hidden-WebContent Through
Keyword Queries
Downloading Textual Hidden-WebContent Through
Keyword Queries
- Alexandros Ntoulas Petros Zerfos Junghoo Cho
- University of California Los Angeles
- Computer Science Department
- ntoulas, pzerfos, cho_at_cs.ucla.edu
- JCDL, June 8th 2005
2Motivation
- I would like to buy a used 98 Ford Taurus
- Technical specs ?
- Reviews ?
- Classifieds ?
- Vehicle history ?
Google?
3Why cant we use a search engine ?
- Search engines today employ crawlers that find
pages by following links around - Many useful pages are available only after
issuing queries (e.g. Classifieds, USPTO,
PubMed, LoC, ) - Search engines cannot reach such pages there are
no links to them (Hidden-Web) - In this talk how can we download Hidden-Web
content?
4Outline
- Interacting with Hidden-Web sites
- Algorithms for selecting queries for the
Hidden-Web sites - Experimental evaluation of our algorithms
5Interacting with Hidden-Web pages (1)
- The user issues a query through a query interface
liver
6Interacting with Hidden-Web pages (2)
Result List Page
- The user issues a query through a query interface
- A result list is presented to the user
7Interacting with Hidden-Web pages (3)
- The user issues a query through a query interface
- A result list is presented to the user
- The user selects and views the interesting
results
8Querying a Hidden-Web site
- Procedurewhile ( there are available resources
) do - (1) select a query to send to the site
- (2) send query and acquire result list
- (3) download the pages
- done
9How should we select the queries ? (1)
- S set of pages in Web site (pages as points)
- qi set of pages returned if we issue query qi
(queries as circles)
10How should we select the queries ? (2)
- Find the queries (circles) that cover the maximum
number of pages (points) - Equivalent to the set-covering problem in
graph-theory
11Challenges during query selection
- In practice we dont know which pages will be
returned by which queries (qi are unknown) - Even if we did know qi, the set-covering problem
is NP-Hard - We will present approximation algorithms to the
query selection problem - We will assume single-keyword queries
12Outline
- Interacting with Hidden-Web sites
- Algorithms for selecting queries for the
Hidden-Web sites - Experimental evaluation of our algorithms
13Some background (1)
- Assumption When we issue query qi to a Web site,
all pages containing qi are returned - P(qi) fraction of pages from site we get back
after issuing qi - Example q liver
- No. of docs in DB 10,000
- No. of docs containing liver 3,000
- P(liver) 0.3
14Some background (2)
- P(q1/\q2) fraction of pages containing both q1
and q2 (intersection of q1 and q2) - P(q1\/q2) fraction of pages containing either q1
or q2 (union of q1 and q2) - Cost and benefit
- How much benefit do we get out of a query ?
- How costly is it to issue a query?
15Cost function
- The cost to issue a query and download the
Hidden-Web pages - cq query cost
- cr cost for retrieving a result item
- cd cost for downloading a document
cq
Cost(qi)
crP(qi)
cdP(qi)
(1) Cost for issuing a query
(2) Cost for retrieving a result item times no.
of results
(3) Cost for retrieving a doc times no. of docs
16Problem formalization
- Find the set of queries q1,,qn
- which maximizes
- P(q1\/\/qn)
- Under the constraint
17Query selection algorithms
- Random Select a query randomly from a
precompiled list (e.g. a dictionary) - Frequency-based Select a query from a
precompiled list based on frequency (e.g. a
corpus previously downloaded from the Web) - Adaptive Analyze previously downloaded pages to
determine promising future queries
18Adaptive query selection
- Assume we have issued q1,,qi-1.
- To find a promising query qi we need to estimate
P(q1\/\/qi-1\/qi) - P( (q1\/\/qi-1) \/ qi)
- P(q1\/\/qi-1)
- P(qi) -
- P(q1\/\/qi-1) P(qiq1\/\/qi-1)
Known (by counting) since we have issued q1,,qi-1
Can measure by counting P(qi) within P(q1,,qi-1)
What about P(qi) ?
19Estimating P(qi)
- Independence estimator
- Zipf estimator IG02
- Rank queries based on frequency of occurrence and
fit a power law distribution - Use fitted distribution to estimate P(qi)
P(qi) P(qiq1\/\/qi-1)
20Query selection algorithm
- foreach qi in potential queries do
-
- Pnew(qi) P(q1\/\/qi-1\/qi) P(q1\/\/qi-1)
-
- Estimate
- done
- return qi with maximum Efficiency(qi)
21Other practical issues
- Efficient calculation of P(qiq1\/\/qi-1)
- Selection of the initial query
- Crawling sites that limit the number of
results(e.g. DMOZ returns up to 10,000 results) - Please refer to our paper for the details
22Outline
- Interacting with Hidden-Web sites
- Algorithms for selecting queries for the
Hidden-Web sites - Experimental evaluation of our algorithms
23Experimental evaluation
- Applied our algorithms to 4 different sites
24Policies
- Random-16K
- Pick query randomly from 16,000 most popular
terms - Random-1M
- Pick query randomly from 1,000,000 most popular
terms - Frequency-based
- Pick query based on frequency of occurrence
- Adaptive
25Coverage of policies
- What fraction of the Web sites can we download by
issuing queries ? - Study P(q1\/\/qi) as i increases
26Coverage of policies for PubMed
- Adaptive gets 80 with 83 queries
- Frequency needs 103 for the same coverage
27Coverage of policies for DMOZ (whole)
- Adaptive outperforms others
28Coverage of policies for DMOZ (arts)
- Adaptive performs best in topic-specific texts
29Other experiments
- Impact of the initial query
- Impact of the various parameters of the cost
function - Crawling sites that limit the number of
results(e.g. DMOZ returns up to 10,000 results) - Please refer to our paper for the details
30Related work
- Issuing queries to databases
- Acquire language model CCD99
- Estimate fraction of the Web indexed LG98
- Estimate relative size and overlap of indexes
BB98 - Build multi-keyword queries that can return a
large number of documents BF04 - Harvesting approaches/cooperative databases (OAI
LS01, DP9 LMZN02)
31Conclusion
- An adaptive algorithm for issuing queries to
Hidden-Web sites - Our algorithm is highly efficient (downloaded
gt90 of a site with 100 queries) - Allows users to tap into unexplored information
on the Web - Allows the research community to download, mine,
study, understand the Hidden-Web
32References
- IG02 P. Ipeirotis, L. Gravano. Distributed
search over the hidden web Hierarchical database
sampling and selection. VLDB 2002. - CCD99 J. Callan, M.E. Connel, A. Du. Automatic
discovery of language models for text databases.
SIGMOD 1999. - LG98 S. Lawrence, C.L. Giles. Searching the
World Wide Web. Science 280(5360)98-100, 1998. - BB98 K. Bharat, A. Broder. A technique for
measuring the relative size and overlap of public
web search engines. WWW 1998. - BF04 L. Barbosa, J. Freire. Siphoning
hidden-web data through keyword-based interfaces. - LS01 C. Lagoze, H.V. Sompel. The Open Archives
Initiative Building a low-barrier
interoperability framework. JCDL 2001. - LMZN02 X. Liu, K. Maly, M. Zubair, M.L. Nelson.
DP9-An OAI Gatway Service for Web Crawlers. JCDL
2002.
33Thank you !
34Impact of the initial query
- Does it matter what the first query is ?
- Crawled PubMed with queries
- data (1,344,999 results)
- information (308,474 results)
- return (29,707 results)
- pubmed (695 results)
35Impact of the initial query
- Algorithm converges regardless of initial query
36Incorporating the document download cost
- Cost(qi) cq crP(qi) cdPnew (qi)
- Crawled PubMed with
- cq 100
- cr 100
- cd 10,000
37Incorporating document download cost
- Adaptive uses resources more efficiently
- Document cost significant portion of the cost
38Can we get all the results back ?
39Downloading from sites limiting the number of
results (1)
- Site returns qi instead of qi
- For qi1 we need to estimate P(qi1q1\/\/qi)
40Downloading from sites limiting the number of
results (2)
- Assuming qi is a random sample of qi
41Impact of the limit of results
- How does the limit of results affect our
algorithms ? - Crawled DMOZ but restricted the algorithms to
1,000 results instead of 10,000
42Dmoz with a result cap at 1,000
- Adaptive still outperforms frequency-based