Alexandros NtoulasPetros ZerfosJunghoo Cho - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Alexandros NtoulasPetros ZerfosJunghoo Cho

Description:

crP(qi) cdP(qi) Problem formalization. Find the set of queries ... Cost(qi) = cq crP(qi) cdPnew (qi) Crawled PubMed with. cq = 100. cr = 100. cd = 10,000 ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 43
Provided by: ntou6
Category:

less

Transcript and Presenter's Notes

Title: Alexandros NtoulasPetros ZerfosJunghoo Cho


1
Downloading Textual Hidden-WebContent Through
Keyword Queries
Downloading Textual Hidden-WebContent Through
Keyword Queries
  • Alexandros Ntoulas Petros Zerfos Junghoo Cho
  • University of California Los Angeles
  • Computer Science Department
  • ntoulas, pzerfos, cho_at_cs.ucla.edu
  • JCDL, June 8th 2005

2
Motivation
  • I would like to buy a used 98 Ford Taurus
  • Technical specs ?
  • Reviews ?
  • Classifieds ?
  • Vehicle history ?

Google?
3
Why cant we use a search engine ?
  • Search engines today employ crawlers that find
    pages by following links around
  • Many useful pages are available only after
    issuing queries (e.g. Classifieds, USPTO,
    PubMed, LoC, )
  • Search engines cannot reach such pages there are
    no links to them (Hidden-Web)
  • In this talk how can we download Hidden-Web
    content?

4
Outline
  • Interacting with Hidden-Web sites
  • Algorithms for selecting queries for the
    Hidden-Web sites
  • Experimental evaluation of our algorithms

5
Interacting with Hidden-Web pages (1)
  • The user issues a query through a query interface

liver
6
Interacting with Hidden-Web pages (2)
Result List Page
  • The user issues a query through a query interface
  • A result list is presented to the user

7
Interacting with Hidden-Web pages (3)
  • The user issues a query through a query interface
  • A result list is presented to the user
  • The user selects and views the interesting
    results

8
Querying a Hidden-Web site
  • Procedurewhile ( there are available resources
    ) do
  • (1) select a query to send to the site
  • (2) send query and acquire result list
  • (3) download the pages
  • done

9
How should we select the queries ? (1)
  • S set of pages in Web site (pages as points)
  • qi set of pages returned if we issue query qi
    (queries as circles)

10
How should we select the queries ? (2)
  • Find the queries (circles) that cover the maximum
    number of pages (points)
  • Equivalent to the set-covering problem in
    graph-theory

11
Challenges during query selection
  • In practice we dont know which pages will be
    returned by which queries (qi are unknown)
  • Even if we did know qi, the set-covering problem
    is NP-Hard
  • We will present approximation algorithms to the
    query selection problem
  • We will assume single-keyword queries

12
Outline
  • Interacting with Hidden-Web sites
  • Algorithms for selecting queries for the
    Hidden-Web sites
  • Experimental evaluation of our algorithms

13
Some background (1)
  • Assumption When we issue query qi to a Web site,
    all pages containing qi are returned
  • P(qi) fraction of pages from site we get back
    after issuing qi
  • Example q liver
  • No. of docs in DB 10,000
  • No. of docs containing liver 3,000
  • P(liver) 0.3

14
Some background (2)
  • P(q1/\q2) fraction of pages containing both q1
    and q2 (intersection of q1 and q2)
  • P(q1\/q2) fraction of pages containing either q1
    or q2 (union of q1 and q2)
  • Cost and benefit
  • How much benefit do we get out of a query ?
  • How costly is it to issue a query?

15
Cost function
  • The cost to issue a query and download the
    Hidden-Web pages
  • cq query cost
  • cr cost for retrieving a result item
  • cd cost for downloading a document

cq
Cost(qi)
crP(qi)
cdP(qi)
(1) Cost for issuing a query
(2) Cost for retrieving a result item times no.
of results
(3) Cost for retrieving a doc times no. of docs
16
Problem formalization
  • Find the set of queries q1,,qn
  • which maximizes
  • P(q1\/\/qn)
  • Under the constraint

17
Query selection algorithms
  • Random Select a query randomly from a
    precompiled list (e.g. a dictionary)
  • Frequency-based Select a query from a
    precompiled list based on frequency (e.g. a
    corpus previously downloaded from the Web)
  • Adaptive Analyze previously downloaded pages to
    determine promising future queries

18
Adaptive query selection
  • Assume we have issued q1,,qi-1.
  • To find a promising query qi we need to estimate
    P(q1\/\/qi-1\/qi)
  • P( (q1\/\/qi-1) \/ qi)
  • P(q1\/\/qi-1)
  • P(qi) -
  • P(q1\/\/qi-1) P(qiq1\/\/qi-1)

Known (by counting) since we have issued q1,,qi-1
Can measure by counting P(qi) within P(q1,,qi-1)
What about P(qi) ?
19
Estimating P(qi)
  • Independence estimator
  • Zipf estimator IG02
  • Rank queries based on frequency of occurrence and
    fit a power law distribution
  • Use fitted distribution to estimate P(qi)

P(qi) P(qiq1\/\/qi-1)
20
Query selection algorithm
  • foreach qi in potential queries do
  • Pnew(qi) P(q1\/\/qi-1\/qi) P(q1\/\/qi-1)
  • Estimate
  • done
  • return qi with maximum Efficiency(qi)

21
Other practical issues
  • Efficient calculation of P(qiq1\/\/qi-1)
  • Selection of the initial query
  • Crawling sites that limit the number of
    results(e.g. DMOZ returns up to 10,000 results)
  • Please refer to our paper for the details

22
Outline
  • Interacting with Hidden-Web sites
  • Algorithms for selecting queries for the
    Hidden-Web sites
  • Experimental evaluation of our algorithms

23
Experimental evaluation
  • Applied our algorithms to 4 different sites

24
Policies
  • Random-16K
  • Pick query randomly from 16,000 most popular
    terms
  • Random-1M
  • Pick query randomly from 1,000,000 most popular
    terms
  • Frequency-based
  • Pick query based on frequency of occurrence
  • Adaptive

25
Coverage of policies
  • What fraction of the Web sites can we download by
    issuing queries ?
  • Study P(q1\/\/qi) as i increases

26
Coverage of policies for PubMed
  • Adaptive gets 80 with 83 queries
  • Frequency needs 103 for the same coverage

27
Coverage of policies for DMOZ (whole)
  • Adaptive outperforms others

28
Coverage of policies for DMOZ (arts)
  • Adaptive performs best in topic-specific texts

29
Other experiments
  • Impact of the initial query
  • Impact of the various parameters of the cost
    function
  • Crawling sites that limit the number of
    results(e.g. DMOZ returns up to 10,000 results)
  • Please refer to our paper for the details

30
Related work
  • Issuing queries to databases
  • Acquire language model CCD99
  • Estimate fraction of the Web indexed LG98
  • Estimate relative size and overlap of indexes
    BB98
  • Build multi-keyword queries that can return a
    large number of documents BF04
  • Harvesting approaches/cooperative databases (OAI
    LS01, DP9 LMZN02)

31
Conclusion
  • An adaptive algorithm for issuing queries to
    Hidden-Web sites
  • Our algorithm is highly efficient (downloaded
    gt90 of a site with 100 queries)
  • Allows users to tap into unexplored information
    on the Web
  • Allows the research community to download, mine,
    study, understand the Hidden-Web

32
References
  • IG02 P. Ipeirotis, L. Gravano. Distributed
    search over the hidden web Hierarchical database
    sampling and selection. VLDB 2002.
  • CCD99 J. Callan, M.E. Connel, A. Du. Automatic
    discovery of language models for text databases.
    SIGMOD 1999.
  • LG98 S. Lawrence, C.L. Giles. Searching the
    World Wide Web. Science 280(5360)98-100, 1998.
  • BB98 K. Bharat, A. Broder. A technique for
    measuring the relative size and overlap of public
    web search engines. WWW 1998.
  • BF04 L. Barbosa, J. Freire. Siphoning
    hidden-web data through keyword-based interfaces.
  • LS01 C. Lagoze, H.V. Sompel. The Open Archives
    Initiative Building a low-barrier
    interoperability framework. JCDL 2001.
  • LMZN02 X. Liu, K. Maly, M. Zubair, M.L. Nelson.
    DP9-An OAI Gatway Service for Web Crawlers. JCDL
    2002.

33
Thank you !
  • Questions ?

34
Impact of the initial query
  • Does it matter what the first query is ?
  • Crawled PubMed with queries
  • data (1,344,999 results)
  • information (308,474 results)
  • return (29,707 results)
  • pubmed (695 results)

35
Impact of the initial query
  • Algorithm converges regardless of initial query

36
Incorporating the document download cost
  • Cost(qi) cq crP(qi) cdPnew (qi)
  • Crawled PubMed with
  • cq 100
  • cr 100
  • cd 10,000

37
Incorporating document download cost
  • Adaptive uses resources more efficiently
  • Document cost significant portion of the cost

38
Can we get all the results back ?

39
Downloading from sites limiting the number of
results (1)
  • Site returns qi instead of qi
  • For qi1 we need to estimate P(qi1q1\/\/qi)

40
Downloading from sites limiting the number of
results (2)
  • Assuming qi is a random sample of qi

41
Impact of the limit of results
  • How does the limit of results affect our
    algorithms ?
  • Crawled DMOZ but restricted the algorithms to
    1,000 results instead of 10,000

42
Dmoz with a result cap at 1,000
  • Adaptive still outperforms frequency-based
Write a Comment
User Comments (0)
About PowerShow.com