FOCUSED CRAWLING - PowerPoint PPT Presentation

About This Presentation
Title:

FOCUSED CRAWLING

Description:

Still only 30-40% Web crawled. Long refreshes (weeks up to a month). Low precision results for crafty queries. Burden of indexing millions of pages. ... – PowerPoint PPT presentation

Number of Views:231
Avg rating:3.0/5.0
Slides: 33
Provided by: malvasia
Category:

less

Transcript and Presenter's Notes

Title: FOCUSED CRAWLING


1
FOCUSED CRAWLING
2
Context
  • World Wide Web growth.
  • Inktomi crawler
  • Hundreds of Sun Sparc workstations
  • Sun Spark ? 75GB RAM, 1TB disk
  • Over 10M pages crawled.
  • Still only 30-40 Web crawled.
  • Long refreshes (weeks up to a month).
  • Low precision results for crafty queries.
  • Burden of indexing millions of pages.
  • Inefficient location of relevant topic-specific
    resources when using keyword queries.

1
3
Why Focused?
  • Better cover a single galaxy than the whole
    universe.
  • Work done on relatively narrow segment of Web.
  • Respectable coverage at rapid rate (due to
    segment-of-interest narrowness).
  • Small investment in hardware.
  • Low network resource usage.

2
4
Core Elements
  • Focused crawler example-driven automatic
    porthole generator.
  • Guided by a classifier and a distiller.
  • Former recognizes relevance from examples
    embedded in topic taxonomy.
  • Latter identifies topical vantage points on Web.
  • Based on canonical topic taxonomy with examples.

3
5
Operation Synopsis
  1. Taxonomy creation.
  2. Example collection.
  3. Taxonomy selection and refinement.
  4. Interactive exploration.
  5. Training.
  6. Resource discovery.
  7. Distillation.
  8. Feedback.

4
6
Taxonomy Creation
  • Pre-training classifier with
  • Canonical taxonomy,
  • Corresponding examples.

5
7
Example Collection
  • Collect URLs of interest (e.g browsing).
  • Import collected URLs.

6
8
Taxonomy Selection and Refinement
  • Propose most common classes where examples fit
    best.
  • Mark classes as GOOD.
  • Refine taxonomy, i.e.
  • Refine categories and/or,
  • Move documents from one category to another.
  • Integration time required by major changes is
  • Few hours for 260,000 Yahoo! documents.
  • Smaller changes (moving docs) are interactive.

7
9
Interactive Exploration
  • Propose URLs found in small neighbourhood of
    examples.
  • Examine and include some of these examples.

8
10
Training
  • Integrate refinements into statistical class
    model (classifier-specific action).

9
11
Distillation
  • Identify relevant hubs by running (intermittently
    and/or concurrently) a topic distillation
    algorithm.
  • Raise visit priorities of hubs and immediate
    neighbours.

10
12
Feedback
  • Report most popular sites and resources.
  • Mark results as useful/useless.
  • Send feedback to classifier and distiller.

11
13
Snapshot
12
14
Some definitions...
  • G directed hypertext graph.
  • C tree-shaped hierarchical topic directory.
  • D(c) examples referred by topic node c ? C.
  • C subset of topics marked good and known as
    user's interest.
  • Remarks
  • Good topic is not ancestor of another good
    topic.
  • p web page, RC(p) relevance of p wrto C
    must be furnished to the system.
  • Rroot(p) 1 Rc0(p) ?Rci(p) where ci
    children of c0.

13
15
Crawler in terms of Graph
  • Start by visiting all pages ? D(C).
  • Inspect V set of visited pages.
  • Choose unvisited page from crawl frontier.
  • GOAL visit as many relevant pages and as few
    irrelevant pages as possible, i.e
  • Find V D(C) V reachable from D(C) s.t.
    ?R(v)/V -gt max, v ? V.
  • Goal attainable due to citations.

14
16
Classification
  • Definitions
  • good(c) c is marked as good.
  • For ddocument
  • P(dr) 1
  • P(cd) P(parent(c)d)P(cd,parent(c))
  • P(cd,parent(c)) P(cparent(c)) P(dc) /
    ?P(dci) where ci are the siblings of c
  • P(dc) depends on document generation model
  • P(cparent(c)) prior distribution of documents.
  • Steps for model generation
  • Pick leaf node c using defined probabilities.
  • Class c has a die with as many faces as unique
    tokens ? U.
  • Face t turns with probability ?(c,t).
  • Length n(d) is chosen arbitrarily by generator.
  • Flip die and write token corresponding to face.
  • If token t occurs n(d,t) times gt

15
17
Remarks on Classification
  • Documents seen as bag of words, without order
    information and inter-term correlation.
  • During crawling the task is the reverse of
    generation.
  • Two types of focus possible with classifier
  • Hard-focus
  • Find c with highest probability
  • If ? ancestor of c s.t. good(ancestor) gt allow
    future visits of links ? d
  • Else prune at d.
  • Soft-focus
  • Page relevance R(d) ?good(c)P(cd)
  • Assume priority of neighbour(d) R(d)
  • If multiple paths for a page gt take maximum of
    relevance
  • When neighbour visited gt update score.

16
18
Distillation
  • Goal identify hubs.
  • Overtaken idea
  • v node ? Web has two scores a(v), h(v) gt
  • h(u) ? (u,v) ? E a(v) (1)
  • a(v) ? (u,v) ? E h(u) (2)
  • E adjacency matrix
  • Enhancements
  • Non-unit edge weight
  • Forward and backward weights matrices EF and EB
  • EFu,v R(v) prevents leakage of prestige from
    relevant hubs to irrelevant authorities
  • EBu,v R(u) prevents relevant authority from
    reflecting prestige on irrelevant hubs
  • ? threshold for including relevant authorities
    into graph.
  • Steps
  • Construct edge set E, only for pages on different
    sites, with forward and backward edge weights.
  • Apply (1) and (2) always restricting authorities
    using ?.

17
19
Integration with the Crawler
  • One watchdog thread
  • Inspect new work from crawl frontier (stored on
    disk)
  • Pass new work to working threads(using shared
    memory buffers).
  • Many working threads
  • Save details of newly explored pages in
    per-worker disk structures
  • Invoke classifier for each new page.
  • Stop workers, collect and integrate results into
    central pool (priority queue).
  • Soft crawling -gt URLs ordered by
  • ( page-fetches ascending, R descending)
  • Hard crawling -gt surviving URLs ordered by
  • page-fetches ascending
  • Populate link graph.
  • Periodically stop crawler and execute distiller
    gt revisit obtained hubs visit unvisited pages
    pointed by hubs.

18
20
Integration
19
21
Evaluation
  • Performance parameters
  • Precision (relevance)
  • Quality of resource discovery.
  • Synopsis
  • Experimental setup
  • Harvesting rate of relevant pages
  • Acquisition robustness
  • Resource discovery robustness
  • Good resources remoteness
  • Effect of distillation on crawling.

20
22
Experimental Setup
  • Crawler C application.
  • Operating through firewall.
  • Crawler run with relatively few threads.
  • Up to 12 example web pages used / category
  • 6,000 URLs / hour returned.
  • 20 topics (gardening, mutual funds, cycling,
    etc).

21
23
Harvesting Rate of Relevant Pages
  • Goal high relevant-page acquisition rate.
  • Low harvest rate -gt time spent merely on
    eliminating irrelevant pages gt better use
    ordinary crawl instead.
  • 3 crawls done
  • Same sample set ? few dozen relevant URLs.
  • Unfocused
  • All out-links registered for exploration
  • No use of R, except measurement gt little slow
    down.
  • Soft
  • Probably more robust than hard crawling, BUT
    needs more skill against unwanted topic
    diffusion.
  • Problem distinguish between noisy and systematic
    drop in relevance.
  • Hard

22
24
Harvesting Rate Example
23
25
Acquisition Robustness
  • Goal maintain proper acquisition rate without
    being too sensitive on the start set.
  • Tests
  • 2 disjoint sets ? 30 of starting URLs randomly
    chosen.
  • For each subset launch a focused crawler.
  • Goal achieved by measuring URLs overlap.
  • Generous visits to new IP-addresses and also
    normal increase in overlapping IP-addresses.

24
26
URL Overlap
25
27
Server overlap
26
28
Resource Discovery Robustness
  • 2 sets of crawlers launched from different random
    samples.
  • popularity/quality algorithm run with 50
    iterations.
  • Server overlap measured.
  • Result most popular sites identified by both
    sets of crawlers although different samples sets
    were used.

27
29
Good Resources Remoteness
  • Any real exploration done ?
  • Non-trivial work done by focused crawler, i.e
    pursuing certain paths while pruning others.
  • Large of servers found at 10 links away and
    beyond from starting set.
  • Millions of pages within 10 links distance.

28
30
Remoteness Example
29
31
Effect of Distillation on Crawling
  • Relevant page may be abandoned due to
    misclassification (e.g page has many images
    /classifier mistakes).
  • Distiller reveals top hubs gt new unvisited URLs.

30
32
Conclusion
  • Strengths
  • Steady collection of relevant resources
  • Robustness to different starting conditions
  • Localization of good resources
  • Immunity to noise
  • Learning specialization from examples
  • Filtering done at data-acquisition level rather
    than as post-processing
  • Crawling done to greater depths due to frontier
    crawling
  • Still to go
  • At what specificity can focused crawl be
    sustained? i.e how do harvest rates depend on
    topics?
  • Sociology of citations between topics gt insights
    on how Web evolves.
  • ...

31
Write a Comment
User Comments (0)
About PowerShow.com