The Focus Project - PowerPoint PPT Presentation

About This Presentation

Title:

The Focus Project

Description:

Find Web research pages which are widely cited by Hawaiian vacation pages. Answer: 'first-aid' ... E.g., 'European airlines' swissair iberia klm. E.g., 'Car ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 45

Provided by: sou59

Category:

more less

Transcript and Presenter's Notes

Title: The Focus Project

1
The Focus Project

Soumen Chakrabarti (IIT Bombay)David Gibson
(Berkeley)Piotr Indyk (Stanford)Kevin McCurley
(IBM Almaden)Martin van Den Berg (Xerox)Byron
Dom (IBM Almaden)

2
Focused CrawlingA New Approach to
Topic-SpecificWeb Resource Discovery

Soumen Chakrabarti (IIT Bombay)Martin van Den
Berg (Xerox)Byron Dom (IBM Almaden)

3
Quote 1

Portals and search pages are changing rapidly, in
part because their biggest strength massive
size and reach can also be a drawback. The most
interesting trend is the growing sense of natural
limits, a recognition that covering a single
galaxy can be more practical and useful than
trying to cover the entire universe.
Dan Gillmore, San Jose Mercury News

4
Scenario

Disk drive research group wants to track magnetic
surface technologies
Compiler research group wants to trawl the web
for graduate student resumés
____ wants to enhance his/her collection of
bookmarks about ____ with prominent and relevant
links
Virtual libraries like Yahoo!, the Open Directory
Project and the Mining Co.

5
Structured web queries

How many links were found from an environment
protection agency site to a site about oil and
natural gas in the last year?
Apart from cycling, what is the most common topic
cited by pages on cycling?
Find Web research pages which are widely cited by
Hawaiian vacation pages

Answer first-aid
6
Quote 2

As people become more savvy users of the Net,
they want things which are better focused on
meeting their specific needs. We're going to see
a whole lot more of this, and it's going to
potentially erode the user base of some of the
big portals.
Jim HakeFounder, Global Information
Infrastructurehttp//www.gii-awards.com/

7
Goals

Spontaneous, decentralized formation of topical
communities
Automatic construction of a focused portal
containing resources that are
Relevant to the users focus of interest
Of high influence and quality
Collectively comprehensive
Discovery that combine structure and content

8
Model

Taxonomy with some chosen topics
Each page has a relevance score w.r.t. chosen
topics
Mendelzon and Milos web access cost model
Goal is to expand start set to maximize average
relevance

All
Science
Sports
Cycling
Physics
Hiking
Zoology
9
Properties to be exploited

A page with high relevance tends to link to at
least some other relevant pages (radius-one rule)
Given that a page u links to relevant page(s),
chances are increased that u points to other
relevant pages (radius-two rule)

?
10
Syntactic query-by-example

If part of the answer is known, trivial search
techniques may do quite well
E.g., European airlines
swissair iberia klm
E.g., Car makers
Which pages link to www.honda.com and
www.toyota.com?

11
(No Transcript)
12
The backlink architecture
GET /P2 HTTP/1.0 Referer http//S1/P1
S1
S2
C
http//S1/P1
http//S2/P2
www.cs.berkeley.edu/soumen/doc/www99back/userstud
y
13
Backlink rationale

Centralized backlink service does not scale
Limited additional storage per server
Turn hyperlinks into undirected edges
A series of forward and backward clicks can
quickly build a topical community
Can be used to boot-strap the focused crawler

14
Backlink example 1
15
Backlink example 2
16
Backlink example 3
17
Backlink example 4
18
Estimating popularity

Extensive research on social network theory
Wasserman and Faust
Hyperlink based
Large in-degree indicates popularity/authority
Not all votes are worth the same
Several similar ideas and refinements
Googol (Page and Brin) and HITS (Kleinberg)
Resource compilation (Chakrabarti et al)
Topic distillation (Bharat and Henzinger)

19
Topic distillation overview

Given web graph and query
Search engine selects sub-graph
Expansion, pruning and edge weights
Nodes iteratively transfer authority to cited
neighbors

The Web
Search Engine
Query
Selected subgraph
20
Preliminary distillation-based approach

Design a keyword query to represent topics of
focus
Using a large web crawl, run topic distillation
on the query
Refine query by inspecting result and
trial-and-error

21
Problems with preliminary approach

Unreliability of keyword match
Engines differ significantly on a given query due
to small overlap Bharat and Bröder
Narrow, arbitrary view of relevant subgraph
Topic model does not improve over time
Dependence on large web crawl and index (lack of
output sensitivity)
Difficulty of query construction

22
Output sensitivity

Say the goal is to find a comprehensive
collection of recreational and competitive
bicycling sites and pages
Ideally effort should scale with size of the
result
Time spent crawling and indexing sites unrelated
to the topic is wasted
Likewise, time that does not improve
comprehensiveness is wasted

23
Query construction
/Companies/Electronics/Power_Supply
power suppl
switch mode smps
-multiprocessor
uninterrupt power suppl ups
-parcel
24
Query complexity

Complex queries needed for distillation
Typical Alta Vista queries are much simpler
(Silverstein, Henzinger, Marais and Moricz)
Forcing a hub or authority helps 86 of the time

25
Proposed solution

Resource discovery system that can be customized
to crawl for any topic by giving examples
Hypertext mining algorithms learn to recognize
pages and sites about the given topic, and a
measure of their centrality
Crawler has guidance hooks controlled by these
two scores

26
Administration scenario
Current Examples
Drag
Taxonomy Editor
Suggested Additional Examples
27
Relevance
Path nodes
All
BusEcon
Recreation
Arts
Companies
Cycling
...
...
Bike Shops
Clubs
Mt.Biking
Good nodes
Subsumed nodes
28
Classification

How relevant is a document w.r.t. a class?
Supervised learning, filtering, classification,
categorization
Many types of classifiers
Bayesian, nearest neighbor, rule-based
Hypertext
Both text and links are class-dependent clues
How to model link-based features?

29
The bag-of-words document model

Decide topic topic c is picked with prior
probability ?(c) ?c?(c) 1
Each c has parameters ?(c,t) for terms t
Coin with face probabilities ?t ?(c,t) 1
Fix document length and keep tossing coin
Given c, probability of document is

30
Exploiting link features

cclass, ttext, Nneighbors
Text-only model Prtc
Using neighbors textto judge my topicPrt,
t(N) c
Better modelPrt, c(N) c
Non-linear relaxation

?
31
Improvement using link features

9600 patents from 12 classes marked by USPTO
Patents have text and cite other patents
Expand test patent to include neighborhood
Forget fraction of neighbors classes

32
Putting it together
33
Monitoring the crawler
One URL
Relevance
Moving Average
Time
34
Measures of success

Harvest rate
What fraction of crawled pages are relevant
Robustness across seed sets
Separate crawls with random disjoint samples
Measure overlap in URLs and servers crawled
Measure agreement in best-rated resources
Evidence of non-trivial work
Links from start set to the best resources

35
Harvest rate
Unfocused
36
Crawl robustness
URL Overlap
Server Overlap
Crawl 1
Crawl 2
37
Top resources after one hour

Recreational and competitive cycling
http//www.truesport.com/Bike/links.htm
http//reality.sgi.com/billh_hampton/jrvs/links.ht
ml
http//www.acs.ucalgary.ca/bentley/mark_links.htm
l
HIV/AIDS research and treatment
http//www.stopaids.org/Otherorgs.html
http//www-hsl.mcmaster.ca/tomflem/aids.html
http//www.iohk.com/UserPages/mlau/aidsinfo.html
Purer and better than root set

38
(No Transcript)
39
(No Transcript)
40
Robustness of resource discovery

Sample disjoint sets of starting URLs
Two separate crawls
Find best authorities
Order by rank
Find overlap in the top-rated resources

41
Distance to best resources
42
Observations

Random walk on the Web rapidly mixes topics
Yet, there are large coherent paths and clusters
Focused crawling gives topic distillation richer
data to work on
Combining content with link structure eliminates
the need to tune link-based heuristics

43
Related work