Title: The Focus Project
1The Focus Project
- Soumen Chakrabarti (IIT Bombay)David Gibson
(Berkeley)Piotr Indyk (Stanford)Kevin McCurley
(IBM Almaden)Martin van Den Berg (Xerox)Byron
Dom (IBM Almaden)
2Focused CrawlingA New Approach to
Topic-SpecificWeb Resource Discovery
- Soumen Chakrabarti (IIT Bombay)Martin van Den
Berg (Xerox)Byron Dom (IBM Almaden)
3Quote 1
- Portals and search pages are changing rapidly, in
part because their biggest strength massive
size and reach can also be a drawback. The most
interesting trend is the growing sense of natural
limits, a recognition that covering a single
galaxy can be more practical and useful than
trying to cover the entire universe. - Dan Gillmore, San Jose Mercury News
4Scenario
- Disk drive research group wants to track magnetic
surface technologies - Compiler research group wants to trawl the web
for graduate student resumés - ____ wants to enhance his/her collection of
bookmarks about ____ with prominent and relevant
links - Virtual libraries like Yahoo!, the Open Directory
Project and the Mining Co.
5Structured web queries
- How many links were found from an environment
protection agency site to a site about oil and
natural gas in the last year? - Apart from cycling, what is the most common topic
cited by pages on cycling? - Find Web research pages which are widely cited by
Hawaiian vacation pages
Answer first-aid
6Quote 2
- As people become more savvy users of the Net,
they want things which are better focused on
meeting their specific needs. We're going to see
a whole lot more of this, and it's going to
potentially erode the user base of some of the
big portals. - Jim HakeFounder, Global Information
Infrastructurehttp//www.gii-awards.com/
7Goals
- Spontaneous, decentralized formation of topical
communities - Automatic construction of a focused portal
containing resources that are - Relevant to the users focus of interest
- Of high influence and quality
- Collectively comprehensive
- Discovery that combine structure and content
8Model
- Taxonomy with some chosen topics
- Each page has a relevance score w.r.t. chosen
topics - Mendelzon and Milos web access cost model
- Goal is to expand start set to maximize average
relevance
All
Science
Sports
Cycling
Physics
Hiking
Zoology
9Properties to be exploited
- A page with high relevance tends to link to at
least some other relevant pages (radius-one rule) - Given that a page u links to relevant page(s),
chances are increased that u points to other
relevant pages (radius-two rule)
?
10Syntactic query-by-example
- If part of the answer is known, trivial search
techniques may do quite well - E.g., European airlines
- swissair iberia klm
- E.g., Car makers
- Which pages link to www.honda.com and
www.toyota.com?
11(No Transcript)
12The backlink architecture
GET /P2 HTTP/1.0 Referer http//S1/P1
S1
S2
C
http//S1/P1
http//S2/P2
www.cs.berkeley.edu/soumen/doc/www99back/userstud
y
13Backlink rationale
- Centralized backlink service does not scale
- Limited additional storage per server
- Turn hyperlinks into undirected edges
- A series of forward and backward clicks can
quickly build a topical community - Can be used to boot-strap the focused crawler
14Backlink example 1
15Backlink example 2
16Backlink example 3
17Backlink example 4
18Estimating popularity
- Extensive research on social network theory
- Wasserman and Faust
- Hyperlink based
- Large in-degree indicates popularity/authority
- Not all votes are worth the same
- Several similar ideas and refinements
- Googol (Page and Brin) and HITS (Kleinberg)
- Resource compilation (Chakrabarti et al)
- Topic distillation (Bharat and Henzinger)
19Topic distillation overview
- Given web graph and query
- Search engine selects sub-graph
- Expansion, pruning and edge weights
- Nodes iteratively transfer authority to cited
neighbors
The Web
Search Engine
Query
Selected subgraph
20Preliminary distillation-based approach
- Design a keyword query to represent topics of
focus - Using a large web crawl, run topic distillation
on the query - Refine query by inspecting result and
trial-and-error
21Problems with preliminary approach
- Unreliability of keyword match
- Engines differ significantly on a given query due
to small overlap Bharat and Bröder - Narrow, arbitrary view of relevant subgraph
- Topic model does not improve over time
- Dependence on large web crawl and index (lack of
output sensitivity) - Difficulty of query construction
22Output sensitivity
- Say the goal is to find a comprehensive
collection of recreational and competitive
bicycling sites and pages - Ideally effort should scale with size of the
result - Time spent crawling and indexing sites unrelated
to the topic is wasted - Likewise, time that does not improve
comprehensiveness is wasted
23Query construction
/Companies/Electronics/Power_Supply
power suppl
switch mode smps
-multiprocessor
uninterrupt power suppl ups
-parcel
24Query complexity
- Complex queries needed for distillation
- Typical Alta Vista queries are much simpler
(Silverstein, Henzinger, Marais and Moricz) - Forcing a hub or authority helps 86 of the time
25Proposed solution
- Resource discovery system that can be customized
to crawl for any topic by giving examples - Hypertext mining algorithms learn to recognize
pages and sites about the given topic, and a
measure of their centrality - Crawler has guidance hooks controlled by these
two scores
26Administration scenario
Current Examples
Drag
Taxonomy Editor
Suggested Additional Examples
27Relevance
Path nodes
All
BusEcon
Recreation
Arts
Companies
Cycling
...
...
Bike Shops
Clubs
Mt.Biking
Good nodes
Subsumed nodes
28Classification
- How relevant is a document w.r.t. a class?
- Supervised learning, filtering, classification,
categorization - Many types of classifiers
- Bayesian, nearest neighbor, rule-based
- Hypertext
- Both text and links are class-dependent clues
- How to model link-based features?
29The bag-of-words document model
- Decide topic topic c is picked with prior
probability ?(c) ?c?(c) 1 - Each c has parameters ?(c,t) for terms t
- Coin with face probabilities ?t ?(c,t) 1
- Fix document length and keep tossing coin
- Given c, probability of document is
30Exploiting link features
- cclass, ttext, Nneighbors
- Text-only model Prtc
- Using neighbors textto judge my topicPrt,
t(N) c - Better modelPrt, c(N) c
- Non-linear relaxation
?
31Improvement using link features
- 9600 patents from 12 classes marked by USPTO
- Patents have text and cite other patents
- Expand test patent to include neighborhood
- Forget fraction of neighbors classes
32Putting it together
33Monitoring the crawler
One URL
Relevance
Moving Average
Time
34Measures of success
- Harvest rate
- What fraction of crawled pages are relevant
- Robustness across seed sets
- Separate crawls with random disjoint samples
- Measure overlap in URLs and servers crawled
- Measure agreement in best-rated resources
- Evidence of non-trivial work
- Links from start set to the best resources
35Harvest rate
Unfocused
36Crawl robustness
URL Overlap
Server Overlap
Crawl 1
Crawl 2
37Top resources after one hour
- Recreational and competitive cycling
- http//www.truesport.com/Bike/links.htm
- http//reality.sgi.com/billh_hampton/jrvs/links.ht
ml - http//www.acs.ucalgary.ca/bentley/mark_links.htm
l - HIV/AIDS research and treatment
- http//www.stopaids.org/Otherorgs.html
- http//www-hsl.mcmaster.ca/tomflem/aids.html
- http//www.iohk.com/UserPages/mlau/aidsinfo.html
- Purer and better than root set
38(No Transcript)
39(No Transcript)
40Robustness of resource discovery
- Sample disjoint sets of starting URLs
- Two separate crawls
- Find best authorities
- Order by rank
- Find overlap in the top-rated resources
41Distance to best resources
42Observations
- Random walk on the Web rapidly mixes topics
- Yet, there are large coherent paths and clusters
- Focused crawling gives topic distillation richer
data to work on - Combining content with link structure eliminates
the need to tune link-based heuristics
43Related work
- WebWatcher, HotList and ColdList
- Filtering as post-processing, not acquisition
- ReferralWeb
- Social network on the Web
- Ahoy!, Cora
- Hand-crafted to find home pages and papers
- WebCrawler, Fish, Shark, Fetuccino, agents
- Crawler guided by query keyword matches
44Comparison with agents
- Agents usually look for keywords and hand-crafted
patterns - Cannot learn new vocabulary dynamically
- Do not use distance-2 centrality information
- Client-side assistant
- We use taxonomy with statistical topic models
- Models can evolve as crawl proceeds
- Combine relevance and centrality
- Broader scope inter-community linkage analysis
and querying
45Conclusion
- New architecture for example-driven
topic-specific web resource discovery - No dependence on full web crawl and index
- Modest desktop hardware adequate
- Variable radius goal-directed crawling
- High harvest rate
- High quality resources found far from keyword
query response nodes