Title: The CLEVER Searching
1 The CLEVER Searching
IBM Almaden Research Center
2The lecture outline
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Introduction
- The HITS algorithm (Hyperlink Induced Topic
Search) - The Automatic Resource Compiler
- The CLEVER
- Web-communities
- The Focused Crawler
3Introduction
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- 100s of millions pages on the WEB
- Every day another million is added
- More than a billion hyperlinks connecting them
- The WEB lacks organization and structure
- How can we find information ?
Traditional search engines !
4Search Engines
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- maintain an index of words and pages containing
them - use a ranking function to rank the pages.
5The ranking function
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Most SE use thumb rules such as
- The number of times the page contains the query
word - The location of the word in the page
- Giving more weight to words in titles or larger
font
6Some disadvantages
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Spamming using invisible fonts, repeating words
- Polysemy - same word having multiple meanings
- Synonymy - different words having the same
meaning
A possible solution Semantic networks
7Semantic networks
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Double edged sword - it helps with synonymy but
aggravates polysemy - Expensive to build and maintain
- Difficult to build many languages, new
terminology
Another possible solution Human selected pages
(YAHOO)
8Some disadvantages
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Countless possible queries
- individual judgement
- almost impossible as the WEB grows by a million
pages a day - Example fishing
The CLEVER solution using hyperlinks
9The HITS Algorithm(1997)
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Let the WEB be a directed graph
- - nodes static HTML pages
- - edges links
- The average node has seven outgoing edges
10The HITS Algorithm(1997)
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Two kinds of usefull pages
- Authority page - contains a lot of information
about the query topic - Hub page - contains a large number of links to
pages containing info
11The HITS Algorithm(1997)
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- The concept
- A good hub points to many good authorities
- A good authority is pointed to by many good hubs
The Goal To find the best H A about the topic
12The HITS Algorithm(1997)
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- The algorithm has two main stages
- The sampling stage - constructing a collection of
pages in which we will search for H A - The weight propagation stage - assigning
numerical scores of H A to every page
13The sampling stage
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Creating a root set of 200 pages using AltaVista
- Expanding the root set to the base set of 2000
pages - deleting links between pages in the same site
The result a subgraph G which is rich in H A
14The weight propagation stage
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Let a(p) be the authority weight of page p
- h(p) be the hub weight of p
- 1. for every page p h(p) 1
- 2. Repeat k times
- for every p a(p) Sq-p h(q)
- for every p h(p) Sp-q a(q)
15Why using hubs ?
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Two main reasons
- Hubs are places from which you can start
searching - In cyberspace, competing authorities ignore each
other and can be connected only by hubs.
16The ARC
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- If computations were not a bottleneck,
- what would be the most effective search
- algorithm ?
Automatic Resource Compiler (ARC)
17The ARC
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- ARC is a system which, given a topic that is
broad and well represented on the WEB, will seek
out and return a list of WEB resources that it
considers the most authoritative.
The goal to compile lists similar to those
provided by YAHOO or Infoseek
18The Algorithm
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Performs a local analysis of both text and
- links to arrive at a global consensus of
- the best resources for the topic.
- Three phases
- search and growth
- weighting
- iteration and reporting
19Search and growth
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Collecting a root set from Altavista using the
query terms. - Augmenting the root set twice by adding
- - pages that point to the documents in the
- root set
- - pages that are pointed to by documents
- in the root set
20Weighting
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- The concept The text around the anchor in
- a page p that links to a page q is descriptive
- of the contents of q.
- Let w(p,q) be a positive numerical weight
- that reflects the amount of topic-related text
- in the vicinity of the anchor.
21Iteration report
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- for every page p h(p) 1
- Repeat k times
- for every p a(p) Sq-p w(q,p) h(q)
- for every p h(p) Sp-q w(p,q) a(q)
- return the best 15 hubs and the best 15
authorities
22Computing w(p,q)
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- What precisely is vicinity ?
- Let us define the anchor window as a window of B
bytes before and after the HREF - The distance between Yahoo and www.yahoo.com
B is set to 50
23Computing w(p,q) - cont.
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- How to map occurences of descritive text in the
anchor window into weights ? - Let n(t) denote the number of matches between
terms in the topic and in the anchor window. -
w(p,q) 1 n(t)
24Analysis
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- The weighting process the iterative computation
take a few seconds - The bottleneck computing the root augmented
sets.
Conclusion The system will not field 1000s of
queries per second and produce answers in real
time.
25Experiments
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Well known sources to compare with ARC
Yahoo Infoseek - Topics
- - topics from the directories of YAHOO
- - 28 topics, each containing 2-3 words
- - Examples cheese,classical guitar, Gulf
- War
26Experiments-cont.
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Volunteers
- - each one evaluated two topics
- - every topic was evaluated by two persons
- The evaluation form
27Results
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- ARCs lists are almost competitive with
- the lists of YAHOO and Infoseek
- The combination of link text analysis
- is successful !
28New improvements (1999)
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Seeking a final set of HA that provide good
access to a wide variety of information - Returning only a single point-of-entry into a WEB
source - Identifying interesting sections of WEB pages and
use this to determine which other pages might be
good HA (Ex. Mango fruit)
The improved system is called CLEVER
29User study
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Comparing the ten top results of CLEVER, YAHOO
Altavista on the same 27 topics - 37 participants rank every page as fantastic,
good, fair, bad - Every SE gets 1 point if its page was ranked
good or fantastic
30The results(1)
31The results(2)
CLEVER performs better than YAHOO and Altavista
32Summary
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- HITS
- CLEVER performs better than YAHOO and Altavista
- The next stage - building resource lists such as
YAHOO. - A usefull by-product of the developing of CLEVER
is the seperation of WEB pages into clusters
(communities)
ARC
CLEVER
33Inferring web communities from link topology
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- WWW - decentralized almost anarchic
- growth process.
- - result large hyper-linked corpus
- lacks a traditional
logical - organization.
- TARGET find an order in dis-order , or
- extract meaningful structures
- from the hyper-linked
environment.
34Inferring web communities from link topology
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- The notion search for hyper-linked communities
through analysis - of the link topology, as a way of
- addressing issues such as
- -navigation.
- -information discovery.
- -web sociology, ext.
35COMMUNITIESdensely interconnected sets of
hubs and authorities.
36Inferring web communities from link topology
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- The mean for the search
- HITS algorythm.
- Activating the algorythm on a specified
- subject reveal a community of hubs
- and authorities related to the subject.
37Inferring web communities from link topology
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- What kinds of communities are discovered by HITS
? - How do the communities discovered depend on the
choice of root-set? - How quickly do the communities crystallize as the
number of iterations grow?
38How quickly do communities crystalize ?
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- principal community a set of the top 10
authorities and top 10 hubs (marked C). - C(R,N) principal community obtained by
- running HITS for N iterations
with - an initial root-set of R
pages. - Empiric results most communities become stable
after predictably 200 iterations and from
root-set of 50.
39How quickly do communities crystallize ?
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Therefore define CC(200,50)
- and run tests over different Rs and Ns
- for 6 representative topics
- 1. Harvard 2.cryptography 3.English
literature - 4. Skiing 5.Optimization 6.Operations
research - Look at the overlap between C(N,R) C, and see
which community converges more quickly (with
smaller Ns Rs).
40How quickly do communities crystallize ?
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- RESULTS
- principal-community for cryptography
crystallizes very quickly (central area in CS),
but for skiing and English literature the
process is remarkelly slow. topics like
operations research are somewhere in between. - How can this be explained?
41Main observations emerging from the tests analysis
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Communities on broad topics usually have quite a
robust structure (relatively independent of the
root-set choice). - HITS success depends on breadth of the topic and
the discipline of human knowledge under which it
falls. (density of hyperlinks in
discipline like CS is greater than in academic
humanities).
42Main observations emerge from the tests analysis
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- IMPORTANT
- the greatest degree of orderly structure, as
extracted by HITS, is found in communities for
which the number of relevant pages, and the
density of hyperlinking, is the largest ! - (crypto VS. English literature).
43Main observations emerge from the tests analysis
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- The standard view
- the web is becoming increasingly chaotic and
difficult to model - Observation
- HITS become more and more effective as the size
of the web continues to increase - Consequence
- we can make predictive statements about less
linked communities based on current experience
with highly-linked ones
44The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Robustness
- on broad topics - stable and robust communities,
despite a very small initial root-set. - Methods used to diverse root-sets
- - querying multiple search-engines for the
initial root-set. - - use same term in different languages.
45The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- RESULT
- The main communities tend to recur in all the
experiments, regardless of how the root set is
constructed.
46The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- TOPIC GENERALIZATION
- HITS tends to generalize topics that are not
sufficiently broad. - WHAT DOES THAT MEAN???
- It means that the principal-community of such
subject will be relevant to a topic which
includes, but larger then, the initial subject
given to HITS.
47The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- TOPIC GENERALIZATION
- topic enough broad Micheal Jordan
- HITS give good and relevant results.
- Topic Dennis Ritchie (author of C)
- HITS results (3 top authorities)
- www.cm.cf.ac.uk/Dave/C/CE.html
- www.cyberdiem.com/vin/learn.html
- www.lysator/liu.se/c/index.html
- All pages on the C language itself!
- But where is Ritchie ???
48The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- swallowed up by the subject
- to which he most prominently
- belongs
- and there are more examples.
49The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Generalization allows an automatic
characterization of specific subjects in terms of
their generalizations.
50The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- TREE OF TOPICS
- picturize generalization as occuring on a tree
of topic where most-general topic are close to
the root, and their sons are sub-topics. - YAHOO - ex. of an hand-made searchable
hierarchy. - HITS gives us a way to collect info. about such
trees automaticaly.
51The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- OTHER FACTORS AFFECTING GENERALIZATION
- 1) Web-centric sub-topics
- 2) commercialization
52The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Web-centric sub-topics
- generality is determined by the representation
of the topic on the www. - many pages are most concerned in topic that
involves the web-itself. - Thus, HITS may focus on a certain community
because it represents a more web-centric
version of the given topic.
53The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Example for the topic linguistics
- the top authorities were
- www.cs.columbia.edu/acl//home.html
- www.cs.columbia.edu/radev/cgi-bin/universe.cgi
- www.ling.rochester.edu/linglinks.html
- The first 2 are strong authorities for
computational linguistics, only a sub-topic of
the requested topic, but this sub-topic is more
linked on the web (more related to CS and the web
itself).
54The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- COMMERCIAL/ADVERTISING INFLUENCES
- topics with both commercial and individual
involvement, the authorities in the principal
communities are the commercialized pages.
55The structure of communities
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- MORE OBSERVATIONS
- 1) infiltration of pages of AltaVista, YAHOO, and
www.microsoft.com into innocent communities. - 2) temporal influence of short-term issues.
- (ex Harvard Conference on the Internet and
Society was prominent with Harvard in 1/97 but
no in 8/97).
56Conclusions
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- The web is less chaotic then seems.
- HITS gives convenient way to analyze the webs
link topology and to reveal the hyperlinked
communities on the web that appear to span a wide
range of interests and discipline. - Using HITS, one can study the sociology of the
web and get a global understanding of how it is
being constructed and how it behaves.
57- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
58Focused Crawling
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- 2 forces shaping the future of the web
- 1. Exploding volume of the WWW.
- 2. Growing mass of users who use
- the web for serious research.
- small is beautiful
- specialized search portals VS.
- one-size-fits-all portals (Alta-Vista)
59Focused Crawling
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- The question
- can a focused portal be built automatically?
- The answer
- yes! using a focused crawler.
60Focused Crawling
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- The focused crawler seeks, aquires, indexes and
maintains pages on a specific set of topics that
represent a relatively narrow segment of the web - -for serious web users, focused portholes are
more useful than generic portals.
61Focused Crawling
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Operation synopsis
- Setting an initial root-set of good example
pages by the user (using an exiting taxonomy as
base). - Crawling from root-set using 2 disciplines
- - scoring the relevance of each new reached page
to the initial group (classifying). - - estimating benefit of crawling out from the
pages out-links (distillation). - Combining the power of content links.
62- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
63Focused Crawling
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Important notions
- - users feedback
- - supervised learning
- - high harvest-rate
- (fraction of page fetches
- relevant to users interest)
- - keeping focused on subject
- example Cycling
64- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
65Focused Crawling - summary
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- Learning from examples mechanism
- Efficient and close-to-subject crawling, ignoring
unrelevant segments of the web - Accessing further relevant segments of the web
while getting into more deep search - giving a good answer for the need of specialized
and filtered web-libraries
66Publications
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- S. Chakrabari, B. Dom, D. Gibson, J. Kleinberg,
P. Raghavan, and S. Rajagopalan. Automatic
Resource Compilation by Analyzing Hyperlink
Structure and Associated Text. - D. Gibson, J. Kleinberg, and P. Raghavan.
Inferring Web Communities from Link Topologies. - S. Chakrabari, B. Dom, D. Gibson, J. Kleinberg,
S.R. Kumar, P. Raghavan, S. Rajagopalan and A.
Tomkins. HyperSearching the web. - S. Chakrabari, M. Van den Berg, B. Dom. Focused
crawling a new approach to topic specific
resource discovery.
67References
- Instructions
- Delete sample document icon and replace with
working document icons as follows - Create document in Word.
- Return to PowerPoint.
- From Insert Menu, select Object
- Click Create from File
- Locate File name in File box
- Make sure Display as Icon is checked.
- Click OK
- Select icon
- From Slide Show Menu, Select Action Settings.
- Click Object Action and select Edit
- Click OK
- http//decweb.ethz.ch/WWW7/1898/com1898.html
- http//www.almaden.ibm.com/cs/k53/abstract.html
- http//www.almaden.ibm.com/cs/k53/clever.html
- http//www.cs.berkeley.edu/soumen/0699raghavan.ht
ml - http//www.cs.berkeley.edu/soumen/doc/www99focus/
html