Search In Small World Networks - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Search In Small World Networks

Description:

Analyzed Alta-Vista, Excite, HotBot, Infoseek, Lycos, and Northern Light ... Analyzed overlap between the largest engines (AltaVista and HotBot) ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 47
Provided by: geor202
Category:

less

Transcript and Presenter's Notes

Title: Search In Small World Networks


1
Search In Small World Networks
  • Presented by George Frederick

2
Searching the World Wide Web
  • Steve Lawrence
  • C. Lee Giles

3
Overview
  • Research performed in 1998, so results are dated
  • The Web is constantly growing and changing,
    making size estimation difficult
  • Search engine companies claimed they could keep
    up
  • Tested actual search engine coverage

4
Overview
  • Typical coverage tests performed by checking
    number of results returned
  • Algorithm may not require exact search term
    matches (related terms)
  • Documents may no longer exist, meaning engines
    with stale data have an advantage
  • Documents may have been altered, changing their
    relevance

5
Market Share
  • Salzberg and Etzioni attempted to calculate
    market share of search engines using their
    MetaCrawler aggregate search service
  • Calculated as percentage of documents users
    followed through from each search engine

6
Market Share
According to Salzber and Etzioni in 1997
7
Market Share
  • Drawbacks to calculation method
  • Difficult to for users to determine relevance
    without clicking through and examining pages
    first
  • User relevance judgments are biased by
    presentation order

8
Web Coverage
  • Selberg and Etzioni also attempted to calculate
    Web coverage for each search engine
  • Flawed because MetaCrawler only retrieves first
    few results from each engine
  • May return unique documents for for first few but
    rest are the same
  • May return same documents for first few but rest
    are unique

9
Web Coverage
  • Lawrence and Giles made own attempt at search
    engine Web coverage calculation
  • Analyzed Alta-Vista, Excite, HotBot, Infoseek,
    Lycos, and Northern Light
  • Google not publicly available at time of study
  • Commonly believed that each indexed roughly same
    material and each covers most of the Web

10
Data Gathering
  • Collected search engines responses to queries by
    NEC Research Institute employees over several
    days
  • Retrieved all matching indices and corresponding
    documents for each query in order to count
  • All indices needed to avoid page rank bias
  • All documents needed to ensure they still existed
    and were not altered since indexing

11
Data Gathering
  • Duplicate documents not counted twice, even with
    different URLs
  • Only lowercase queries considered
  • Pages not displayed within a minute were not
    counted
  • Considered only queries returning up to 600
    results collectively
  • Only exact search terms were counted
  • 575 queries analyzed in total

12
Sources of Bias
  • Size of Web based on search engine coverage
    overlap
  • Biased because not all pages could be indexed
  • Set of all pages that can be indexed by search
    engines is referred to as indexable Web

13
Sources of Bias
  • Pages are often manually registered to several
    different search engines, meaning indices are not
    collected randomly
  • Pages that are highly linked to by other pages
    are more likely to be indexed

14
Methodology
  • Authors expected that larger engines have lower
    dependence
  • Dont rely as heavily on user submitted indices
  • Can crawl and find less popular pages
  • Assumption is that the larger an engine is, the
    more accurate an estimate it can provide of the
    size of the Web

15
Methodology
  • Analyzed overlap between the largest engines
    (AltaVista and HotBot)
  • Estimated that the indexable Web has a lower
    bound of 320 million pages
  • Common earlier estimates ranged from only 75 -
    200 million
  • Authors assert that these were significant
    underestimations

16
Size Estimates
Estimated indexable Web size
17
Size Estimates
Percentage of indexable Web each search engine
covers
18
Additional Analysis
  • Percentage of invalid links

19
Additional Analysis
  • Median age of documents
  • Results suggest that engines with most recent
    pages dont necessarily have best coverage
  • Tradeoff between database size and update
    frequency

20
Results
  • Search engine coverage varies by an order of
    magnitude
  • Indexable Web estimated to have lower bound of
    320 million pages
  • Engines index only a fraction of the Web
  • Individual engines covered between 3-34 of the
    indexable Web each

21
Conclusion
  • Combining results of multiple search engines
    significantly increases number of unique results
  • Search aggregators would significantly increase
    quality of search results

22
The Small-World Phenomenon and Decentralized
Search
  • Jon Kleinberg

23
Small World Search
  • Related to gossip algorithms in that each node
    works with only local knowledge but presents
    emergent behavior
  • Watts-Strogatz model involves d-dimensional
    lattice with uniformly random shortcuts
  • Possible to prove that no decentralized search
    can find short paths with only local knowledge

24
Small World Search
  • Subtle variation involves shortcuts with
    probability that decays like the dth power of
    their distance (in d dimensions) and would
    support efficient search
  • i.e. a node is approximately as likely to create
    shortcuts at distances 1 to 10 as it is at
    distances 10 to 100, 100 to 1000, etc.

25
Small World Search
A node with several random shortcuts spanning
different distance scales
26
Small World Search
  • Construction of networks based on Watts-Strogatz
    variation has been successfully employed in
    peer-to-peer file-sharing systems and on the
    Internet

27
Identity and Search In Social Networks
  • Duncan J. Watts
  • Peter Sheridan Dodds
  • M. E. J. Newman

28
Overview
  • Proposes model that defines a class of searchable
    networks and a method for searching them that is
    applicable to many network search problems

29
Searchability
  • Searchability is defined as the property of being
    able to find a target quickly
  • Searchability has been shown to exist in
    scale-free and lattice networks, but neither is a
    satisfactory model of society

30
Social Network Model
  • Authors assert that proposed model is based on
    plausible social structures
  • Follows naturally from six contentions about
    social networks

31
1. Identities
  • Nodes in social networks have identities in
    addition to relationships
  • Identities are defined as a set of
    characteristics individuals attribute to
    themselves and others through their association
    In social groups
  • A group is defined as a collection of nodes with
    a well-defined set of social characteristics

32
2. Hierarchical View
  • Individuals break down world into hierarchy of
    more and more specific layers
  • Top layer is the world
  • Bottom layer is the individual
  • In practice, individuals dont usually go all the
    way down but stop at a cognitively manage layer
  • A reasonable upper bound on group size is g100

33
2. Hierarchical View
  • The similarity xij between individuals i and j is
    the height of their lowest common ancestor level
    in this hierarchy
  • Xij 1 if i and j are in the same group
  • Hierarchies are defined to have a depth l and
    branching ratio b
  • Purely cognitive construct for social distance
    measure, not actual network

34
3. Homophily
  • The more similar individuals are, the more likely
    it is that they know each other
  • To construct the network, randomly choose a node
    i and a link distance x with probability p(x) c
    exp-ax
  • a is a tunable parameter (measure of homophily)
  • c is a normalizing constant

35
3. Homophily
  • Choose a second node j uniformly from all nodes
    with distance x from i
  • Repeat until the nodes have an average of z
    friends

36
3. Homophily
  • When e-a 1 all links will be as short as
    possible, meaning individuals only have
    connections to those most similar to themselves,
    forming many isolated cliques
  • When e-a b, individuals are equally likely to
    be linked with any other individuals, resulting
    in a uniform random graph

37
4. Multiple Hierarchies
  • Individuals split the world into multiple
    hierarchies dependent upon context i.e. geography
    or occupation
  • These represent different societal dimensions
  • A nodes identity is defined as an H-dimensional
    coordinate vector vi where vih is the position of
    node i in the hth hierarchy/dimension

38
4. Multiple Hierarchies
  • Each i is randomly assigned a coordinate for each
    of the H hierarchies and allocated friends as
    previously described, randomly choosing a
    hierarchy h for each link
  • When H 1 and e-a 1, the link density must
    obey the constraint z

39
5. Social Distance
  • Individuals have their own perception of social
    distance yij minh xijh
  • Close proximity in just one hierarchy is
    sufficient to connote affiliation
  • Violates triangle inequality
  • Individuals i and j can be close in one hierarchy
    and individuals j and k close in another, but i
    and k may still be far apart in both hierarchies

40
6. Local Knowledge
  • Individuals have only local information
  • Its own coordinate vector vi
  • Its neighbors coordinate vectors vj
  • Its targets coordinate vector vt
  • Social distance and network paths are known
  • Neither is sufficient for efficient searching
  • Both combined are capable

41
6. Local Knowledge
  • Following Milgrams example with each node
    forwarding a message to one other node j
    perceived to be closer to the target t

42
Modeling Milgram
  • Principle objective is to find average length
    of a message chain between a randomly selected
    sender s and target t is small
  • Small has been previously defined to mean that
    grows slowly with population size N
  • Insufficient as probability of chain termination
    at each hop of p0.25
  • Requires absolute bound

43
Modeling Milgram
  • A searchable network is defined as one with a
    probability of successful delivery q is at least
    some fixed value r
  • In terms of chain length, the authors formally
    require q r
  • Maximum required
  • For experimental purposes, the authors set r0.05
    and p0.25 and requiring of population size N

44
Conclusion
  • By setting the parameters so, in accordance with
    the 6 sociological contentions, the authors are
    able to recreate Milgrams results in the
    simulation
  • Concept is general enough to be applied to many
    types of networks beyond social such as p2p, the
    Web, citation networks

45
Conclusion
  • The multi-dimensional aspect of the model makes
    decentralized database organization and searching
    more efficient with simple, greedy algorithms

46
Papers
  • S. Lawrence and C.L. Giles. Searching the world
    wide web. Science, 280(4)98--100 (1998).
    http//citeseer.ist.psu.edu/lawrence98searching.ht
    ml
  • J. Kleinberg. The Small-World Phenomenon and
    Decentralized Search. SIAM News 37(3), April 2004
  • D. J. Watts, P. S. Dodds, M. E. J. Newman.
    Identity and Search in Social Networks. Science
    269(5571), 2002.
Write a Comment
User Comments (0)
About PowerShow.com