Title: Information Retrieval Techniques For Peer-To-Peer Networks
1Information Retrieval Techniques For Peer-To-Peer
Networks
- Demetrios Zeinalipour-Yazti, Vana Kalogeraki and
- Dimitrios Gunopulos
- Presented By Ranjan Dash
2Layout
- Introduction
- P2P Network IR Techniques
- PeerWare Infrastructure and experiments
3Introduction
- Major challenge
- efficiently search the content of other peers
- Definition
- Large number of peers collaborate dynamically in
an ad hoc manner and share information in
large-scale distributed environments without
centralized co-ordination - P2P environment characteristic
- Each peer has a database or collection of docs
- Query contains set of key words
- Reply message contains pointers to matching
documents - Different from static data environments
- No central repository
- Nodes join and leave in ad hoc and dynamically
4P2P Network IR Techniques
- P2P Network IR Techniques
- Breadth-First Search (BFS)
- Random Breadth-First-Search (RBFS)
- Intelligent Search Mechanism (ISM)
- Directed BFS and gtRES
- Random Walker Searches
- Randomized Gossiping
- Local Routing Indices
- Centralized Approaches
- Searching Object Identifiers
- Distributed IR
5P2P Network IR Techniques
- Breadth-First Search (BFS)
- Widely used in file-sharing systems
- Propagates to all neighbors except sender
- QueryHit Msg (of docs, bandwidth info) follows
the same path - Simple, guarantees high hit rate
- Poor in performance and network utilization
- Low bandwidth node - a bottleneck
- Can be improved using TTL
6P2P Network IR Techniques
- Random Breadth-First Search (RBFS)
- Dramatic improvements over BFS
- Forwards only to a fraction of its peers,
selected at random - Does not need global knowledge, takes local
decisions - faster - Probabilistic might not reach some large
network segments
7P2P Network IR Techniques
- Intelligent Search Mechanism (ISM)
- Quick, efficient and least communication costs
- Propagates only to peers more likely to reply
- Consists of 2 components that run in each peer
- Profile mechanism
- Relevance rank
- Works good for query locality
- Forwards to same neighbor always -Starvation
for new peers - Solution add small random subset of peers to
most relevant set
8P2P Network IR Techniques
- Profile mechanism
- Builds a profile for each of its neighboring
peers - Maintains T most recent Queries and QueryHits
with no of results - Least recently used replacement policy for most
recent query
9P2P Network IR Techniques
- Relevance rank
- Ranking of neighbors to decide which ones to
forward a query - Ranking of a peer Pi for a query q
- Qsim is cosine similarity between 2 queries
0, most results in the past that matters like
gtRES
10P2P Network IR Techniques
- Directed BFS and gtRES
- forwards a query to a subset of its peers based
on some aggregated statistics - Send out to k peers which had returned the most
results for the last m queries
- BFS turned into a DFS for k 1, m10
- Similar to ISM, but simpler
- Does not explore nodes that contain content
related to query - Performs well because it routes larger networks
segments
11P2P Network IR Techniques
- Each node randomly forwards a query message,
called a walker to one of its peers - Can be extended from 1-walker to k-walker
- Resembles RBFS but message numbers increase
linearly - Like RBFS does not use most relevant content to
guide query - Adaptive Probability search (APS) similar
- Uses feed back from previous searches to
probabilistically guide future walkers
12P2P Network IR Techniques
- Randomized Gossiping PlanetP
- Global inverted index, partially constructed by
each node, called local index bloom filter - Propagates it to the rest through gossiping
- Adv. Of bloom filter
- Smaller messages
- Saving in network I/O
- Problem of scalability for PlanetP
13P2P Network IR Techniques
- Local Routing Indices
- by Arturo Crespo and Hector Garcia-Molina
- Hybrid technique uses local indices containing
the direction toward the documents - 3 techniques
- compound routing indices (CRI)
- hop-count routing index (HRI)
- exponentially aggregated index (ERI)
- Good for topologies where only few nodes have
very large numbers of neighbors - (tree, tree
with cycles) - The routing indices are similar to the routing
tables deployed in the BellmanFord - CRI - a node q maintains statistics for each
neighbor that indicate how many documents are
reachable through each neighbor. - HRI - CRI for k hops prohibitive storage cost
for large k. - ERI - addresses the issue of HRI by aggregating
HRI using a cost formula.
14P2P Network IR Techniques
- Centralized Approaches
- maintain an inverted index over all the documents
in the participating hosts collections - Google,
Yahoo, Napster - Each joining peer A uploads an index of all its
shared documents to the central repository R. - A querying node B searches As documents through
R. - B can communicate with A directly (using an
out-of-band protocol such as HTTP). - Kazaa - Little different. Uses a set of
more-powerful peers that acts as a central
repositories - different kind of animal than the rest.
- Simple, Robust, shorter search time, guaranteed
to find all results
15P2P Network IR Techniques
- Searching Object Identifiers
- Distributed file indexing systems - Chord,
OceanStore, and Content Addressable Network
(CAN), Freenet - efficient searches using object identifiers (a
hashcode on the name of a file) rather than
keywords. - Perform object lookup operations to get the
address (an IP address) of the node that is
storing the object. - Optimizes object retrieval by minimizing the
numbers of messages and hops required. - Disadvantage - only search for object identifiers
and thus cant capture the relevance of the doc.
16P2P Network IR Techniques
- Distributed IR
- Having distributed databases, the main IR problem
is deciding which databases are most likely to
contain the most relevant documents. - Its possible to achieve good results for
conceptually separated collections. - However, the assumption is that the querying
party has some statistical knowledge about each
databases contents (word frequencies in
documents) and therefore must have a global view
of the system.
17PeerWare Infrastructure and experiments
- Evaluation metrics
- recall rate the fraction of documents each of
the search mechanisms retrieves - Efficiency - the number of messages needed to
find the results - Implemented only algorithms that require local
knowledge when searching for documents. - BFS (the baseline)
- Implemented RBFS, gtRES (k 0.5 d and m 100,
where d is the degree of a node) , and ISM - these 3 techniques forward query messages to half
the neighbors that BFS contacts. - gtRES and ISM use previous knowledge to decide on
which peers to forward the query
18PeerWare Infrastructure and experiments
BFS requires almost 2.5 times as many messages as
its competitors.
19PeerWare Infrastructure and experiments
ISM found the most documents. ISM achieved almost
a 90-percent recall rate while using only 38
percent of the messages BFS required. ISM
improves its knowledge over time. Both gtRES and
ISM started out with a low recall rate (around 40
to 50 percent) because initially they randomly
choose their neighbors.