Title: P2P Databases
1P2P Databases
2Overview
- 0. Data objects, pointers (URLs), and attributes
- 1. Freeform versus structured attribute data
- 2. Centralized indices for attribute data and
pointers (ex Napster)
- 3. Query by flooding (ex Gnutella)
- 4. DHTs (ex Chord)
- 5. Problems with DHTs
- 6. Keyword queries in DHTs (Magnolia)
- 7. Popularity queries
- 8. Demo of system
- 9. (if time) Data transmission   - Overlay vs
DHT Multicast   - Bittorrent / Splitstream
- 10. (if time) P2P file systems and versioning
(precursor to undo/redo logging from later in the
course)
3P2P Today
edonkey
bittorrent
pastry
jxta
can
fiorana
napster
freenet
united devices
open cola
?
aim
ocean store
netmeeting
farsite
gnutella
icq
ebay
morpheus
limewire
seti_at_home
bearshare
uddi
grove
jabber
popular power
kazaa
folding_at_home
tapestry
mojo nation
process tree
chord
4Object representation and storage
Objects
Attributes Name , Artist, Album , Genre
Pointer to object
5P2P vs. Distributed DBMS
Traditional DDBMS Issues
- Transactions
- Distributed Query Optimization
- Interoperation of heterogeneous data sources
- Reliability/failure of nodes
Complex features do not scale
6P2P vs. Distributed DBMS
- Example application file-sharing
- Simple data model and query language
- No complex query optimization
- Easy interoperation
- No guarantee on quality of results
- Individual site availability unimportant
- Local updates
- No transactions
- Network partitions OK
Simple Amenable to large-scale network of
PCs
7Example file sharing
- Challenge 1 Performance
- Asking everyone is expensive!
- If I am smart, I only need to ask one peer
- How can I be smart?
File X?
8Search in P2P
- System can control
- Connections made by users/topology
- Data placement
- Query type
- Tight control Structured
- Efficient, comprehensive
- Loose control Unstructured
- Inefficient, not comprehensive, simple,
expressive
- Used in real life
9Centralized
- Napster model
- Benefits
- Efficient search
- Limited bandwidth usage
- No per-node state
- Drawbacks
- Central point of failure
- Limited scale
Bob
Alice
Jane
Judy
10http//www.snocap.com/
11Unstructured Query Flooding
12Problems with unstructured
- Inefficient
- Query messages are flooded
- Even if routing is intelligent, worst case load
is still O(n), where n is nodes in system
- Not comprehensive
- If I do not get a result for my query, is it
because none exists?
- (Of course, many optimizations are possible)
13Distributed Hash Table (DHTs)
- Model
- Key/Object pair, the key is hashed to get an ID
- Example
- Objects are files
- The key is the content of the file
- The ID is the hash of the file contents
- Single operation Lookup(ID)
- Input integer ID
- Output the object with the corresponding ID
14Identifiers
- IDs are m-bit integers
- Nodes are also assigned IDs
- Commonly assigned by hashing a nodes IP address,
although many problems with this
- An object is stored on the node with the smallest
ID greater than the objects ID
- This node is called the successor of the objects
ID
- IDs are arranged on a circle, so 0 2m-1
15Data Placement
0
m 3
7
1
1
6
6
2
2
3
5
4
16Connections
0
7
1
Finger pointers
6
2
3
5
4
17Query
- Lookup(objectID)
- objectID is typically the ID of the object you
are looking for, but not necessarily
- Approach
- Find the predecessor of the object
- I.e. the node with the largest ID that is smaller
than the object ID
- Return the successor of the predecessor
18Query Example
- Say node 0 wants to find the object with ID 7
- For simplicity, we will assume a node exists at
every ID in the space
19Query Example
0
Node 0 Lookup(7)
7
1
Node 0 FindPred (7)
6
2
3
5
4
20Query Example
0
Node 4 FindPred(7)
7
1
6
2
3
5
4
21Query Example
0
Node 6 FindPred(7)
7
1
Node 6 is predecessor Return successor node 7
6
2
3
5
4
22Query characteristics
- With high probability, a query can be answered by
contacting O(log N) nodes
- N total nodes in the network
- Efficient!
- Also notice if an object with the ID exists in
the network, it will be found
- Comprehensive!
- State is also O(log N) in size
23Query characteristics
- Note that finger pointers are not required for
correct operation
- Only successor pointers are needed
- But then cost of query increases
- O(N) in worst case
24Advantages of Structured?
- Scalability/Efficiency
- load grows with O(log N)
- Comprehensiveness
25Disadvantages? (cont)
- Availability of Data
- If a node dies suddenly, what happens to the data
it was storing?
- MUST replicate data across multiple nodes
- Query Language
- How can we express keyword queries efficiently?
- Many useful applications require different
languages
26Magnolia
27Resulting Distribution
28Prefix hashing
29Balancing
Innovation
Balanced over the sibling group
100
Sibling group ID100
All siblings in a group share the same prefix
30Insert
Keyword hP? SiblingGroup ID
Random Sibling
Locate a sibling node via SIFT
31Advantages
- Good Balancing Properties
32Advantages
- Low Traffic Load on nodes for popular queries
- Quick Lookup
- Popularity Ranking of Objects
- Distributed Replication for resilience
33Implementing Magnolia
- Developed on top of a chord clone written in
Python
- If youre going to write a peer-to-peer app, why
not leverage existing modules and libraries?
- Challenge How do we implement group-based stores
and queries without requiring additional network
maintenance?
34Chords Finger Table
- A chord node maintains a finger table of M IPs
pointing to nodes ahead of it in the ring.
- A pointer at index i is the successor of node id
(2i-1). This lets us reach any node in the
network in O(log M) hops
- We use the M most significant bits in a nodes
id to indicate its group. We want to reach any
group in O(log M) hops.
- Do we need another table?
- Nope. The last M entries in our finger table
provide this.
35Talking to Siblings
- How do we propagate queries through the group?
- Naïve solution send to our predecessor and
successor.
- A better solution We can send a query throughout
the group by treating the sibling group as a tree.
36Sibling Tree
N/N 16 M/M 4
0 1 2 3 4 5
6 7 8 9 10 11 12
13 14 15
0
023
01
8
1
822
122
81
11
2
12
9
5
221
21
521
921
1221
51
91
121
10
11
3
4
6
7
13
14
Every edge can be found in the finger table!
1420
15
37Sibling Tree Problems
- Problems
- Not every possible node will exist
- Not every node will have results to report
- The query maker needs to know when the search is
done
- But were okay!
- Nodes can determine if a child sub-tree is dead
- Even if a child node in our sibling table is of a
higher ID than expected
- its sub-tree contains all existing descendents of
the expected id
- we can predict when a child is in a sibling our
ancestors tree
38Bigger Problems
- What if a pointer in our finger table fails?
- We either have to find the successor to its id
or fail to query the sub-tree
- What if the lowest ID node isnt the root of our
tree?
- Some of our edges wont be in our finger table
39Popularity queries
40Yulania , Demo
41BitTorrent
42SplitStream