Complex Queries in DHT-based Peer-to-Peer Networks - PowerPoint PPT Presentation

About This Presentation

Title:

Complex Queries in DHT-based Peer-to-Peer Networks

Description:

Title: The PIER Relational Query Processing System Author: Ryan Huebsch Last modified by: Ryan Huebsch Created Date: 1/31/2002 10:12:22 PM Document presentation format – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 19

Provided by: RyanHu9

Learn more at: http://www.huebsch.org

Category:

more less

Transcript and Presenter's Notes

Title: Complex Queries in DHT-based Peer-to-Peer Networks

1
Complex Queries in DHT-based Peer-to-Peer Networks

Matthew Harren, Joe Hellerstein,
Ryan Huebsch, Boon Thau Loo,
Scott Shenker, Ion Stoica
p2p_at_db.cs.berkeley.edu
UC Berkeley, CS Division

IPTPS 3/8/02
2
Outline

Contrast P2P DB systems
Motivation
Architecture
DHT Requirements
Query Processor
Current Status
Future Research

3
Uniting DHTs andQuery Processing
4
P2P DB Systems
P2P
DB
Flexibility ? ?
Decentralized ? ?
Strong Semantics ? ?
Powerful query facilities ? ?
Fault Tolerance ? ?
Lightweight ? ?
Transactions Concurrency Control ? ?
5
P2P DB ?

P2P Database? No!
ACID transactional guarantees do not scale, nor
does the everyday user want ACID semantics
Much too heavyweight of a solution for the
everyday user
Query Processing on P2P!
Both P2P and DBs do data location and movement
Can be naturally unified (lessons in both
directions)
P2P brings scalability flexibilityDB brings
relational model query facilities

6
P2P Query Processing(Simple) Example
SELECT song, size, server FROM album, song WHERE
album.ID song.albumID AND album.name Rubber
Soul

Filesharing

Keyword searching is ONE canned SQL query
Imagine what else you could do!

7
P2P Query Processing(Simple) Example
SELECT song, size, server FROM album-ngrams AN,
song WHERE AN.ID song.albumID AND AN.ngram IN
ltlist of search ngramsgt GROUP BY AN.ID HAVING
COUNT(AN.ngram) gt lt of ngrams in searchgt

Filesharing

Keyword searching is ONE canned SQL query
Imagine what else you could do!
Fuzzy Searching, Resource Discovery, Enhanced DNS

8
What this projectIS and IS NOT about

IS NOT ABOUT Absolute Performance
In most situations a centralized solution could
be faster
IS ABOUT Decentralized Features
No administrator, anonymity, shared resources,
tolerates failures, resistant to censorship
IS NOT ABOUT Replacing RDBMS
Centralized solutions still have their place for
many applications (commercial records, etc.)
IS ABOUT Research synergies
Unifying/morphing design principles and
techniques from DB and NW communities

9
General Architecture

Note the data is stored separately from the
query engine, not a standard DB practice!

Based on Distributed Hash Tables (DHT) to get
many good networking properties
A query processor is built on top

10
DHT API

Basic API
publish(RID, object)
lookup(RID)
multicast(object)
NOTE Applications can only fetch-by-name a very
limited query language!

11
DHT API Enhancements I

Basic API
publish(namespace, RID, object)
lookup(namespace, RID)
multicast(namespace, object)
Namespaces subsets of the ID space for logical
and physical data partitioning

12
DHT API Enhancements II

Additions
lscan(namespace) retrieve the data stored
locally from a particular namespace
newData(namespace) receive a callback when new
data is inserted into the local store for the
namespace
This violates the abstraction of location
independence
Why necessary? Parallel scanning of base relation
Why acceptable? Access is limited to reading,
applications can not control the location of data

13
Query Processor(QP) Architecture

QP is just another application as far as the DHT
is concerned DHT objects QP tuples
User applications can use QP to query data using
a subset of SQL
Select
Project
Joins
Group By / Aggregate
Data can be metadata (for a file sharing type
application) or entire records, mechanisms are
the same

14
Indexes. The lifeblood of a database engine.

DHTs mapping of RID/Object is equivalent to an
index
Additional indexes are created by adding another
key/value pair with the key being the value of
the indexed field(s) and value being a pointer
to the object (the RID or primary key)

Secondary
PKey
Key
Index NS
Data
Ptr
DHT
DHT
Primary
PKey
Data
Primary Index
Secondary Index
15
Relational Algorithms

Selection/Projection
Join Algorithms
Symmetric Hash
Use lscan on tables R S. Republish tuples in a
temporary namespace using the join attributes as
the RID. Nodes in the temporary namespace perform
mini-joins locally as tuples arrive and forwards
results to requestor.
Fetch Matches
If there is an index on the join attribute(s) for
one table (say R), use lscan for other table (say
S) and then issue a lookup probing for matches in
R.
Semi-Join like algorithms
Bloom-Join like algorithms
Group-By (Aggregation)

16
Interesting note

The state of the join is stored in the DHT store
Rehashed data is automatically re-routed to the
proper node if the coordinate space adjusted
When a node splits (to accept a new node into the
network) the data is also split, this includes
previously delivered rehashed tuples
Allows for graceful re-organization of the
network not to interfere with ongoing operations

17
Where we are

A working real implementation of our Query
Processing (currently named PIER) on top of a CAN
simulator
Initial work studying and analyzing algorithms
nothing really ground-breaking YET!
Analyzing the design space and which problems
seem most interesting to pursue

18
Where to go from here?

Common Issues
Caching Both at DHT and QP levels
Using Replication for speed and fault tolerance
(both in data and computation)
Security
Database Issues
Pre-computation of (intermediate) results
Continuous queries/alerters
Query optimization (Is this like network
routing?)
More algorithms, Dist-DBMS have more tricks
Performance Metrics for P2P QP Systems
What are the new apps the system enables?

19
Additional Slides
20
Symmetric Hash Join

The tuple is checked against predicates that
apply to it (i.e. produced gt 1970)
Unnecessary fields can be projected out
Re-insert the resulting tuple into the network
using the join key value as the new RID, and use
a new temporary namespace (both tables use same
namespace)

When each node receives the multicast it uses
lscan to read all data stored at the node. Each
object or tuple is analyzed
I want Hawaiian images that appeared in movies
produced since 1970
Create a query request SELECT name, URL FROM
images, movies WHERE image.ID movie.ID AND
21
N-grams