Distributed Databases and Query Processing - PowerPoint PPT Presentation

About This Presentation

Title:

Distributed Databases and Query Processing

Description:

Number of Views:95

Avg rating:3.0/5.0

Slides: 9

Provided by: tho9

Category:

Tags: dangling | databases | distributed | processing | query

Transcript and Presenter's Notes

Title: Distributed Databases and Query Processing

1
Distributed Databases and Query Processing
2
Distributed DBs vs. Parallel DBs

Many autonomous processors that may participate
in database operations. Difference from a
shared-nothing parallel system is in the
assumption about the cost of communication.
Shared-nothing parallel systems message passing
cost is small compared with disk accesses and
other costs.
Distributed systems processors are typically
physically distant, and so, the cost of message
passing is high.
Advantages
Like parallel systems, a distributed system can
use many processors to accelerate query
evaluation.
Further, since the processors are widely
separated, we can increase resilience in the face
of failures by replicating data at several sites.
Disadvantage
Increased complexity of many DB aspects we have
to take care to minimize the number of messages
sent between sites.

3
Why distribute data?

Organization is itself distributed among many
sites
Banks
Many branches. Each branch will keep a database
of accounts maintained at that branch.
Some data are kept in the central office, such as
employee records and current interest rates.
Department Stores
Many individual stores. Each store has a database
of sales at that store and inventory at that
store.
Some data are kept in the central office, such as
employee records and a chain-wide inventory, data
about credit card customers, and information
about suppliers such as unfilled orders.
Libraries
Consortium of universities that each hold on-line
books and other documents.
Search at any site will examine the catalog of
documents available at all sites and deliver an
electronic copy of the document to the user if
any site holds it.

4
Horizontal Partitioning

For example, the chain of stores might be
imagined to have a single sales relation, such as
Sales(item, date, price, purchaser)
This relation does not exist physically. Rather,
it is the union of a number of relations with the
same schema, one at each of the stores in the
chain.
Local relations are called fragments.

5
Vertical Partitioning

Example Suppose we want to find out which sales
at the Victoria store were made to customers who
are more than 90 days in arrears on their credit
card payments.
Virtual relation
Sales(item, date, price, purchaser,
lastCreditCardPayment)
Real relations
Sales(item, date, price, creditCardNumber) in
one site
CreditCard(purchaser, creditCardNumber,
lastCreditCardPayment) in another site

6
The Distributed Join Problem

R(A,B)??S(B,C)
Naïve Strategies
1. Send a copy of R to the site of S, and compute
the join there.
2. Send a copy of S to the site of R and compute
the join there.
However
If the channel has low-capacity, e.g., a phone
line or wireless link, then the cost of the join
is primarily the time it takes to copy one of the
relations.
Even if communication is fast, there may be a
better query plan if the shared attribute B has
values that are much smaller than the values of A
and C.
E.g., B could be an identifier for documents or
videos, while A and C are the documents or videos
themselves.

7
Semijoin Reductions

Query plan Only the relevant part of each
relation is shipped to the site of the other.
For this, compute first the semijoin
R(X,Y) ?lt S(Y,Z) R(X,Y) ?? ?Y(S (Y,Z))
by sending ?Y(S (Y,Z)) to the site of R.
In this way, R is reduced (hopefully)!
Finally, send the reduced R to the site of S and
compute the join.

8
Some arguments

The semijoin plan is superior if
?Y(S (Y,Z)) is much smaller than S.
?Y(S (Y,Z)) will be small compared with S if
There are many duplicates to be eliminated
The components for the attributes of Z are large
compared with the components of Y
e.g., Z includes attributes whose values are
audios, videos, or documents.
R(X,Y) ?lt S(Y,Z) is smaller than R.
That is, R must contain many dangling tuples in
its join with S.