Distributed Databases and Query Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed Databases and Query Processing

Description:

Difference from a shared-nothing parallel system is in the assumption about the ... That is, R must contain many dangling tuples in its join with S. ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 9
Provided by: tho9
Category:

less

Transcript and Presenter's Notes

Title: Distributed Databases and Query Processing


1
Distributed Databases and Query Processing
2
Distributed DBs vs. Parallel DBs
  • Many autonomous processors that may participate
    in database operations. Difference from a
    shared-nothing parallel system is in the
    assumption about the cost of communication.
  • Shared-nothing parallel systems message passing
    cost is small compared with disk accesses and
    other costs.
  • Distributed systems processors are typically
    physically distant, and so, the cost of message
    passing is high.
  • Advantages
  • Like parallel systems, a distributed system can
    use many processors to accelerate query
    evaluation.
  • Further, since the processors are widely
    separated, we can increase resilience in the face
    of failures by replicating data at several sites.
  • Disadvantage
  • Increased complexity of many DB aspects we have
    to take care to minimize the number of messages
    sent between sites.

3
Why distribute data?
  • Organization is itself distributed among many
    sites
  • Banks
  • Many branches. Each branch will keep a database
    of accounts maintained at that branch.
  • Some data are kept in the central office, such as
    employee records and current interest rates.
  • Department Stores
  • Many individual stores. Each store has a database
    of sales at that store and inventory at that
    store.
  • Some data are kept in the central office, such as
    employee records and a chain-wide inventory, data
    about credit card customers, and information
    about suppliers such as unfilled orders.
  • Libraries
  • Consortium of universities that each hold on-line
    books and other documents.
  • Search at any site will examine the catalog of
    documents available at all sites and deliver an
    electronic copy of the document to the user if
    any site holds it.

4
Horizontal Partitioning
  • For example, the chain of stores might be
    imagined to have a single sales relation, such as
  • Sales(item, date, price, purchaser)
  • This relation does not exist physically. Rather,
    it is the union of a number of relations with the
    same schema, one at each of the stores in the
    chain.
  • Local relations are called fragments.

5
Vertical Partitioning
  • Example Suppose we want to find out which sales
    at the Victoria store were made to customers who
    are more than 90 days in arrears on their credit
    card payments.
  • Virtual relation
  • Sales(item, date, price, purchaser,
    lastCreditCardPayment)
  • Real relations
  • Sales(item, date, price, creditCardNumber) in
    one site
  • CreditCard(purchaser, creditCardNumber,
    lastCreditCardPayment) in another site

6
The Distributed Join Problem
  • R(A,B)??S(B,C)
  • Naïve Strategies
  • 1. Send a copy of R to the site of S, and compute
    the join there.
  • 2. Send a copy of S to the site of R and compute
    the join there.
  • However
  • If the channel has low-capacity, e.g., a phone
    line or wireless link, then the cost of the join
    is primarily the time it takes to copy one of the
    relations.
  • Even if communication is fast, there may be a
    better query plan if the shared attribute B has
    values that are much smaller than the values of A
    and C.
  • E.g., B could be an identifier for documents or
    videos, while A and C are the documents or videos
    themselves.

7
Semijoin Reductions
  • Query plan Only the relevant part of each
    relation is shipped to the site of the other.
  • For this, compute first the semijoin
  • R(X,Y) ?lt S(Y,Z) R(X,Y) ?? ?Y(S (Y,Z))
  • by sending ?Y(S (Y,Z)) to the site of R.
  • In this way, R is reduced (hopefully)!
  • Finally, send the reduced R to the site of S and
    compute the join.

8
Some arguments
  • The semijoin plan is superior if
  • ?Y(S (Y,Z)) is much smaller than S.
  • ?Y(S (Y,Z)) will be small compared with S if
  • There are many duplicates to be eliminated
  • The components for the attributes of Z are large
    compared with the components of Y
  • e.g., Z includes attributes whose values are
    audios, videos, or documents.
  • R(X,Y) ?lt S(Y,Z) is smaller than R.
  • That is, R must contain many dangling tuples in
    its join with S.
Write a Comment
User Comments (0)
About PowerShow.com