Optimal Distributed Declustering using Replication - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Optimal Distributed Declustering using Replication

Description:

Declustering data over multiple disks to improve performance for range queries ... Golden Ratio Sequences (GRS) [Bhatia et al, 2000] ICDT 2005. 6. Other schemes ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 21
Provided by: nes6
Category:

less

Transcript and Presenter's Notes

Title: Optimal Distributed Declustering using Replication


1
Optimal Distributed Declustering using Replication
  • Keith Frikken
  • Purdue University
  • Jan 5, 2005

2
Declustering Data
  • Declustering data over multiple disks to improve
    performance for range queries has been well
    studied
  • Applications include
  • Spatio-temporal databases
  • Image and video data
  • Scientific simulation datasets

3
Goal
  • Divide data uniformly along dimensions to create
    tiles
  • Put records contained in each tile on different
    disks so that I/O can be parallelized
  • Assumptions
  • Data can be tiled in such a way
  • Disks have constant retrieval times
  • Assigning tiles to disks is similar to a coloring
    problem (disks are colors)
  • A range query can be answered optimally if the
    of I/O retrievals for any specific disk is ? of
    tiles/ of disks?
  • Two approaches
  • Coloring schemes
  • Replication

4
Notations
  • k is number of disks
  • m is number of tiles in queries
  • r is level of replication (i.e., is 2)
  • Q is the set of all range queries
  • ret(q) is the actual retrieval time of q
  • Optimal retrieval time for a query q is
    oq?m/k?
  • Additive error e, maxq?Qret(q)-oq

5
Coloring schemes
  • Disk Modulo (DM) Du and Sobolewski, 1982
  • Fieldwise XOR (FX) Kim and Pramanik, 1988
  • Cyclic Schemes (RPHM, GFIB, EXH) Prabhakar et
    al, 1998
  • Golden Ratio Sequences (GRS) Bhatia et al,
    2000

6
Other schemes
  • Atallah and Prabhakar, 2000 developed a scheme
    in two dimensional grids for k2n disks the has
    additive error of O(log k)
  • Sinha et al, 2001 proved lower bounds on the
    additive error of ?(log k) and ?(log(d-1)/2 k)
    for 2 dimensions and d (gt2) dimensions
    respectively
  • Chen and Cheng, 2002 showed that an additive
    error of O(log(d-1) k) is achievable for any of
    dimensions (gt2)

7
Replication
  • Placing records on multiple disks can further
    improve performance of declustering schemes
  • Two Problems
  • How to schedule a query (i.e., what tiles are
    retrieved from each disk)
  • How to use replication to balance load
  • Approaches
  • Chained Declustering Hsiao and DeWitt, 1990
  • Random Duplication Allocation Sanders et al
    2000, Sanders, 2001, and Czumaj and
    Scheidler, 2003

8
Replication Results
  • Chained Declustering
  • Fast Scheduling Algorithm O(mk) time to test if
    a specific retrieval time is possible Aerts et
    al, 2000
  • RDA
  • If mck(log k) then optimal with high prob
    Czumaj and Scheideler, 2003
  • Fast scheduling algorithm O(?kO(1)) time
    Czumaj and Scheideler, 2003
  • Hybrid techniques Chen and Cheng, 2002
  • Use GRS with second random disk

9
Our Results
  • We define a new class of schemes called the shift
    schemes
  • Deterministic
  • Any query with at least k(k-1)e tiles can be
    answered in an optimal fashion
  • Queries can be scheduled in O(mk(log e)) time
  • If a single disk fails, then any query with at
    least k(k-1)e tiles can be answered optimally
  • Experimental performance similar to RDA (better
    for many cases)

10
Shift Scheme Definition
  • Use any strong coloring scheme
  • Use a modified chain declustering
  • Defined by shift value s (where gcd(s,k)1)
  • Base scheme is defined by function f(x,y)
  • Second color is (f(x,y)s mod k)

11
Shift Scheme Definition
  • Use any strong coloring scheme
  • Use a modified chain declustering
  • Defined by shift value s (where gcd(s,k)1)
  • Base scheme is defined by function f(x,y)
  • Second color is (f(x,y)s mod k)

0,3 1,4 2,0 3,1 4,2
2,0 3,1 4,2 0,3 1,4
4,2 0,3 1,4 2,0 3,1
1,4 2,0 3,1 4,2 0,3
3,1 4,2 0,3 1,4 2,0
12
Scheduling
  • Can use modification of chain declustering
    scheduling algorithm to schedule queries in
    O(mk(log e)) time
  • Essentially, use previous algorithm to test if a
    specific load is possible and do a binary search
    on the possible loads

13
Bound(1)
  • There are k disks (D0,,Dk-1)
  • Disk Di has ti tiles initially (as the primary
    disk)
  • The number of tiles is mt0tk-1
  • Di shifts di tiles to Di1
  • di ti
  • The goal is to minimize the most tiles at a disk,
    i.e., max0ik-1di-1ti-di

14
Bound(2)
  • Recall,
  • o?m/k?
  • max0ik-1ti oe
  • Suppose mk(k-1)e
  • Then,
  • o (k-1)e
  • Surplus ( ) is bounded by
    (k-1)e
  • max0ik-1di (k-1)e o
  • Two cases
  • If disk has a surplus
  • If disk has a shortage

15
32 disks
16
64 disks
17
128 disks
18
32 disks, 3 dimensions
19
Generalizations
  • Permutations
  • Higher levels of replication
  • Survivability
  • If the level of replication is r, can handle any
    r-1 failures
  • When r2, and a single disk fails then
  • Fast scheduling still possible
  • Large queries still optimal

20
Summary
  • Shift schemes are a new class of schemes
  • Optimal for large enough queries
  • Efficient scheduling algorithm
  • Resilient to disk failures
  • Future Work
  • Better analysis of scheme
  • Choosing shift values
Write a Comment
User Comments (0)
About PowerShow.com