Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries

Description:

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 28
Provided by: Weng70
Category:

less

Transcript and Presenter's Notes

Title: Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries


1
Using Space and Attribute Partitioned Partial
Replicas for Data Subsetting and Aggregation
Queries
  • Li Weng, Umit Catalyurek, Tahsin Kurc,
  • Gagan Agrawal, Joel Saltz

2
Motivation Data-Driven Science
Oil Reservoir Management
Magnetic Resonance Imaging

Data-driven applications from science,
Engineering, biomedicine Large
Spatio-temporal datasets Several attributes
at each point
3
Replication of Scientific Datasets
  • A variety of queries on the same dataset
  • Each requires different spatial-temporal region
    and subset of attributes
  • No chunking and indexing strategy can optimize
    for all
  • Replication Create multiple copies
  • Use different chunking and indexing schemes
  • Large storage overhead

4
Partial Replication
  • Can we get benefits of replication without the
    large overheads ?
  • Not all attributes accessed uniformly
  • Not all spatio-temporal regions accessed with
    uniform probability
  • Partial Replication Each replica has
  • Only a subset of attributes (attribute
    partitioned) and/or
  • Only a rectilinear spatio-temporal region (space
    partitioned)
  • Challenge
  • No single partial replica may be able to answer
    the query
  • Can we choose and combine partial replicas to
    optimize query processing ?

5
Prior Work (CCGRID 05)
  • Query planning with partial replicas
  • Cost models
  • Greedy selection algorithm
  • Only considered space partitioned replicas
  • Consider SELECT SQL queries
  • Implemented as an extension to Automatic Data
    Virtualization System (HPDC 04)

6
Contributions
  • Support combined use of space and attribute
    partitioned partial replicas
  • Dynamic programming algorithm for selecting the
    best set of attribute partitioned replicas
  • New greedy strategy for recommending a
    combination of replicas
  • Extend replica selection algorithm to address
    queries with aggregations
  • -- replicas may be unevenly stored across
    storage units

7
System Overview
The Replica Selection Module is coupled tightly
with our prior work of supporting SQL Select
queries on scientific datasets in a cluster
environment.
8
STORM Runtime System
  • A middleware to support data selection, data
    partitioning, and data transfer operations on
    flat-file datasets hosted on a parallel system.
  • Services
  • Query service
  • Data source service
  • Indexing service
  • Filtering service
  • Partition generation service
  • Data mover service

9
Outline
  • Introduction
  • Motivation
  • Contributions
  • System overview
  • Query execution and algorithm design
  • Uniformly partitioned chunks and select queries
  • Uneven partitioning and aggregation operation
  • Experimental results
  • Related work
  • Conclusions

10
Uniformly Partitioned Chunks and Select Queries
  • Computing Goodness Value
  • goodness useful dataper-chunk / costper-chunk
  • Chunk an atomic unit in space partitioned
    replicas or a logic unit in attribute partitioned
    replicas
  • Full chunks and partial chunks of a partial
    replica
  • Cost per-chunk tread nread tseek
  • tread average read time for a disk page
  • nread number of pages fetched
  • tseek average seek time
  • Fragment
  • intermediate unit between a replica and its
    chunks
  • a group of full or partial chunks having same
    goodness value in a replica
  • goodnessper-fragmen useful dataper-fragment /
    costper-fragment

11
An Example Query and Intersecting Replicas
  • Replica 1
  • 3 full chunks and 2 partial chunks
  • 3 fragments
  • Composite Replica 2
  • 10 full chunks
  • 1 fragment

12
General Structure of Replica Selection Algorithm
13
Calculate the Costj,j
  • Dynamic Programming Algorithm
  • R a group of attribute-partitioned replicas
  • R the optimal combination output
  • l the number of referred attributes in Q
  • M1..l the referred attribute list

Foreach k from 2 to l
Foreach u from 1 to l-k1
Yes
No
Calculate Costu..v, Locu..v-gts-1, Locu..v-gtrr1
Yes
No
Calculate Costu..v, Locu..v-gts-1, Locu..v-gtrr2
Costu..v8
Find the qminCostu..pCostp1..v Costu..vq,
Locu..v-gtsp, Locu..v-gtr-1
Output(loc1..l)
14
  • Greedy Strategy
  • Q an issued query
  • R the partial replicas
  • D the original dataset
  • F all fragments intersecting with the query
    boundary
  • Fmax the fragment with the maximum goodness
    value in F
  • S the ordered list of the candidate fragments
    in decreasing order of their goodness value

Calculate the fragment set F
Yes
F is null?
No
Append Fmax Into S
No
Yes
Subtract the overlap
Re-compute the goodness value
Add D if needed
Output
S
15
Uneven Partitioning and Aggregation Operations
  • Computing Goodness Value
  • Goodness(F) Sp?P data(F) /maxp?P
    (costp(CurLoad)costp(F))
  • P all available storage nodes
  • CurLoad current workload across all storage
    nodes due to previously chosen candidate replicas
  • Cost fragment treadnreadtseek
    nseektfilternfiltertaggnaggttransntrans
  • tfilter average filtering time for a tuple
  • nfilter number of total tuples in all chunks
  • taggr average aggregate computation time for a
    tuple
  • naggr number of total useful tuples
  • ttrans network transfer time for one unit of
    data
  • ntrans the amount of data after aggregate
    operation

16
Foreach Fi in F
  • Workload aware greedy strategy
  • Q an issued query
  • F the interesting fragment sets
  • D the original dataset
  • F all fragments intersecting with the query
    boundary
  • Fmax the fragment with the maximum goodness
    value in F
  • S the ordered list of the candidate fragments
    in decreasing order of their goodness value

Yes
Overlap with F-Fi exists?
No
Append Fi into S
Yes
F is NULL?
No
Calculate the current goodness value for Fi in F
Append Fmax Into S
No
Yes
Subtract the overlap
Add D if needed
Output
S
17
Outline
  • Introduction
  • Motivation
  • Contributions
  • System overview
  • Query execution and algorithm design
  • Uniformly partitioned chunks and select queries
  • Uneven partitioning and aggregation operation
  • Experimental results
  • Related work
  • Conclusions

18
Experimental Setup Design
  • A Linux cluster connected via a Switched Fast
    Ethernet. Each node has a PIII 933MHz CPU, 512 MB
    main Memory, and three 100GB IDE disks.
  • Performance evaluation of the combination of
    space-partitioned and attribute-partitioned
    replicas, and the benefit of attribute-partitioned
    replicas
  • Scalability test when increasing the number of
    nodes hosting dataset
  • Performance test when data query sizes are
    varied
  • Performance evaluation for aggregate queries with
    unevenly partitioned replicas.

19
(No Transcript)
20
SELECT attrlist from IPARS where RID in 0,1 and
TIME in 1000,1399 and Xgt0 and Xlt11

and Ygt0 and Ylt28 and Zgt0 and Zlt28
21
  • attrspace part
  • the combined use of all replicas
  • space part
  • only use the space-partitioned replicas
  • A run-time optimization

22
  • Query
  • SELECT from IPARS where TIMEgt1000 and
    TIMElt1599 and Xgt0 and Xlt11

  • and Ygt0 and Ylt31 and Zgt0 and Zlt31
  • Upto 4 nodes, query execution time scales
    linearly.
  • Due to the dominating seek cost in the total I/O
    overhead, execution time is not reduced by half
    while using 8 nodes.

23
  • Query
  • SELECT from IPARS where TIMEgt1000 and
    TIMEltTIMEVAL and Xgt0 and Xlt11

  • and Ygt0 and Ylt28 and Zgt0 and Zlt28
  • Our algorithm has chosen 1,3,4,6 out of all
    replicas in Table 1.
  • The query filters 83 of the retrieved data when
    using the original dataset only however,
  • it need to filter about 50 of the retrieved
    data in the presence of replicas.

24
Aggregate Queries with Unevenly Partitioned
Replicas
25
Aggregate Queries with Unevenly Partitioned
Replicas
26
Alg solution by the proposed algorithm AlgRef
solution after the refinement step Solution-1
2 two manually created solutions
27
Related Work
  • Replication research
  • Exact copies of portions of data
  • Data availability and reliability
  • Multi-disk system with replicated data
  • Data caching techniques
  • Using aggregate memory and cooperative caches
  • Management and replacement of replicas
  • Our previous work on performance optimization
    using space partitioned replicas

28
Conclusions
  • The proposed cost models are capable of
    estimating execution time trends.
  • The designed greedy strategy together with
    dynamic programming algorithm can choose a good
    set of candidate replicas that decrease the query
    execution time.
  • Our implementations show good scalability.
  • When data transfer bandwidth is the limiting
    factor, using a combination of space and
    attribute partitioned replicas should be
    preferred.
Write a Comment
User Comments (0)
About PowerShow.com