Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries

About This Presentation

Title:

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries

Description:

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 28

Provided by: Weng70

Learn more at: http://web.cse.ohio-state.edu

Category:

more less

Transcript and Presenter's Notes

Title: Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries

1
Using Space and Attribute Partitioned Partial
Replicas for Data Subsetting and Aggregation
Queries

Li Weng, Umit Catalyurek, Tahsin Kurc,
Gagan Agrawal, Joel Saltz

2
Motivation Data-Driven Science
Oil Reservoir Management
Magnetic Resonance Imaging

Data-driven applications from science,
Engineering, biomedicine Large
Spatio-temporal datasets Several attributes
at each point
3
Replication of Scientific Datasets

A variety of queries on the same dataset
Each requires different spatial-temporal region
and subset of attributes
No chunking and indexing strategy can optimize
for all
Replication Create multiple copies
Use different chunking and indexing schemes
Large storage overhead

4
Partial Replication

Can we get benefits of replication without the
large overheads ?
Not all attributes accessed uniformly
Not all spatio-temporal regions accessed with
uniform probability
Partial Replication Each replica has
Only a subset of attributes (attribute
partitioned) and/or
Only a rectilinear spatio-temporal region (space
partitioned)
Challenge
No single partial replica may be able to answer
the query
Can we choose and combine partial replicas to
optimize query processing ?

5
Prior Work (CCGRID 05)

Query planning with partial replicas
Cost models
Greedy selection algorithm
Only considered space partitioned replicas
Consider SELECT SQL queries
Implemented as an extension to Automatic Data
Virtualization System (HPDC 04)

6
Contributions

Support combined use of space and attribute
partitioned partial replicas
Dynamic programming algorithm for selecting the
best set of attribute partitioned replicas
New greedy strategy for recommending a
combination of replicas
Extend replica selection algorithm to address
queries with aggregations
-- replicas may be unevenly stored across
storage units

7
System Overview
The Replica Selection Module is coupled tightly
with our prior work of supporting SQL Select
queries on scientific datasets in a cluster
environment.
8
STORM Runtime System

A middleware to support data selection, data
partitioning, and data transfer operations on
flat-file datasets hosted on a parallel system.
Services
Query service
Data source service
Indexing service
Filtering service
Partition generation service
Data mover service

9
Outline

Introduction
Motivation
Contributions
System overview
Query execution and algorithm design
Uniformly partitioned chunks and select queries
Uneven partitioning and aggregation operation
Experimental results
Related work
Conclusions

10
Uniformly Partitioned Chunks and Select Queries

Computing Goodness Value
goodness useful dataper-chunk / costper-chunk
Chunk an atomic unit in space partitioned
replicas or a logic unit in attribute partitioned
replicas
Full chunks and partial chunks of a partial
replica
Cost per-chunk tread nread tseek
tread average read time for a disk page
nread number of pages fetched
tseek average seek time
Fragment
intermediate unit between a replica and its
chunks
a group of full or partial chunks having same
goodness value in a replica
goodnessper-fragmen useful dataper-fragment /
costper-fragment

11
An Example Query and Intersecting Replicas

Replica 1
3 full chunks and 2 partial chunks
3 fragments
Composite Replica 2
10 full chunks
1 fragment

12
General Structure of Replica Selection Algorithm
13
Calculate the Costj,j

Dynamic Programming Algorithm
R a group of attribute-partitioned replicas
R the optimal combination output
l the number of referred attributes in Q
M1..l the referred attribute list

Foreach k from 2 to l
Foreach u from 1 to l-k1
Yes
No
Calculate Costu..v, Locu..v-gts-1, Locu..v-gtrr1
Yes
No
Calculate Costu..v, Locu..v-gts-1, Locu..v-gtrr2
Costu..v8
Find the qminCostu..pCostp1..v Costu..vq,
Locu..v-gtsp, Locu..v-gtr-1
Output(loc1..l)
14

Greedy Strategy
Q an issued query
R the partial replicas
D the original dataset
F all fragments intersecting with the query
boundary
Fmax the fragment with the maximum goodness
value in F
S the ordered list of the candidate fragments
in decreasing order of their goodness value

Calculate the fragment set F
Yes
F is null?
No
Append Fmax Into S
No
Yes
Subtract the overlap
Re-compute the goodness value
Add D if needed
Output
S
15
Uneven Partitioning and Aggregation Operations

Computing Goodness Value
Goodness(F) Sp?P data(F) /maxp?P
(costp(CurLoad)costp(F))
P all available storage nodes
CurLoad current workload across all storage
nodes due to previously chosen candidate replicas
Cost fragment treadnreadtseek
nseektfilternfiltertaggnaggttransntrans
tfilter average filtering time for a tuple
nfilter number of total tuples in all chunks
taggr average aggregate computation time for a
tuple
naggr number of total useful tuples
ttrans network transfer time for one unit of
data
ntrans the amount of data after aggregate
operation

16
Foreach Fi in F

Workload aware greedy strategy
Q an issued query
F the interesting fragment sets
D the original dataset
F all fragments intersecting with the query
boundary
Fmax the fragment with the maximum goodness
value in F
S the ordered list of the candidate fragments
in decreasing order of their goodness value

Yes
Overlap with F-Fi exists?
No
Append Fi into S
Yes
F is NULL?
No
Calculate the current goodness value for Fi in F
Append Fmax Into S
No
Yes
Subtract the overlap
Add D if needed
Output
S
17
Outline

Introduction
Motivation
Contributions
System overview
Query execution and algorithm design
Uniformly partitioned chunks and select queries
Uneven partitioning and aggregation operation
Experimental results
Related work
Conclusions

18
Experimental Setup Design

A Linux cluster connected via a Switched Fast
Ethernet. Each node has a PIII 933MHz CPU, 512 MB
main Memory, and three 100GB IDE disks.
Performance evaluation of the combination of
space-partitioned and attribute-partitioned
replicas, and the benefit of attribute-partitioned
replicas
Scalability test when increasing the number of
nodes hosting dataset
Performance test when data query sizes are
varied
Performance evaluation for aggregate queries with
unevenly partitioned replicas.

19
(No Transcript)
20
SELECT attrlist from IPARS where RID in 0,1 and
TIME in 1000,1399 and Xgt0 and Xlt11

and Ygt0 and Ylt28 and Zgt0 and Zlt28
21

attrspace part
the combined use of all replicas
space part
only use the space-partitioned replicas
A run-time optimization

Query
SELECT from IPARS where TIMEgt1000 and
TIMElt1599 and Xgt0 and Xlt11
and Ygt0 and Ylt31 and Zgt0 and Zlt31
Upto 4 nodes, query execution time scales
linearly.
Due to the dominating seek cost in the total I/O
overhead, execution time is not reduced by half
while using 8 nodes.

Query
SELECT from IPARS where TIMEgt1000 and
TIMEltTIMEVAL and Xgt0 and Xlt11
and Ygt0 and Ylt28 and Zgt0 and Zlt28
Our algorithm has chosen 1,3,4,6 out of all
replicas in Table 1.
The query filters 83 of the retrieved data when
using the original dataset only however,
it need to filter about 50 of the retrieved
data in the presence of replicas.

24
Aggregate Queries with Unevenly Partitioned
Replicas
25
Aggregate Queries with Unevenly Partitioned
Replicas
26
Alg solution by the proposed algorithm AlgRef
solution after the refinement step Solution-1
2 two manually created solutions
27
Related Work

Replication research
Exact copies of portions of data
Data availability and reliability
Multi-disk system with replicated data
Data caching techniques
Using aggregate memory and cooperative caches
Management and replacement of replicas
Our previous work on performance optimization
using space partitioned replicas

28
Conclusions

The proposed cost models are capable of
estimating execution time trends.
The designed greedy strategy together with
dynamic programming algorithm can choose a good
set of candidate replicas that decrease the query
execution time.
Our implementations show good scalability.
When data transfer bandwidth is the limiting
factor, using a combination of space and
attribute partitioned replicas should be
preferred.

Write a Comment

User Comments (0)

About PowerShow.com

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries - PowerPoint PPT Presentation

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz – PowerPoint PPT presentation