Title: Parallel Evaluation of Composite Aggregate Queries
1Parallel Evaluation of Composite Aggregate Queries
- by Chen, Olston, Ramakrishnan
- http//www.cs.brandeis.edu/7Eolga/cs228/Reading2
0List_files/icde08.pdf
Joel Freeman, Cosi 228
2Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
3Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
4Background
5Background
- Correlated Query
- In SQL, has an inner query that uses values from
the outer query
6Background
- Correlated Query
- In SQL, has an inner query that uses values from
the outer query - Inner query is evaluated once for each row of the
outer
7Background
- Simple Example
- Consider two relations
- Budget(company, budget)
- Expenses(company, expenses)
8Background
- Simple Example
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
9Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
10Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
11Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
12Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
13Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
14Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
15Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
16Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
17Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
18Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
19Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
20Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
21Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
22Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
RESULTS
23Background
- SELECT company
- FROM Budget AS B
- WHERE budget lt (SELECT expenses
- FROM Expenses
- WHERE company B.company )
RESULTS
Google
24Background
- Correlated Aggregate Query
25Background
- Correlated Aggregate Query
- For each tuple in the outer query, perform
aggregation in the inner query
26Background
- Correlated Aggregate Query
- For each tuple in the outer query, perform
aggregation in the inner query - SELECT name
- FROM Student as s1
- WHERE age gt (SELECT AVG(age)
- FROM Student
- WHERE department s1.department)
27Background
- SELECT name
- FROM Student as s1
- WHERE age gt (SELECT AVG(age)
- FROM Student
- WHERE department s1.department)
28Background
- SELECT name
- FROM Student as s1
- WHERE age gt (SELECT AVG(age)
- FROM Student
- WHERE department s1.department)
29Background
- SELECT name
- FROM Student as s1
- WHERE age gt (SELECT AVG(age)
- FROM Student
- WHERE department s1.department)
30Background
- SELECT name
- FROM Student as s1
- WHERE age gt (SELECT AVG(age)
- FROM Student
- WHERE department s1.department)
31Background
- SELECT name
- FROM Student as s1
- WHERE age gt (SELECT AVG(age)
- FROM Student
- WHERE department s1.department)
32Background
- SELECT name
- FROM Student as s1
- WHERE age gt (SELECT AVG(age)
- FROM Student
- WHERE department s1.department)
33Background
- SELECT name
- FROM Student as s1
- WHERE age gt (SELECT AVG(age)
- FROM Student
- WHERE department s1.department)
RESULTS
Jane
34Background
35Background
- Cube Space
- A different way of looking at relational data
36Background
- Cube Space
- A different way of looking at relational data
- Example Search Engine Logs
37Background
38Background
- Cube Space
- Schema
- (keyword, pageCount, adCount, time)
39Background
- Cube Space
- Schema
- (keyword, pageCount, adCount, time)
- E.g. (dogs, 9, 1, 530)
40Background
41Background
keyword
hippo
dogs
data
time
501
502
612
42Background
keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
43Background
keyword
(pageCount, adCount)
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
44Background
keyword
hippo
(0, 0)
(0, 1)
(3, 3)
region
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
45Background
keyword
hippo
(0, 0)
(0, 1)
(3, 3)
2 dimension attributes 2 measure attributes
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
46Background
keyword
Imagine a 3rd dimension attribute, Country, for
the country of origin of the query
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
47Background
keyword
Imagine a 3rd dimension attribute, Country, for
the country of origin of the query Hence cube
space
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
48Background
keyword
Many data points per keyword per minute
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(3, 0)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
49Background
keyword
(9, 0)
(2, 6)
Bags not sets
hippo
(0, 0)
(0, 1)
(3, 3)
(3, 0)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
50Background
- Composite Subset Measure Query
51Background
- Composite Subset Measure Query
- A complex query made up of multiple Correlated
Aggregate queries
52Background
- Composite Subset Measure Query
- A complex query made up of multiple Correlated
Aggregate queries - Has multiple steps (components)
53Background
- Composite Subset Measure Query
- A complex query made up of multiple Correlated
Aggregate queries - Has multiple steps (components)
- Applied to regions in cube space
54Background
- How are composite subset measure queries unique?
- citation 4 from the paper
55Background
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
56Background
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
57Background
- Typical GROUP BY operation on time
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
58Background
- Fixed region granularity (time is in mins)
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
59Background
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
60Background
- New model Regions grouped by hour as well
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
61Background
- What does a composite subset measure query look
like?
62Background
- Example
- (keyword, pageCount, adCount, time)
63Background
- Example
- (keyword, pageCount, adCount, time)
- M1 For every minute and keyword, find the median
of page count.
64Background
- Example
- (keyword, pageCount, adCount, time)
- M1 For every minute and keyword, find the median
of page count. - M2 For every minute and keyword, find the median
of ad count.
65Background
- Example
- (keyword, pageCount, adCount, time)
- M1 For every minute and keyword, find the median
of page count. - M2 For every minute and keyword, find the median
of ad count. - M3 For every minute and keyword, find the ratio
between M1 and M2
66Background
67Background
- Composite multiple steps
- Subset over subsets of dimensions
68Background
- Composite multiple steps
- Subset over subsets of dimensions
- Measure aggregating measures
69Background
- Composite multiple steps
- Subset over subsets of dimensions
- Measure aggregating measures
- Queries
70Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
71Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
72Motivation
- Search Engine logs can be pretty big
73Motivation
- Search Engine logs can be pretty big
- What if we want to analyze TBs of data?
74Motivation
- Search Engine logs can be pretty big
- What if we want to analyze TBs of data?
- E.g. Yahoo targeted advertising
75Motivation
- Search Engine logs can be pretty big
- What if we want to analyze TBs of data?
- E.g. Yahoo targeted advertising
- Amazon item recommendations
76Motivation
- We often use composite subset measure queries
77Motivation
- We often use composite subset measure queries
- How might we parallelize?
78Motivation
Parallelizing
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
79Motivation
Parallelizing
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
80Motivation
Parallelizing send each
partition to a
different node
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
81Motivation
Parallelizing how do we
decide how to
distribute our partitions?
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
82Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
83Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
84Data Distribution
- Idea compute each component of the composite
query in parallel
85Data Distribution
86Data Distribution
- Problem
- Composite subset measure queries often use the
same data for different components
87Data Distribution
- Problem
- Composite subset measure queries often use the
same data for different components - For each query component, we would need to
repartition data
88Data Distrubtion
- Remember?
- (keyword, linksCount, adsCount, time)
- M1 For every minute and keyword, find the median
of links count. - M2 For every hour and keyword, find the median
of ads count.
89Data Distrubtion
- Remember?
- (keyword, linksCount, adsCount, time)
- M1 For every minute and keyword, find the median
of links count. - M2 For every hour and keyword, find the median
of ads count.
Parallelizing each individual component means
distributing data twice
90Data Distrubtion
- Remember?
- (keyword, linksCount, adsCount, time)
- M1 For every minute and keyword, find the median
of links count. - M2 For every hour and keyword, find the median
of ads count.
To compute M3, we need to join the results of M1
and M2, which may be large
91Data Distribution
92Data Distribution
- Solution
- Every aggregation carried out at only one node
- All data required for the final aggregation must
be carried out at one node
93Data Distribution
- Solution
- Localized Aggregation
94Data Distribution
- Solution
- In our example, we would partition data into the
regions for M3, and perform M1, M2, and M3 all at
one node
95Data Distribution
96Data Distribution
- Distribution process
- Mapper allocated part of dataset
97Data Distribution
- Distribution process
- Mapper allocated part of dataset
- Produces key/value pairs
98Data Distribution
- Distribution process
- Mapper allocated part of dataset
- Produces key/value pairs
- Each reducer is assigned responsibility for
certain keys
99Data Distribution
- Distribution process
- Mapper allocated part of dataset
- Produces key/value pairs
- Each reducer is assigned responsibility for
certain keys - Data is distributed to reducers, who carry out
the computation
100Data Distribution
- Example
- Schema (university, dept, enrollment)
- e.g. (brandeis, history, 90)
101Data Distribution
- Example
- Schema (university, dept, enrollment)
- e.g. (brandeis, history, 90)
- Query For each department, find the average
enrollment over all universities.
102Data Distribution
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
(BU, history, 110)
103Data Distribution
Mappers
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
(BU, history, 110)
104Data Distribution
Mappers
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
(BU, history, 110)
105Data Distribution
Reducers
Physics
Mappers
(Brandeis, history, 90)
History
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
106Data Distribution
Reducers
Physics
Mappers
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
107Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
108Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
109Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
110Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
(Wellesley, art, 47)
111Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(BU, history, 110)
(Wellesley, art, 47)
Art
(BU, history, 110)
(Wellesley, art, 47)
112Data Distribution
Reducers
Physics
68
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
90
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(BU, history, 110)
(Wellesley, art, 47)
Art
(BU, history, 110)
47
(Wellesley, art, 47)
113Data Distribution
- Process
- Mapper allocated region of dataset
- Produces key/value pairs
- Each reducer is assigned responsibility for
certain keys - Data is distributed to reducers, who carry out
the computation
114Data Distribution
- Mappers use a special algorithm to decide which
reducer to send data to (next slides)
115Data Distribution
- Mappers use a special algorithm to decide which
reducer to send data to (next slides) - Reducers execute algorithm specified in paper
Composite Subset Measures 4
116Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
117Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
118Distribution Key
- The distribution key is assigned to a piece of
data by a mapper to determine which reducer to
send it to
119Distribution Key
- The distribution key is assigned to a piece of
data by a mapper to determine which reducer to
send it to - In the prior example, the distribution key was
department
120Distribution Key
- Composite queries are made up of several
component queries
121Distribution Key
- Composite queries are made up of several
component queries - What regions do we use for the distribution key?
122Distribution Key
123(No Transcript)
124Distribution Key
- To determine the distribution key
125Distribution Key
- To determine the distribution key
- The least common ancestor of all measure
granularities is a feasible distribution key
126Distribution Key
- To determine the distribution key
- The least common ancestor of all measure
granularities is a feasible distribution key - All the other feasible distribution keys are
generalizations
127(No Transcript)
128Distribution Key
- gt
- Distribution key is
- (Keyword, Hour)
129Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
130Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
131Sliding Windows
- Example we measure some value for each minute
and calculate a moving average for a 3 minute
window
132Sliding Windows
- 3-minute moving average
- Minutes Value
- 1 1
- 2 6
- 3 5
- 4 0
- 5 2
- 6 9
- 7 0
133Sliding Windows
- 3-minute moving average
- Minutes Value Average
- 1 1
- 2 6
- 3 5
- 4 0
- 5 2
- 6 9
- 7 0
134Sliding Windows
- 3-minute moving average
- Minutes Value Average
- 1 1 (165)
- 2 6
- 3 5
- 4 0
- 5 2
- 6 9
- 7 0
135Sliding Windows
- 3-minute moving average
- Minutes Value Average
- 1 1 (165)/3
- 2 6
- 3 5
- 4 0
- 5 2
- 6 9
- 7 0
136Sliding Windows
- 3-minute moving average
- Minutes Value Average
- 1 1 (165)/3
- 2 6 (650)/3
- 3 5 (502)/3
- 4 0 (092)/3
- 5 2 (290)/3
- 6 9
- 7 0
137Sliding Windows
- 3-minute moving average
- Minutes Value Average
- 1 1 (165)/3 4
- 2 6 (650)/3 3.7
- 3 5 (502)/3 2.3
- 4 0 (092)/3 3.7
- 5 2 (290)/3 3.7
- 6 9
- 7 0
138Sliding Windows
- 3-minute moving average
- Minutes Value Average
- 1 1 4
- 2 6 3.7
- 3 5 2.3
- 4 0 3.7
- 5 2 3.7
- 6 9
- 7 0
139Sliding Windows
- In order to ensure localized evaluation, each
reducer must have data for an ENTIRE window
140Sliding Windows
- 3-minute moving average
- Minutes Value
- 1 1
- 2 6
- 3 5
- 4 0
- 5 2
- 6 9
- 7 0
141Sliding Windows
- 3-minute moving average
- Minutes Value
- 1 1
- 2 6
- 3 5
- 4 0
- 5 2
- 6 9
- 7 0
142Sliding Windows
- gt same data must be sent to multiple reducers
143Sliding Windows
- gt same data must be sent to multiple reducers
- Consider window of size 20. How much extra data
is stored in the system?
144Sliding Windows
- gt same data must be sent to multiple reducers
- Consider window of size 20. How much extra data
is stored in the system?
A lot!
145Sliding Windows
- Solution clustering factor
146Sliding Windows
- Solution clustering factor
- Clustering factor of an execution is the number
of reducer blocks that are merged together into
one super-block
147Sliding Windows
148Sliding Windows
- Example
- Suppose clustering factor 2, and window size
10
149Sliding Windows
- Example
- Suppose clustering factor 2, and window size
10 - Each reducer gets 2 reducer blocks
150Sliding Windows
- Example
- Suppose clustering factor 2, and window size
10 - Each reducer gets 2 reducer blocks
- 5 reducers used instead of 10
151Sliding Windows
- Example
- Suppose clustering factor 2, and window size
10 - Each reducer gets 2 reducer blocks
- 5 reducers used instead of 10
- tradeoff ? redundant data gt ? parallelism
152Sliding Windows
- How do sliding windows affect the distribution
key?
153(No Transcript)
154Sliding Windows
- How do sliding windows affect the distribution
key?
155Sliding Windows
- How do sliding windows affect the distribution
key? - Start with the initial components in the
workflow. If sliding window, apply OpConvert.
Then apply OpCombine.
156Sliding Windows
157(No Transcript)
158Sliding Windows
159Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
160Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
161Optimizations
162Optimizations
- 1 Early Aggregation
- When aggregations are performed by reducers, the
resulting data is usually smaller
163Optimizations
- 1 Early Aggregation
- When aggregations are performed by reducers, the
resulting data is usually smaller - If we aggregate at the mappers, we have to
transfer less to the reducers
164Optimizations
- 1 Early Aggregation
- Only works if basic measures (components that
dont rely on other components) are parallelizable
165Optimizations
166Optimizations
- 2 Composite Sort Key
- Mappers sort data by determining which reducer
to send it to
167Optimizations
- 2 Composite Sort Key
- Mappers sort data by determining which reducer
to send it to - Reducers then sort received data before
aggregating
168Optimizations
- 2 Composite Sort Key
- Mappers sort data by determining which reducer
to send it to - Reducers then sort received data before
aggregating - A composite sort key would allow reducers to
order incoming data
169Optimizations
170Optimizations
- 3 Use magic formula
- How do we determine best distribution key and
clustering factor for queries with windows?
171Optimizations
- 3 Use magic formula
- For each possible distribution key, derive this
formula and set equal to 0. Then solve for cf.
172Optimizations
173Optimizations
174Optimizations
- 4 Skew Detection
- When a mapper receives data, it samples the data
to see what reducers it goes to
175Optimizations
- 4 Skew Detection
- When a mapper receives data, it samples the data
to see what reducers it goes to - Mappers forward sample results to the master node
176Optimizations
- 4 Skew Detection
- When a mapper receives data, it samples the data
to see what reducers it goes to - Mappers forward sample results to the master node
- Node can choose new clust factor with the lowest
maximal workload
177Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
178Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
179Evaluation
- Ran trials use Hadoop on 100 machines
180Evaluation
- 50 mappers,
- 50 reducers
- Approximately
- Linear scaleup
181Evaluation
- 50 mappers,
- 50 reducers
- Approximately
- Linear scaleup
Approximately linear scaleup
182Evaluation
- 50 mappers,
- 50 reducers
- Approximately
- Linear scaleup
Approximately linear scaleup Query 6 has sliding
window
183Evaluation
184Evaluation
Approximately linear scaleup
185Evaluation
Approximately linear scaleup Query 6 has coarse
granularity in addition to sliding window
186Evaluation
187Evaluation
Model prediction from earlier formula
188Evaluation
Model prediction from earlier formula Too small
clustering factor gt a lot of redundant data Too
big clustering factor gt not much parallelization
189Evaluation
190Evaluation
DS0 has coarse granularity DS2 is
fine-grained DS1 is intermediate
191Evaluation
192Evaluation
1 billion records 2Block and 4Block impose
minimum on number of blocks per reducer
193Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
194Outline
- Background
- Motivation
- Data Distribution Scheme
- Distribution Key
- Sliding Windows
- Optimizations
- Evaluation Results
195Summary
- Designed a strategy for evaluating correlated
aggregate queries in parallel
196Summary
- Designed a strategy for evaluating correlated
aggregate queries in parallel - Supports sliding window components in the queries
197Summary
- Designed a strategy for evaluating correlated
aggregate queries in parallel - Supports sliding window components in the queries
- Optimized distribution scheme to minimize
execution time