Parallel Evaluation of Composite Aggregate Queries

About This Presentation

Title:

Parallel Evaluation of Composite Aggregate Queries

Description:

Parallel Evaluation of Composite Aggregate Queries – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 198

Provided by: joe673

Category:

more less

Transcript and Presenter's Notes

Title: Parallel Evaluation of Composite Aggregate Queries

1
Parallel Evaluation of Composite Aggregate Queries

by Chen, Olston, Ramakrishnan
http//www.cs.brandeis.edu/7Eolga/cs228/Reading2
0List_files/icde08.pdf

Joel Freeman, Cosi 228
2
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

3
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

4
Background

Correlated Query

5
Background

Correlated Query
In SQL, has an inner query that uses values from
the outer query

6
Background

Correlated Query
In SQL, has an inner query that uses values from
the outer query
Inner query is evaluated once for each row of the
outer

7
Background

Simple Example
Consider two relations
Budget(company, budget)
Expenses(company, expenses)

8
Background

Simple Example
SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

9
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

10
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

11
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

12
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

13
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

14
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

15
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

16
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

17
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

18
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

19
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

20
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

21
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

22
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

RESULTS
23
Background

SELECT company
FROM Budget AS B
WHERE budget lt (SELECT expenses
FROM Expenses
WHERE company B.company )

RESULTS
Google
24
Background

Correlated Aggregate Query

25
Background

Correlated Aggregate Query
For each tuple in the outer query, perform
aggregation in the inner query

26
Background

Correlated Aggregate Query
For each tuple in the outer query, perform
aggregation in the inner query
SELECT name
FROM Student as s1
WHERE age gt (SELECT AVG(age)
FROM Student
WHERE department s1.department)

27
Background

SELECT name
FROM Student as s1
WHERE age gt (SELECT AVG(age)
FROM Student
WHERE department s1.department)

28
Background

SELECT name
FROM Student as s1
WHERE age gt (SELECT AVG(age)
FROM Student
WHERE department s1.department)

29
Background

SELECT name
FROM Student as s1
WHERE age gt (SELECT AVG(age)
FROM Student
WHERE department s1.department)

30
Background

SELECT name
FROM Student as s1
WHERE age gt (SELECT AVG(age)
FROM Student
WHERE department s1.department)

31
Background

SELECT name
FROM Student as s1
WHERE age gt (SELECT AVG(age)
FROM Student
WHERE department s1.department)

32
Background

SELECT name
FROM Student as s1
WHERE age gt (SELECT AVG(age)
FROM Student
WHERE department s1.department)

33
Background

SELECT name
FROM Student as s1
WHERE age gt (SELECT AVG(age)
FROM Student
WHERE department s1.department)

RESULTS
Jane
34
Background

Cube Space

35
Background

Cube Space
A different way of looking at relational data

36
Background

Cube Space
A different way of looking at relational data
Example Search Engine Logs

37
Background

Cube Space
Schema

38
Background

Cube Space
Schema
(keyword, pageCount, adCount, time)

39
Background

Cube Space
Schema
(keyword, pageCount, adCount, time)
E.g. (dogs, 9, 1, 530)

40
Background

Cube Space Example

41
Background
keyword
hippo
dogs
data
time
501
502
612
42
Background
keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
43
Background
keyword
(pageCount, adCount)
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
44
Background
keyword
hippo
(0, 0)
(0, 1)
(3, 3)
region
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
45
Background
keyword
hippo
(0, 0)
(0, 1)
(3, 3)
2 dimension attributes 2 measure attributes
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
46
Background
keyword
Imagine a 3rd dimension attribute, Country, for
the country of origin of the query
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
47
Background
keyword
Imagine a 3rd dimension attribute, Country, for
the country of origin of the query Hence cube
space
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
48
Background
keyword
Many data points per keyword per minute
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(3, 0)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
49
Background
keyword
(9, 0)
(2, 6)
Bags not sets
hippo
(0, 0)
(0, 1)
(3, 3)
(3, 0)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
50
Background

Composite Subset Measure Query

51
Background

Composite Subset Measure Query
A complex query made up of multiple Correlated
Aggregate queries

52
Background

Composite Subset Measure Query
A complex query made up of multiple Correlated
Aggregate queries
Has multiple steps (components)

53
Background

Composite Subset Measure Query
A complex query made up of multiple Correlated
Aggregate queries
Has multiple steps (components)
Applied to regions in cube space

54
Background

How are composite subset measure queries unique?
citation 4 from the paper

55
Background
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
56
Background
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
57
Background

Typical GROUP BY operation on time

Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
58
Background

Fixed region granularity (time is in mins)

Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
59
Background
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
60
Background

New model Regions grouped by hour as well

Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
61
Background

What does a composite subset measure query look
like?

62
Background

Example
(keyword, pageCount, adCount, time)

63
Background

Example
(keyword, pageCount, adCount, time)
M1 For every minute and keyword, find the median
of page count.

64
Background

Example
(keyword, pageCount, adCount, time)
M1 For every minute and keyword, find the median
of page count.
M2 For every minute and keyword, find the median
of ad count.

65
Background

Example
(keyword, pageCount, adCount, time)
M1 For every minute and keyword, find the median
of page count.
M2 For every minute and keyword, find the median
of ad count.
M3 For every minute and keyword, find the ratio
between M1 and M2

66
Background

Composite multiple steps

67
Background

Composite multiple steps
Subset over subsets of dimensions

68
Background

Composite multiple steps
Subset over subsets of dimensions
Measure aggregating measures

69
Background

Composite multiple steps
Subset over subsets of dimensions
Measure aggregating measures
Queries

70
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

71
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

72
Motivation

Search Engine logs can be pretty big

73
Motivation

Search Engine logs can be pretty big
What if we want to analyze TBs of data?

74
Motivation

Search Engine logs can be pretty big
What if we want to analyze TBs of data?
E.g. Yahoo targeted advertising

75
Motivation

Search Engine logs can be pretty big
What if we want to analyze TBs of data?
E.g. Yahoo targeted advertising
Amazon item recommendations

76
Motivation

We often use composite subset measure queries

77
Motivation

We often use composite subset measure queries
How might we parallelize?

78
Motivation
Parallelizing
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
79
Motivation
Parallelizing
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
80
Motivation
Parallelizing send each
partition to a
different node
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
81
Motivation
Parallelizing how do we
decide how to
distribute our partitions?
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
82
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

83
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

84
Data Distribution

Idea compute each component of the composite
query in parallel

85
Data Distribution

Problem

86
Data Distribution

Problem
Composite subset measure queries often use the
same data for different components

87
Data Distribution

Problem
Composite subset measure queries often use the
same data for different components
For each query component, we would need to
repartition data

88
Data Distrubtion

Remember?
(keyword, linksCount, adsCount, time)
M1 For every minute and keyword, find the median
of links count.
M2 For every hour and keyword, find the median
of ads count.

89
Data Distrubtion

Remember?
(keyword, linksCount, adsCount, time)
M1 For every minute and keyword, find the median
of links count.
M2 For every hour and keyword, find the median
of ads count.

Parallelizing each individual component means
distributing data twice
90
Data Distrubtion

Remember?
(keyword, linksCount, adsCount, time)
M1 For every minute and keyword, find the median
of links count.
M2 For every hour and keyword, find the median
of ads count.

To compute M3, we need to join the results of M1
and M2, which may be large
91
Data Distribution

Solution

92
Data Distribution

Solution
Every aggregation carried out at only one node
All data required for the final aggregation must
be carried out at one node

93
Data Distribution

Solution
Localized Aggregation

94
Data Distribution

Solution
In our example, we would partition data into the
regions for M3, and perform M1, M2, and M3 all at
one node

95
Data Distribution

Distribution process

96
Data Distribution

Distribution process
Mapper allocated part of dataset

97
Data Distribution

Distribution process
Mapper allocated part of dataset
Produces key/value pairs

98
Data Distribution

Distribution process
Mapper allocated part of dataset
Produces key/value pairs
Each reducer is assigned responsibility for
certain keys

99
Data Distribution

Distribution process
Mapper allocated part of dataset
Produces key/value pairs
Each reducer is assigned responsibility for
certain keys
Data is distributed to reducers, who carry out
the computation

100
Data Distribution

Example
Schema (university, dept, enrollment)
e.g. (brandeis, history, 90)

101
Data Distribution

Example
Schema (university, dept, enrollment)
e.g. (brandeis, history, 90)
Query For each department, find the average
enrollment over all universities.

102
Data Distribution
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
(BU, history, 110)
103
Data Distribution
Mappers
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
(BU, history, 110)
104
Data Distribution
Mappers
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
(BU, history, 110)
105
Data Distribution
Reducers
Physics
Mappers
(Brandeis, history, 90)
History
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
106
Data Distribution
Reducers
Physics
Mappers
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
107
Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
108
Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
109
Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
110
Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
(Wellesley, art, 47)
111
Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(BU, history, 110)
(Wellesley, art, 47)
Art
(BU, history, 110)
(Wellesley, art, 47)
112
Data Distribution
Reducers
Physics
68
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
90
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(BU, history, 110)
(Wellesley, art, 47)
Art
(BU, history, 110)
47
(Wellesley, art, 47)
113
Data Distribution

Process
Mapper allocated region of dataset
Produces key/value pairs
Each reducer is assigned responsibility for
certain keys
Data is distributed to reducers, who carry out
the computation

114
Data Distribution

Mappers use a special algorithm to decide which
reducer to send data to (next slides)

115
Data Distribution

Mappers use a special algorithm to decide which
reducer to send data to (next slides)
Reducers execute algorithm specified in paper
Composite Subset Measures 4

116
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

117
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

118
Distribution Key

The distribution key is assigned to a piece of
data by a mapper to determine which reducer to
send it to

119
Distribution Key

The distribution key is assigned to a piece of
data by a mapper to determine which reducer to
send it to
In the prior example, the distribution key was
department

120
Distribution Key

Composite queries are made up of several
component queries

121
Distribution Key

Composite queries are made up of several
component queries
What regions do we use for the distribution key?

122
Distribution Key

Aggregation Workflows

123
(No Transcript)
124
Distribution Key

To determine the distribution key

125
Distribution Key

To determine the distribution key
The least common ancestor of all measure
granularities is a feasible distribution key

126
Distribution Key

To determine the distribution key
The least common ancestor of all measure
granularities is a feasible distribution key
All the other feasible distribution keys are
generalizations

127
(No Transcript)
128
Distribution Key

gt
Distribution key is
(Keyword, Hour)

129
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

130
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

131
Sliding Windows

Example we measure some value for each minute
and calculate a moving average for a 3 minute
window

132
Sliding Windows

3-minute moving average
Minutes Value
1 1
2 6
3 5
4 0
5 2
6 9
7 0

133
Sliding Windows

3-minute moving average
Minutes Value Average
1 1
2 6
3 5
4 0
5 2
6 9
7 0

134
Sliding Windows

3-minute moving average
Minutes Value Average
1 1 (165)
2 6
3 5
4 0
5 2
6 9
7 0

135
Sliding Windows

3-minute moving average
Minutes Value Average
1 1 (165)/3
2 6
3 5
4 0
5 2
6 9
7 0

136
Sliding Windows

3-minute moving average
Minutes Value Average
1 1 (165)/3
2 6 (650)/3
3 5 (502)/3
4 0 (092)/3
5 2 (290)/3
6 9
7 0

137
Sliding Windows

3-minute moving average
Minutes Value Average
1 1 (165)/3 4
2 6 (650)/3 3.7
3 5 (502)/3 2.3
4 0 (092)/3 3.7
5 2 (290)/3 3.7
6 9
7 0

138
Sliding Windows

3-minute moving average
Minutes Value Average
1 1 4
2 6 3.7
3 5 2.3
4 0 3.7
5 2 3.7
6 9
7 0

139
Sliding Windows

In order to ensure localized evaluation, each
reducer must have data for an ENTIRE window

140
Sliding Windows

3-minute moving average
Minutes Value
1 1
2 6
3 5
4 0
5 2
6 9
7 0

141
Sliding Windows

3-minute moving average
Minutes Value
1 1
2 6
3 5
4 0
5 2
6 9
7 0

142
Sliding Windows

gt same data must be sent to multiple reducers

143
Sliding Windows

gt same data must be sent to multiple reducers
Consider window of size 20. How much extra data
is stored in the system?

144
Sliding Windows

gt same data must be sent to multiple reducers
Consider window of size 20. How much extra data
is stored in the system?

A lot!
145
Sliding Windows

Solution clustering factor

146
Sliding Windows

Solution clustering factor
Clustering factor of an execution is the number
of reducer blocks that are merged together into
one super-block

147
Sliding Windows

Example

148
Sliding Windows

Example
Suppose clustering factor 2, and window size
10

149
Sliding Windows

Example
Suppose clustering factor 2, and window size
10
Each reducer gets 2 reducer blocks

150
Sliding Windows

Example
Suppose clustering factor 2, and window size
10
Each reducer gets 2 reducer blocks
5 reducers used instead of 10

151
Sliding Windows

Example
Suppose clustering factor 2, and window size
10
Each reducer gets 2 reducer blocks
5 reducers used instead of 10
tradeoff ? redundant data gt ? parallelism

152
Sliding Windows

How do sliding windows affect the distribution
key?

153
(No Transcript)
154
Sliding Windows

How do sliding windows affect the distribution
key?

155
Sliding Windows

How do sliding windows affect the distribution
key?
Start with the initial components in the
workflow. If sliding window, apply OpConvert.
Then apply OpCombine.

156
Sliding Windows
157
(No Transcript)
158
Sliding Windows
159
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

160
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

161
Optimizations

1 Early Aggregation

162
Optimizations

1 Early Aggregation
When aggregations are performed by reducers, the
resulting data is usually smaller

163
Optimizations

1 Early Aggregation
When aggregations are performed by reducers, the
resulting data is usually smaller
If we aggregate at the mappers, we have to
transfer less to the reducers

164
Optimizations

1 Early Aggregation
Only works if basic measures (components that
dont rely on other components) are parallelizable

165
Optimizations

2 Composite Sort Key

166
Optimizations

2 Composite Sort Key
Mappers sort data by determining which reducer
to send it to

167
Optimizations

2 Composite Sort Key
Mappers sort data by determining which reducer
to send it to
Reducers then sort received data before
aggregating

168
Optimizations

2 Composite Sort Key
Mappers sort data by determining which reducer
to send it to
Reducers then sort received data before
aggregating
A composite sort key would allow reducers to
order incoming data

169
Optimizations

3 Use magic formula

170
Optimizations

3 Use magic formula
How do we determine best distribution key and
clustering factor for queries with windows?

171
Optimizations

3 Use magic formula
For each possible distribution key, derive this
formula and set equal to 0. Then solve for cf.

172
Optimizations

3 Use magic formula

173
Optimizations

4 Skew Detection

174
Optimizations

4 Skew Detection
When a mapper receives data, it samples the data
to see what reducers it goes to

175
Optimizations

4 Skew Detection
When a mapper receives data, it samples the data
to see what reducers it goes to
Mappers forward sample results to the master node

176
Optimizations

4 Skew Detection
When a mapper receives data, it samples the data
to see what reducers it goes to
Mappers forward sample results to the master node
Node can choose new clust factor with the lowest
maximal workload

177
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

178
Outline

Background
Motivation
Data Distribution Scheme
Distribution Key
Sliding Windows
Optimizations
Evaluation Results

179
Evaluation

Ran trials use Hadoop on 100 machines

180
Evaluation

50 mappers,
50 reducers
Approximately
Linear scaleup

181
Evaluation

50 mappers,
50 reducers
Approximately
Linear scaleup

Approximately linear scaleup
182
Evaluation

50 mappers,
50 reducers
Approximately
Linear scaleup

Approximately linear scaleup Query 6 has sliding
window
183
Evaluation
184
Evaluation
Approximately linear scaleup
185
Evaluation
Approximately linear scaleup Query 6 has coarse
granularity in addition to sliding window
186
Evaluation
187
Evaluation
Model prediction from earlier formula
188
Evaluation
Model prediction from earlier formula Too small
clustering factor gt a lot of redundant data Too
big clustering factor gt not much parallelization
189
Evaluation
190
Evaluation
DS0 has coarse granularity DS2 is
fine-grained DS1 is intermediate
191
Evaluation
192
Evaluation
1 billion records 2Block and 4Block impose
minimum on number of blocks per reducer
193
Outline