Parallel Evaluation of Composite Aggregate Queries - PowerPoint PPT Presentation

1 / 197
About This Presentation
Title:

Parallel Evaluation of Composite Aggregate Queries

Description:

Parallel Evaluation of Composite Aggregate Queries – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 198
Provided by: joe673
Category:

less

Transcript and Presenter's Notes

Title: Parallel Evaluation of Composite Aggregate Queries


1
Parallel Evaluation of Composite Aggregate Queries
  • by Chen, Olston, Ramakrishnan
  • http//www.cs.brandeis.edu/7Eolga/cs228/Reading2
    0List_files/icde08.pdf

Joel Freeman, Cosi 228
2
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

3
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

4
Background
  • Correlated Query

5
Background
  • Correlated Query
  • In SQL, has an inner query that uses values from
    the outer query

6
Background
  • Correlated Query
  • In SQL, has an inner query that uses values from
    the outer query
  • Inner query is evaluated once for each row of the
    outer

7
Background
  • Simple Example
  • Consider two relations
  • Budget(company, budget)
  • Expenses(company, expenses)

8
Background
  • Simple Example
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

9
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

10
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

11
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

12
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

13
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

14
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

15
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

16
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

17
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

18
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

19
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

20
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

21
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

22
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

RESULTS
23
Background
  • SELECT company
  • FROM Budget AS B
  • WHERE budget lt (SELECT expenses
  • FROM Expenses
  • WHERE company B.company )

RESULTS
Google
24
Background
  • Correlated Aggregate Query

25
Background
  • Correlated Aggregate Query
  • For each tuple in the outer query, perform
    aggregation in the inner query

26
Background
  • Correlated Aggregate Query
  • For each tuple in the outer query, perform
    aggregation in the inner query
  • SELECT name
  • FROM Student as s1
  • WHERE age gt (SELECT AVG(age)
  • FROM Student
  • WHERE department s1.department)

27
Background
  • SELECT name
  • FROM Student as s1
  • WHERE age gt (SELECT AVG(age)
  • FROM Student
  • WHERE department s1.department)

28
Background
  • SELECT name
  • FROM Student as s1
  • WHERE age gt (SELECT AVG(age)
  • FROM Student
  • WHERE department s1.department)

29
Background
  • SELECT name
  • FROM Student as s1
  • WHERE age gt (SELECT AVG(age)
  • FROM Student
  • WHERE department s1.department)

30
Background
  • SELECT name
  • FROM Student as s1
  • WHERE age gt (SELECT AVG(age)
  • FROM Student
  • WHERE department s1.department)

31
Background
  • SELECT name
  • FROM Student as s1
  • WHERE age gt (SELECT AVG(age)
  • FROM Student
  • WHERE department s1.department)

32
Background
  • SELECT name
  • FROM Student as s1
  • WHERE age gt (SELECT AVG(age)
  • FROM Student
  • WHERE department s1.department)

33
Background
  • SELECT name
  • FROM Student as s1
  • WHERE age gt (SELECT AVG(age)
  • FROM Student
  • WHERE department s1.department)

RESULTS
Jane
34
Background
  • Cube Space

35
Background
  • Cube Space
  • A different way of looking at relational data

36
Background
  • Cube Space
  • A different way of looking at relational data
  • Example Search Engine Logs

37
Background
  • Cube Space
  • Schema

38
Background
  • Cube Space
  • Schema
  • (keyword, pageCount, adCount, time)

39
Background
  • Cube Space
  • Schema
  • (keyword, pageCount, adCount, time)
  • E.g. (dogs, 9, 1, 530)

40
Background
  • Cube Space Example

41
Background
keyword
hippo
dogs
data
time
501
502
612
42
Background
keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
43
Background
keyword
(pageCount, adCount)
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
44
Background
keyword
hippo
(0, 0)
(0, 1)
(3, 3)
region
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
45
Background
keyword
hippo
(0, 0)
(0, 1)
(3, 3)
2 dimension attributes 2 measure attributes
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
46
Background
keyword
Imagine a 3rd dimension attribute, Country, for
the country of origin of the query
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
47
Background
keyword
Imagine a 3rd dimension attribute, Country, for
the country of origin of the query Hence cube
space
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
48
Background
keyword
Many data points per keyword per minute
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(3, 0)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
49
Background
keyword
(9, 0)
(2, 6)
Bags not sets
hippo
(0, 0)
(0, 1)
(3, 3)
(3, 0)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
time
501
502
612
50
Background
  • Composite Subset Measure Query

51
Background
  • Composite Subset Measure Query
  • A complex query made up of multiple Correlated
    Aggregate queries

52
Background
  • Composite Subset Measure Query
  • A complex query made up of multiple Correlated
    Aggregate queries
  • Has multiple steps (components)

53
Background
  • Composite Subset Measure Query
  • A complex query made up of multiple Correlated
    Aggregate queries
  • Has multiple steps (components)
  • Applied to regions in cube space

54
Background
  • How are composite subset measure queries unique?
  • citation 4 from the paper

55
Background
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
56
Background
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
57
Background
  • Typical GROUP BY operation on time

Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
58
Background
  • Fixed region granularity (time is in mins)

Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
59
Background
Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
60
Background
  • New model Regions grouped by hour as well

Keyword
(9, 0)
(2, 6)
hippo
(0, 0)
(0, 1)
(3, 3)
(1, 2)
(3, 0)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
61
Background
  • What does a composite subset measure query look
    like?

62
Background
  • Example
  • (keyword, pageCount, adCount, time)

63
Background
  • Example
  • (keyword, pageCount, adCount, time)
  • M1 For every minute and keyword, find the median
    of page count.

64
Background
  • Example
  • (keyword, pageCount, adCount, time)
  • M1 For every minute and keyword, find the median
    of page count.
  • M2 For every minute and keyword, find the median
    of ad count.

65
Background
  • Example
  • (keyword, pageCount, adCount, time)
  • M1 For every minute and keyword, find the median
    of page count.
  • M2 For every minute and keyword, find the median
    of ad count.
  • M3 For every minute and keyword, find the ratio
    between M1 and M2

66
Background
  • Composite multiple steps

67
Background
  • Composite multiple steps
  • Subset over subsets of dimensions

68
Background
  • Composite multiple steps
  • Subset over subsets of dimensions
  • Measure aggregating measures

69
Background
  • Composite multiple steps
  • Subset over subsets of dimensions
  • Measure aggregating measures
  • Queries

70
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

71
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

72
Motivation
  • Search Engine logs can be pretty big

73
Motivation
  • Search Engine logs can be pretty big
  • What if we want to analyze TBs of data?

74
Motivation
  • Search Engine logs can be pretty big
  • What if we want to analyze TBs of data?
  • E.g. Yahoo targeted advertising

75
Motivation
  • Search Engine logs can be pretty big
  • What if we want to analyze TBs of data?
  • E.g. Yahoo targeted advertising
  • Amazon item recommendations

76
Motivation
  • We often use composite subset measure queries

77
Motivation
  • We often use composite subset measure queries
  • How might we parallelize?

78
Motivation
Parallelizing
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
79
Motivation
Parallelizing
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
80
Motivation
Parallelizing send each
partition to a
different node
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
81
Motivation
Parallelizing how do we
decide how to
distribute our partitions?
Keyword
hippo
(0, 0)
(0, 1)
(3, 3)
dogs
(5, 0)
(6, 3)
(1, 0)
data
(9, 1)
(2, 0)
(2, 0)
Time
501
502
612
82
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

83
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

84
Data Distribution
  • Idea compute each component of the composite
    query in parallel

85
Data Distribution
  • Problem

86
Data Distribution
  • Problem
  • Composite subset measure queries often use the
    same data for different components

87
Data Distribution
  • Problem
  • Composite subset measure queries often use the
    same data for different components
  • For each query component, we would need to
    repartition data

88
Data Distrubtion
  • Remember?
  • (keyword, linksCount, adsCount, time)
  • M1 For every minute and keyword, find the median
    of links count.
  • M2 For every hour and keyword, find the median
    of ads count.

89
Data Distrubtion
  • Remember?
  • (keyword, linksCount, adsCount, time)
  • M1 For every minute and keyword, find the median
    of links count.
  • M2 For every hour and keyword, find the median
    of ads count.

Parallelizing each individual component means
distributing data twice
90
Data Distrubtion
  • Remember?
  • (keyword, linksCount, adsCount, time)
  • M1 For every minute and keyword, find the median
    of links count.
  • M2 For every hour and keyword, find the median
    of ads count.

To compute M3, we need to join the results of M1
and M2, which may be large
91
Data Distribution
  • Solution

92
Data Distribution
  • Solution
  • Every aggregation carried out at only one node
  • All data required for the final aggregation must
    be carried out at one node

93
Data Distribution
  • Solution
  • Localized Aggregation

94
Data Distribution
  • Solution
  • In our example, we would partition data into the
    regions for M3, and perform M1, M2, and M3 all at
    one node

95
Data Distribution
  • Distribution process

96
Data Distribution
  • Distribution process
  • Mapper allocated part of dataset

97
Data Distribution
  • Distribution process
  • Mapper allocated part of dataset
  • Produces key/value pairs

98
Data Distribution
  • Distribution process
  • Mapper allocated part of dataset
  • Produces key/value pairs
  • Each reducer is assigned responsibility for
    certain keys

99
Data Distribution
  • Distribution process
  • Mapper allocated part of dataset
  • Produces key/value pairs
  • Each reducer is assigned responsibility for
    certain keys
  • Data is distributed to reducers, who carry out
    the computation

100
Data Distribution
  • Example
  • Schema (university, dept, enrollment)
  • e.g. (brandeis, history, 90)

101
Data Distribution
  • Example
  • Schema (university, dept, enrollment)
  • e.g. (brandeis, history, 90)
  • Query For each department, find the average
    enrollment over all universities.

102
Data Distribution
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
(BU, history, 110)
103
Data Distribution
Mappers
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
(BU, history, 110)
104
Data Distribution
Mappers
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
(BU, history, 110)
105
Data Distribution
Reducers
Physics
Mappers
(Brandeis, history, 90)
History
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
106
Data Distribution
Reducers
Physics
Mappers
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
107
Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
108
Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
109
Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
110
Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(Wellesley, art, 47)
Art
(BU, history, 110)
(Wellesley, art, 47)
111
Data Distribution
Reducers
Physics
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(BU, history, 110)
(Wellesley, art, 47)
Art
(BU, history, 110)
(Wellesley, art, 47)
112
Data Distribution
Reducers
Physics
68
Mappers
(Brandeis, physics, 68)
(Brandeis, history, 90)
History
(Brandeis, history, 90)
(Brandeis, physics, 68)
(Bentley, history, 70)
90
(Bentley, history, 70)
(Wellesley, history, 90)
(Wellesley, history, 90)
(BU, history, 110)
(Wellesley, art, 47)
Art
(BU, history, 110)
47
(Wellesley, art, 47)
113
Data Distribution
  • Process
  • Mapper allocated region of dataset
  • Produces key/value pairs
  • Each reducer is assigned responsibility for
    certain keys
  • Data is distributed to reducers, who carry out
    the computation

114
Data Distribution
  • Mappers use a special algorithm to decide which
    reducer to send data to (next slides)

115
Data Distribution
  • Mappers use a special algorithm to decide which
    reducer to send data to (next slides)
  • Reducers execute algorithm specified in paper
    Composite Subset Measures 4

116
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

117
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

118
Distribution Key
  • The distribution key is assigned to a piece of
    data by a mapper to determine which reducer to
    send it to

119
Distribution Key
  • The distribution key is assigned to a piece of
    data by a mapper to determine which reducer to
    send it to
  • In the prior example, the distribution key was
    department

120
Distribution Key
  • Composite queries are made up of several
    component queries

121
Distribution Key
  • Composite queries are made up of several
    component queries
  • What regions do we use for the distribution key?

122
Distribution Key
  • Aggregation Workflows

123
(No Transcript)
124
Distribution Key
  • To determine the distribution key

125
Distribution Key
  • To determine the distribution key
  • The least common ancestor of all measure
    granularities is a feasible distribution key

126
Distribution Key
  • To determine the distribution key
  • The least common ancestor of all measure
    granularities is a feasible distribution key
  • All the other feasible distribution keys are
    generalizations

127
(No Transcript)
128
Distribution Key
  • gt
  • Distribution key is
  • (Keyword, Hour)

129
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

130
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

131
Sliding Windows
  • Example we measure some value for each minute
    and calculate a moving average for a 3 minute
    window

132
Sliding Windows
  • 3-minute moving average
  • Minutes Value
  • 1 1
  • 2 6
  • 3 5
  • 4 0
  • 5 2
  • 6 9
  • 7 0

133
Sliding Windows
  • 3-minute moving average
  • Minutes Value Average
  • 1 1
  • 2 6
  • 3 5
  • 4 0
  • 5 2
  • 6 9
  • 7 0

134
Sliding Windows
  • 3-minute moving average
  • Minutes Value Average
  • 1 1 (165)
  • 2 6
  • 3 5
  • 4 0
  • 5 2
  • 6 9
  • 7 0

135
Sliding Windows
  • 3-minute moving average
  • Minutes Value Average
  • 1 1 (165)/3
  • 2 6
  • 3 5
  • 4 0
  • 5 2
  • 6 9
  • 7 0

136
Sliding Windows
  • 3-minute moving average
  • Minutes Value Average
  • 1 1 (165)/3
  • 2 6 (650)/3
  • 3 5 (502)/3
  • 4 0 (092)/3
  • 5 2 (290)/3
  • 6 9
  • 7 0

137
Sliding Windows
  • 3-minute moving average
  • Minutes Value Average
  • 1 1 (165)/3 4
  • 2 6 (650)/3 3.7
  • 3 5 (502)/3 2.3
  • 4 0 (092)/3 3.7
  • 5 2 (290)/3 3.7
  • 6 9
  • 7 0

138
Sliding Windows
  • 3-minute moving average
  • Minutes Value Average
  • 1 1 4
  • 2 6 3.7
  • 3 5 2.3
  • 4 0 3.7
  • 5 2 3.7
  • 6 9
  • 7 0

139
Sliding Windows
  • In order to ensure localized evaluation, each
    reducer must have data for an ENTIRE window

140
Sliding Windows
  • 3-minute moving average
  • Minutes Value
  • 1 1
  • 2 6
  • 3 5
  • 4 0
  • 5 2
  • 6 9
  • 7 0

141
Sliding Windows
  • 3-minute moving average
  • Minutes Value
  • 1 1
  • 2 6
  • 3 5
  • 4 0
  • 5 2
  • 6 9
  • 7 0

142
Sliding Windows
  • gt same data must be sent to multiple reducers

143
Sliding Windows
  • gt same data must be sent to multiple reducers
  • Consider window of size 20. How much extra data
    is stored in the system?

144
Sliding Windows
  • gt same data must be sent to multiple reducers
  • Consider window of size 20. How much extra data
    is stored in the system?

A lot!
145
Sliding Windows
  • Solution clustering factor

146
Sliding Windows
  • Solution clustering factor
  • Clustering factor of an execution is the number
    of reducer blocks that are merged together into
    one super-block

147
Sliding Windows
  • Example

148
Sliding Windows
  • Example
  • Suppose clustering factor 2, and window size
    10

149
Sliding Windows
  • Example
  • Suppose clustering factor 2, and window size
    10
  • Each reducer gets 2 reducer blocks

150
Sliding Windows
  • Example
  • Suppose clustering factor 2, and window size
    10
  • Each reducer gets 2 reducer blocks
  • 5 reducers used instead of 10

151
Sliding Windows
  • Example
  • Suppose clustering factor 2, and window size
    10
  • Each reducer gets 2 reducer blocks
  • 5 reducers used instead of 10
  • tradeoff ? redundant data gt ? parallelism

152
Sliding Windows
  • How do sliding windows affect the distribution
    key?

153
(No Transcript)
154
Sliding Windows
  • How do sliding windows affect the distribution
    key?

155
Sliding Windows
  • How do sliding windows affect the distribution
    key?
  • Start with the initial components in the
    workflow. If sliding window, apply OpConvert.
    Then apply OpCombine.

156
Sliding Windows
157
(No Transcript)
158
Sliding Windows
159
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

160
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

161
Optimizations
  • 1 Early Aggregation

162
Optimizations
  • 1 Early Aggregation
  • When aggregations are performed by reducers, the
    resulting data is usually smaller

163
Optimizations
  • 1 Early Aggregation
  • When aggregations are performed by reducers, the
    resulting data is usually smaller
  • If we aggregate at the mappers, we have to
    transfer less to the reducers

164
Optimizations
  • 1 Early Aggregation
  • Only works if basic measures (components that
    dont rely on other components) are parallelizable

165
Optimizations
  • 2 Composite Sort Key

166
Optimizations
  • 2 Composite Sort Key
  • Mappers sort data by determining which reducer
    to send it to

167
Optimizations
  • 2 Composite Sort Key
  • Mappers sort data by determining which reducer
    to send it to
  • Reducers then sort received data before
    aggregating

168
Optimizations
  • 2 Composite Sort Key
  • Mappers sort data by determining which reducer
    to send it to
  • Reducers then sort received data before
    aggregating
  • A composite sort key would allow reducers to
    order incoming data

169
Optimizations
  • 3 Use magic formula

170
Optimizations
  • 3 Use magic formula
  • How do we determine best distribution key and
    clustering factor for queries with windows?

171
Optimizations
  • 3 Use magic formula
  • For each possible distribution key, derive this
    formula and set equal to 0. Then solve for cf.

172
Optimizations
  • 3 Use magic formula

173
Optimizations
  • 4 Skew Detection

174
Optimizations
  • 4 Skew Detection
  • When a mapper receives data, it samples the data
    to see what reducers it goes to

175
Optimizations
  • 4 Skew Detection
  • When a mapper receives data, it samples the data
    to see what reducers it goes to
  • Mappers forward sample results to the master node

176
Optimizations
  • 4 Skew Detection
  • When a mapper receives data, it samples the data
    to see what reducers it goes to
  • Mappers forward sample results to the master node
  • Node can choose new clust factor with the lowest
    maximal workload

177
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

178
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

179
Evaluation
  • Ran trials use Hadoop on 100 machines

180
Evaluation
  • 50 mappers,
  • 50 reducers
  • Approximately
  • Linear scaleup

181
Evaluation
  • 50 mappers,
  • 50 reducers
  • Approximately
  • Linear scaleup

Approximately linear scaleup
182
Evaluation
  • 50 mappers,
  • 50 reducers
  • Approximately
  • Linear scaleup

Approximately linear scaleup Query 6 has sliding
window
183
Evaluation
184
Evaluation
Approximately linear scaleup
185
Evaluation
Approximately linear scaleup Query 6 has coarse
granularity in addition to sliding window
186
Evaluation
187
Evaluation
Model prediction from earlier formula
188
Evaluation
Model prediction from earlier formula Too small
clustering factor gt a lot of redundant data Too
big clustering factor gt not much parallelization
189
Evaluation
190
Evaluation
DS0 has coarse granularity DS2 is
fine-grained DS1 is intermediate
191
Evaluation
192
Evaluation
1 billion records 2Block and 4Block impose
minimum on number of blocks per reducer
193
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

194
Outline
  • Background
  • Motivation
  • Data Distribution Scheme
  • Distribution Key
  • Sliding Windows
  • Optimizations
  • Evaluation Results

195
Summary
  • Designed a strategy for evaluating correlated
    aggregate queries in parallel

196
Summary
  • Designed a strategy for evaluating correlated
    aggregate queries in parallel
  • Supports sliding window components in the queries

197
Summary
  • Designed a strategy for evaluating correlated
    aggregate queries in parallel
  • Supports sliding window components in the queries
  • Optimized distribution scheme to minimize
    execution time
Write a Comment
User Comments (0)
About PowerShow.com