By: A. LaMarca

About This Presentation

Title:

By: A. LaMarca

Description:

The Influence of Caches on the Performance of Sorting By: A. LaMarca & R. Lander Presenter : Shai Brandes Introduction Sorting is one of the most important operations ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 62

Provided by: stude2571

Category:

more less

Transcript and Presenter's Notes

Title: By: A. LaMarca

1

The Influence of Caches on the Performance of
Sorting

By A. LaMarca R. Lander
Presenter Shai Brandes

2
Introduction

Sorting is one of the most important operations
performed by computers.
In the days of magnetic tape storage before
modern data-bases, it was almost certainly the
most common operation performed by computers as
most "database" updating was done by sorting
transactions and merging them with a master file.

3
Introduction cont.

Since the introduction of caches, main memory
continued to grow slower relative to processor
cycle times.
The time to service a cache miss grew to 100
cycles and more.
Cache miss penalties have grown to the point
where good overall performance cannot be achieved
without good cache performance.

4
Introduction cont.

In the article, the authors investigate, both
experimentally and analytically , the potential
performance gains that cache-conscious design
offers in improving the performance of several
sorting algorithms.

5
Introduction cont.

For each algorithm, an implementation variant
with potential for good overall performance, was
chosen.
Than, the algorithm was optimized, using
traditional techniques to minimize the number of
instruction executed.
This algorithm forms the baseline for comparison.
Memory optimizations were applied to the
comparison sort baseline algorithm, in order to
improve cache performance.

6
Performance measures

The authors concentrate on three performance
measures
Instruction count
Cache misses
Overall performance
The analyses presented here are only
approximation, since cache misses cannot be
analyzed precisely due to factors such as
variation in process scheduling and the operating
systems virtual to physical page mapping policy.

7
Main lesson

The main lesson from the article is that because
of the cache miss penalties are growing larger
with each new generation of processors, improving
an algorithms overall performance requires
increasing the number of instruction executed,
while at the same time, reducing the number of
cache misses.

8
Design parameters of caches

Capacity total number of blocks the cache can
hold.
Block size the number of bytes that are loaded
from and written to memory at a time.
Associativity in an N-way set associative
cache, a particular block can be loaded in N
different cache locations.
Replacement policy which block do we remove
from the cache as a new block is loaded

9
Which cache are we investigating?

In modern machines, more than one cache is
placed between the main memory and the processor.

processor
memory
Full associative
N-way associative
Direct map
10
Which cache are we investigating?

The largest miss penalty is typically incurred to
the cache closest to the main memory, which is
usually direct-mapped.
Thus, we will focus on improving the performance
of direct-mapped caches.

11
Improve the cache hit ratio

Temporal locality there is a good chance that
an accessed data will be accessed again in the
near future.
Spatial locality - there is a good chance that
subsequently accessed data items are located near
each other in memory.

12
Cache misses

Compulsory miss occur when a block is first
accessed and loaded to the cache.
Capacity miss caused by the fact that the cache
is not large enough to hold all the accessed
blocks at one time.
Conflict miss occur when two or more blocks,
which are mapped to the same location in the
cache, are accessed.

13
Measurements

n the number of keys to be sorted
C the number of blocks in the cache
B the number of keys that fit in a cache
block

Cache block

B keys
14
Mergesort

Two sorted lists can be merged into a single
list by repeatedly adding the smaller key to a
single sorted list

6 3 1
5 4 2
1
2
3
4
5
6
15
Mergesort

By treating a set of unordered keys as a set of
sorted lists of length 1, the keys can be
repeatedly merged together until a single sorted
set of keys remains.
The iterative mergesort was chosen as the base
algorithm.

16
Mergesort base algorithm

Mergesort makes log2n passes over the array,
where the i-th pass merges sorted subarrays of
length 2i-1 into sorted subarrays of size 2i.

1
4
8
3
i1
1
4
8
3
i2
8 4 3 1
17
Improvements to the base algorithm

Alternating the merging process from one array to
another to avoid unnecessary copying.
Loop unrolling
Sorting subarrays of size 4 with a fast in-line
sorting method.
Thus, the number of passes is log2(n/4).
If this number is odd, then an additional copy
pass is needed to return the sorted array to the
input array.

18
The problem with the algorithm

The base mergesort has the potential for
terrible cache performance if a pass is large
enough to wrap around the in the cache, keys will
be ejected before they are used again.
n BC/2 ?the entire sort will be performed in
the cache only Compulsory misses.
BC/2 lt n BC ?temporal reuse drops off sharply
BC lt n ?no temporal reuse
In each pass
The block is accessed in the input array (r/w)
The block is accessed in the auxiliary array
(w/r).
? 2 cache misses per block? 2/B cache misses per
key

19

n BC/2

1
2
3
4
Input array n keys
1
4
3
2
i2
i1
4 cache misses
Auxiliary array
1
4
2
3
No cache misses!
Cache block
1
4
2
2
1
4
3
3
Read key1
Cache after pass i1
Read 1 miss
Write 1 miss
Read 1 miss
Write 1 miss
20
Mergesort analysis

For nBC/2 ? 2/B misses per key
The entire sort will be performed in the cache
only Compulsory misses

21
Base Mergesort analysis cont.

For BC/2ltn (misses per key)
2/B log2(n/4) 1/B 2/B
(log2(n/4) mod 2)

In each pass, each key is moved from a source
array to a destination array. Every B-th key
visited in the source array results in one cache
miss. Every B-th key written to the destination
array results in one cache miss.
Number of merge passes
Initial pass of sorting groups of 4 keys. 1
compulsory miss per block. thus, 1/B misses per
key
If number of iteration is odd, we need to copy
the sorted array to the input array
22
1st Memory optimizationTiled mergesort

Improve temporal locality
Phase 1- subarrays of legnth BC/2 are sorted
using mergesort.
Phase 2- Return the arrays to the base
mergesort to complete the sorting of the
entire array.
Avoid the final copy if log2(n/4)is odd
subarrays of size 2 are sorted in-line if log2(n)
is odd.

23
tiled-mergesort example

4 13 10 6 3 15 11 16 9 5 7 12 8 2 14 1
Phase 1 - mergesort every BC / 2 keys
14 8 2 1
12 9 7 5
16 15 11 3
13 10 6 4
Phase 2- regular Mergesort
14 12 9 8 7 5 2 1
15 16 13 11 10 6 4 3
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
24
Tiled Mergesort analysis

For nBC/2 ? 2/B misses per key
The entire sort will be performed in the cache
only Compulsory misses

25
Tiled Mergesort analysis cont.

For BC/2ltn (misses per key)
2/B log2(2n/BC) 2/B 0

number of iteration is forced to be even. no need
to copy the sorted array to the input array
Initial pass of mergesorting groups of BC/2
keys. Each merge is done in the cache with 2
compulsory misses per block.
Number of merge passes
each pass is large enough to wrap around the in
the cache, keys will be ejected before they are
used again. 2 compulsory miss per block. thus,
2/B misses per key
26
Tiled mergesort cont.

The problem
In phase 2 no reuse is achieved across passes
since the set size is larger than the cache.
The solution multi-mergesort

27
2nd Memory optimization multi-mergesort

We replace the final log2(n/(BC/2)) merge
passes of tiled mergesort with a single pass that
merges all the subarrays at once.
The last pass uses a memory optimized heap which
holds the heads of the subarrays.
The number of misses per key due to the use of
the heap is negligible for practical values of n,
B and C.

28
multi-mergesort example

4 13 10 6 3 15 11 16 9 5 7 12 8 2 14 1
Phase 1 - mergesort every BC / 2 keys
14 8 2 1
12 9 7 5
16 15 11 3
13 10 6 4
Phase 2- multi Mergesort all n/(BC/2) subarrays
at once
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
29
Multi Mergesort analysis

For nBC/2 ? 2/B misses per key
The entire sort will be performed in the cache
only Compulsory misses

30
Multi Mergesort analysis cont.

For BC/2ltn (misses per key)
2/B 2/B

Initial pass of mergesorting groups of BC/2
keys. Each merge is done in the cache with 2
compulsory misses per block. number of iteration
is forced to be odd ? That way, in the next
phase we will multi-merge keys from the auxiliary
array to the input array
a single pass that merges all the n/(BC/2)
subarrays at once. 2 compulsory miss per
block. thus, 2/B misses per key
31
Performance

Base Tiled multi
Multi-merge all subarrays in a single pass
Instructions per key
100
0
Set size in keys
10000
100000
Cache size
32
Performance cont.

Base Tiled multi
Increase in cache misses set size is larger than
cache
constant number of cache misses per key!
Cache misses per key
66 fewer misses than the base
2
1
0
10000
100000
Set size in keys
Cache size
33
Performance cont.

Base Tiled multi
Worst performance due to the large number of
cache misses
Time (cycles per key)
Due to increase in instruction count
Executes up to 55 faster than Base
200
0
Set size in keys
10000
100000
Cache size
34
Quicksort - Divide and conquer algorithm.

Choose a pivot
2
1
7
3
5
6
4

Partition the set around the pivot

Quicksort left region
Quicksort right region

At the end of the pass the pivot is in its final
position.

35
Quicksort base algorithm

Implementation of optimized Quicksort which was
developed by Sedgewick
Rather than sorting small subsets in the natural
course of quicksort recursion, they are left
unsorted until the very end, at which time they
are sorted using
insertion sort in a single final pass over the
entire array

36
Insertion sort

Sort by repeatedly taking the next item and
inserting it into the final data structure in its
proper order with respect to items already
inserted.

1
3
4
2
37
Quicksort base algorithm cont.

Quicksort make sequential passes
? all keys in a block are always used
? exellent spatial locality
Divide and conquer
? if subarray is small enough to fit in the
cache quicksort will incur at most 1 cache miss
per block before the subset is fully sorted
? exellent temporal locality

38
1st Memory optimizationmemory tuned quicksort

Remove Sedgewicks insertion sort in the final
pass.
Instead, sort small subarrays when they are first
encountered using insertion sort.
Motivation
When a small subarray is encountered it has just
been part of a recent partition
? all of its keys should be in the cache

39
2nd Memory optimizationmulti quicksort

n BC ? 1 cache miss per block
Problem
Larger sets of keys incur a substantial number
of misses.
Solution
a single multi-partition pass is performed
divides the full set into a number of subsets
which are likely to be cache sized or smaller

40
Feller

If k points are placed randomly in a range of
length 1
P( subrangei X ) (1 - X)k

41
multi quicksort cont.

multi-partition the array into 3n / (BC) pieces.
? (3n / (BC)) 1 pivots.
? P( subseti BC) (1 BC/n) (3n / (BC)) 1
lim n ? 8(1 BC/n) (3n / (BC)) 1 e-3
?the percentage of subsets that are larger
than the cache is less than 5.

feller
42
Memory tuned quicksort analysis

We analyze the algorithm in two parts
1. Assumption
partitioning an array of size m costs
m gt BC ? m/B misses
m BC ? 0 misses
2. Correct the assumption
estimate the undercounted and
over-counted cache misses

43
Memory tuned quicksort analysis cont.

M(n) the expected number of misses
0 n BC
M(n)
n/b 1/n ?M(i)M(n-i-1) else

0iltn-1
number of misses in the left region
number of misses in the left region
n places to locate the pivot P(pivot is in the
i-th place) 1/n

Assumption partitioning an array of size
n gt BC costs
n / B misses

44
Memory tuned quicksort analysis cont.

The recurrence solves to
0 n BC
M(n)
2(n1)/B ln(n1)/(BC2)O(1/n)else

45
Memory tuned quicksort analysis cont.First
correction

Undercounting the misses when the subproblem
first reaches size BC .
We counted it as 0, but this subproblem may have
no part in the cache!
We add n/B more misses, since there are
approximately n keys in ALL the subproblems that
first reaches size BC.

46
Memory tuned quicksort analysis cont.Second
correction

In the very first partitioning there are n/B
misses, but not for the subsequent partitioning !
In the end of partitioning, some of the array in
the LEFT subproblem is still in the cache.
? there are hits that we counted as misses
Note by the time the algorithm reaches the right
subproblem, its data has been removed from the
cache

47
Memory tuned quicksort analysis Second
correction cont.

The expected number of subproblems of size gt BC
0 n BC
N(n)
1 1/n ?N(i)N(n-i-1) else

0iltn-1
number of subproblems of size gt BC in the left /
right region
ngtBC thus, this array itself is a subproblem
larger than the cache
n places to locate the pivot P(pivot is in the
i-th place) 1/n
48
Memory tuned quicksort analysis Second
correction cont.

The recurrence solves to
0 n BC
N(n)
(n1)/(BC2) 1 else

49
Memory tuned quicksort analysis Second
correction cont.

In each subproblems of size n gt BC

pivot
Left sub-problem

array
R progresses left ? it can access to these blue
cache blocks
L
R
On average, BC/2 keys are in the cache (1/2 cache)
L progresses right ? eventually will access
blocks that map to the blue blocks in the cache
and replace them

cache
50
Memory tuned quicksort analysis Second
correction cont.

Assumption R points to a key in the block
located at the right end of the cache
Reminder this is a direct map cache, the i-th
block will be in the i (mod C)

R
51
Memory tuned quicksort analysis Second
correction cont.

2 possible scenarios. the first

X
X
X
X
i blocks
L points to a key in a block which is mapped to
this cache block. L progresses and replaces the
blocks on the right
R points to a key in a block which is mapped to
this cache block. R progresses to the blue blocks
on the left
On average, there will be c/2 i / 2 hits
52
Memory tuned quicksort analysis Second
correction cont.

The second scenario

X
X
i blocks
R points to a key in a block which is mapped to
this cache block. R progresses to the blue blocks
on the left
On average, there will be i c/2 - i / 2
c/2 i / 2 hits
L points to a key in a block which is mapped to
this cache block. L progresses and replaces the
blocks on the right
53
Memory tuned quicksort analysis Second
correction cont.

Number of hits
1/(c/2) ? c/2 i / 2 3C/8

0iltc/2
L can start on any block with equal probability
Average number of hits
54
Memory tuned quicksort analysis Second
correction cont.

Number of hits not acounted for the computation
of M(n)
3C/8 N(n)

the expected number of misses
number of hits after a partition
The expected number of sub-problems of size gt BC
55
Memory tuned quicksort analysis Second
correction cont.

The expected number of misses per key for ngtBC
M(n) (n/B) - 3C/8 N(n) /n
2/B ln(n/BC) 5/8B 3C/8n

misses per key
56
Base quicksort analysis

Number of cache misses per key
2/B ln(n/BC) 5/8B 3C/8n 1/B

Same as Memory tuned quicksort

Base QS makes an extra pass at the end to perform
the insertion sort.

57
Multi quicksort analysis cont.

Number of cache misses per key
for nBC
1/B

Compulsory misses
58
Multi quicksort analysis cont.

Number of cache misses per key for ngtBC
2/B 2/B

Each partition is returned to the input array and
sorted in place
We partition the input to k3n/BC
pieces. Assumption Each partition is smaller
than the cache We hold k linked lists, one for
each partition. 100 keys can fit in one linked
list node (minimize storage waste). Each
partitioned key is moved to the a linked
list 1 miss per block in the input array 1
miss per block in the linked list
59
Performance

Base Memory tuned multi
Multi partition
Constant number of additional instructions
Instructions per key
150
0
Set size in keys
10000
100000
Cache size
60
Performance cont.

Base Memory tuned multi
Cache misses per key
1
Multi partition usually produce subsets smaller
than cache. 1 miss per key!
0
10000
100000
Set size in keys
Cache size
61
Performance cont.

Base Tiled multi
Due to increase in instruction cost. If larger
sets were sorted it would have outperformed the
other 2 variants
Time (cycles per key)
200
0
Set size in keys
10000
100000
Cache size

Write a Comment

User Comments (0)