Title: By: A. LaMarca
1 The Influence of Caches on the Performance of
Sorting
- By A. LaMarca R. Lander
- Presenter Shai Brandes
2Introduction
- Sorting is one of the most important operations
performed by computers. - In the days of magnetic tape storage before
modern data-bases, it was almost certainly the
most common operation performed by computers as
most "database" updating was done by sorting
transactions and merging them with a master file.
3Introduction cont.
- Since the introduction of caches, main memory
continued to grow slower relative to processor
cycle times. - The time to service a cache miss grew to 100
cycles and more. - Cache miss penalties have grown to the point
where good overall performance cannot be achieved
without good cache performance.
4Introduction cont.
- In the article, the authors investigate, both
experimentally and analytically , the potential
performance gains that cache-conscious design
offers in improving the performance of several
sorting algorithms.
5Introduction cont.
- For each algorithm, an implementation variant
with potential for good overall performance, was
chosen. - Than, the algorithm was optimized, using
traditional techniques to minimize the number of
instruction executed. - This algorithm forms the baseline for comparison.
- Memory optimizations were applied to the
comparison sort baseline algorithm, in order to
improve cache performance.
6Performance measures
- The authors concentrate on three performance
measures - Instruction count
- Cache misses
- Overall performance
- The analyses presented here are only
approximation, since cache misses cannot be
analyzed precisely due to factors such as
variation in process scheduling and the operating
systems virtual to physical page mapping policy.
7Main lesson
- The main lesson from the article is that because
of the cache miss penalties are growing larger
with each new generation of processors, improving
an algorithms overall performance requires
increasing the number of instruction executed,
while at the same time, reducing the number of
cache misses.
8Design parameters of caches
- Capacity total number of blocks the cache can
hold. - Block size the number of bytes that are loaded
from and written to memory at a time. - Associativity in an N-way set associative
cache, a particular block can be loaded in N
different cache locations. - Replacement policy which block do we remove
from the cache as a new block is loaded
9Which cache are we investigating?
- In modern machines, more than one cache is
placed between the main memory and the processor.
processor
memory
Full associative
N-way associative
Direct map
10Which cache are we investigating?
- The largest miss penalty is typically incurred to
the cache closest to the main memory, which is
usually direct-mapped. - Thus, we will focus on improving the performance
of direct-mapped caches.
11Improve the cache hit ratio
- Temporal locality there is a good chance that
an accessed data will be accessed again in the
near future. - Spatial locality - there is a good chance that
subsequently accessed data items are located near
each other in memory.
12Cache misses
- Compulsory miss occur when a block is first
accessed and loaded to the cache. - Capacity miss caused by the fact that the cache
is not large enough to hold all the accessed
blocks at one time. - Conflict miss occur when two or more blocks,
which are mapped to the same location in the
cache, are accessed.
13Measurements
- n the number of keys to be sorted
- C the number of blocks in the cache
- B the number of keys that fit in a cache
block
Cache block
B keys
14Mergesort
- Two sorted lists can be merged into a single
list by repeatedly adding the smaller key to a
single sorted list
6 3 1
5 4 2
1
2
3
4
5
6
15Mergesort
- By treating a set of unordered keys as a set of
sorted lists of length 1, the keys can be
repeatedly merged together until a single sorted
set of keys remains. - The iterative mergesort was chosen as the base
algorithm.
16Mergesort base algorithm
- Mergesort makes log2n passes over the array,
where the i-th pass merges sorted subarrays of
length 2i-1 into sorted subarrays of size 2i.
1
4
8
3
i1
1
4
8
3
i2
8 4 3 1
17Improvements to the base algorithm
- Alternating the merging process from one array to
another to avoid unnecessary copying. - Loop unrolling
- Sorting subarrays of size 4 with a fast in-line
sorting method. - Thus, the number of passes is log2(n/4).
- If this number is odd, then an additional copy
pass is needed to return the sorted array to the
input array.
18The problem with the algorithm
- The base mergesort has the potential for
terrible cache performance if a pass is large
enough to wrap around the in the cache, keys will
be ejected before they are used again. - n BC/2 ?the entire sort will be performed in
the cache only Compulsory misses. - BC/2 lt n BC ?temporal reuse drops off sharply
- BC lt n ?no temporal reuse
- In each pass
- The block is accessed in the input array (r/w)
- The block is accessed in the auxiliary array
(w/r). - ? 2 cache misses per block? 2/B cache misses per
key
19 n BC/2
1
2
3
4
Input array n keys
1
4
3
2
i2
i1
4 cache misses
Auxiliary array
1
4
2
3
No cache misses!
Cache block
1
4
2
2
1
4
3
3
Read key1
Cache after pass i1
Read 1 miss
Write 1 miss
Read 1 miss
Write 1 miss
20Mergesort analysis
- For nBC/2 ? 2/B misses per key
- The entire sort will be performed in the cache
- only Compulsory misses
21Base Mergesort analysis cont.
- For BC/2ltn (misses per key)
- 2/B log2(n/4) 1/B 2/B
(log2(n/4) mod 2)
In each pass, each key is moved from a source
array to a destination array. Every B-th key
visited in the source array results in one cache
miss. Every B-th key written to the destination
array results in one cache miss.
Number of merge passes
Initial pass of sorting groups of 4 keys. 1
compulsory miss per block. thus, 1/B misses per
key
If number of iteration is odd, we need to copy
the sorted array to the input array
221st Memory optimizationTiled mergesort
- Improve temporal locality
- Phase 1- subarrays of legnth BC/2 are sorted
using mergesort. - Phase 2- Return the arrays to the base
mergesort to complete the sorting of the
entire array. - Avoid the final copy if log2(n/4)is odd
subarrays of size 2 are sorted in-line if log2(n)
is odd.
23tiled-mergesort example
4 13 10 6 3 15 11 16 9 5 7 12 8 2 14 1
Phase 1 - mergesort every BC / 2 keys
14 8 2 1
12 9 7 5
16 15 11 3
13 10 6 4
Phase 2- regular Mergesort
14 12 9 8 7 5 2 1
15 16 13 11 10 6 4 3
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
24Tiled Mergesort analysis
- For nBC/2 ? 2/B misses per key
- The entire sort will be performed in the cache
- only Compulsory misses
25Tiled Mergesort analysis cont.
- For BC/2ltn (misses per key)
- 2/B log2(2n/BC) 2/B 0
number of iteration is forced to be even. no need
to copy the sorted array to the input array
Initial pass of mergesorting groups of BC/2
keys. Each merge is done in the cache with 2
compulsory misses per block.
Number of merge passes
each pass is large enough to wrap around the in
the cache, keys will be ejected before they are
used again. 2 compulsory miss per block. thus,
2/B misses per key
26Tiled mergesort cont.
- The problem
- In phase 2 no reuse is achieved across passes
since the set size is larger than the cache. - The solution multi-mergesort
272nd Memory optimization multi-mergesort
- We replace the final log2(n/(BC/2)) merge
passes of tiled mergesort with a single pass that
merges all the subarrays at once. - The last pass uses a memory optimized heap which
holds the heads of the subarrays. - The number of misses per key due to the use of
the heap is negligible for practical values of n,
B and C.
28multi-mergesort example
4 13 10 6 3 15 11 16 9 5 7 12 8 2 14 1
Phase 1 - mergesort every BC / 2 keys
14 8 2 1
12 9 7 5
16 15 11 3
13 10 6 4
Phase 2- multi Mergesort all n/(BC/2) subarrays
at once
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
29Multi Mergesort analysis
- For nBC/2 ? 2/B misses per key
- The entire sort will be performed in the cache
- only Compulsory misses
30Multi Mergesort analysis cont.
- For BC/2ltn (misses per key)
- 2/B 2/B
Initial pass of mergesorting groups of BC/2
keys. Each merge is done in the cache with 2
compulsory misses per block. number of iteration
is forced to be odd ? That way, in the next
phase we will multi-merge keys from the auxiliary
array to the input array
a single pass that merges all the n/(BC/2)
subarrays at once. 2 compulsory miss per
block. thus, 2/B misses per key
31Performance
Base Tiled multi
Multi-merge all subarrays in a single pass
Instructions per key
100
0
Set size in keys
10000
100000
Cache size
32Performance cont.
Base Tiled multi
Increase in cache misses set size is larger than
cache
constant number of cache misses per key!
Cache misses per key
66 fewer misses than the base
2
1
0
10000
100000
Set size in keys
Cache size
33Performance cont.
Base Tiled multi
Worst performance due to the large number of
cache misses
Time (cycles per key)
Due to increase in instruction count
Executes up to 55 faster than Base
200
0
Set size in keys
10000
100000
Cache size
34Quicksort - Divide and conquer algorithm.
Choose a pivot
2
1
7
3
5
6
4
- Partition the set around the pivot
Quicksort left region
Quicksort right region
- At the end of the pass the pivot is in its final
position.
35Quicksort base algorithm
- Implementation of optimized Quicksort which was
developed by Sedgewick - Rather than sorting small subsets in the natural
course of quicksort recursion, they are left
unsorted until the very end, at which time they
are sorted using - insertion sort in a single final pass over the
entire array
36Insertion sort
- Sort by repeatedly taking the next item and
inserting it into the final data structure in its
proper order with respect to items already
inserted.
1
3
4
2
37Quicksort base algorithm cont.
- Quicksort make sequential passes
- ? all keys in a block are always used
- ? exellent spatial locality
- Divide and conquer
- ? if subarray is small enough to fit in the
cache quicksort will incur at most 1 cache miss
per block before the subset is fully sorted - ? exellent temporal locality
381st Memory optimizationmemory tuned quicksort
- Remove Sedgewicks insertion sort in the final
pass. - Instead, sort small subarrays when they are first
encountered using insertion sort. - Motivation
- When a small subarray is encountered it has just
been part of a recent partition - ? all of its keys should be in the cache
392nd Memory optimizationmulti quicksort
- n BC ? 1 cache miss per block
- Problem
- Larger sets of keys incur a substantial number
of misses. - Solution
- a single multi-partition pass is performed
divides the full set into a number of subsets
which are likely to be cache sized or smaller
40Feller
- If k points are placed randomly in a range of
length 1 - P( subrangei X ) (1 - X)k
-
41multi quicksort cont.
- multi-partition the array into 3n / (BC) pieces.
- ? (3n / (BC)) 1 pivots.
- ? P( subseti BC) (1 BC/n) (3n / (BC)) 1
- lim n ? 8(1 BC/n) (3n / (BC)) 1 e-3
- ?the percentage of subsets that are larger
than the cache is less than 5.
feller
42Memory tuned quicksort analysis
- We analyze the algorithm in two parts
- 1. Assumption
- partitioning an array of size m costs
- m gt BC ? m/B misses
- m BC ? 0 misses
- 2. Correct the assumption
- estimate the undercounted and
- over-counted cache misses
43Memory tuned quicksort analysis cont.
- M(n) the expected number of misses
- 0 n BC
- M(n)
- n/b 1/n ?M(i)M(n-i-1) else
0iltn-1
number of misses in the left region
number of misses in the left region
n places to locate the pivot P(pivot is in the
i-th place) 1/n
- Assumption partitioning an array of size
- n gt BC costs
- n / B misses
44Memory tuned quicksort analysis cont.
- The recurrence solves to
- 0 n BC
- M(n)
- 2(n1)/B ln(n1)/(BC2)O(1/n)else
45Memory tuned quicksort analysis cont.First
correction
- Undercounting the misses when the subproblem
first reaches size BC . - We counted it as 0, but this subproblem may have
no part in the cache! - We add n/B more misses, since there are
approximately n keys in ALL the subproblems that
first reaches size BC.
46Memory tuned quicksort analysis cont.Second
correction
- In the very first partitioning there are n/B
misses, but not for the subsequent partitioning !
- In the end of partitioning, some of the array in
the LEFT subproblem is still in the cache. - ? there are hits that we counted as misses
- Note by the time the algorithm reaches the right
subproblem, its data has been removed from the
cache
47Memory tuned quicksort analysis Second
correction cont.
- The expected number of subproblems of size gt BC
- 0 n BC
- N(n)
- 1 1/n ?N(i)N(n-i-1) else
0iltn-1
number of subproblems of size gt BC in the left /
right region
ngtBC thus, this array itself is a subproblem
larger than the cache
n places to locate the pivot P(pivot is in the
i-th place) 1/n
48Memory tuned quicksort analysis Second
correction cont.
- The recurrence solves to
- 0 n BC
- N(n)
- (n1)/(BC2) 1 else
49Memory tuned quicksort analysis Second
correction cont.
- In each subproblems of size n gt BC
pivot
Left sub-problem
array
R progresses left ? it can access to these blue
cache blocks
L
R
On average, BC/2 keys are in the cache (1/2 cache)
L progresses right ? eventually will access
blocks that map to the blue blocks in the cache
and replace them
cache
50Memory tuned quicksort analysis Second
correction cont.
- Assumption R points to a key in the block
located at the right end of the cache - Reminder this is a direct map cache, the i-th
block will be in the i (mod C)
R
51Memory tuned quicksort analysis Second
correction cont.
- 2 possible scenarios. the first
X
X
X
X
i blocks
L points to a key in a block which is mapped to
this cache block. L progresses and replaces the
blocks on the right
R points to a key in a block which is mapped to
this cache block. R progresses to the blue blocks
on the left
On average, there will be c/2 i / 2 hits
52Memory tuned quicksort analysis Second
correction cont.
X
X
i blocks
R points to a key in a block which is mapped to
this cache block. R progresses to the blue blocks
on the left
On average, there will be i c/2 - i / 2
c/2 i / 2 hits
L points to a key in a block which is mapped to
this cache block. L progresses and replaces the
blocks on the right
53Memory tuned quicksort analysis Second
correction cont.
- Number of hits
- 1/(c/2) ? c/2 i / 2 3C/8
0iltc/2
L can start on any block with equal probability
Average number of hits
54Memory tuned quicksort analysis Second
correction cont.
- Number of hits not acounted for the computation
of M(n) - 3C/8 N(n)
the expected number of misses
number of hits after a partition
The expected number of sub-problems of size gt BC
55Memory tuned quicksort analysis Second
correction cont.
- The expected number of misses per key for ngtBC
- M(n) (n/B) - 3C/8 N(n) /n
- 2/B ln(n/BC) 5/8B 3C/8n
misses per key
56Base quicksort analysis
- Number of cache misses per key
- 2/B ln(n/BC) 5/8B 3C/8n 1/B
Same as Memory tuned quicksort
- Base QS makes an extra pass at the end to perform
the insertion sort.
57Multi quicksort analysis cont.
- Number of cache misses per key
- for nBC
- 1/B
Compulsory misses
58Multi quicksort analysis cont.
- Number of cache misses per key for ngtBC
- 2/B 2/B
Each partition is returned to the input array and
sorted in place
We partition the input to k3n/BC
pieces. Assumption Each partition is smaller
than the cache We hold k linked lists, one for
each partition. 100 keys can fit in one linked
list node (minimize storage waste). Each
partitioned key is moved to the a linked
list 1 miss per block in the input array 1
miss per block in the linked list
59Performance
Base Memory tuned multi
Multi partition
Constant number of additional instructions
Instructions per key
150
0
Set size in keys
10000
100000
Cache size
60Performance cont.
Base Memory tuned multi
Cache misses per key
1
Multi partition usually produce subsets smaller
than cache. 1 miss per key!
0
10000
100000
Set size in keys
Cache size
61Performance cont.
Base Tiled multi
Due to increase in instruction cost. If larger
sets were sorted it would have outperformed the
other 2 variants
Time (cycles per key)
200
0
Set size in keys
10000
100000
Cache size