The Influence of Caches on the Performance of Sorting - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

The Influence of Caches on the Performance of Sorting

Description:

( r is radix ) ... accumulates counts of the number of keys with each radix. ... r : radix. A : the number of counts per block. assumption ... – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 37

Provided by: davinc

Category:

more less

Transcript and Presenter's Notes

Title: The Influence of Caches on the Performance of Sorting

1
The Influence of Caches on the Performance of
Sorting

98. 5. 8
? ??

2
The Influence of Caches

The time to service a cache miss has grown
Algorithms must take into account both
instruction count and cache performance
We will survey four popular sorting algorithms
mergesort, quicksort, heapsort, radixsort
instruction count, cache misses, overall
performance(time)

3
Design for Evaluation

32bytes cache blocks
direct-mapped cache of 2MB
the size of the set to be sorted32KB32MB
C the number of blocks in the cache
B the number of keys that fit in a cache block
gt BC the of keys that fit in the cache

4
Mergesort - Base Algorithm

Two sorted lists -gt a single sorted list
iterative mergesort makes ?log2n? passes
i-th pass merges sorted subarrays of length 2i-1
into sorted subarrays of length 2i
optimization sorting subarrays of size 4 with a
fast in-line sorting method
gt ?log2(n/4)? passes

5
Mergesort - Memory Optimizations

Base mergesort uses each data item only once per
pass
when the set size is larger than BC/2, temporal
reuse drops off sharply
when the set size is larger than BC, no temporal
reuse occurs at all

6
Mergesort - Memory Optimizations

tiled mergesort
1st phase subarray of length BC/2 are sorted
using mergesort.
2nd phase complete the sorting of the entire
array
multi-mergesort
replace the final ?log2(n/(BC/2))? merge passes
of tiled mergesort with a single pass that merges
all of the pieces together at once
these two methods increase the instruction count

7
Mergesort - Performance
8
Mergesort - Analysis

Base algorithm
n ? BC/2 2/B misses per key
n ? BC/2 2/B ?log2(n/4)? 1/B 2/B( ?log2(n/4)?
mod2)
2/B ?log2(n/4)? capacity misses during the
?log2(n/4)? merge passes
1/B compulsory misses incurred in the initial
pass of sorting into groups of 4
2/B( ?log2(n/4)? mod2) final copy

9
Mergesort - Analysis

tiled mergesort
n ? BC/2 2/B ?log2(2n/BC)? 2/B
2/B ?log2(2n/BC)? the number of misses per key
for the final ?log2(n/(BC/2))? merge passes
2/B the number of misses per key in sorting
into BC/2 size pieces
multi mergesort
n ? BC/2 4/B ( 2/B 2/B )
2/B tiled mergesort
2/B the pieces stored in the auxiliary array
back into the input arrary

10
Quicksort - Base Algorithm

A key from the set is chosen as the pivot, and
all other keys in the set are partitioned around
this pivot
Sedgewicks algorithm
sort small subsets using a faster sorting method
all the small unsorted subarray be left unsorted
until the very end, at which time they are sorted
in a single pass over the entire array.

11
Quicksort - Memory Optimizations

Quicksort generally exhibits excellent cache
performance since this makes sequential passes
through the source array
gt high spatial locality
divide-and-conquer
gt high temporal locality
if a subset to be sorted is small enough to fit
in the cache, only one cache miss per block

12
Quicksort - Memory Optimizations

Memory-tuned quicksort
remove Sedgewicks elegant insertion sort at the
end, instead sorts small subarrays when they are
first encountered.
Multi-quicksort
divide the full set into a number of subsets
which are likely to be cache sized or smaller.
Partition the input array into 3n/(BC) pieces

13
Quicksort - Performance
14
Quicksort - Analysis

Memory-tuned quicksort
n ? BC 1/B (compulsory misses)
n ? BC 2/Bln(n/BC) 5/8B 3C/8n

The expected of misses in partitioning an
array of size n

15
Quicksort - Analysis

The expected number of subproblem of size gt BC

The expected of hits not accounted for in the
computation of M(n)

The expected number of misses per key

16
Quicksort - Analysis

Multi-quicksort
n ? BC 1/B (compulsory misses)
n ? BC 4/B
one miss in the input array and one miss in the
linked list
each partition is returned to the input array and
sorted in place

17
Heapsort - Base Algorithm

First build a heap and then remove them one at a
time in sorted order.
Heap
the value of parent of i ? the value of i

8
2
4
18
Heapsort - Memory Optimizations

Memory-tuned heapsort
replace the traditional binary heap with a d-heap
where each non-leaf node has d children instead
of two.
d is chosen so that exactly d keys fit in a cache
block
Since the block size is 32 bytes and we are
sorting 8byte keys, d 4.
align the heap array in memory so that all d
children lie on the same cache block

19
Heapsort - Performance
20
Heapsort - Analysis

n ? BC 1/B misses per key
n ? BC
the build-heap phase
Pr(next leaf is not in the cache) 1/B
Pr(the parent(that leaf ) is not in the cache)
1/(dB)
.
?1/(diB) d / ((d-1)B) the misses incurred per
key
dn/((d-1)B) the expected number of misses

21
Radixsort - Base Algorithm

Non-comparison
use counting sort
two n key array, a count array of size 2r
( r is radix )
if we are sorting b bit integers with a radix of
r then radix method does ?b/r? iterations each
with 2 passes over the source array.

22
Radixsort - Base Algorithm

1st phase
accumulates counts of the number of keys with
each radix.
the counts are used to determine the offsets in
the keys of each radix in the destination array.
2nd phase
moves the source array to the destination array
according to the offsets

23
Radixsort - Base Algorithm

An improvement to reduce the number of passes
over the source array
accumulating the counts for (i1)-st iteration at
the same time as moving keys during the i-th
iteration
require a second count array of size 2r
a positive effect on both instruction count and
cache misses.

24
Radixsort - Performance
25
Radixsort - Analysis

Parameters to consider
b the number of bits per key
r radix
A the number of counts per block
assumption
2r1 ? AC ( the size of two count array ? cache
capacity )
r ? b

26
Radixsort - Analysis

For n gt BC, MtravMcountMdest
Mtrav the of misses per key incurred in
traversing the source destination arrays
Mcount the of misses per key incurred in
accesses to the count arrays
Mdest the of misses per key in the
destination array caused by conflicts with the
traversal of the source array

27
Radixsort - Analysis

Mtrav ( 2?b/r? 1 2(?b/r? mod 2) ) / B
an initial pass over the input array during which
counts are accumulated
there are ?b/r? iterations, where a source array
is moved to the dest array
if ?b/r? is odd then the final sorted array must
be copied back into the input array

28
Radixsort - Analysis

Mcount
the expected number of misses per key incurred by
a traversal over the source array in the count
array of size 2m is

29
Radixsort - Analysis

Mdest
the probability that the traversal will visit the
cache block of x before the next visit to x by
the destination array is

30
Discussion - Performance Comparison
31
Discussion - Robustness
Tiled mergesort
Memory-tuned heapsort
32
Discussion - Page Mapping Policies

We have assumed that a block of contiguous pages
in the virtual address space map to a block of
contiguous pages in the cache
gt virtually indexed cache
tiled mergesort relies heavily on the assumption
that a cache-sized block of pages do not conflict
in the cache.
The speed of tiled mergesort relies heavily on
the quality of the operating systems page
mapping decisions
Solaris and Digital Unix make cache-conscious
decisions about page placement

33
Conclusions

Despite its very low instruction count, radix
sort is outperformed by both mergesort and
quicksort due to its relatively poor locality
Despite the fact that the multi-mergesort
executed 75 more instructions than the base
mergesort, it sorts up to 70 faster
The improvements for quicksort is modest

34
Conclusions

Improving cache performance even at the cost of
an increased instruction count can improve
overall performance
Knowing and using architectural constants such as
cache size and block size can improve an
algorithms memory system performance beyond that
of a generic algorithm

35
Conclusions

Spatial locality can be improved by adjusting an
algorithms structure to fully utilize cache
blocks.
Temporal locality can be improved by padding and
adjusting data layout so that structures are
aligned within cache blocks.

36
Conclusions

Capacity misses can be reduced by processing
large data sets in cache-sized pieces
Conflict misses can be reduced by processing data
in cache-block-sizes pieces

Write a Comment

User Comments (0)