The Influence of Caches on the Performance of Sorting - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

The Influence of Caches on the Performance of Sorting

Description:

( r is radix ) ... accumulates counts of the number of keys with each radix. ... r : radix. A : the number of counts per block. assumption ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 37
Provided by: davinc
Category:

less

Transcript and Presenter's Notes

Title: The Influence of Caches on the Performance of Sorting


1
The Influence of Caches on the Performance of
Sorting
  • 98. 5. 8
  • ? ??

2
The Influence of Caches
  • The time to service a cache miss has grown
  • Algorithms must take into account both
    instruction count and cache performance
  • We will survey four popular sorting algorithms
  • mergesort, quicksort, heapsort, radixsort
  • instruction count, cache misses, overall
    performance(time)

3
Design for Evaluation
  • 32bytes cache blocks
  • direct-mapped cache of 2MB
  • the size of the set to be sorted32KB32MB
  • C the number of blocks in the cache
  • B the number of keys that fit in a cache block
  • gt BC the of keys that fit in the cache

4
Mergesort - Base Algorithm
  • Two sorted lists -gt a single sorted list
  • iterative mergesort makes ?log2n? passes
  • i-th pass merges sorted subarrays of length 2i-1
    into sorted subarrays of length 2i
  • optimization sorting subarrays of size 4 with a
    fast in-line sorting method
  • gt ?log2(n/4)? passes

5
Mergesort - Memory Optimizations
  • Base mergesort uses each data item only once per
    pass
  • when the set size is larger than BC/2, temporal
    reuse drops off sharply
  • when the set size is larger than BC, no temporal
    reuse occurs at all

6
Mergesort - Memory Optimizations
  • tiled mergesort
  • 1st phase subarray of length BC/2 are sorted
    using mergesort.
  • 2nd phase complete the sorting of the entire
    array
  • multi-mergesort
  • replace the final ?log2(n/(BC/2))? merge passes
    of tiled mergesort with a single pass that merges
    all of the pieces together at once
  • these two methods increase the instruction count

7
Mergesort - Performance
8
Mergesort - Analysis
  • Base algorithm
  • n ? BC/2 2/B misses per key
  • n ? BC/2 2/B ?log2(n/4)? 1/B 2/B( ?log2(n/4)?
    mod2)
  • 2/B ?log2(n/4)? capacity misses during the
    ?log2(n/4)? merge passes
  • 1/B compulsory misses incurred in the initial
    pass of sorting into groups of 4
  • 2/B( ?log2(n/4)? mod2) final copy

9
Mergesort - Analysis
  • tiled mergesort
  • n ? BC/2 2/B ?log2(2n/BC)? 2/B
  • 2/B ?log2(2n/BC)? the number of misses per key
    for the final ?log2(n/(BC/2))? merge passes
  • 2/B the number of misses per key in sorting
    into BC/2 size pieces
  • multi mergesort
  • n ? BC/2 4/B ( 2/B 2/B )
  • 2/B tiled mergesort
  • 2/B the pieces stored in the auxiliary array
    back into the input arrary

10
Quicksort - Base Algorithm
  • A key from the set is chosen as the pivot, and
    all other keys in the set are partitioned around
    this pivot
  • Sedgewicks algorithm
  • sort small subsets using a faster sorting method
  • all the small unsorted subarray be left unsorted
    until the very end, at which time they are sorted
    in a single pass over the entire array.

11
Quicksort - Memory Optimizations
  • Quicksort generally exhibits excellent cache
    performance since this makes sequential passes
    through the source array
  • gt high spatial locality
  • divide-and-conquer
  • gt high temporal locality
  • if a subset to be sorted is small enough to fit
    in the cache, only one cache miss per block

12
Quicksort - Memory Optimizations
  • Memory-tuned quicksort
  • remove Sedgewicks elegant insertion sort at the
    end, instead sorts small subarrays when they are
    first encountered.
  • Multi-quicksort
  • divide the full set into a number of subsets
    which are likely to be cache sized or smaller.
  • Partition the input array into 3n/(BC) pieces

13
Quicksort - Performance
14
Quicksort - Analysis
  • Memory-tuned quicksort
  • n ? BC 1/B (compulsory misses)
  • n ? BC 2/Bln(n/BC) 5/8B 3C/8n
  • The expected of misses in partitioning an
    array of size n

15
Quicksort - Analysis
  • The expected number of subproblem of size gt BC
  • The expected of hits not accounted for in the
    computation of M(n)
  • The expected number of misses per key

16
Quicksort - Analysis
  • Multi-quicksort
  • n ? BC 1/B (compulsory misses)
  • n ? BC 4/B
  • one miss in the input array and one miss in the
    linked list
  • each partition is returned to the input array and
    sorted in place

17
Heapsort - Base Algorithm
  • First build a heap and then remove them one at a
    time in sorted order.
  • Heap
  • the value of parent of i ? the value of i

8
2
4
18
Heapsort - Memory Optimizations
  • Memory-tuned heapsort
  • replace the traditional binary heap with a d-heap
    where each non-leaf node has d children instead
    of two.
  • d is chosen so that exactly d keys fit in a cache
    block
  • Since the block size is 32 bytes and we are
    sorting 8byte keys, d 4.
  • align the heap array in memory so that all d
    children lie on the same cache block

19
Heapsort - Performance
20
Heapsort - Analysis
  • n ? BC 1/B misses per key
  • n ? BC
  • the build-heap phase
  • Pr(next leaf is not in the cache) 1/B
  • Pr(the parent(that leaf ) is not in the cache)
    1/(dB)
  • .
  • ?1/(diB) d / ((d-1)B) the misses incurred per
    key
  • dn/((d-1)B) the expected number of misses

21
Radixsort - Base Algorithm
  • Non-comparison
  • use counting sort
  • two n key array, a count array of size 2r
  • ( r is radix )
  • if we are sorting b bit integers with a radix of
    r then radix method does ?b/r? iterations each
    with 2 passes over the source array.

22
Radixsort - Base Algorithm
  • 1st phase
  • accumulates counts of the number of keys with
    each radix.
  • the counts are used to determine the offsets in
    the keys of each radix in the destination array.
  • 2nd phase
  • moves the source array to the destination array
    according to the offsets

23
Radixsort - Base Algorithm
  • An improvement to reduce the number of passes
    over the source array
  • accumulating the counts for (i1)-st iteration at
    the same time as moving keys during the i-th
    iteration
  • require a second count array of size 2r
  • a positive effect on both instruction count and
    cache misses.

24
Radixsort - Performance
25
Radixsort - Analysis
  • Parameters to consider
  • b the number of bits per key
  • r radix
  • A the number of counts per block
  • assumption
  • 2r1 ? AC ( the size of two count array ? cache
    capacity )
  • r ? b

26
Radixsort - Analysis
  • For n gt BC, MtravMcountMdest
  • Mtrav the of misses per key incurred in
    traversing the source destination arrays
  • Mcount the of misses per key incurred in
    accesses to the count arrays
  • Mdest the of misses per key in the
    destination array caused by conflicts with the
    traversal of the source array

27
Radixsort - Analysis
  • Mtrav ( 2?b/r? 1 2(?b/r? mod 2) ) / B
  • an initial pass over the input array during which
    counts are accumulated
  • there are ?b/r? iterations, where a source array
    is moved to the dest array
  • if ?b/r? is odd then the final sorted array must
    be copied back into the input array

28
Radixsort - Analysis
  • Mcount
  • the expected number of misses per key incurred by
    a traversal over the source array in the count
    array of size 2m is

29
Radixsort - Analysis
  • Mdest
  • the probability that the traversal will visit the
    cache block of x before the next visit to x by
    the destination array is

30
Discussion - Performance Comparison
31
Discussion - Robustness
Tiled mergesort
Memory-tuned heapsort
32
Discussion - Page Mapping Policies
  • We have assumed that a block of contiguous pages
    in the virtual address space map to a block of
    contiguous pages in the cache
  • gt virtually indexed cache
  • tiled mergesort relies heavily on the assumption
    that a cache-sized block of pages do not conflict
    in the cache.
  • The speed of tiled mergesort relies heavily on
    the quality of the operating systems page
    mapping decisions
  • Solaris and Digital Unix make cache-conscious
    decisions about page placement

33
Conclusions
  • Despite its very low instruction count, radix
    sort is outperformed by both mergesort and
    quicksort due to its relatively poor locality
  • Despite the fact that the multi-mergesort
    executed 75 more instructions than the base
    mergesort, it sorts up to 70 faster
  • The improvements for quicksort is modest

34
Conclusions
  • Improving cache performance even at the cost of
    an increased instruction count can improve
    overall performance
  • Knowing and using architectural constants such as
    cache size and block size can improve an
    algorithms memory system performance beyond that
    of a generic algorithm

35
Conclusions
  • Spatial locality can be improved by adjusting an
    algorithms structure to fully utilize cache
    blocks.
  • Temporal locality can be improved by padding and
    adjusting data layout so that structures are
    aligned within cache blocks.

36
Conclusions
  • Capacity misses can be reduced by processing
    large data sets in cache-sized pieces
  • Conflict misses can be reduced by processing data
    in cache-block-sizes pieces
Write a Comment
User Comments (0)
About PowerShow.com