By: A. LaMarca - PowerPoint PPT Presentation

About This Presentation
Title:

By: A. LaMarca

Description:

The Influence of Caches on the Performance of Sorting By: A. LaMarca & R. Lander Presenter : Shai Brandes Introduction Sorting is one of the most important operations ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 62
Provided by: stude2571
Category:
Tags: lamarca | merge | natural | sort

less

Transcript and Presenter's Notes

Title: By: A. LaMarca


1

The Influence of Caches on the Performance of
Sorting
  • By A. LaMarca R. Lander
  • Presenter Shai Brandes

2
Introduction
  • Sorting is one of the most important operations
    performed by computers.
  • In the days of magnetic tape storage before
    modern data-bases, it was almost certainly the
    most common operation performed by computers as
    most "database" updating was done by sorting
    transactions and merging them with a master file.

3
Introduction cont.
  • Since the introduction of caches, main memory
    continued to grow slower relative to processor
    cycle times.
  • The time to service a cache miss grew to 100
    cycles and more.
  • Cache miss penalties have grown to the point
    where good overall performance cannot be achieved
    without good cache performance.

4
Introduction cont.
  • In the article, the authors investigate, both
    experimentally and analytically , the potential
    performance gains that cache-conscious design
    offers in improving the performance of several
    sorting algorithms.

5
Introduction cont.
  • For each algorithm, an implementation variant
    with potential for good overall performance, was
    chosen.
  • Than, the algorithm was optimized, using
    traditional techniques to minimize the number of
    instruction executed.
  • This algorithm forms the baseline for comparison.
  • Memory optimizations were applied to the
    comparison sort baseline algorithm, in order to
    improve cache performance.

6
Performance measures
  • The authors concentrate on three performance
    measures
  • Instruction count
  • Cache misses
  • Overall performance
  • The analyses presented here are only
    approximation, since cache misses cannot be
    analyzed precisely due to factors such as
    variation in process scheduling and the operating
    systems virtual to physical page mapping policy.

7
Main lesson
  • The main lesson from the article is that because
    of the cache miss penalties are growing larger
    with each new generation of processors, improving
    an algorithms overall performance requires
    increasing the number of instruction executed,
    while at the same time, reducing the number of
    cache misses.

8
Design parameters of caches
  • Capacity total number of blocks the cache can
    hold.
  • Block size the number of bytes that are loaded
    from and written to memory at a time.
  • Associativity in an N-way set associative
    cache, a particular block can be loaded in N
    different cache locations.
  • Replacement policy which block do we remove
    from the cache as a new block is loaded

9
Which cache are we investigating?
  • In modern machines, more than one cache is
    placed between the main memory and the processor.

processor
memory
Full associative
N-way associative
Direct map
10
Which cache are we investigating?
  • The largest miss penalty is typically incurred to
    the cache closest to the main memory, which is
    usually direct-mapped.
  • Thus, we will focus on improving the performance
    of direct-mapped caches.

11
Improve the cache hit ratio
  • Temporal locality there is a good chance that
    an accessed data will be accessed again in the
    near future.
  • Spatial locality - there is a good chance that
    subsequently accessed data items are located near
    each other in memory.

12
Cache misses
  • Compulsory miss occur when a block is first
    accessed and loaded to the cache.
  • Capacity miss caused by the fact that the cache
    is not large enough to hold all the accessed
    blocks at one time.
  • Conflict miss occur when two or more blocks,
    which are mapped to the same location in the
    cache, are accessed.

13
Measurements
  • n the number of keys to be sorted
  • C the number of blocks in the cache
  • B the number of keys that fit in a cache
    block

Cache block


B keys
14
Mergesort
  • Two sorted lists can be merged into a single
    list by repeatedly adding the smaller key to a
    single sorted list

6 3 1
5 4 2
1
2
3
4
5
6
15
Mergesort
  • By treating a set of unordered keys as a set of
    sorted lists of length 1, the keys can be
    repeatedly merged together until a single sorted
    set of keys remains.
  • The iterative mergesort was chosen as the base
    algorithm.

16
Mergesort base algorithm
  • Mergesort makes log2n passes over the array,
    where the i-th pass merges sorted subarrays of
    length 2i-1 into sorted subarrays of size 2i.

1
4
8
3
i1
1
4
8
3
i2
8 4 3 1
17
Improvements to the base algorithm
  • Alternating the merging process from one array to
    another to avoid unnecessary copying.
  • Loop unrolling
  • Sorting subarrays of size 4 with a fast in-line
    sorting method.
  • Thus, the number of passes is log2(n/4).
  • If this number is odd, then an additional copy
    pass is needed to return the sorted array to the
    input array.

18
The problem with the algorithm
  • The base mergesort has the potential for
    terrible cache performance if a pass is large
    enough to wrap around the in the cache, keys will
    be ejected before they are used again.
  • n BC/2 ?the entire sort will be performed in
    the cache only Compulsory misses.
  • BC/2 lt n BC ?temporal reuse drops off sharply
  • BC lt n ?no temporal reuse
  • In each pass
  • The block is accessed in the input array (r/w)
  • The block is accessed in the auxiliary array
    (w/r).
  • ? 2 cache misses per block? 2/B cache misses per
    key

19

n BC/2

1
2
3
4
Input array n keys
1
4
3
2
i2
i1
4 cache misses
Auxiliary array
1
4
2
3
No cache misses!
Cache block
1
4
2
2
1
4
3
3
Read key1
Cache after pass i1
Read 1 miss
Write 1 miss
Read 1 miss
Write 1 miss
20
Mergesort analysis
  • For nBC/2 ? 2/B misses per key
  • The entire sort will be performed in the cache
  • only Compulsory misses

21
Base Mergesort analysis cont.
  • For BC/2ltn (misses per key)
  • 2/B log2(n/4) 1/B 2/B
    (log2(n/4) mod 2)

In each pass, each key is moved from a source
array to a destination array. Every B-th key
visited in the source array results in one cache
miss. Every B-th key written to the destination
array results in one cache miss.
Number of merge passes
Initial pass of sorting groups of 4 keys. 1
compulsory miss per block. thus, 1/B misses per
key
If number of iteration is odd, we need to copy
the sorted array to the input array
22
1st Memory optimizationTiled mergesort
  • Improve temporal locality
  • Phase 1- subarrays of legnth BC/2 are sorted
    using mergesort.
  • Phase 2- Return the arrays to the base
    mergesort to complete the sorting of the
    entire array.
  • Avoid the final copy if log2(n/4)is odd
    subarrays of size 2 are sorted in-line if log2(n)
    is odd.

23
tiled-mergesort example

4 13 10 6 3 15 11 16 9 5 7 12 8 2 14 1
Phase 1 - mergesort every BC / 2 keys
14 8 2 1
12 9 7 5
16 15 11 3
13 10 6 4
Phase 2- regular Mergesort
14 12 9 8 7 5 2 1
15 16 13 11 10 6 4 3
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
24
Tiled Mergesort analysis
  • For nBC/2 ? 2/B misses per key
  • The entire sort will be performed in the cache
  • only Compulsory misses

25
Tiled Mergesort analysis cont.
  • For BC/2ltn (misses per key)
  • 2/B log2(2n/BC) 2/B 0

number of iteration is forced to be even. no need
to copy the sorted array to the input array
Initial pass of mergesorting groups of BC/2
keys. Each merge is done in the cache with 2
compulsory misses per block.
Number of merge passes
each pass is large enough to wrap around the in
the cache, keys will be ejected before they are
used again. 2 compulsory miss per block. thus,
2/B misses per key
26
Tiled mergesort cont.
  • The problem
  • In phase 2 no reuse is achieved across passes
    since the set size is larger than the cache.
  • The solution multi-mergesort

27
2nd Memory optimization multi-mergesort
  • We replace the final log2(n/(BC/2)) merge
    passes of tiled mergesort with a single pass that
    merges all the subarrays at once.
  • The last pass uses a memory optimized heap which
    holds the heads of the subarrays.
  • The number of misses per key due to the use of
    the heap is negligible for practical values of n,
    B and C.

28
multi-mergesort example

4 13 10 6 3 15 11 16 9 5 7 12 8 2 14 1
Phase 1 - mergesort every BC / 2 keys
14 8 2 1
12 9 7 5
16 15 11 3
13 10 6 4
Phase 2- multi Mergesort all n/(BC/2) subarrays
at once
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
29
Multi Mergesort analysis
  • For nBC/2 ? 2/B misses per key
  • The entire sort will be performed in the cache
  • only Compulsory misses

30
Multi Mergesort analysis cont.
  • For BC/2ltn (misses per key)
  • 2/B 2/B

Initial pass of mergesorting groups of BC/2
keys. Each merge is done in the cache with 2
compulsory misses per block. number of iteration
is forced to be odd ? That way, in the next
phase we will multi-merge keys from the auxiliary
array to the input array
a single pass that merges all the n/(BC/2)
subarrays at once. 2 compulsory miss per
block. thus, 2/B misses per key
31
Performance

Base Tiled multi
Multi-merge all subarrays in a single pass
Instructions per key
100
0
Set size in keys
10000
100000
Cache size
32
Performance cont.

Base Tiled multi
Increase in cache misses set size is larger than
cache
constant number of cache misses per key!
Cache misses per key
66 fewer misses than the base
2
1
0
10000
100000
Set size in keys
Cache size
33
Performance cont.

Base Tiled multi
Worst performance due to the large number of
cache misses
Time (cycles per key)
Due to increase in instruction count
Executes up to 55 faster than Base
200
0
Set size in keys
10000
100000
Cache size
34
Quicksort - Divide and conquer algorithm.

Choose a pivot
2
1
7
3
5
6
4
  • Partition the set around the pivot

Quicksort left region
Quicksort right region
  • At the end of the pass the pivot is in its final
    position.

35
Quicksort base algorithm
  • Implementation of optimized Quicksort which was
    developed by Sedgewick
  • Rather than sorting small subsets in the natural
    course of quicksort recursion, they are left
    unsorted until the very end, at which time they
    are sorted using
  • insertion sort in a single final pass over the
    entire array

36
Insertion sort
  • Sort by repeatedly taking the next item and
    inserting it into the final data structure in its
    proper order with respect to items already
    inserted.

1
3
4
2
37
Quicksort base algorithm cont.
  • Quicksort make sequential passes
  • ? all keys in a block are always used
  • ? exellent spatial locality
  • Divide and conquer
  • ? if subarray is small enough to fit in the
    cache quicksort will incur at most 1 cache miss
    per block before the subset is fully sorted
  • ? exellent temporal locality

38
1st Memory optimizationmemory tuned quicksort
  • Remove Sedgewicks insertion sort in the final
    pass.
  • Instead, sort small subarrays when they are first
    encountered using insertion sort.
  • Motivation
  • When a small subarray is encountered it has just
    been part of a recent partition
  • ? all of its keys should be in the cache

39
2nd Memory optimizationmulti quicksort
  • n BC ? 1 cache miss per block
  • Problem
  • Larger sets of keys incur a substantial number
    of misses.
  • Solution
  • a single multi-partition pass is performed
    divides the full set into a number of subsets
    which are likely to be cache sized or smaller

40
Feller
  • If k points are placed randomly in a range of
    length 1
  • P( subrangei X ) (1 - X)k

41
multi quicksort cont.
  • multi-partition the array into 3n / (BC) pieces.
  • ? (3n / (BC)) 1 pivots.
  • ? P( subseti BC) (1 BC/n) (3n / (BC)) 1
  • lim n ? 8(1 BC/n) (3n / (BC)) 1 e-3
  • ?the percentage of subsets that are larger
    than the cache is less than 5.

feller
42
Memory tuned quicksort analysis
  • We analyze the algorithm in two parts
  • 1. Assumption
  • partitioning an array of size m costs
  • m gt BC ? m/B misses
  • m BC ? 0 misses
  • 2. Correct the assumption
  • estimate the undercounted and
  • over-counted cache misses

43
Memory tuned quicksort analysis cont.
  • M(n) the expected number of misses
  • 0 n BC
  • M(n)
  • n/b 1/n ?M(i)M(n-i-1) else

0iltn-1
number of misses in the left region
number of misses in the left region
n places to locate the pivot P(pivot is in the
i-th place) 1/n
  • Assumption partitioning an array of size
  • n gt BC costs
  • n / B misses

44
Memory tuned quicksort analysis cont.
  • The recurrence solves to
  • 0 n BC
  • M(n)
  • 2(n1)/B ln(n1)/(BC2)O(1/n)else


45
Memory tuned quicksort analysis cont.First
correction
  • Undercounting the misses when the subproblem
    first reaches size BC .
  • We counted it as 0, but this subproblem may have
    no part in the cache!
  • We add n/B more misses, since there are
    approximately n keys in ALL the subproblems that
    first reaches size BC.

46
Memory tuned quicksort analysis cont.Second
correction
  • In the very first partitioning there are n/B
    misses, but not for the subsequent partitioning !
  • In the end of partitioning, some of the array in
    the LEFT subproblem is still in the cache.
  • ? there are hits that we counted as misses
  • Note by the time the algorithm reaches the right
    subproblem, its data has been removed from the
    cache

47
Memory tuned quicksort analysis Second
correction cont.
  • The expected number of subproblems of size gt BC
  • 0 n BC
  • N(n)
  • 1 1/n ?N(i)N(n-i-1) else


0iltn-1
number of subproblems of size gt BC in the left /
right region
ngtBC thus, this array itself is a subproblem
larger than the cache
n places to locate the pivot P(pivot is in the
i-th place) 1/n
48
Memory tuned quicksort analysis Second
correction cont.
  • The recurrence solves to
  • 0 n BC
  • N(n)
  • (n1)/(BC2) 1 else


49
Memory tuned quicksort analysis Second
correction cont.
  • In each subproblems of size n gt BC

pivot
Left sub-problem

array
R progresses left ? it can access to these blue
cache blocks
L
R
On average, BC/2 keys are in the cache (1/2 cache)
L progresses right ? eventually will access
blocks that map to the blue blocks in the cache
and replace them

cache
50
Memory tuned quicksort analysis Second
correction cont.
  • Assumption R points to a key in the block
    located at the right end of the cache
  • Reminder this is a direct map cache, the i-th
    block will be in the i (mod C)


R
51
Memory tuned quicksort analysis Second
correction cont.
  • 2 possible scenarios. the first


X
X
X
X
i blocks
L points to a key in a block which is mapped to
this cache block. L progresses and replaces the
blocks on the right
R points to a key in a block which is mapped to
this cache block. R progresses to the blue blocks
on the left
On average, there will be c/2 i / 2 hits
52
Memory tuned quicksort analysis Second
correction cont.
  • The second scenario


X
X
i blocks
R points to a key in a block which is mapped to
this cache block. R progresses to the blue blocks
on the left
On average, there will be i c/2 - i / 2
c/2 i / 2 hits
L points to a key in a block which is mapped to
this cache block. L progresses and replaces the
blocks on the right
53
Memory tuned quicksort analysis Second
correction cont.
  • Number of hits
  • 1/(c/2) ? c/2 i / 2 3C/8


0iltc/2
L can start on any block with equal probability
Average number of hits
54
Memory tuned quicksort analysis Second
correction cont.
  • Number of hits not acounted for the computation
    of M(n)
  • 3C/8 N(n)

the expected number of misses
number of hits after a partition
The expected number of sub-problems of size gt BC
55
Memory tuned quicksort analysis Second
correction cont.
  • The expected number of misses per key for ngtBC
  • M(n) (n/B) - 3C/8 N(n) /n
  • 2/B ln(n/BC) 5/8B 3C/8n

misses per key
56
Base quicksort analysis
  • Number of cache misses per key
  • 2/B ln(n/BC) 5/8B 3C/8n 1/B

Same as Memory tuned quicksort
  • Base QS makes an extra pass at the end to perform
    the insertion sort.

57
Multi quicksort analysis cont.
  • Number of cache misses per key
  • for nBC
  • 1/B

Compulsory misses
58
Multi quicksort analysis cont.
  • Number of cache misses per key for ngtBC
  • 2/B 2/B

Each partition is returned to the input array and
sorted in place
We partition the input to k3n/BC
pieces. Assumption Each partition is smaller
than the cache We hold k linked lists, one for
each partition. 100 keys can fit in one linked
list node (minimize storage waste). Each
partitioned key is moved to the a linked
list 1 miss per block in the input array 1
miss per block in the linked list
59
Performance

Base Memory tuned multi
Multi partition
Constant number of additional instructions
Instructions per key
150
0
Set size in keys
10000
100000
Cache size
60
Performance cont.

Base Memory tuned multi
Cache misses per key
1
Multi partition usually produce subsets smaller
than cache. 1 miss per key!
0
10000
100000
Set size in keys
Cache size
61
Performance cont.

Base Tiled multi
Due to increase in instruction cost. If larger
sets were sorted it would have outperformed the
other 2 variants
Time (cycles per key)
200
0
Set size in keys
10000
100000
Cache size
Write a Comment
User Comments (0)
About PowerShow.com