Title: Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders MaxPlanckInstitut fr Informatik
1Asynchronous Parallel Disk SortingRoman
Dementiev and Peter SandersMax-Planck-Institut
für Informatik,Saarbrücken, GermanyThanks
to A. Crauser, D. Hutchinson, L. Kettner, S.
Mitra, A. Morton, N. Rajput, and J. Vitter
2Motivation Related Work
- Sorting is important for EM problems
- Disks are cheaper than computers
- Prefetch buffers for load disk balancing and
overlapping have been studied but no results
which guarantee - overlapping during merging of runs
- asymptotically optimal I/O volume
- Here algorithm implementation
- I/O cost ? lower bound
- guarantees almost perfect overlapping between I/O
and computation
3Motivation Related Work
- ChaudhryCormen0102 impl. of distributed
memory, parallel disk - Theory
-
- Implementations
Differences - Optimal prefetching algorithm
- Overlapping of I/O and computation
- Completely asynchronous implementation
- Part of ltstxxlgt library, compatible with STL. Use
as simple as - stxxlvectorltmy_typegt v
- long long int ivec_size // long long int is a
64-bit data type - while(i--)
- v.push_back(f(i)) // fill vector with some
values - stxxlsort(v.begin(), v.end()) // sort
- // check order using STL predicate
- assert(stdis_sorted(v.begin(), v.end()))
BarvGrovVit97
SandEgnKorst00
HutchVitt01
HutchSandVitt02
BarveVitter99
Here DementievSanders03
4External Multi-way Merge Sort
N input size M main memory size B block
size D number of disks
- These algorithms do not guarantee overlapping
5Multi-way Merge Sort with Overlapped I/Os
- Theorem 1 If I/O and computation can be
overlapped, N elements can be sorted in time -
- where is the time needed for run
formation, - is the time needed for merging
k?(M/B) sequences, - k O(N/M) total runs.
6Overlapping Run Formation
- Easy
- Corollary 2 Input of size N gt sorted runs of
size M/2 in time - 2M in the paper
run numbers
time
3
1
2
4
k-1
k
...
read sort write
2
3
4
k-1
k
I/O bound case
...
1
1
3
2
4
k-1
k
...
1
2
3
4
k-1
k
read sort write
compute bound case
1
2
3
4
k-1
k
1
2
3
4
k-1
k
7Simple External Merging
- Smallest element of a block is a trigger
merger
k sorted runs
.
8Overlapping I/O and Merging
- Prediction of delay between issuing two reads is
not easy
1B-12
3B-14
5B-16
merger
1B-12
3B-14
5B-16
k
1B-12
3B-14
5B-16
9Overlapping I/O and Merging
- Prediction of delay between issuing two reads is
not easy - Solution
2
3B-14
5B-16
merger
2
3B-14
5B-16
k
2
3B-14
5B-16
10Overlapping I/O and Merging contd.
- Theorem 4 Merging k sequences with a total of N
elements takes time - l time to produce one element of output
- L time to input or output D arbitrary blocks
- Our I/O thread strategy
- ? DB elements in write buffer gt output
step - lt DB elements in write buffer AND D overlap
buffers avail. gt input step - To prove Theorem 4 we distinguish two cases
- I/O bound case 2L ? DBl
- Compute bound case 2Llt DBl
11Compute Bound Case
r
- Lemma 6 In compute bound case after k/D1 steps,
the merging thread never blocks until all
elements are merged. - Notation
- ? elem. in write buffer
- r of elem. in the overlap and merge buffers
- Our I/O thread strategy
- ? DB elements in write buffer gt output
step - lt DB elements in write buffer AND D overlap
buffers avail. gt input step
kB3DB
I/O blocking
kB2DB
output
fetch
kBDB
kB2 L/l
kBy L/l
merge thread blocked
kB
?
0
DB
2DB- L/l
2DB
12I/O Bound Case
- Lemma 7 In I/O bound case the I/O thread never
blocks until all input blocks are fetched. - Our I/O thread strategy (the same)
- ? DB elements in write buffer gt output
step - lt DB elements in write buffer AND D overlap
buffers avail. gt input step - Proof similar to compute bound case
r
kB3DB
kB2DB L/l
blocking
kB2DB
output
kBDB L/l
fetch
kBDB
?
kB L/l
DB- L/l
DB
2DB- L/l
2DB
0
13Multi-head Model ? Multi-disk Model
- Randomized Cyclic Allocation VitHut01 makes
accesses in ? balanced - Optimal prefetching from HutchSanVitt01,SanEgnKo
r00 - Corollary 8 For any ? gt 0, prefetch buffer of
size m?(D/?) merging k sequences with a total N
elements can be implemented to run in time
14Implementation Issues
- Implementation (part of ltstxxlgt library)
- Forming runs Key sorting, efficient two passes
MSD radix sort - Multi-way merging Sanders00
- prefetch buffer overlap buffer read buffer
- Asynchronous I/O (POSIX threads)
- No superfluous copying
- User blocks are transferred by DMA (O_DIRECT
flag) - Buffers are passed by pointer between pools
ltstxxlgt structure
15Hardware
3000 Euro, 375 MByte/s, July 2002
16Single Disk Performance
- LEDA-SM, TPIE
- 2 GB volume, 32-bit keys, runs of size 256 MB,
g 3.2 O6 - I/O bandwidth 45.4 Mbyte/s 95 of peak
bandwidth of the disk
17Multiple Disk Performance
- 2 GB volume, 32-bit keys, runs of size 256 MB,
g 3.2 O6 - TPIE, LEDA-SM
- Linux Soft-RAID 8X128KByte stripes
- Bandwidth of 315 MByte/s
18Element Size
- 16GB volume, 32-bit keys, runs of size 256 MB,
g 3.2 O6 - ? 64 ? merging is I/O bound
- ? 128 ? run formation is I/O bound
- Small elements require special treatment
19Optimal Prefetching
1
read buffers
2
3
kO(D) overlap buffers
4
2D write buffers
merge
..
merge buffers
k
D blocks
merging
1
read buffers
2
3
kO(D) overlap buffers
2D write buffers
O(D) prefetch buffers
4
merge
..
merge buffers
k
D blocks
merging
20Block Size
- Block sizes of several MB gt good performance
- Leave room for read and write buffers
- Naive striping implies factor 8 block size
reduction - ? Dramatic run time increase
21Large Inputs
- One-pass, curves go up because of
- Slower zones
- Seek times
22Summary
- Parallel disk sorting with high performance on
state of art hardware with theoretical
performance guarantees - Bandwidth is no longer a limiting factor for
external memory algorithms - Price performance ratio can improve by adding
disks
23?