Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders MaxPlanckInstitut fr Informatik - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders MaxPlanckInstitut fr Informatik

Description:

Asynchronous Parallel Disk Sorting. Roman Dementiev and Peter Sanders. Max-Planck-Institut f r ... Linux Soft-RAID 8X128KByte stripes. Bandwidth of 315 MByte/s ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 24

Provided by: romande

Learn more at: http://algo2.iti.kit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders MaxPlanckInstitut fr Informatik

1
Asynchronous Parallel Disk SortingRoman
Dementiev and Peter SandersMax-Planck-Institut
für Informatik,Saarbrücken, GermanyThanks
to A. Crauser, D. Hutchinson, L. Kettner, S.
Mitra, A. Morton, N. Rajput, and J. Vitter
2
Motivation Related Work

Sorting is important for EM problems
Disks are cheaper than computers
Prefetch buffers for load disk balancing and
overlapping have been studied but no results
which guarantee
overlapping during merging of runs
asymptotically optimal I/O volume
Here algorithm implementation
I/O cost ? lower bound
guarantees almost perfect overlapping between I/O
and computation

3
Motivation Related Work

ChaudhryCormen0102 impl. of distributed
memory, parallel disk
Theory
Implementations
Differences
Optimal prefetching algorithm
Overlapping of I/O and computation
Completely asynchronous implementation
Part of ltstxxlgt library, compatible with STL. Use
as simple as
stxxlvectorltmy_typegt v
long long int ivec_size // long long int is a
64-bit data type
while(i--)
v.push_back(f(i)) // fill vector with some
values
stxxlsort(v.begin(), v.end()) // sort
// check order using STL predicate
assert(stdis_sorted(v.begin(), v.end()))

BarvGrovVit97
SandEgnKorst00
HutchVitt01
HutchSandVitt02
BarveVitter99
Here DementievSanders03
4
External Multi-way Merge Sort
N input size M main memory size B block
size D number of disks

These algorithms do not guarantee overlapping

5
Multi-way Merge Sort with Overlapped I/Os

Theorem 1 If I/O and computation can be
overlapped, N elements can be sorted in time
where is the time needed for run
formation,
is the time needed for merging
k?(M/B) sequences,
k O(N/M) total runs.

6
Overlapping Run Formation

Easy
Corollary 2 Input of size N gt sorted runs of
size M/2 in time
2M in the paper

run numbers
time
3
1
2
4
k-1
k
...
read sort write
2
3
4
k-1
k
I/O bound case
...
1
1
3
2
4
k-1
k
...
1
2
3
4
k-1
k
read sort write
compute bound case
1
2
3
4
k-1
k
1
2
3
4
k-1
k
7
Simple External Merging

Smallest element of a block is a trigger

merger
k sorted runs
.
8
Overlapping I/O and Merging

Prediction of delay between issuing two reads is
not easy

1B-12
3B-14
5B-16

merger
1B-12
3B-14
5B-16

k

1B-12
3B-14
5B-16

9
Overlapping I/O and Merging

Prediction of delay between issuing two reads is
not easy
Solution

2
3B-14
5B-16

merger
2
3B-14
5B-16

k

2
3B-14
5B-16

10
Overlapping I/O and Merging contd.

Theorem 4 Merging k sequences with a total of N
elements takes time
l time to produce one element of output
L time to input or output D arbitrary blocks
Our I/O thread strategy
? DB elements in write buffer gt output
step
lt DB elements in write buffer AND D overlap
buffers avail. gt input step
To prove Theorem 4 we distinguish two cases
I/O bound case 2L ? DBl
Compute bound case 2Llt DBl

11
Compute Bound Case
r

Lemma 6 In compute bound case after k/D1 steps,
the merging thread never blocks until all
elements are merged.
Notation
? elem. in write buffer
r of elem. in the overlap and merge buffers
Our I/O thread strategy
? DB elements in write buffer gt output
step
lt DB elements in write buffer AND D overlap
buffers avail. gt input step

kB3DB
I/O blocking
kB2DB
output
fetch
kBDB
kB2 L/l
kBy L/l
merge thread blocked
kB
?
0
DB
2DB- L/l
2DB
12
I/O Bound Case

Lemma 7 In I/O bound case the I/O thread never
blocks until all input blocks are fetched.
Our I/O thread strategy (the same)
? DB elements in write buffer gt output
step
lt DB elements in write buffer AND D overlap
buffers avail. gt input step
Proof similar to compute bound case

r
kB3DB
kB2DB L/l
blocking
kB2DB
output
kBDB L/l
fetch
kBDB
?
kB L/l
DB- L/l
DB
2DB- L/l
2DB
0
13
Multi-head Model ? Multi-disk Model

Randomized Cyclic Allocation VitHut01 makes
accesses in ? balanced
Optimal prefetching from HutchSanVitt01,SanEgnKo
r00
Corollary 8 For any ? gt 0, prefetch buffer of
size m?(D/?) merging k sequences with a total N
elements can be implemented to run in time

14
Implementation Issues

Implementation (part of ltstxxlgt library)
Forming runs Key sorting, efficient two passes
MSD radix sort
Multi-way merging Sanders00
prefetch buffer overlap buffer read buffer
Asynchronous I/O (POSIX threads)
No superfluous copying
User blocks are transferred by DMA (O_DIRECT
flag)
Buffers are passed by pointer between pools

ltstxxlgt structure
15
Hardware
3000 Euro, 375 MByte/s, July 2002
16
Single Disk Performance

LEDA-SM, TPIE
2 GB volume, 32-bit keys, runs of size 256 MB,
g 3.2 O6
I/O bandwidth 45.4 Mbyte/s 95 of peak
bandwidth of the disk

17
Multiple Disk Performance

2 GB volume, 32-bit keys, runs of size 256 MB,
g 3.2 O6
TPIE, LEDA-SM
Linux Soft-RAID 8X128KByte stripes
Bandwidth of 315 MByte/s

18
Element Size

16GB volume, 32-bit keys, runs of size 256 MB,
g 3.2 O6
? 64 ? merging is I/O bound
? 128 ? run formation is I/O bound
Small elements require special treatment

19
Optimal Prefetching
1
read buffers
2
3
kO(D) overlap buffers
4
2D write buffers
merge
..

merge buffers
k
D blocks
merging
1
read buffers
2
3
kO(D) overlap buffers
2D write buffers
O(D) prefetch buffers
4
merge
..

merge buffers
k
D blocks
merging
20
Block Size

Block sizes of several MB gt good performance
Leave room for read and write buffers
Naive striping implies factor 8 block size
reduction
? Dramatic run time increase

21
Large Inputs

One-pass, curves go up because of
Slower zones
Seek times

22
Summary

Parallel disk sorting with high performance on
state of art hardware with theoretical
performance guarantees
Bandwidth is no longer a limiting factor for
external memory algorithms
Price performance ratio can improve by adding
disks

23
?

Write a Comment

User Comments (0)