Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders MaxPlanckInstitut fr Informatik - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders MaxPlanckInstitut fr Informatik

Description:

Asynchronous Parallel Disk Sorting. Roman Dementiev and Peter Sanders. Max-Planck-Institut f r ... Linux Soft-RAID 8X128KByte stripes. Bandwidth of 315 MByte/s ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 24
Provided by: romande
Learn more at: http://algo2.iti.kit.edu
Category:

less

Transcript and Presenter's Notes

Title: Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders MaxPlanckInstitut fr Informatik


1
Asynchronous Parallel Disk SortingRoman
Dementiev and Peter SandersMax-Planck-Institut
für Informatik,Saarbrücken, GermanyThanks
to A. Crauser, D. Hutchinson, L. Kettner, S.
Mitra, A. Morton, N. Rajput, and J. Vitter
2
Motivation Related Work
  • Sorting is important for EM problems
  • Disks are cheaper than computers
  • Prefetch buffers for load disk balancing and
    overlapping have been studied but no results
    which guarantee
  • overlapping during merging of runs
  • asymptotically optimal I/O volume
  • Here algorithm implementation
  • I/O cost ? lower bound
  • guarantees almost perfect overlapping between I/O
    and computation

3
Motivation Related Work
  • ChaudhryCormen0102 impl. of distributed
    memory, parallel disk
  • Theory
  • Implementations
    Differences
  • Optimal prefetching algorithm
  • Overlapping of I/O and computation
  • Completely asynchronous implementation
  • Part of ltstxxlgt library, compatible with STL. Use
    as simple as
  • stxxlvectorltmy_typegt v
  • long long int ivec_size // long long int is a
    64-bit data type
  • while(i--)
  • v.push_back(f(i)) // fill vector with some
    values
  • stxxlsort(v.begin(), v.end()) // sort
  • // check order using STL predicate
  • assert(stdis_sorted(v.begin(), v.end()))

BarvGrovVit97
SandEgnKorst00
HutchVitt01
HutchSandVitt02
BarveVitter99
Here DementievSanders03
4
External Multi-way Merge Sort
N input size M main memory size B block
size D number of disks
  • These algorithms do not guarantee overlapping

5
Multi-way Merge Sort with Overlapped I/Os
  • Theorem 1 If I/O and computation can be
    overlapped, N elements can be sorted in time
  • where is the time needed for run
    formation,
  • is the time needed for merging
    k?(M/B) sequences,
  • k O(N/M) total runs.

6
Overlapping Run Formation
  • Easy
  • Corollary 2 Input of size N gt sorted runs of
    size M/2 in time
  • 2M in the paper

run numbers
time
3
1
2
4
k-1
k
...
read sort write
2
3
4
k-1
k
I/O bound case
...
1
1
3
2
4
k-1
k
...
1
2
3
4
k-1
k
read sort write
compute bound case
1
2
3
4
k-1
k
1
2
3
4
k-1
k
7
Simple External Merging
  • Smallest element of a block is a trigger

merger
k sorted runs
.
8
Overlapping I/O and Merging
  • Prediction of delay between issuing two reads is
    not easy

1B-12
3B-14
5B-16

merger
1B-12
3B-14
5B-16

k

1B-12
3B-14
5B-16

9
Overlapping I/O and Merging
  • Prediction of delay between issuing two reads is
    not easy
  • Solution

2
3B-14
5B-16

merger
2
3B-14
5B-16

k

2
3B-14
5B-16

10
Overlapping I/O and Merging contd.
  • Theorem 4 Merging k sequences with a total of N
    elements takes time
  • l time to produce one element of output
  • L time to input or output D arbitrary blocks
  • Our I/O thread strategy
  • ? DB elements in write buffer gt output
    step
  • lt DB elements in write buffer AND D overlap
    buffers avail. gt input step
  • To prove Theorem 4 we distinguish two cases
  • I/O bound case 2L ? DBl
  • Compute bound case 2Llt DBl

11
Compute Bound Case
r
  • Lemma 6 In compute bound case after k/D1 steps,
    the merging thread never blocks until all
    elements are merged.
  • Notation
  • ? elem. in write buffer
  • r of elem. in the overlap and merge buffers
  • Our I/O thread strategy
  • ? DB elements in write buffer gt output
    step
  • lt DB elements in write buffer AND D overlap
    buffers avail. gt input step

kB3DB
I/O blocking
kB2DB
output
fetch
kBDB
kB2 L/l
kBy L/l
merge thread blocked
kB
?
0
DB
2DB- L/l
2DB
12
I/O Bound Case
  • Lemma 7 In I/O bound case the I/O thread never
    blocks until all input blocks are fetched.
  • Our I/O thread strategy (the same)
  • ? DB elements in write buffer gt output
    step
  • lt DB elements in write buffer AND D overlap
    buffers avail. gt input step
  • Proof similar to compute bound case

r
kB3DB
kB2DB L/l
blocking
kB2DB
output
kBDB L/l
fetch
kBDB
?
kB L/l
DB- L/l
DB
2DB- L/l
2DB
0
13
Multi-head Model ? Multi-disk Model
  • Randomized Cyclic Allocation VitHut01 makes
    accesses in ? balanced
  • Optimal prefetching from HutchSanVitt01,SanEgnKo
    r00
  • Corollary 8 For any ? gt 0, prefetch buffer of
    size m?(D/?) merging k sequences with a total N
    elements can be implemented to run in time

14
Implementation Issues
  • Implementation (part of ltstxxlgt library)
  • Forming runs Key sorting, efficient two passes
    MSD radix sort
  • Multi-way merging Sanders00
  • prefetch buffer overlap buffer read buffer
  • Asynchronous I/O (POSIX threads)
  • No superfluous copying
  • User blocks are transferred by DMA (O_DIRECT
    flag)
  • Buffers are passed by pointer between pools

ltstxxlgt structure
15
Hardware
3000 Euro, 375 MByte/s, July 2002
16
Single Disk Performance
  • LEDA-SM, TPIE
  • 2 GB volume, 32-bit keys, runs of size 256 MB,
    g 3.2 O6
  • I/O bandwidth 45.4 Mbyte/s 95 of peak
    bandwidth of the disk

17
Multiple Disk Performance
  • 2 GB volume, 32-bit keys, runs of size 256 MB,
    g 3.2 O6
  • TPIE, LEDA-SM
  • Linux Soft-RAID 8X128KByte stripes
  • Bandwidth of 315 MByte/s

18
Element Size
  • 16GB volume, 32-bit keys, runs of size 256 MB,
    g 3.2 O6
  • ? 64 ? merging is I/O bound
  • ? 128 ? run formation is I/O bound
  • Small elements require special treatment

19
Optimal Prefetching
1
read buffers
2
3
kO(D) overlap buffers
4
2D write buffers
merge
..

merge buffers
k
D blocks
merging
1
read buffers
2
3
kO(D) overlap buffers
2D write buffers
O(D) prefetch buffers
4
merge
..

merge buffers
k
D blocks
merging
20
Block Size
  • Block sizes of several MB gt good performance
  • Leave room for read and write buffers
  • Naive striping implies factor 8 block size
    reduction
  • ? Dramatic run time increase

21
Large Inputs
  • One-pass, curves go up because of
  • Slower zones
  • Seek times

22
Summary
  • Parallel disk sorting with high performance on
    state of art hardware with theoretical
    performance guarantees
  • Bandwidth is no longer a limiting factor for
    external memory algorithms
  • Price performance ratio can improve by adding
    disks

23
?
Write a Comment
User Comments (0)
About PowerShow.com