MAPLD2005/C178 - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

MAPLD2005/C178

Description:

Algorithms Quick Sort Heap Sort Radix Sort Bitonic Sort Odd/Even Merge SRC System Architecture Example - Quick Sort Example - Quick Sort Example ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 52

Provided by: JohnHa51

Learn more at: http://klabs.org

Category:

more less

Transcript and Presenter's Notes

Title: MAPLD2005/C178

1
Sorting on the SRC 6 Reconfigurable Computer

John Harkins, Tarek El-Ghazawi, Esam El-Araby,
Miaoqing HuangThe George Washington
UniversityWashington, DC

2
Algorithms

Quick Sort
Heap Sort
Radix Sort
Bitonic Sort
Odd/Even Merge

3
SRC System Architecture
16 Port Crossbar Switch1.6 GB/s Peak Port BW

\ 64
\ 64
\ 64
\ 64
ProcessorNode
FPGANode
MemoryNode
Up to 16 Nodes per Switch
4
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 13 3141510 2 6 0 8
412 7 511 1 9
5
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
6
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
7
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
8
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
mL 0 3 5 7 4 2 6 1 8
9
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
mL 0 3 5 7 4 2 6 1 8
PS 0 1 2 3 4 5 6 7 8
10
Quick Sort - MIMD Architecture

6 Instances
Median of 3 to select pivot
Pipeline Sort for partitions 10 vs. Insertion
Sort 20

BankA
BankB
BankC
BankD
BankE
BankF

FPGA1

FPGA2
QS1
QS2
QS3
QS4
QS5
QS6
90
84
11
Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
13
14
3
10
2
6
15
11
1
0
8
4
12
7
5
9
12
Example - Heap Sort
13
14
3
10
2
6
15
11
1
4
12
7
5
8
13
Example - Heap Sort
13
14
3
10
2
6
15
11
1
8
4
12
7
5
0
9
14
Example - Heap Sort
13
14
3
10
2
6
15
11
1
8
4
12
7
5
9
0
15
Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
6 13 3141510 2 6 9 8
412 7 511 1 0
13
14
3
10
2
15
6
11
1
8
4
12
7
5
16
Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
6 13 3141510 211 9 8
412 7 5 6 1 0
13
14
3
10
2
15
11
6
1
8
4
12
7
5
17
Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
max 151314 912 711 3 8
410 2 5 6 1 0
15
14
13
12
7
11
9
6
1
3
8
4
10
2
5
0
18
Heap Sort - MIMD Architecture

6 Instances
Almost identical to processor code

BankA
BankB
BankC
BankD
BankE
BankF

FPGA1
FPGA2
HS1
HS2
HS3
HS4
HS5
HS6
55
5
19
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
1 13 3141510 2 6 0 8
412 7 511 1 9
Pass1
01234567891011121314
15
? index0 0
count1 4
1101001111101111101000100110000010000100
110001110101101100011001
count2 4
count3 4
count4 4
? index1 4
index0 0
n
indexn ? counti n gt 0
? index2 8
i1
? index3 12
20
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2

Pass2
01234567891011121314
15
? index0 0
count0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 0
? index1 4
? index2 8
? index3 12
21
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2 13

Pass2
01234567891011121314
15
? index0 0
count0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 1
1101
? index1 5
? index2 8
? index3 12
22
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2 13
3
Pass2
01234567891011121314
15
count0 1
? index0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 1
1101
? index1 5
? index2 8
0011
? index3 13
23
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2 13 14
3
Pass2
01234567891011121314
15
count0 1
? index0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 2
1101
? index1 5
1110
? index2 9
0011
? index3 13
24
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
3 0 1 2 3 4 5 6 7 8
9101112131415
Pass3
01234567891011121314
15
0000
0000
1101001111101111101000100110000010000100
110001110101101100011001
1000
0001
0100
0010
1100
0011
1101
? index0 4
0100
0101
0101
0001
0110
1001
0111
? index1 8
1110
1000
1010
1001
0010
1010
0110
1011
0011
? index2 12
1100
1111
1101
0111
1110
1011
1111
? index3 16
25
Radix Sort - MIMD Architecture

3 Instances
Uses enumeration sort
Radix 13 bits vs. 8 bits

BankA
BankB
BankC
BankD
BankE
BankF

FPGA1
FPGA2
Radix Sort1
Radix Sort2
Radix Sort3
33
5
26
MIMD Code Structure
main.c int main( ) int n 5237706
int64 buf buf cacheAlign(n)
mapSort(buf, n) free(buf) exit(0)
mapSort.mc void mapSort(int64 buf, n)
OBM_BANK_A (bufA, int64, n/6) OBM_BANK_B (bufB,
int64, n/6) OBM_BANK_F (bufF, int64, n/6)
DMA_CPU(dir, bufA, stripes, buf, n) pragma src
parallel sections pragma src section
Xsort(bufA, n/6) pragma src section
Xsort(bufB, n/6) pragma src section
Xsort(bufF, n/6) DMA_CPU(dir, bufA,
stripes, buf, n) return

27
Example - Bitonic Sort
Input Keys
Schedule
13 31415 10 2 6 0 8 412
7 511 1 9
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
13
3
14
15

28
Example - Bitonic Sort
Input Keys
Schedule
10 2 6 0 8 412
7 511 1 9
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
3

13

15

14

10
2
6
0
29
Example - Bitonic Sort
Input Keys
Schedule
8 412
7 511 1 9
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

3
5

15
11

13
1

14
9
2

10

6

0

30
Example - Bitonic Sort
Input Keys
Schedule
8 412
7
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
5

3

11

13

9

14

1

15

6
8

2
4

10
12

0
7
31
Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6

0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

1

0

12

2

5

3

8

6

7

10

9

13

4

14

11

15

32
Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6 10131415

0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

1

7

4

5

9
10

12
13

8
14

11
15

33
Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6 10131415
1 4 5 7
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

1

4

5

7

8

9

11

12

34
Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6 10131415 8
91112 1 4 5 7
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

8

9

11

12

35
Bitonic Sort - SIMD Architecture

2 Instances
Parallel sorting network

BankA
BankB
BankC
BankD
BankE
BankF

FPGA1
FPGA2

8 Input Bitonic Sorting Network1
4 InputBitonic Sort2
SIMDController
5
27
36
Example - Odd/Even Merge
Input Keys
A 0 1 2 4 7111214 B 3 5
6 8 9101315
Merged Keys
C

MUX
Z-2

Z-1

37
Example - Odd/Even Merge
Input Keys
A 0 1 2 4 7111214 B 3 5
6 8 9101315
Merged Keys
C

0

Z-2
3

1

Z-1
5

38
Example - Odd/Even Merge
Input Keys
A 2 4 7111214 B
6 8 9101315
Merged Keys
C

0
2

Z-2

3

1

4

Z-1
5

39
Example - Odd/Even Merge
Input Keys
A 7111214 B
6 8 9101315
Merged Keys
C

0
2
7

Z-2

3

4
1
11

Z-1
5

40
Example - Odd/Even Merge
Input Keys
A 1214 B
6 8 9101315
Merged Keys
C 0 1

2
3
7

Z-2
0
6

5
4
11

1
Z-1
8

41
Example - Odd/Even Merge
Input Keys
A 1214 B
9101315
Merged Keys
C 0 1 2 3

4
6
7

Z-2
2
9

8
5
11

3
Z-1
10

42
Odd/Even Merge - SIMD Architecture

1 Instance
Parallel sorting network
A/B odd C/D even

BankA
BankB
BankC
BankD
BankE
BankF

FPGA1
FPGA2
Odd Merge Two
Even Merge Two
Merge Out
40
5
43
SIMD Code Structure
main.c int main( ) int n 5237706
int64 buf buf cacheAlign(n)
mapSort(buf, n) free(buf) exit(0)
mapSort.mc void mapSort(int64 buf, n)
OBM_BANK_A (AA, int64, n/6) OBM_BANK_B (BB,
int64, n/6) OBM_BANK_F (FF, int64, n/6)
DMA_CPU(dir, AA, stripes, buf, n) for (i0
iltrounds i) schedule( r1, r2)
bitonicSort8(AAr1,BBr1,CCr1,DDr1,
AAr2,BBr2,CCr2.DDr2,
AAr1,BBr1,CCr1,DDr1,
AAr2,BBr2,CCr2,DDr2)
bitonicSort4(EEr1,FFr1,EEr2,FFr2,
) DMA_CPU(dir, bufA, stripes,
buf, n) return

44
Implementation Comparisons
Algorithm Processor Complexity Language Compiler Lines Of Code Recursion FPGA Util. Slices MIMD SIMD Refactoring Upper Bound x106 keys/s
Quick Sort X86 N lgN C 81
Quick Sort FPGA N lgN MC 97/96 n/a 90,84 31.58
Heap Sort X86 N lgN C 55 -
Heap Sort FPGA N lgN MC 56/54 n/a 55,0 31.58
Radix Sort X86 N C 70 -
Radix Sort FPGA N MC 81/64 n/a 33,0 60.00
Bitonic Sort X86 Nlg2N C 78
Bitonic Sort FPGA lg2N VHDL 53/478/365 n/a 27,0 6.32
O/E Merge X86 N C 52 -
O/E Merge FPGA N MC 71/120 n/a 40,0 60.87
icc v8.0 -fast
entirely
X86 Dual Xeon 2.8GHz
mcc v1.8
major changes
FPGA Virtex2XC6000 _at_ 100MHz
mcc v1.9
some
MC MAP C
very little
almost none
45
Lesson Learned 1

Know your tools
Develop accurate assessments early

Compiler Quick Sort Heap Sort Radix Sort Bitonic Sort O/E Merge
2.8 GHz Xeonx106 keys/s gcc 1.99 0.50 1.63 - -
2.8 GHz Xeonx106 keys/s icc -fast 5.66 1.06 4.72 - -
FPGA upper bound estimatex106 keys/s 31.58 31.58 60.00 6.32 60.87
Upper bound on speedup vs gcc 15.87 63.16 36.81 - -
Upper bound on speedup vs icc 5.58 29.79 12.71 - -
46
Test Conditions