Title: Automatic Transformation and Optimization of Applications on GPUs and GPU Clusters
1Automatic Transformation and Optimization of
Applications on GPUs and GPU Clusters
PhD Oral Defence Wenjing Ma Advisor Dr Gagan
Agrawal The Ohio State University
2Outline of Contents
- Motivation
- Accelerators, GPGPU and GPU cluster
- Difficulty of GPU programming
- Framework and Approaches
- Code generation for data mining applications
- Translation system for enabling data mining
applications on GPUs - Automatic translation of data mining applications
from MATLAB to GPUs - Automatic code generation for data mining on
clusters with GPU support - Arranging data on shared memory with ILP Solver
- Code optimization for tensor contractions
- Auto-tuning approach for tensor contractions on
GPUs - Loop transformation for tensor contraction
sequences on multi-level memory architecture
3Introduction
- Accelerators, GPGPU and GPU cluster
- Multi-core architectures are more and more
popular in high performance computing - GPU, Cell Processor, FPGA
- GPU has good performance/price ratio
- Difficulty of Programming
- How to program a cluster with accelerators on
each node ?
4Our Approach
- Provide high-level support for programming
emerging high-end configurations - Effective and simple optimization strategies
- Focus on specific application classes
- Data mining application
- Tensor contraction expressions
5Outline of Contents
- Motivation
- Accelerators, GPGPU and GPU cluster
- Difficulty of GPU programming
- Framework and Approaches
- Code generation for data mining applications
- Translation system for enabling data mining
applications on GPUs - Automatic translation of data mining applications
from MATLAB to GPUs - Automatic code generation for data mining on
clusters with GPU support - Arranging data on shared memory with ILP Solver
- Code optimization for tensor contractions
- Auto-tuning approach for tensor contractions on
GPUs - Loop transformation for tensor contraction
sequences on multi-level memory architecture
6Shared memory on GPU
- Features of shared memory on GPU
- Small in size
- Software controllable
- Much faster than device memory
-
- Need a strategy to arrange data on shared memory
- Arrange by hand Time consuming and not optimal
- Previous work intuitive solution
7An Example to show shared memory usage
Void Kernel_function(float A, float C, )
__shared__ float s_CrNUM_THREADS __shared__
float s_ArNUM_THREADS for(int
i0iltniNUM_THREADS) for(int
j0jltrj)? / load A in device memory
into s_A / for(int j0jltmj)?
for(int k0kltrk)? / load C in
device memory into s_C/ ...... / load B
in device memory into s_A /
8Problem Formulation for Shared Memory Arrangement
- What to Consider
- A kernel function (with a number of basic blocks)
- Array, section of array, element of array
- Live ranges of each variable
- Determine in which basic block a variable is
allocated to shared memory - Assign_pointik variable i, basic block k
9Integer Linear Programming
- Linear Programming
- Objective function
- Maximize z CT x
- Constraints
- Axb
- Solution
- Values of vector x
- Special case of linear programming
- All the unknown variables are integers (within
1,0 in our case)? - Solvable for reasonable size of problems
10Integer Programming for Shared Memory Arrangement
(cntd)?
- Objective Function
- Maximize shared memory usage
- Minimize data transfer between memory hierarchies
Maximize z ?i?1nVar, k ?1nLiveiAgg_SMref
ik ? i ?1..nVar, k
?1nLiveiTotal_memcopyik
11Integer Programming for Shared Memory Arrangement
Agg_SMrefik ?j?live_blocksijIs_assignedijR
efsijitersj
Total_memcopyik Data_transijitersj
2size_allocij , if Accessik readwrite
Data_transij 0 , if Accessik temp
size_allocij , otherwise
12An Example to Show size_alloc
for (int i0 iltn i)? for (int j0 jltm
j)? for (int k 0 kltr k)?
Ck Aik- Bjk ......
Size_alloc rm
Size_alloc rm
Size_alloc r
Size_alloc 1
13Integer Programming for Shared Memory Arrangement
- Constraints
- Total allocation does not exceed the limit of
shared memory at any time - Only at most one assign_point is 1 in each live
range
?i?live_listjIs_assignedijsize_allocijlimit
?i?live_blocksjkassign_pointij1
14An Example
for (int i0 iltn i)? for (int j0 jltm
j)? for (int k 0 kltr k)?
Ck Aik- Bjk ......
Integer Programming Solver
A nr B mr C r n 2048 m 3 r 3 NUM_THREADS
256
assign_pointij i denotes variable I, j
denotes basic block j. Variables 0, 1, 2
correspond to A, B, C in the code.
assign_point011 assign_point101 assign
_point201 / all other elements of
assign_point are 0 /
15An Example (cntd)?
Generated Code __shared__ float
s_Bmr __shared__ float s_CrNUM_THREADS __
shared__ float s_ArNUM_THREADS / load B to
s_B / for(int i0iltniNUM_THREADS)
for(int j0jltrj)? s_AtidrjAtidi
j for(int j0jltmj)? for(int
k0kltrk)? s_Cktids_Atidrk
-s_Bjk ...... / Synchronize and
combination of C /
for (int i0 iltn i)? for (int j0 jltm
j)? for (int k 0 kltr k)?
Ck Aik- Bjk ......
16Suggesting Loop Transformation
for (int rc 0 rc lt nRowCl rc)
tempDis 0 for(int c 0cltnumColc)?
tempDis tempDis datarc
AcomprccolCLc
for (int rc 0 rc lt nRowCl rc)
tempDisrc 0 for(int c 0cltnumColc)?
/ load into shared memory / for (int rc
0 rc lt nRowCl rc)?
tempDisrc datarc AcomprccolCLc
17Experiment Results
18Experiment Results
PCA Co-clustering
19Effect of Loop Transformation
PCA Co-clustering
20Outline of Contents
- Motivation
- Accelerators, GPGPU and GPU cluster
- Difficulty of GPU programming
- Framework and Approaches
- Code generation for data mining applications
- Translation system for enabling data mining
applications on GPUs - Automatic translation of data mining applications
from MATLAB to GPUs - Automatic code generation for data mining on
clusters with GPU support - Arranging data on shared memory with ILP Solver
- Code optimization for tensor contractions
- Auto-tuning approach for tensor contractions on
GPUs - Loop transformation for tensor contraction
sequences on multi-level memory architecture
21Tensor Contraction on GPU and Auto-tuning
- Tensor contraction expressions
- Motivated by the CCSD(T) part of NWchem
- In the form of high-dimensional matrix
multiplication - Example
- rh1 h2 p3 p4 th6 h7 h1 h2 vp3 p4 h6
h7 - Auto-tuning
- Compile-time and Run-time optimization
- Selecting best implementation with given input
problem
22Original Algorithm and Optimization
- Original Algorithm on T10 GPU
- Loading input matrices to shared memory
- Index Calculation
- Flattening and index combination
- Optimization for Fermi
- Register tiling
- Registers serve as a second level of cache
- Larger shared memory and register file on Fermi
- Modified index calculation order
- Different output/input access ratio for each
thread - rh1 h2 p4 p3 th6 h7 h1 h2 vp3 p4 h6
h7
23Motivation of auto-tuning for tensor contractions
on GPU
Running time of two functions on Fermi with
different index orders
Favor input Favor output
Ex 1 (a) 0.425 0.504
Ex 1 (b) 0.487 0.584
Ex 1 (c) 0.51 0.671
Ex 1 (d) 0.681 0.881
Ex 2 (A) 13.6 11
Ex 2 (B) 105.5 41.5
Ex 2 (C) 199.7 149.9
Ex 2 (D) 27.1 22.6
- Algorithm modification for different
architectures - Different algorithm choices for different inputs
24Approaches of Auto-tuning
- Existing approaches
- Analytical cost model
- Hard to capture complex architecture features
- Empirical search
- Not practical when search space is large
- Our approach
- Parametrizable micro-benchmarks
- Focusing on main features that affect performance
25Auto-tuning with Parametrizable Micro-benchmarks
Different Implementations
Target Expressions
Architecture Features
Micro Benchmark
Parameter Space
Expression and problem size in application
Execution
Models and Thresholds
Implementation Choice
26Auto-tuning Approach for Tensor Contractions on
Different GPUs
- Auto-tuning tool
- Parametrizable micro-benchmarks
- Auto-tuning parameters
- Memory access pattern
- Kernel Consolidation
27Micro-benchmark Evaluation for Memory Access
- Access Stride on device memory makes big
difference - Coalesced accesses
- adjacent threads access contiguous words in
device memory - Cache
- L1 and L2
-
- Mapping to tensor contractions
- Index calculation order
- For uncommon index in the order of input/output
- For common index in the order of each input
28Mapping to tensor contractions
rh1 h2 p4 p3 th6 h7 h1 h2 vp3 p4 h6
h7 Mapping to Ca,b Aa,c
Bc,b Collaborative loading of the input
ThreadID.x
Index c of B
Index p3 of v
- calculate with input order p3 is the inner loop
- Accessing v
- Strides between two thread with adjacent x index
1 - Calculate with output order p4 is the inner loop
- Accessing v
- Strides between two thread with adjacent x index
range(p3)
29Micro-benchmark Evaluation for Memory Access
- A simple micro-benchmark
- Three types of stride stride_x, stride_y,
stride_iter
Fermi
Atid.xstride_x tid.ystride_y
istride_iter / i is the index of the loop /
T10
30Experiments
- Memory access for single expression
Actual values are running time in ms
Tile size Predicted choice Actual (in order) Actual (out order)
12 in order 0.241 0.295
13 in order 0.312 0.302
14 in order 0.425 0.504
15 in order 0.487 0.584
16 in order 0.51 0.671
17 in order 0.681 0.881
18 in order 1.078 1.471
Tile size Predicted choice Actual (in order) Actual (out order)
12 out order 0.222 0.214
13 out order 0.28 0.27
14 out order 0.364 0.354
15 out order 0.511 0.482
16 out order 0.854 0.644
17 Equal 0.943 0.92
18 Equal 1.193 1.124
31Micro-benchmark Evaluation for Kernel
Consolidation
- Launching multiple kernels at the same time
- With data copy
- Overlapping of computing and data transfer
- Without data copy
- Better utilization of the computing resource
- Using a matrix-matrix multiplication kernel as
micro-benchmark
32Choice of kernel consolidation
- Tightly coupled consolidation
- For functions with large data movement cost
-
- Loosely coupled consolidation
- For functions with comparable computation and
data movement
Foreach (task i) data copy (host to
device) Foreach (task i) launch the
kernels Foreach (task i) data copy (device to
host)
Foreach (task i) data copy for task i (host to
device) launch kernel(i) data copy for task i
(device to host)
33Experiments
- Kernel Consolidation for single expression
Micro-benchmark
Real contraction
34Experiment
- Running on collections of tensor contractions
Fermi without data copy
T10 without data copy
Fermi with data copy
35Outline of Contents
- Motivation
- Accelerators, GPGPU and GPU cluster
- Difficulty of GPU programming
- Framework and Approaches
- Code generation for data mining applications
- Translation system for enabling data mining
applications on GPUs - Automatic translation of data mining applications
from MATLAB to GPUs - Automatic code generation for data mining on
clusters with GPU support - Arranging data on shared memory with ILP Solver
- Code optimization for tensor contractions
- Auto-tuning approach for tensor contractions on
GPUs - Loop transformation for tensor contraction
sequences on multi-level memory architecture
36Motivation of loop fusion for sequence of tensor
contractions
- Tensor contraction Sequence
?pC4(p, a) A(p, q, r, s)
T3(a, q, r, s)
?qC3(q, b)
T2(a, b, r, s)
T3(a, q, r, s)
?rC2(r, c)
T2(a, b, r, s)
T1(a, b, c, s)
T1(a, b, c, s)
B(a, b, c, d)
?sC1(s, d)
- Need to find the fusion chains
- Memory limit at different levels
- With GPU, memory limitation is more strict
37Tensor contractions in multi-level memory
hierarchy
- Memory hierarchy in GPU clusters
- a disk
- ß global memory
- ? local memory/GPU memory
- None of the levels could be bypassed
- A higher level is smaller and faster than a lower
level
38Loop transformation for tensor contraction
sequences on multi-level memory architecture
- Single tensor contraction
- Memory and data movement cost on multi-level
memory - Tensor contractions represented as ZXY
- Loop fusion for sequence of tensor contractions
- Condition for fusion
- Fusion on multi-level memory hierarchy
39Single Tensor Contraction on Multi-level memory
Hierarchy
- One array fits in memory
- Xx y, Y y z, Zx z , assume X fits in
memory - Memory cost NxNymin(Nx, Ny)1 Mß
- No redundant data movement
- No array fits in memory
- To minimize data movement, a preferred solution
is - Ti Tj T
- Multi-level memory hierarchy
- Tile size determined with particular system
parameters and problem sizes
40Fusion Conditions
- A sequence
- Only when data movement dominates
- Factor determining the ratio
- Common index of the first contraction
- Uncommon index of the smaller matrix in the
second contraction
I1(d, c2,..., cn) I0(d, c1, , cn) B0(d, c1,
, cn) I2(d, c3,, cn) I1(d, c2, , cn) B1(d,
c2, , cn)
In(d) In-1(d, cn) Bn-1(d, cn)
Ii(ci1)
Ii(ci1)
Ii(ci1)
Ii(ci1)
41Fusion Conditions
Ii(ci1)
Bi
Bi
Ii(ci1)
- The B matrices in the middle of the chain
should be very small - Bi resides in memory
- The first B and the last B could be large
- Tile sizes are determined as in single
contraction
42Memory requirement and data movement cost of
fused loops
S1IinIi1n Ii2 S2Ii n Ii1, S3 Ii2 S4
Ii for sx?S1 do Allocate Ii1sx for sy ?
S2-S1 do Allocate Iisy for sz ? S4-S2
do Produce Iisz end for Update
Ii1sy end for for sw ? S3-S1 do
Allocate Ii2sw Produce Ii2sw end
for end for
I1(d, c2,..., cn) I0(d, c1, , cn) B0(d, c1,
, cn) I2(d, c3,, cn) I1(d, c2, , cn) B1(d,
c2, , cn) In(d) In-1(d, cn) Bn-1(d, cn)
43Algorithm to determine fusion chains
- For a fusable contraction list
- With one matrix fitting to memory in each
contraction - Memory cost
- When memory cost exceeds memory limit, a split is
made to break the fusion chain
f(i, j) 0, if jlti
, otherwise
44Fusion in multi-level memory hierarchy
- With given chains at the lower level, determine
subchains at the higher level - Reduced memory requirement forß level
- Same procedure to select fusion chains
f(i, j) 0, if jlti
, if memory?(i, j) M?
, otherwise
45Evaluation
Fusion at Global Memory level
Fusion at disk level
46Outline
- Motivation
- Accelerators, GPGPU and GPU cluster
- Difficulty of GPU programming
- Framework and Approaches
- Code generation for data mining applications
- Translation system for enabling data mining
applications on GPUs - Automatic translation of data mining applications
from MATLAB to GPUs - Automatic code generation for data mining on
clusters with GPU support - Arranging data on shared memory with ILP Solver
- Code optimization for tensor contractions
- Auto-tuning approach for tensor contractions on
GPUs - Loop transformation for tensor contraction
sequences on multi-level memory architecture
47GREENRIDE A Translation system for enabling data
mining applications on GPUs
- User input
- Code analyzer
- Analysis of variables (variable type and size)?
- Analysis of reduction functions (sequential code
from the user)? - Code Generator ( generating CUDA code and C
code invoking the kernel function)? - Optimization
48GREENRIDE A Translation system for enabling data
mining applications on GPUs
Variable Analyzer
Host Program
User Input
Variable information
Variable Access Pattern and Combination Operations
Kernel functions
Reduction functions
Code Generator
Data copy and thread grid configuration
Optional functions
Code Analyzer( In LLVM)?
Executable
49GMAT-DM Automatic Transformation from MATLAB for
GPUs
MATLAB code
OCTAVE parser
GMAT-DM
- Transform MATLAB code for GPU
- Convert MATLAB code to C
- Use GREENRIDE to convert to CUDA
- Matrix manipulation
- Modified metric for matrix multiplication chain
- Function combination
C code
GREENRIDE
CUDA code
50AUTO-GC Automatic Code Generation for FREERIDE
with GPU Support
Variable Information
Reduction Functions
Optional Functions
User input
Add support to GPU clusters!
Code Analyzer
Access Pattern Reduction Objects Combination
Operation
Variable Analyzer
Variable Info Parallel Loop
Cluster of CPUs
FREERIDE Code
Code Generator
CUDA Code
GPU on Each Node
51Future Work
- Extend the code generation system for data mining
applications to more structures - Improve and apply ILP approach for shared memory
arrangement for other architectures - Include more parameters in the auto-tuning
framework - Extend loop transformation to heterogeneous
structures
52Conclusion
- Code generation for data mining applications
- Translation system for enabling data mining
applications on GPUs - Automatic translation of data mining applications
from MATLAB to GPUs - Automatic code generation for data mining on
clusters with GPU support - Arranging data on shared memory with ILP Solver
- Code optimization for tensor contractions
- Auto-tuning approach for tensor contractions on
GPUs - Loop transformation for tensor contraction
sequences on multi-level memory architecture
53Thank you !