Title: An implementation of the PARADISE Design code in CUDA
1An implementation of the PARADISE Design code in
CUDA
Sponsors Behzad Sharif and Sam Stone Faculty
Advisors Yoram Bresler and Wen-Mei Hwu Group
Members Abdullah Muzahid, Sudhindra Kota, and
Nikhil Pandit
2PARADISE
- Stands for Patient-Adaptive Reconstruction and
Acquisition in Dynamic Imaging with Sensitivity
Encoding - It is an improved process for 3D cardiac MRI,
designed to overcome the limitations of current
MRI systems in imaging dynamic phenomena, e.g.,
beating human heart - PARADISE uses 3 components to overcome the
inherent acquisition speed limitations in MRI - Parallel receiver coils with sensitivity encoding
- Builds mathematical model for capturing each
patients cardiac dynamics - Adapts the both acquisition reconstruction to
the previous two points - PARADISE forms a high spatial and temporal
resolution movie of the patients cardiovascular
system (20X improvement over conventional
methods) - Reference B. Sharif, Y. Bresler, Adaptive
Real-time Cardiac MRI Using PARADISE Validation
By The Physiologically Improved NCAT Phantom,
Proc. IEEE ISBI07, pp. 1020-1023, 2007.
3PARADISE Acquisition Design
- Calculates the optimal MRI scanning pattern which
would yield the best image quality both in terms
of - Spatial and temporal resolution
- Signal to Noise Ratio
- MRI scanning pattern ?
-
- Goal of the Design Algorithm is to find
- the optimal and
- For calculating the optimal scanning pattern, it
uses - Receiver coil spatial profiles
- Spatio-temporal heart model that is adapted to
the patient
c1
c2
4PARADISE Design Algorithm
- Design algorithm searches over a 2 dimensional
space to find the optimal c1 and c2 which
minimize a predefined COST function - Resulting parameter values give the optimal MRI
sampling pattern with the following
specifications - TR MRI sampling interval
- NPERIOD Period of sampling scheme
- KYOFFSET Size of jump along the ky axis from
one sample to the next - MATLAB implementation finishes in 75 hours.
- We need the patient to stay in the MRI scanner
while this is being done ? lt5 min.
5Converting MATLAB code to C
- Not as easy as expected
- Performing Matrix operations in C is bit harder
than in MATLAB - Converting the MATLAB .mat file containing coil
profiles to a format suitable for use in C - Modifying an existing C implementation of
complex matrix inverse to C. - C implementation finishes in 1695 s (28 minutes)
6PARADISE design code
- for(int c2 1018c2gt792c2--)
-
- for(int c11c1ltc2/2c1)
-
- for(int
suppindex0suppindexltsuppvec_lensuppindex) -
- int f1
suppvec_fsuppindex - int y1
suppvec_ysuppindex - ec_yec_yCount
y1-1 - ec_fec_fCount
f1-1 - for(int
suppindex20suppindex2ltsuppvec_lensuppindex2) -
- int f2
suppvec_fsuppindex2 - int y2
suppvec_ysuppindex2 - int
numalpha y2-y1 - int temp
f1 numalphac1
7Dividing into blocks
- Each block
- Handles one iteration
- of the c2, c1 loop
- Calculates a value of
- Cost
Block (0,0) c22 1018 c21 1
Block (1,0) c22 1018 c21 2
. . . .
Block (0,1) c22 c21
Block (1,1) c22 c21
. . . .
. . . .
. . . .
8Kernel Code
- for(int suppindex0suppindexltsuppvec_lensuppinde
x) -
- int f1 suppvec_fsuppindex
- int y1 suppvec_ysuppindex
- ec_yec_yCount y1-1
- ec_fec_fCount f1-1
- for(int suppindex20suppindex2ltsuppvec_le
nsuppindex2) -
- int f2 suppvec_fsuppindex2
- int y2 suppvec_ysuppindex2
- int numalpha y2- y1
- int temp f1 numalphac1
- if(((f2-temp)c2) 0)
-
- ec_yec_yCount y2-1
- ec_fec_fCount
f2-1
9Kernel Code
- for(int suppindex0suppindexltsuppvec_lensuppinde
x) -
- int f1 suppvec_fsuppindex
- int y1 suppvec_ysuppindex
- ec_y0 y1-1
- ec_f0 f1-1
- for(int suppindex20suppindex2ltsuppvec_le
n/BLOCK_SIZEsuppindex2) -
- int f2 suppvec_fsuppindex2BLOCK_SIZEthrea
dIdx.x - int y2 suppvec_ysuppindex2BLOCK_SI
ZEthreadIdx,x - int numalpha y2- y1
- int temp f numalphac1
- if(((f2-temp)c2) 0)
-
- ec_y?? y2-1
- ec_f?? f2-1
10Problem
- Updating shared memory is a problem
- not all threads enter the if condition
- index is not known
- Solution
- Serialize the shared memory update
- This eliminates most of the benefits of
parallelizing the inner loop - Run time of 14 minutes. Speedup mostly due to
multiple blocks executing simultaneously
11Speeding up serial update
- The serial update need not be done for every
iteration of the inner loop (suppindex2) - In most of the iterations, none of the threads
enter the if condition so no update of shared
memory required - Runtime reduced from 14 minutes to 8 minutes
12Parallelize Matrix operations
- for(int suppindex0suppindexltsuppvec_lensuppinde
x) -
- int f1 suppvec_fsuppindex
- int y1 suppvec_ysuppindex
- ec_yec_yCount y1-1
- ec_fec_fCount f1-1
- for(int suppindex20suppindex2ltsuppvec_le
nsuppindex2) -
- int f2 suppvec_fsuppindex2
- int y2 suppvec_ysuppindex2
- .
- .
- .
-
-
-
13Parallelize Matrix operations
- Size of Matrices are small
- Maximum size of a matrix is 8x8
- Usually size of 2x2, 3x3
- Most threads idle during this period
- Runtime reduced to around 6 minutes after
- parallelizing matrix operations
14Using constant memory
- for(int suppindex0suppindexltsuppvec_lensuppinde
x) -
- int f1 suppvec_fsuppindex
- int y1 suppvec_ysuppindex
- ec_yec_yCount y1-1
- ec_fec_fCount f1-1
- for(int suppindex20suppindex2ltsuppvec_le
nsuppindex2) -
- int f2 suppvec_fsuppindex2
- int y2 suppvec_ysuppindex2
- .
- .
- .
-
-
-
15Using constant memory
- Arrays used in loops are read only
- Using constant memory should give considerable
speedup - But, using constant memory increases overall
runtime by only 40 seconds - Constant memory is fast only if all threads are
accessing the same address - Cost scales linearly with the number of different
addresses read by all threads - In our kernel every thread accesses a different
constant memory location - No shared memory left to use for the arrays
- Use Texture memory
- Just a few seconds speedup over global memory
implementation -
16Experiments with loop unrolling
1
2
4
8
Best runtime Unrolling factor 8 292s
17Experiments with block sizes
60
120
240
Best runtime Block size 120 292s
18Optimizations Summary
19Some statistics
- Number of blocks 102774
- Number of threads per block 120
- Shared memory used per block 7805KB
- Best CPU runtime 1695s
- Best GPU runtime 292s
- Speedup achieved 5.8
- GFLOPs 0.132
- Texture Memory Bandwidth 5.2 GB/s
- Shared Memory Bandwidth 7.2 GB/s
20Issues limiting speedup
- Shared memory size
- Serialized update of shared memory
- Off-chip memory bandwidth
- Integer modulo operation in the innermost loop
21Thank You
Questions?
22Backup Slides