An implementation of the PARADISE Design code in CUDA - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

An implementation of the PARADISE Design code in CUDA

Description:

Abdullah Muzahid, Sudhindra Kota, and Nikhil Pandit. PARADISE ... It is an improved process for 3D cardiac MRI, designed to overcome the ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 23

Provided by: coursesEc

Category:

more less

Transcript and Presenter's Notes

Title: An implementation of the PARADISE Design code in CUDA

1
An implementation of the PARADISE Design code in
CUDA
Sponsors Behzad Sharif and Sam Stone Faculty
Advisors Yoram Bresler and Wen-Mei Hwu Group
Members Abdullah Muzahid, Sudhindra Kota, and
Nikhil Pandit
2
PARADISE

Stands for Patient-Adaptive Reconstruction and
Acquisition in Dynamic Imaging with Sensitivity
Encoding
It is an improved process for 3D cardiac MRI,
designed to overcome the limitations of current
MRI systems in imaging dynamic phenomena, e.g.,
beating human heart
PARADISE uses 3 components to overcome the
inherent acquisition speed limitations in MRI
Parallel receiver coils with sensitivity encoding
Builds mathematical model for capturing each
patients cardiac dynamics
Adapts the both acquisition reconstruction to
the previous two points
PARADISE forms a high spatial and temporal
resolution movie of the patients cardiovascular
system (20X improvement over conventional
methods)
Reference B. Sharif, Y. Bresler, Adaptive
Real-time Cardiac MRI Using PARADISE Validation
By The Physiologically Improved NCAT Phantom,
Proc. IEEE ISBI07, pp. 1020-1023, 2007.

3
PARADISE Acquisition Design

Calculates the optimal MRI scanning pattern which
would yield the best image quality both in terms
of
Spatial and temporal resolution
Signal to Noise Ratio
MRI scanning pattern ?
Goal of the Design Algorithm is to find
the optimal and
For calculating the optimal scanning pattern, it
uses
Receiver coil spatial profiles
Spatio-temporal heart model that is adapted to
the patient

c1
c2
4
PARADISE Design Algorithm

Design algorithm searches over a 2 dimensional
space to find the optimal c1 and c2 which
minimize a predefined COST function
Resulting parameter values give the optimal MRI
sampling pattern with the following
specifications
TR MRI sampling interval
NPERIOD Period of sampling scheme
KYOFFSET Size of jump along the ky axis from
one sample to the next
MATLAB implementation finishes in 75 hours.
We need the patient to stay in the MRI scanner
while this is being done ? lt5 min.

5
Converting MATLAB code to C

Not as easy as expected
Performing Matrix operations in C is bit harder
than in MATLAB
Converting the MATLAB .mat file containing coil
profiles to a format suitable for use in C
Modifying an existing C implementation of
complex matrix inverse to C.
C implementation finishes in 1695 s (28 minutes)

6
PARADISE design code

for(int c2 1018c2gt792c2--)
for(int c11c1ltc2/2c1)
for(int
suppindex0suppindexltsuppvec_lensuppindex)
int f1
suppvec_fsuppindex
int y1
suppvec_ysuppindex
ec_yec_yCount
y1-1
ec_fec_fCount
f1-1
for(int
suppindex20suppindex2ltsuppvec_lensuppindex2)
int f2
suppvec_fsuppindex2
int y2
suppvec_ysuppindex2
int
numalpha y2-y1
int temp
f1 numalphac1

7
Dividing into blocks

Each block
Handles one iteration
of the c2, c1 loop
Calculates a value of
Cost

Block (0,0) c22 1018 c21 1
Block (1,0) c22 1018 c21 2
. . . .
Block (0,1) c22 c21
Block (1,1) c22 c21
. . . .
. . . .
. . . .
8
Kernel Code

for(int suppindex0suppindexltsuppvec_lensuppinde
x)
int f1 suppvec_fsuppindex
int y1 suppvec_ysuppindex
ec_yec_yCount y1-1
ec_fec_fCount f1-1
for(int suppindex20suppindex2ltsuppvec_le
nsuppindex2)
int f2 suppvec_fsuppindex2
int y2 suppvec_ysuppindex2
int numalpha y2- y1
int temp f1 numalphac1
if(((f2-temp)c2) 0)
ec_yec_yCount y2-1
ec_fec_fCount
f2-1

9
Kernel Code

for(int suppindex0suppindexltsuppvec_lensuppinde
x)
int f1 suppvec_fsuppindex
int y1 suppvec_ysuppindex
ec_y0 y1-1
ec_f0 f1-1
for(int suppindex20suppindex2ltsuppvec_le
n/BLOCK_SIZEsuppindex2)
int f2 suppvec_fsuppindex2BLOCK_SIZEthrea
dIdx.x
int y2 suppvec_ysuppindex2BLOCK_SI
ZEthreadIdx,x
int numalpha y2- y1
int temp f numalphac1
if(((f2-temp)c2) 0)
ec_y?? y2-1
ec_f?? f2-1

10
Problem

Updating shared memory is a problem
not all threads enter the if condition
index is not known
Solution
Serialize the shared memory update
This eliminates most of the benefits of
parallelizing the inner loop
Run time of 14 minutes. Speedup mostly due to
multiple blocks executing simultaneously

11
Speeding up serial update

The serial update need not be done for every
iteration of the inner loop (suppindex2)
In most of the iterations, none of the threads
enter the if condition so no update of shared
memory required
Runtime reduced from 14 minutes to 8 minutes

12
Parallelize Matrix operations

for(int suppindex0suppindexltsuppvec_lensuppinde
x)
int f1 suppvec_fsuppindex
int y1 suppvec_ysuppindex
ec_yec_yCount y1-1
ec_fec_fCount f1-1
for(int suppindex20suppindex2ltsuppvec_le
nsuppindex2)
int f2 suppvec_fsuppindex2
int y2 suppvec_ysuppindex2
.
.
.

13
Parallelize Matrix operations

Size of Matrices are small
Maximum size of a matrix is 8x8
Usually size of 2x2, 3x3
Most threads idle during this period
Runtime reduced to around 6 minutes after
parallelizing matrix operations

14
Using constant memory

for(int suppindex0suppindexltsuppvec_lensuppinde
x)
int f1 suppvec_fsuppindex
int y1 suppvec_ysuppindex
ec_yec_yCount y1-1
ec_fec_fCount f1-1
for(int suppindex20suppindex2ltsuppvec_le
nsuppindex2)
int f2 suppvec_fsuppindex2
int y2 suppvec_ysuppindex2
.
.
.

15
Using constant memory

Arrays used in loops are read only
Using constant memory should give considerable
speedup
But, using constant memory increases overall
runtime by only 40 seconds
Constant memory is fast only if all threads are
accessing the same address
Cost scales linearly with the number of different
addresses read by all threads
In our kernel every thread accesses a different
constant memory location
No shared memory left to use for the arrays
Use Texture memory
Just a few seconds speedup over global memory
implementation

16
Experiments with loop unrolling
1
2
4
8
Best runtime Unrolling factor 8 292s
17
Experiments with block sizes
60
120
240
Best runtime Block size 120 292s
18
Optimizations Summary
19
Some statistics