An implementation of the PARADISE Design code in CUDA - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

An implementation of the PARADISE Design code in CUDA

Description:

Abdullah Muzahid, Sudhindra Kota, and Nikhil Pandit. PARADISE ... It is an improved process for 3D cardiac MRI, designed to overcome the ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 23
Provided by: coursesEc
Category:

less

Transcript and Presenter's Notes

Title: An implementation of the PARADISE Design code in CUDA


1
An implementation of the PARADISE Design code in
CUDA
Sponsors Behzad Sharif and Sam Stone Faculty
Advisors Yoram Bresler and Wen-Mei Hwu Group
Members Abdullah Muzahid, Sudhindra Kota, and
Nikhil Pandit
2
PARADISE
  • Stands for Patient-Adaptive Reconstruction and
    Acquisition in Dynamic Imaging with Sensitivity
    Encoding
  • It is an improved process for 3D cardiac MRI,
    designed to overcome the limitations of current
    MRI systems in imaging dynamic phenomena, e.g.,
    beating human heart
  • PARADISE uses 3 components to overcome the
    inherent acquisition speed limitations in MRI
  • Parallel receiver coils with sensitivity encoding
  • Builds mathematical model for capturing each
    patients cardiac dynamics
  • Adapts the both acquisition reconstruction to
    the previous two points
  • PARADISE forms a high spatial and temporal
    resolution movie of the patients cardiovascular
    system (20X improvement over conventional
    methods)
  • Reference B. Sharif, Y. Bresler, Adaptive
    Real-time Cardiac MRI Using PARADISE Validation
    By The Physiologically Improved NCAT Phantom,
    Proc. IEEE ISBI07, pp. 1020-1023, 2007.

3
PARADISE Acquisition Design
  • Calculates the optimal MRI scanning pattern which
    would yield the best image quality both in terms
    of
  • Spatial and temporal resolution
  • Signal to Noise Ratio
  • MRI scanning pattern ?
  • Goal of the Design Algorithm is to find
  • the optimal and
  • For calculating the optimal scanning pattern, it
    uses
  • Receiver coil spatial profiles
  • Spatio-temporal heart model that is adapted to
    the patient

c1
c2
4
PARADISE Design Algorithm
  • Design algorithm searches over a 2 dimensional
    space to find the optimal c1 and c2 which
    minimize a predefined COST function
  • Resulting parameter values give the optimal MRI
    sampling pattern with the following
    specifications
  • TR MRI sampling interval
  • NPERIOD Period of sampling scheme
  • KYOFFSET Size of jump along the ky axis from
    one sample to the next
  • MATLAB implementation finishes in 75 hours.
  • We need the patient to stay in the MRI scanner
    while this is being done ? lt5 min.

5
Converting MATLAB code to C
  • Not as easy as expected
  • Performing Matrix operations in C is bit harder
    than in MATLAB
  • Converting the MATLAB .mat file containing coil
    profiles to a format suitable for use in C
  • Modifying an existing C implementation of
    complex matrix inverse to C.
  • C implementation finishes in 1695 s (28 minutes)

6
PARADISE design code
  • for(int c2 1018c2gt792c2--)
  • for(int c11c1ltc2/2c1)
  • for(int
    suppindex0suppindexltsuppvec_lensuppindex)
  • int f1
    suppvec_fsuppindex
  • int y1
    suppvec_ysuppindex
  • ec_yec_yCount
    y1-1
  • ec_fec_fCount
    f1-1
  • for(int
    suppindex20suppindex2ltsuppvec_lensuppindex2)
  • int f2
    suppvec_fsuppindex2
  • int y2
    suppvec_ysuppindex2
  • int
    numalpha y2-y1
  • int temp
    f1 numalphac1

7
Dividing into blocks
  • Each block
  • Handles one iteration
  • of the c2, c1 loop
  • Calculates a value of
  • Cost

Block (0,0) c22 1018 c21 1
Block (1,0) c22 1018 c21 2
. . . .
Block (0,1) c22 c21
Block (1,1) c22 c21
. . . .
. . . .
. . . .
8
Kernel Code
  • for(int suppindex0suppindexltsuppvec_lensuppinde
    x)
  • int f1 suppvec_fsuppindex
  • int y1 suppvec_ysuppindex
  • ec_yec_yCount y1-1
  • ec_fec_fCount f1-1
  • for(int suppindex20suppindex2ltsuppvec_le
    nsuppindex2)
  • int f2 suppvec_fsuppindex2
  • int y2 suppvec_ysuppindex2
  • int numalpha y2- y1
  • int temp f1 numalphac1
  • if(((f2-temp)c2) 0)
  • ec_yec_yCount y2-1
  • ec_fec_fCount
    f2-1

9
Kernel Code
  • for(int suppindex0suppindexltsuppvec_lensuppinde
    x)
  • int f1 suppvec_fsuppindex
  • int y1 suppvec_ysuppindex
  • ec_y0 y1-1
  • ec_f0 f1-1
  • for(int suppindex20suppindex2ltsuppvec_le
    n/BLOCK_SIZEsuppindex2)
  • int f2 suppvec_fsuppindex2BLOCK_SIZEthrea
    dIdx.x
  • int y2 suppvec_ysuppindex2BLOCK_SI
    ZEthreadIdx,x
  • int numalpha y2- y1
  • int temp f numalphac1
  • if(((f2-temp)c2) 0)
  • ec_y?? y2-1
  • ec_f?? f2-1

10
Problem
  • Updating shared memory is a problem
  • not all threads enter the if condition
  • index is not known
  • Solution
  • Serialize the shared memory update
  • This eliminates most of the benefits of
    parallelizing the inner loop
  • Run time of 14 minutes. Speedup mostly due to
    multiple blocks executing simultaneously

11
Speeding up serial update
  • The serial update need not be done for every
    iteration of the inner loop (suppindex2)
  • In most of the iterations, none of the threads
    enter the if condition so no update of shared
    memory required
  • Runtime reduced from 14 minutes to 8 minutes

12
Parallelize Matrix operations
  • for(int suppindex0suppindexltsuppvec_lensuppinde
    x)
  • int f1 suppvec_fsuppindex
  • int y1 suppvec_ysuppindex
  • ec_yec_yCount y1-1
  • ec_fec_fCount f1-1
  • for(int suppindex20suppindex2ltsuppvec_le
    nsuppindex2)
  • int f2 suppvec_fsuppindex2
  • int y2 suppvec_ysuppindex2
  • .
  • .
  • .

13
Parallelize Matrix operations
  • Size of Matrices are small
  • Maximum size of a matrix is 8x8
  • Usually size of 2x2, 3x3
  • Most threads idle during this period
  • Runtime reduced to around 6 minutes after
  • parallelizing matrix operations

14
Using constant memory
  • for(int suppindex0suppindexltsuppvec_lensuppinde
    x)
  • int f1 suppvec_fsuppindex
  • int y1 suppvec_ysuppindex
  • ec_yec_yCount y1-1
  • ec_fec_fCount f1-1
  • for(int suppindex20suppindex2ltsuppvec_le
    nsuppindex2)
  • int f2 suppvec_fsuppindex2
  • int y2 suppvec_ysuppindex2
  • .
  • .
  • .

15
Using constant memory
  • Arrays used in loops are read only
  • Using constant memory should give considerable
    speedup
  • But, using constant memory increases overall
    runtime by only 40 seconds
  • Constant memory is fast only if all threads are
    accessing the same address
  • Cost scales linearly with the number of different
    addresses read by all threads
  • In our kernel every thread accesses a different
    constant memory location
  • No shared memory left to use for the arrays
  • Use Texture memory
  • Just a few seconds speedup over global memory
    implementation

16
Experiments with loop unrolling
1
2
4
8
Best runtime Unrolling factor 8 292s
17
Experiments with block sizes
60
120
240
Best runtime Block size 120 292s
18
Optimizations Summary
19
Some statistics
  • Number of blocks 102774
  • Number of threads per block 120
  • Shared memory used per block 7805KB
  • Best CPU runtime 1695s
  • Best GPU runtime 292s
  • Speedup achieved 5.8
  • GFLOPs 0.132
  • Texture Memory Bandwidth 5.2 GB/s
  • Shared Memory Bandwidth 7.2 GB/s

20
Issues limiting speedup
  • Shared memory size
  • Serialized update of shared memory
  • Off-chip memory bandwidth
  • Integer modulo operation in the innermost loop

21
Thank You
Questions?
22
Backup Slides
Write a Comment
User Comments (0)
About PowerShow.com