Automated Dynamic Analysis of CUDA Programs - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Automated Dynamic Analysis of CUDA Programs

Description:

– PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 37
Provided by: cagCsa
Category:

less

Transcript and Presenter's Notes

Title: Automated Dynamic Analysis of CUDA Programs


1
Automated Dynamic Analysisof CUDA Programs
  • Michael Boyer, Kevin Skadron, and Westley Weimer
  • University of Virginia
  • boyer,skadron,weimer_at_cs.virginia.edu
  • currently on sabbatical with NVIDIA Research

2
Outline
  • GPGPU
  • CUDA
  • Automated analyses
  • Correctness race conditions
  • Performance bank conflicts
  • Preliminary results
  • Future work
  • Conclusion

3
Why GPGPU?
From NVIDIA CUDA Programming Guide, Version 1.1
4
CPU vs. GPU Design
Single-Thread Latency
Aggregate Throughput
From NVIDIA CUDA Programming Guide, Version 1.1
5
GPGPU Programming
  • Traditional approach graphics APIs
  • ATI/AMD Close-to-the-Metal (CTM)
  • NVIDIA Compute Unified Device Architecture (CUDA)

6
CUDA Abstractions
  • Kernel functions
  • Scratchpad memory
  • Barrier synchronization

7
CUDA Example Program
  • __host__ void example(int cpu_mem)
  • cudaMalloc(gpu_mem, mem_size)
  • cudaMemcpy(gpu_mem, cpu_mem, HostToDevice)
  • kernel ltltlt grid, threads, mem_size gtgtgt
    (gpu_mem)
  • cudaMemcpy(cpu_mem, gpu_mem, DeviceToHost)
  • __global__ void kernel(int mem)
  • int thread_id threadIdx.x
  • memthread_id thread_id

8
CUDA Hardware
GPU
Multiprocessor 2
Global Device Memory
9
Outline
  • GPGPU
  • CUDA
  • Automated analyses
  • Correctness race conditions
  • Performance bank conflicts
  • Preliminary results
  • Future work
  • Conclusion

10
Race Conditions
  • Ordering of instructions among multiple threads
    is arbitrary
  • Relaxed memory consistency model
  • Synchronization __syncthreads()
  • Barrier / memory fence

11
Race Conditions Example
  • 1 extern __shared__ int s
  • 2
  • 3 __global__ void kernel(int out)
  • 4 int id threadIdx.x
  • 5 int nt blockDim.x
  • 6
  • 7 sid id
  • 8 out s(id 1) nt
  • 9

8 out s(id 1) nt
12
Automatic Instrumentation
Original CUDA Source Code
Intermediate Representation
Compile
Instrumentation
Execute
Instrumented CUDA Source Code
Output Race Conditions Detected?
13
Race Condition Instrumentation
  • Two global bookkeeping arrays
  • Reads writes of all threads
  • Two per-thread bookkeeping arrays
  • Reads writes of a single thread
  • After each shared memory access
  • Update bookkeeping arrays
  • Detect report race conditions

14
Race Condition Detection
Add synchronization between lines 7 and 8 No
race conditions detected
  • Original code
  • RAW hazard at expression
  • line 8 outid s(id 1) nt

15
Outline
  • GPGPU
  • CUDA
  • Automated analyses
  • Correctness race conditions
  • Performance bank conflicts
  • Preliminary results
  • Future work
  • Conclusion

16
Bank Conflicts
  • PBSM is fast
  • Much faster than global memory
  • Potentially as fast as register access
  • assuming no bank conflicts
  • Bank conflicts cause serialized access

17
Non-Conflicting Access Patterns
Stride 1
18
Conflicting Access Patterns
19
Impact of Bank Conflicts
20
Automatic Instrumentation
Original CUDA Source Code
Intermediate Representation
Compile
Instrumentation
Execute
Instrumented CUDA Source Code
Output Race Conditions Detected?
Output Bank Conflicts Detected?
21
Bank Conflict Instrumentation
  • Global bookkeeping array
  • Tracks address accessed by each thread
  • After each PBSM access
  • Each thread updates its entry
  • One thread computes and reports bank conflicts

22
Bank Conflict Detection
CAUSE_BANK_CONFLICTS true Bank conflicts
at line 14 memj Bank 0 1 2 3 4
5 6 7 8 9 Accesses 16 0 0 0 0 0 0 0
0 0
  • CAUSE_BANK_CONFLICTS false
  • No bank conflicts at
  • line 14 memj

23
Preliminary Results
  • Scan
  • Included in CUDA SDK
  • All-prefix sums operation
  • 400 lines of code
  • Explicitly prevents race conditions and bank
    conflicts

24
Preliminary ResultsRace Condition Detection
  • Original code
  • No race conditions detected
  • Remove any synchronization calls
  • Race conditions detected

25
Preliminary ResultsBank Conflict Detection
  • Original code
  • Small number of minor bank conflicts
  • Enable bank conflict avoidance macro
  • Bank conflicts increased!
  • Confirmed by manual analysis
  • Culprit incorrect emulation mode

26
Instrumentation Overhead
  • Two sources
  • Emulation
  • Instrumentation
  • Assumption for debugging, programmers will
    already use emulation mode

27
Instrumentation Overhead
28
Future Work
  • Find more types of bugs
  • Correctness array bounds checking
  • Performance memory coalescing
  • Reduce instrumentation overhead
  • Execute instrumented code natively

29
Conclusion
  • GPGPU enormous performance potential
  • But parallel programming is challenging
  • Automated instrumentation can help
  • Find synchronization bugs
  • Identify inefficient memory accesses
  • And more

30
Questions?
  • Instrumentation tool will be available at
  • http//www.cs.virginia.edu/mwb7w/cuda

31
Domain Mapping
From NVIDIA CUDA Programming Guide, Version 1.1
32
Coalesced Accesses
From NVIDIA CUDA Programming Guide, Version 1.1
33
Non-Coalesced Accesses
From NVIDIA CUDA Programming Guide, Version 1.1
34
Race Condition Detection Algorithm
  • A thread t knows a race condition exists at
    shared memory location m if
  • Location m has been read from and written to
  • One of the accesses to m came from t
  • One of the accesses to m came from a thread other
    than t
  • Note that we are only checking for RAW and WAR
    hazards

35
Bank Conflicts Example
  • extern __shared__ int mem
  • __global__ void kernel(int iters)
  • int min, stride, max, id threadIdx.x
  • if (CAUSE_BANK_CONFLICTS)
  • // Set stride to cause bank conflicts
  • else
  • // Set stride to avoid bank conflicts
  • for (int i 0 i lt iters i)
  • for (int j min j lt max j stride)
  • memj

36
Instrumented Code Example
Original Code
  • extern __shared__ int s
  • __global__ void kernel()
  • int id threadIdx.x
  • int nt blockDim.x
  • blockDim.y
  • blockDim.z
  • sid id
  • int temp s(ntid-1) nt

RAW hazard at expression line 10 temp s((nt
id) - 1) nt
Instrumentation
Write a Comment
User Comments (0)
About PowerShow.com