Title: Accelerating Machine Learning Applications on Graphics Processors
1Accelerating Machine Learning Applications on
Graphics Processors
- Narayanan Sundaram and Bryan Catanzaro
- Presented by
- Narayanan Sundaram
2Big Picture
Searcher
Consumer Search
Application
Frameworks
Patterns
Face Search Developer
Feature Extraction Classifier
Application Patterns
CBIR Application Framework
Application Framework Developer
Pattern Language
Map Reduce Programming Framework
SW Infrastructure
Map Reduce ProgrammingPattern
Map Reduce Programming Framework Developer
CUDA Computation Communication Framework
Barrier/Reduction Computation Communication Pat
terns
CUDA Framework Developer
Nvidia G80
Platform
Hardware Architect
3GPUs as proxy for manycore
- GPUs are interesting architectures to program
- Transitioning from highly specialized pipelines
to general purpose - The only way to get performance from GPUs is
through parallelism (No caching, branch
prediction, prefetching etc.) - Can launch millions of threads in one call
4GPUs are not for everyone
- Memory coalescing is really important
- Irregular memory accesses to even local stores is
discouraged - up to 30 performance hit on some
apps for local memory bank conflicts - Cannot forget that it is a SIMD machine
- Memory consistency is non-existent inter-SM
synchronization is absent - Hardware scheduled threads
- 20 us overhead for kernel call (20,000
instructions _at_ 1GHz)
5NVIDIA G80 Architecture
6NVIDIA GeForce 8800 GTX Specifications
Number of Streaming Multiprocessors 16
Multiprocessor Width 8
Local Store Size 16 KB
Total number of Stream Processors 128
Peak SP Floating Point Rate 346 Gflops
Clock 1.35 GHz
Device Memory 768 MB
Peak Memory Bandwidth 86.4 GB/s
Connection to Host CPU PCI Express
CPU -gt GPU bandwidth 2.2 GB/s
GPU -gt CPU bandwidth 1.7 GB/s
measured values
7GPU programming - CUDA
- Each block can have upto 512 threads that
synchronize - Millions of blocks can be issued
- No synchronization between blocks
- No control over scheduling
8Support Vector Machines
- A hugely popular machine learning technique for
classification - Tries to find a hyperplane separating the
different classes with maximum margin - Non-linear surfaces can be generated through
non-linear kernel functions - Uses Quadratic Programming for training (specific
set of constraints imply a wide variety of
techniques for solving it)
9SVM Training
- Quadratic Program
- Some kernel functions
Variables a Weight for each training point
(determines classifier) Data l number of
training points C trades off error on training
set for generalization performance y Label (/-
1) for each training point x training points
10Choice of parallel algorithm(among chunking
algorithms)
Sequential Minimal Optimization (SMO)
11Fitting SMO on a GPU
- Shared memory constraints on the GPU fits the
algorithm as only two vectors need to be shared
among all the threads - Performance strongly dependent on the choice of
the working set - Several heuristics proposed two are popular
(1st and 2nd order) - 2nd order heuristic is almost twice as costly,
but saves on the number of iterations
12Adaptive heuristic
- Both heuristics can be expressed as a series of
Map Reduce stages - A Map Reduce code generator was used to generate
the code - Sample periodically and adapt depending on the
most converging heuristic at any given time - Tightly coupled map-reduces are essential for
machine learning algorithms - Cannot afford the overhead of general library
call when called millions of times
13Results
Normalized to 1st order heuristic
14Overall speedup compared to LIBSVM
15SVM Classification
- SVM classification task involves finding which
side of the hyperplane a point lies on - Specifically,
- where
- Insight Instead of doing this serially for all
points, note that
16Restructuring the Classification problem
Test Data
SV
Test Data
SV
Vs
Output
Output
17Results
Dataset LibSVM time (s) CPU Optimized code time (s) GPU time (s)
Adult 61.307 7.476 0.575
Web 106.835 15.733 1.063
MNIST 269.880 9.522 1.951
USPS 0.777 0.229 0.00958
Face 88.835 5.191 0.705
18Results
19Is this compute or memory bound?
- GPUs are better for memory bound jobs (Observed 7
GB/s vs 1 GB/s for other streaming-like apps)
20Importance of memory coalescing
- In order to avoid non-coalesced memory accesses,
carried both Data and DataT into GPU memory - Letting 0.05 of memory accesses to be
non-coalesced led to a 21 drop in performance
for one case - Well written code should scale with GPU size
(parallelism should be limited by problem size,
not machine size)
21Is SIMD becoming ubiquitous?
- SIMD already important for performance on
uniprocessor systems - Task Vs Data parallelism
- Intels new GPU has wide SIMD
- CUDA lesson - Runtime SIMD binding easier for
programmers - Non-SIMD leads to performance penalty, not
incorrect programs prevents premature
optimizations and keep code flexible
22Conclusion
- GPUs and Manycore CPUs are on a collision course
- Data parallelism on GPUs or Task parallelism on
CPUs - Rethink serial control and data structures
- Sequential optimizations may harm parallelism
- Machine learning can use a lot of parallel
hardware if software engineered properly