Accelerating Machine Learning Applications on Graphics Processors - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Accelerating Machine Learning Applications on Graphics Processors

Description:

Transitioning from highly specialized pipelines to general purpose ... accesses to even local stores is discouraged - up to 30% performance hit on some ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 23

Provided by: narayanan

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Accelerating Machine Learning Applications on Graphics Processors

1
Accelerating Machine Learning Applications on
Graphics Processors

Narayanan Sundaram and Bryan Catanzaro
Presented by
Narayanan Sundaram

2
Big Picture
Searcher
Consumer Search
Application
Frameworks
Patterns
Face Search Developer
Feature Extraction Classifier
Application Patterns
CBIR Application Framework
Application Framework Developer
Pattern Language
Map Reduce Programming Framework
SW Infrastructure
Map Reduce ProgrammingPattern
Map Reduce Programming Framework Developer
CUDA Computation Communication Framework
Barrier/Reduction Computation Communication Pat
terns
CUDA Framework Developer
Nvidia G80
Platform
Hardware Architect
3
GPUs as proxy for manycore

GPUs are interesting architectures to program
Transitioning from highly specialized pipelines
to general purpose
The only way to get performance from GPUs is
through parallelism (No caching, branch
prediction, prefetching etc.)
Can launch millions of threads in one call

4
GPUs are not for everyone

Memory coalescing is really important
Irregular memory accesses to even local stores is
discouraged - up to 30 performance hit on some
apps for local memory bank conflicts
Cannot forget that it is a SIMD machine
Memory consistency is non-existent inter-SM
synchronization is absent
Hardware scheduled threads
20 us overhead for kernel call (20,000
instructions _at_ 1GHz)

5
NVIDIA G80 Architecture
6
NVIDIA GeForce 8800 GTX Specifications
Number of Streaming Multiprocessors 16
Multiprocessor Width 8
Local Store Size 16 KB
Total number of Stream Processors 128
Peak SP Floating Point Rate 346 Gflops
Clock 1.35 GHz
Device Memory 768 MB
Peak Memory Bandwidth 86.4 GB/s
Connection to Host CPU PCI Express
CPU -gt GPU bandwidth 2.2 GB/s
GPU -gt CPU bandwidth 1.7 GB/s
measured values
7
GPU programming - CUDA

Each block can have upto 512 threads that
synchronize
Millions of blocks can be issued
No synchronization between blocks
No control over scheduling

8
Support Vector Machines

A hugely popular machine learning technique for
classification
Tries to find a hyperplane separating the
different classes with maximum margin
Non-linear surfaces can be generated through
non-linear kernel functions
Uses Quadratic Programming for training (specific
set of constraints imply a wide variety of
techniques for solving it)

9
SVM Training

Quadratic Program
Some kernel functions

Variables a Weight for each training point
(determines classifier) Data l number of
training points C trades off error on training
set for generalization performance y Label (/-
1) for each training point x training points
10
Choice of parallel algorithm(among chunking
algorithms)
Sequential Minimal Optimization (SMO)
11
Fitting SMO on a GPU

Shared memory constraints on the GPU fits the
algorithm as only two vectors need to be shared
among all the threads
Performance strongly dependent on the choice of
the working set
Several heuristics proposed two are popular
(1st and 2nd order)
2nd order heuristic is almost twice as costly,
but saves on the number of iterations

12
Adaptive heuristic

Both heuristics can be expressed as a series of
Map Reduce stages
A Map Reduce code generator was used to generate
the code
Sample periodically and adapt depending on the
most converging heuristic at any given time
Tightly coupled map-reduces are essential for
machine learning algorithms
Cannot afford the overhead of general library
call when called millions of times

13
Results
Normalized to 1st order heuristic
14
Overall speedup compared to LIBSVM
15
SVM Classification

SVM classification task involves finding which
side of the hyperplane a point lies on
Specifically,
where
Insight Instead of doing this serially for all
points, note that

16
Restructuring the Classification problem
Test Data
SV
Test Data
SV
Vs
Output
Output
17
Results
Dataset LibSVM time (s) CPU Optimized code time (s) GPU time (s)
Adult 61.307 7.476 0.575
Web 106.835 15.733 1.063
MNIST 269.880 9.522 1.951
USPS 0.777 0.229 0.00958
Face 88.835 5.191 0.705
18
Results
19
Is this compute or memory bound?

GPUs are better for memory bound jobs (Observed 7
GB/s vs 1 GB/s for other streaming-like apps)

20
Importance of memory coalescing

In order to avoid non-coalesced memory accesses,
carried both Data and DataT into GPU memory
Letting 0.05 of memory accesses to be
non-coalesced led to a 21 drop in performance
for one case
Well written code should scale with GPU size
(parallelism should be limited by problem size,
not machine size)

21
Is SIMD becoming ubiquitous?

SIMD already important for performance on
uniprocessor systems
Task Vs Data parallelism
Intels new GPU has wide SIMD
CUDA lesson - Runtime SIMD binding easier for
programmers
Non-SIMD leads to performance penalty, not
incorrect programs prevents premature
optimizations and keep code flexible

22
Conclusion

GPUs and Manycore CPUs are on a collision course
Data parallelism on GPUs or Task parallelism on
CPUs
Rethink serial control and data structures
Sequential optimizations may harm parallelism
Machine learning can use a lot of parallel
hardware if software engineered properly

Write a Comment

User Comments (0)