Title: Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
1Harvesting the Opportunity of GPU-based
Acceleration Matei Ripeanu Networked Systems
Laboratory (NetSysLab) University of British
Columbia Joint work Abdullah Gharaibeh, Samer
Al-Kiswany
2Networked Systems Laboratory (NetSysLab) Universit
y of British Columbia
A golf course
a (nudist) beach
( and 199 days of rain each year)
3Hybrid architectures in Top 500
Nov10
4- Hybrid architectures
- High compute power / memory bandwidth
- Energy efficient
- operated today at low efficiency
- Agenda for this talk
- GPU Architecture Intuition
- What generates the above characteristics?
- Progress on efficiently harnessing hybrid
- (GPU-based) architectures
5Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
6Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
7Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
8Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
9Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
10Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
11Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
12Feed the cores with data
Idea 3 The processing elements are data hungry!
? Wide, high throughput memory bus
1310,000x parallelism!
Idea 4 Hide memory access latency ? Hardware
supported multithreading
14The Resulting GPU Architecture
- nVidia Tesla 2050
- 448 cores
- Four memories
- Shared
- fast 4 cycles
- small 48KB
- Global
- slow 400-600cycles
- large up to 3GB
- high throughput 150GB/s
- Texture read only
- Constant read only
- Hybrid
- PCI 16x -- 4GBps
GPU
15GPUs offer different characteristics
- High peak memory bandwidth
- Limited memory space
- High peak compute power
- High host-device communication overhead
- Complex to program
16Projects at NetSysLab_at_UBChttp//netsyslab.ece.ubc
.ca
- Porting applications to efficiently exploit GPU
characteristics - Size Matters Space/Time Tradeoffs to Improve
GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC10 - Accelerating Sequence Alignment on Hybrid
Architectures, A. Gharaibeh, M. Ripeanu,
Scientific Computing Magazine, January/February
2011. - Middleware runtime support to simplify
application development - CrystalGPU Transparent and Efficient Utilization
of GPU Power, A. Gharaibeh, S. Al-Kiswany, M.
Ripeanu, TR - GPU-optimized building blocks Data structures
and libraries - GPU Support for Batch Oriented Workloads, L.
Costa, S. Al-Kiswany, M. Ripeanu, IPCCC09 - Size Matters Space/Time Tradeoffs to Improve
GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC10 - A GPU Accelerated Storage System, A. Gharaibeh,
S. Al-Kiswany, M. Ripeanu, HPDC10 - On GPU's Viability as a Middleware Accelerator,
S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.
Ripeanu, JoCC08
17Motivating Question How should we design
applications to efficiently exploit GPU
characteristics?
- Context
- A bioinformatics problem Sequence Alignment
- A string matching problem
- Data intensive (102 GB)
Size Matters Space/Time Tradeoffs to Improve
GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC10
18Past work sequence alignment on GPUs
- MUMmerGPU Schatz 07, Trapnell 09
- A GPU port of the sequence alignment tool MUMmer
Kurtz 04 - 4x (end-to-end) compared to CPU version
()
gt 50 overhead
Hypothesis mismatch between the core data
structure (suffix tree) and GPU characteristics
19Idea trade-off time for space
- Use a space efficient data structure (though,
from higher computational complexity class)
suffix array - 4x speedup compared to suffix tree-based on GPU
Significant overhead reduction
- Consequences
- Opportunity to exploit multi-GPU systems as I/O
is less of a bottleneck - Focus is shifted towards optimizing the compute
stage
20Outline for the rest of this talk
- Sequence alignment background and offloading to
GPU - Space/Time trade-off analysis
- Evaluation
21Background Sequence Alignment Problem
CCAT GGCT... .....CGCCCTA
GCAATTT.... ...GCGG ...TAGGC
TGCGC... ...CGGCA...
...GGCG ...GGCTA ATGCG
.TCGG... TTTGCGG. ...TAGG
...ATAT .CCTA... CAATT.
..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..
Queries
Reference
- Problem Find where each query most likely
originated from - Queries
- 108 queries
- 101 to 102 symbols length per query
- Reference
- 106 to 1011 symbols length (up to 400GB)
22GPU Offloading Opportunity and Challenges
- Sequence alignment
- Easy to partition
- Memory intensive
- GPU
- Massively parallel
- High memory bandwidth
Opportunity
- Data Intensive
- Large output size
- Limited memory space
- No direct access to other I/O devices (e.g., disk)
Challenges
23GPU Offloading addressing the challenges
- Data intensive problem and limited memory space
- divide and compute in rounds
- search-optimized data-structures
- Large output size
- compressed output representation (decompress on
the CPU)
- subrefs DivideRef(ref)
- subqrysets DivideQrys(qrys)
- foreach subqryset in subqrysets
- results NULL
- CopyToGPU(subqryset)
- foreach subref in subrefs
- CopyToGPU(subref)
- MatchKernel(subqryset,
- subref)
- CopyFromGPU(results)
-
- Decompress(results)
High-level algorithm (executed on the host)
24Space/Time Trade-off Analysis
25The core data structure
- massive number of queries and long reference gt
- pre-process reference to an index
- Past work build a suffix tree (MUMmerGPU Schatz
07, 09) - Search O(qry_len) per query
- Space O(ref_len)
- but the constant is high
- 20 x ref_len
- Post-processing DFS
traversal for each query O(4qry_len -
min_match_len)
26The core data structure
- massive number of queries and long reference gt
pre-process reference to an index
subrefs DivideRef(ref) subqrysets
DivideQrys(qrys) foreach subqryset in subqrysets
results NULL CopyToGPU(subqryset)
foreach subref in subrefs
CopyToGPU(subref) MatchKernel(subqryset,
subref) CopyFromGPU(results
) Decompress(results)
- Past work build a suffix tree (MUMmerGPU Schatz
07) - Search O(qry_len) per query
- Space O(ref_len), but the constant is high
20xref_len - Post-processing
O(4qry_len - min_match_len), DFS traversal per
query
Expensive
Efficient
Expensive
27A better matching data structure?
Suffix Tree
Suffix Array
0 A
1 ACA
2 ACACA
3 CA
4 CACA
5 TACACA
Less data to transfer
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post- process O(4qry_len - min_match_len) O(qry_len min_match_len)
Compute
Impact 1 Reduced communication
28A better matching data structure
Suffix Tree
Suffix Array
0 A
1 ACA
2 ACACA
3 CA
4 CACA
5 TACACA
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post- process O(4qry_len - min_match_len) O(qry_len min_match_len)
Space for longer sub-references gt fewer
processing rounds
Compute
Impact 2 Better data locality is achieved at
the cost of additional per-thread processing time
29A better matching data structure
Suffix Tree
Suffix Array
0 A
1 ACA
2 ACACA
3 CA
4 CACA
5 TACACA
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post- process O(4qry_len - min_match_len) O(qry_len min_match_len)
Compute
Impact 3 Lower post-processing overhead
30Evaluation
31Evaluation setup
- Testbed
- Low-end Geforce 9800 GX2 GPU (512MB)
- High-end Tesla C1060 (4GB)
- Base line suffix tree on GPU (MUMmerGPU Schatz
07, 09) - Success metrics
- Performance
- Energy consumption
- Workloads (NCBI Trace Archive, http//www.ncbi.nlm
.nih.gov/Traces)
Workload / Species Reference sequence length of queries Average read length
HS1 - Human (chromosome 2) 238M 78M 200
HS2 - Human (chromosome 3) 100M 2M 700
MONO - L. monocytogenes 3M 6M 120
SUIS - S. suis 2M 26M 36
32Speedup array-based over tree-based
33Dissecting the overheads
Significant reduction in data transfers and
post-processing
Workload HS1, 78M queries, 238M ref. length on
GeForce
34Comparing with CPU performance baseline single
core performance
Suffix tree
Suffix array
Suffix tree
35Summary
- GPUs have drastically different performance
characteristics - Reconsidering the choice of the data structure
used is necessary when porting applications to
the GPU - A good matching data structure ensures
- Low communication overhead
- Data locality might be achieved at the cost of
additional per thread processing time - Low post-processing overhead
36Code, benchmarks and papers available at
netsyslab.ece.ubc.ca