Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance - PowerPoint PPT Presentation

About This Presentation

Title:

Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance

Description:

Title: Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance Author: abdullah Last modified by: Matei Ripeanu Created Date – PowerPoint PPT presentation

Number of Views:156

Avg rating:3.0/5.0

Slides: 37

Provided by: abd108

Category:

more less

Transcript and Presenter's Notes

Title: Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance

1
Harvesting the Opportunity of GPU-based
Acceleration Matei Ripeanu Networked Systems
Laboratory (NetSysLab) University of British
Columbia Joint work Abdullah Gharaibeh, Samer
Al-Kiswany
2
Networked Systems Laboratory (NetSysLab) Universit
y of British Columbia
A golf course
a (nudist) beach
( and 199 days of rain each year)
3
Hybrid architectures in Top 500
Nov10
4

Hybrid architectures
High compute power / memory bandwidth
Energy efficient
operated today at low efficiency
Agenda for this talk
GPU Architecture Intuition
What generates the above characteristics?
Progress on efficiently harnessing hybrid
(GPU-based) architectures

5
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
6
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
7
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
8
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
9
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
10
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
11
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
12
Feed the cores with data
Idea 3 The processing elements are data hungry!
? Wide, high throughput memory bus
13
10,000x parallelism!
Idea 4 Hide memory access latency ? Hardware
supported multithreading
14
The Resulting GPU Architecture

nVidia Tesla 2050
448 cores
Four memories
Shared
fast 4 cycles
small 48KB
Global
slow 400-600cycles
large up to 3GB
high throughput 150GB/s
Texture read only
Constant read only
Hybrid
PCI 16x -- 4GBps

GPU
15
GPUs offer different characteristics

High peak memory bandwidth
Limited memory space

High peak compute power
High host-device communication overhead
Complex to program

16
Projects at NetSysLab_at_UBChttp//netsyslab.ece.ubc
.ca

Porting applications to efficiently exploit GPU
characteristics
Size Matters Space/Time Tradeoffs to Improve
GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC10
Accelerating Sequence Alignment on Hybrid
Architectures, A. Gharaibeh, M. Ripeanu,
Scientific Computing Magazine, January/February
2011.
Middleware runtime support to simplify
application development
CrystalGPU Transparent and Efficient Utilization
of GPU Power, A. Gharaibeh, S. Al-Kiswany, M.
Ripeanu, TR
GPU-optimized building blocks Data structures
and libraries
GPU Support for Batch Oriented Workloads, L.
Costa, S. Al-Kiswany, M. Ripeanu, IPCCC09
Size Matters Space/Time Tradeoffs to Improve
GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC10
A GPU Accelerated Storage System, A. Gharaibeh,
S. Al-Kiswany, M. Ripeanu, HPDC10
On GPU's Viability as a Middleware Accelerator,
S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.
Ripeanu, JoCC08

17
Motivating Question How should we design
applications to efficiently exploit GPU
characteristics?

Context
A bioinformatics problem Sequence Alignment
A string matching problem
Data intensive (102 GB)

Size Matters Space/Time Tradeoffs to Improve
GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC10
18
Past work sequence alignment on GPUs

MUMmerGPU Schatz 07, Trapnell 09
A GPU port of the sequence alignment tool MUMmer
Kurtz 04
4x (end-to-end) compared to CPU version

()
gt 50 overhead
Hypothesis mismatch between the core data
structure (suffix tree) and GPU characteristics
19
Idea trade-off time for space

Use a space efficient data structure (though,
from higher computational complexity class)
suffix array
4x speedup compared to suffix tree-based on GPU

Significant overhead reduction

Consequences
Opportunity to exploit multi-GPU systems as I/O
is less of a bottleneck
Focus is shifted towards optimizing the compute
stage

20
Outline for the rest of this talk

Sequence alignment background and offloading to
GPU
Space/Time trade-off analysis
Evaluation

21
Background Sequence Alignment Problem
CCAT GGCT... .....CGCCCTA
GCAATTT.... ...GCGG ...TAGGC
TGCGC... ...CGGCA...
...GGCG ...GGCTA ATGCG
.TCGG... TTTGCGG. ...TAGG
...ATAT .CCTA... CAATT.
..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..
Queries
Reference

Problem Find where each query most likely
originated from
Queries
108 queries
101 to 102 symbols length per query
Reference
106 to 1011 symbols length (up to 400GB)

22
GPU Offloading Opportunity and Challenges

Sequence alignment
Easy to partition
Memory intensive

GPU
Massively parallel
High memory bandwidth

Opportunity

Data Intensive
Large output size

Limited memory space
No direct access to other I/O devices (e.g., disk)

Challenges
23
GPU Offloading addressing the challenges

Data intensive problem and limited memory space
divide and compute in rounds
search-optimized data-structures
Large output size
compressed output representation (decompress on
the CPU)

subrefs DivideRef(ref)
subqrysets DivideQrys(qrys)
foreach subqryset in subqrysets
results NULL
CopyToGPU(subqryset)
foreach subref in subrefs
CopyToGPU(subref)
MatchKernel(subqryset,
subref)
CopyFromGPU(results)
Decompress(results)

High-level algorithm (executed on the host)
24
Space/Time Trade-off Analysis
25
The core data structure

massive number of queries and long reference gt
pre-process reference to an index

Past work build a suffix tree (MUMmerGPU Schatz
07, 09)
Search O(qry_len) per query
Space O(ref_len)
but the constant is high
20 x ref_len
Post-processing DFS
traversal for each query O(4qry_len -
min_match_len)

26
The core data structure

massive number of queries and long reference gt
pre-process reference to an index

subrefs DivideRef(ref) subqrysets
DivideQrys(qrys) foreach subqryset in subqrysets
results NULL CopyToGPU(subqryset)
foreach subref in subrefs
CopyToGPU(subref) MatchKernel(subqryset,

subref) CopyFromGPU(results
) Decompress(results)

Past work build a suffix tree (MUMmerGPU Schatz
07)
Search O(qry_len) per query
Space O(ref_len), but the constant is high
20xref_len
Post-processing
O(4qry_len - min_match_len), DFS traversal per
query

Expensive
Efficient
Expensive
27
A better matching data structure?
Suffix Tree
Suffix Array
0 A
1 ACA
2 ACACA
3 CA
4 CACA
5 TACACA
Less data to transfer
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post- process O(4qry_len - min_match_len) O(qry_len min_match_len)
Compute
Impact 1 Reduced communication
28
A better matching data structure
Suffix Tree
Suffix Array
0 A
1 ACA
2 ACACA
3 CA
4 CACA
5 TACACA
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post- process O(4qry_len - min_match_len) O(qry_len min_match_len)
Space for longer sub-references gt fewer
processing rounds
Compute
Impact 2 Better data locality is achieved at
the cost of additional per-thread processing time
29
A better matching data structure
Suffix Tree
Suffix Array
0 A
1 ACA
2 ACACA
3 CA
4 CACA
5 TACACA
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post- process O(4qry_len - min_match_len) O(qry_len min_match_len)
Compute
Impact 3 Lower post-processing overhead
30
Evaluation
31
Evaluation setup

Testbed
Low-end Geforce 9800 GX2 GPU (512MB)
High-end Tesla C1060 (4GB)
Base line suffix tree on GPU (MUMmerGPU Schatz
07, 09)
Success metrics
Performance
Energy consumption
Workloads (NCBI Trace Archive, http//www.ncbi.nlm
.nih.gov/Traces)

Workload / Species Reference sequence length of queries Average read length
HS1 - Human (chromosome 2) 238M 78M 200
HS2 - Human (chromosome 3) 100M 2M 700
MONO - L. monocytogenes 3M 6M 120
SUIS - S. suis 2M 26M 36
32
Speedup array-based over tree-based
33
Dissecting the overheads
Significant reduction in data transfers and
post-processing
Workload HS1, 78M queries, 238M ref. length on
GeForce
34
Comparing with CPU performance baseline single
core performance
Suffix tree
Suffix array
Suffix tree
35
Summary

GPUs have drastically different performance
characteristics
Reconsidering the choice of the data structure
used is necessary when porting applications to
the GPU
A good matching data structure ensures
Low communication overhead
Data locality might be achieved at the cost of
additional per thread processing time
Low post-processing overhead

36
Code, benchmarks and papers available at
netsyslab.ece.ubc.ca

Write a Comment

User Comments (0)