Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance - PowerPoint PPT Presentation

About This Presentation
Title:

Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance

Description:

Title: Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance Author: abdullah Last modified by: Matei Ripeanu Created Date – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 37
Provided by: abd108
Category:

less

Transcript and Presenter's Notes

Title: Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance


1
Harvesting the Opportunity of GPU-based
Acceleration Matei Ripeanu Networked Systems
Laboratory (NetSysLab) University of British
Columbia Joint work Abdullah Gharaibeh, Samer
Al-Kiswany
2
Networked Systems Laboratory (NetSysLab) Universit
y of British Columbia
A golf course
a (nudist) beach
( and 199 days of rain each year)
3
Hybrid architectures in Top 500
Nov10
4
  • Hybrid architectures
  • High compute power / memory bandwidth
  • Energy efficient
  • operated today at low efficiency
  • Agenda for this talk
  • GPU Architecture Intuition
  • What generates the above characteristics?
  • Progress on efficiently harnessing hybrid
  • (GPU-based) architectures

5
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
6
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
7
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
8
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
9
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
10
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
11
Acknowledgement Slide borrowed from presentation
by Kayvon Fatahalian
12
Feed the cores with data
Idea 3 The processing elements are data hungry!
? Wide, high throughput memory bus
13
10,000x parallelism!
Idea 4 Hide memory access latency ? Hardware
supported multithreading
14
The Resulting GPU Architecture
  • nVidia Tesla 2050
  • 448 cores
  • Four memories
  • Shared
  • fast 4 cycles
  • small 48KB
  • Global
  • slow 400-600cycles
  • large up to 3GB
  • high throughput 150GB/s
  • Texture read only
  • Constant read only
  • Hybrid
  • PCI 16x -- 4GBps

GPU
15
GPUs offer different characteristics
  • High peak memory bandwidth
  • Limited memory space
  • High peak compute power
  • High host-device communication overhead
  • Complex to program

16
Projects at NetSysLab_at_UBChttp//netsyslab.ece.ubc
.ca
  • Porting applications to efficiently exploit GPU
    characteristics
  • Size Matters Space/Time Tradeoffs to Improve
    GPGPU Applications Performance, A.Gharaibeh, M.
    Ripeanu, SC10
  • Accelerating Sequence Alignment on Hybrid
    Architectures, A. Gharaibeh, M. Ripeanu,
    Scientific Computing Magazine, January/February
    2011.
  • Middleware runtime support to simplify
    application development
  • CrystalGPU Transparent and Efficient Utilization
    of GPU Power, A. Gharaibeh, S. Al-Kiswany, M.
    Ripeanu, TR
  • GPU-optimized building blocks Data structures
    and libraries
  • GPU Support for Batch Oriented Workloads, L.
    Costa, S. Al-Kiswany, M. Ripeanu, IPCCC09
  • Size Matters Space/Time Tradeoffs to Improve
    GPGPU Applications Performance, A.Gharaibeh, M.
    Ripeanu, SC10
  • A GPU Accelerated Storage System, A. Gharaibeh,
    S. Al-Kiswany, M. Ripeanu, HPDC10
  • On GPU's Viability as a Middleware Accelerator,
    S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.
    Ripeanu, JoCC08

17
Motivating Question How should we design
applications to efficiently exploit GPU
characteristics?
  • Context
  • A bioinformatics problem Sequence Alignment
  • A string matching problem
  • Data intensive (102 GB)

Size Matters Space/Time Tradeoffs to Improve
GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC10
18
Past work sequence alignment on GPUs
  • MUMmerGPU Schatz 07, Trapnell 09
  • A GPU port of the sequence alignment tool MUMmer
    Kurtz 04
  • 4x (end-to-end) compared to CPU version

()
gt 50 overhead
Hypothesis mismatch between the core data
structure (suffix tree) and GPU characteristics
19
Idea trade-off time for space
  • Use a space efficient data structure (though,
    from higher computational complexity class)
    suffix array
  • 4x speedup compared to suffix tree-based on GPU

Significant overhead reduction
  • Consequences
  • Opportunity to exploit multi-GPU systems as I/O
    is less of a bottleneck
  • Focus is shifted towards optimizing the compute
    stage

20
Outline for the rest of this talk
  • Sequence alignment background and offloading to
    GPU
  • Space/Time trade-off analysis
  • Evaluation

21
Background Sequence Alignment Problem
CCAT GGCT... .....CGCCCTA
GCAATTT.... ...GCGG ...TAGGC
TGCGC... ...CGGCA...
...GGCG ...GGCTA ATGCG
.TCGG... TTTGCGG. ...TAGG
...ATAT .CCTA... CAATT.
..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..
Queries
Reference
  • Problem Find where each query most likely
    originated from
  • Queries
  • 108 queries
  • 101 to 102 symbols length per query
  • Reference
  • 106 to 1011 symbols length (up to 400GB)

22
GPU Offloading Opportunity and Challenges
  • Sequence alignment
  • Easy to partition
  • Memory intensive
  • GPU
  • Massively parallel
  • High memory bandwidth

Opportunity
  • Data Intensive
  • Large output size
  • Limited memory space
  • No direct access to other I/O devices (e.g., disk)

Challenges
23
GPU Offloading addressing the challenges
  • Data intensive problem and limited memory space
  • divide and compute in rounds
  • search-optimized data-structures
  • Large output size
  • compressed output representation (decompress on
    the CPU)
  • subrefs DivideRef(ref)
  • subqrysets DivideQrys(qrys)
  • foreach subqryset in subqrysets
  • results NULL
  • CopyToGPU(subqryset)
  • foreach subref in subrefs
  • CopyToGPU(subref)
  • MatchKernel(subqryset,
  • subref)
  • CopyFromGPU(results)
  • Decompress(results)

High-level algorithm (executed on the host)
24
Space/Time Trade-off Analysis
25
The core data structure
  • massive number of queries and long reference gt
  • pre-process reference to an index
  • Past work build a suffix tree (MUMmerGPU Schatz
    07, 09)
  • Search O(qry_len) per query
  • Space O(ref_len)
  • but the constant is high
  • 20 x ref_len
  • Post-processing DFS
    traversal for each query O(4qry_len -
    min_match_len)

26
The core data structure
  • massive number of queries and long reference gt
    pre-process reference to an index

subrefs DivideRef(ref) subqrysets
DivideQrys(qrys) foreach subqryset in subqrysets
results NULL CopyToGPU(subqryset)
foreach subref in subrefs
CopyToGPU(subref) MatchKernel(subqryset,


subref) CopyFromGPU(results
) Decompress(results)
  • Past work build a suffix tree (MUMmerGPU Schatz
    07)
  • Search O(qry_len) per query
  • Space O(ref_len), but the constant is high
    20xref_len
  • Post-processing
    O(4qry_len - min_match_len), DFS traversal per
    query

Expensive
Efficient
Expensive
27
A better matching data structure?
Suffix Tree
Suffix Array
0 A
1 ACA
2 ACACA
3 CA
4 CACA
5 TACACA
Less data to transfer
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post- process O(4qry_len - min_match_len) O(qry_len min_match_len)
Compute
Impact 1 Reduced communication
28
A better matching data structure
Suffix Tree
Suffix Array
0 A
1 ACA
2 ACACA
3 CA
4 CACA
5 TACACA
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post- process O(4qry_len - min_match_len) O(qry_len min_match_len)
Space for longer sub-references gt fewer
processing rounds
Compute
Impact 2 Better data locality is achieved at
the cost of additional per-thread processing time
29
A better matching data structure
Suffix Tree
Suffix Array
0 A
1 ACA
2 ACACA
3 CA
4 CACA
5 TACACA
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post- process O(4qry_len - min_match_len) O(qry_len min_match_len)
Compute
Impact 3 Lower post-processing overhead
30
Evaluation
31
Evaluation setup
  • Testbed
  • Low-end Geforce 9800 GX2 GPU (512MB)
  • High-end Tesla C1060 (4GB)
  • Base line suffix tree on GPU (MUMmerGPU Schatz
    07, 09)
  • Success metrics
  • Performance
  • Energy consumption
  • Workloads (NCBI Trace Archive, http//www.ncbi.nlm
    .nih.gov/Traces)

Workload / Species Reference sequence length of queries Average read length
HS1 - Human (chromosome 2) 238M 78M 200
HS2 - Human (chromosome 3) 100M 2M 700
MONO - L. monocytogenes 3M 6M 120
SUIS - S. suis 2M 26M 36
32
Speedup array-based over tree-based
33
Dissecting the overheads
Significant reduction in data transfers and
post-processing
Workload HS1, 78M queries, 238M ref. length on
GeForce
34
Comparing with CPU performance baseline single
core performance
Suffix tree
Suffix array
Suffix tree
35
Summary
  • GPUs have drastically different performance
    characteristics
  • Reconsidering the choice of the data structure
    used is necessary when porting applications to
    the GPU
  • A good matching data structure ensures
  • Low communication overhead
  • Data locality might be achieved at the cost of
    additional per thread processing time
  • Low post-processing overhead

36
Code, benchmarks and papers available at
netsyslab.ece.ubc.ca
Write a Comment
User Comments (0)
About PowerShow.com