Mars: A MapReduce Framework on Graphics Processors - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Mars: A MapReduce Framework on Graphics Processors

Description:

Naga K. Govindaraju Tuyong Wang. Microsoft Corp. Sina Corp. Presenter: Wenbin Fang ... Latency hiding using large number of concurrent threads. Low context ... – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 41

Provided by: cse2

Category:

more less

Transcript and Presenter's Notes

Title: Mars: A MapReduce Framework on Graphics Processors

1
Mars A MapReduce Framework on Graphics Processors

Bingsheng He1, Wenbin Fang, Qiong Luo
Hong Kong Univ. of Sci. and Tech.
Naga K. Govindaraju Tuyong
Wang
Microsoft Corp.
Sina Corp.

Presenter Wenbin Fang
1, Currently in Microsoft Research Asia
2
Overview

Introduction
Design
Implementation
Evaluation
Conclusion

3
Overview

Introduction
Design
Implementation
Evaluation
Conclusion

4
Graphics Processing Units (GPUs)

Massively multithreaded co-processors
Recent NVIDIA GPUs
1 TFLOPS peak performance
Consist of many SIMD multiprocessors
Thread groups

5
Graphics Processing Units (Cont.)

High bandwidth device memory
gt10x higher than CPU main memorys bandwidth.
High latency device memory
200 clock cycles of latency
Latency hiding using large number of concurrent
threads
Low context-switch overhead

6
GPGPU

Linear algebra Larsen 01, Fatahalian 04, Galoppo
05
FFT Moreland 03, Horn 06
Matrix operations Jiang 05
Folding_at_home, Seti_at_home
Database applications
Basic Operators Govindaraju 04
Sorting Govindaraju 06
Join He 08

7
GPU Programming

DirectX, OpenGL

Graphics rendering pipeline

NVIDIA CUDA, ATI CAL

Different programming models.
8

Without worrying about hardware details
Make GPGPU programming much
easier.
Well harness high parallelism and high
computational ability of GPUs.

MapReduce

9
The Original MapReduce

MapReduce is a parallel programming model for
processing and generating large datasets,
proposed by google OSDI04.
It takes a set of records in the form of
key/value pair as input, and produces a set of
output key/value pairs.

10
MapReduce Functions

Programmers specify two functions
map (in_key, in_value)
reduce (out_key, list(intermediate_value))
The MapReduce runtime takes care of
parallelization
fault tolerance
data distribution
load balancing

11
MapReduce Workflow
From http//labs.google.com/papers/mapreduce.html
12
MapReduce outside google

Hadoop Apache project
MapReduce on multicore CPUs -- Phoenix HPCA'07,
Ranger et al.
MapReduce on Cell 07, Kruijf et al.
Merge ASPLOS '08, Linderman et al.
MapReduce on GPUs stmcs'08, Catanzaro et al.)

13
Overview

Motivation
Design
Implementation
Evaluation
Conclusion

14
Software stack of Mars

Applications
Matrix Multiplication
String Match
Inverted Index
Similarity Score
Page View Count
Page View Rank

Mars A MapReduce on GPUs
NVIDIA CUDA
GPU Driver, OS
15
MapReduce on Multi-core CPU (Phoenix HPCA'07)
Input
Split
Map
Partition
Reduce
Merge
Output
16
Limitations on GPUs

Lack of dynamic memory allocation on GPUs
How to support variant length data?
How to dynamically allocate output buffer on
GPUs?
Lack of lock support
How to synchronize to avoid write conflict?

17
Data Structure for Mars
Support variant length record!

A Record ltKey, Value, Index entrygt

Key1
Key2
Key3

Value1
Value2
Value3

IndexEntry1
IndexEntry2
IndexEntry3

An index entry ltkey size, key offset, val size,
val offsetgt
18
Lock-free scheme for result output
Basic idea Calculate the offset for each thread
on the output buffer.

Histogram on key size, value size, and record
count.
Prefixsum on key size, value size, and record
count.
Allocate output buffer on GPU memory.
Perform computing.

19
Lock-free scheme example
Pick up odd numbers from the array 1, 3, 2, 3,
4, 6, 9, 8. map function as a filter filter
all odd numbers
20
Lock-free scheme example
T1
T2
T3
T4
1, 3, 2, 3, 4, 7,
9, 8
1
3
2
3
4
7
9
8
Step1 Histogram
Step2 Prefixsum
(5)
21
Lock-free scheme example
T1
T2
T3
T4
1, 3, 2, 3, 4, 7,
9, 8
Step3 Allocate
22
Lock-free scheme example
T1
T2
T3
T4
1, 3, 2, 3, 4, 7,
9, 8
Step4 Computation
1
3
2
3
4
7
9
8
Prefixsum
23
Mars workflow
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Sort and Group
ReduceCount
Prefixsum
Allocate output bufer on GPU
Reduce
Output
24
Mars workflow Map Only
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Output
Map only, without grouping and reduce
25
Mars workflow - Without Reduce
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Sort and Group
Output
Map and grouping, without reduce
26
APIs of Mars
User-defined MapCount Map Compare
(optional) ReduceCount (optional) Reduce
(optional)

Runtime Provided
AddMapInput
MapReduce
EmitInterCount
EmitIntermediate
EmitCount (optional)
Emit (optional)

27
Overview

Introduction
Design
Implementation
Evaluation
Conclusion

28
Mars-GPU

NVIDIA CUDA
Each map invocation or reduce invocation is a GPU
thread.

Mars-CPU

Operating systems thread APIs
Each map invocation or reduce invocation is a CPU
thread.

29
CUDA Features Used in Mars-GPU Implementation
Coalesced Access
Build-in vector type int4, char4, float4
30
Overview

Motivation
Design
Implementation
Evaluation
Conclusion

31
Experimental Setup

Comparison
CPU Phoenix, Mars-CPU (cpu thread 4)
GPU Mars-GPU (gpu thread app dependent)

32
Applications

String Match (SM) Find the position of a string
in a file.
S 32MB, M 64MB, L 128MB
Inverted Index (II) Build inverted index for
links in HTML files.
S 16MB, M 32MB, L 64MB
Matrix Multiplication (MM) Multiply two
matrices.
S 512x512, M 1024x10242, L 2048x2048

33
Applications (Cont.)

Similarity Score (SS) Compute the pair-wise
similarity score for a set of documents.
S 512x128, M 1024x128, L 2048x128
Page View Rank (PVR) Count the number of
distinct page views from web logs.
S 32MB, M 64MB, L 96MB
Page View Count (PVC) Find the top-10 hot pages
in the web log.
S 32MB, M 64MB, L 96MB

34
Effect of Coalessed Access
Coalessed access achieves a speedup of 1.2-2X
35
Effect of Built-In Data Types
Built-in data types achieve a speedup up to 2
times
36
GPU accelerates computation in MapReduce
With large data set
37
Mars-GPU vs. Phoenix on Quadcore CPU
The speedup is 1.5-16 times with various data
sizes
38
Mars-CPU vs. Phoenix
Mars-CPU is 1-5 times as fast as Phoenix
39
Overview