Mars: A MapReduce Framework on Graphics Processors - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Mars: A MapReduce Framework on Graphics Processors

Description:

Naga K. Govindaraju Tuyong Wang. Microsoft Corp. Sina Corp. Presenter: Wenbin Fang ... Latency hiding using large number of concurrent threads. Low context ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 41
Provided by: cse2
Category:

less

Transcript and Presenter's Notes

Title: Mars: A MapReduce Framework on Graphics Processors


1
Mars A MapReduce Framework on Graphics Processors
  • Bingsheng He1, Wenbin Fang, Qiong Luo
  • Hong Kong Univ. of Sci. and Tech.
  • Naga K. Govindaraju Tuyong
    Wang
  • Microsoft Corp.
    Sina Corp.

Presenter Wenbin Fang
1, Currently in Microsoft Research Asia
2
Overview
  • Introduction
  • Design
  • Implementation
  • Evaluation
  • Conclusion

3
Overview
  • Introduction
  • Design
  • Implementation
  • Evaluation
  • Conclusion

4
Graphics Processing Units (GPUs)
  • Massively multithreaded co-processors
  • Recent NVIDIA GPUs
  • 1 TFLOPS peak performance
  • Consist of many SIMD multiprocessors
  • Thread groups

5
Graphics Processing Units (Cont.)
  • High bandwidth device memory
  • gt10x higher than CPU main memorys bandwidth.
  • High latency device memory
  • 200 clock cycles of latency
  • Latency hiding using large number of concurrent
    threads
  • Low context-switch overhead

6
GPGPU
  • Linear algebra Larsen 01, Fatahalian 04, Galoppo
    05
  • FFT Moreland 03, Horn 06
  • Matrix operations Jiang 05
  • Folding_at_home, Seti_at_home
  • Database applications
  • Basic Operators Govindaraju 04
  • Sorting Govindaraju 06
  • Join He 08

7
GPU Programming
  • DirectX, OpenGL

Graphics rendering pipeline
  • NVIDIA CUDA, ATI CAL

Different programming models.
8
  • Without worrying about hardware details
  • Make GPGPU programming much
  • easier.
  • Well harness high parallelism and high
  • computational ability of GPUs.
  • MapReduce

9
The Original MapReduce
  • MapReduce is a parallel programming model for
    processing and generating large datasets,
    proposed by google OSDI04.
  • It takes a set of records in the form of
    key/value pair as input, and produces a set of
    output key/value pairs.

10
MapReduce Functions
  • Programmers specify two functions
  • map (in_key, in_value)
  • reduce (out_key, list(intermediate_value))
  • The MapReduce runtime takes care of
  • parallelization
  • fault tolerance
  • data distribution
  • load balancing

11
MapReduce Workflow
From http//labs.google.com/papers/mapreduce.html
12
MapReduce outside google
  • Hadoop Apache project
  • MapReduce on multicore CPUs -- Phoenix HPCA'07,
    Ranger et al.
  • MapReduce on Cell 07, Kruijf et al.
  • Merge ASPLOS '08, Linderman et al.
  • MapReduce on GPUs stmcs'08, Catanzaro et al.)

13
Overview
  • Motivation
  • Design
  • Implementation
  • Evaluation
  • Conclusion

14
Software stack of Mars
  • Applications
  • Matrix Multiplication
  • String Match
  • Inverted Index
  • Similarity Score
  • Page View Count
  • Page View Rank

Mars A MapReduce on GPUs
NVIDIA CUDA
GPU Driver, OS
15
MapReduce on Multi-core CPU (Phoenix HPCA'07)
Input
Split
Map
Partition
Reduce
Merge
Output
16
Limitations on GPUs
  • Lack of dynamic memory allocation on GPUs
  • How to support variant length data?
  • How to dynamically allocate output buffer on
    GPUs?
  • Lack of lock support
  • How to synchronize to avoid write conflict?

17
Data Structure for Mars
Support variant length record!
  • A Record ltKey, Value, Index entrygt

Key1
Key2
Key3

Value1
Value2
Value3

IndexEntry1
IndexEntry2
IndexEntry3

An index entry ltkey size, key offset, val size,
val offsetgt
18
Lock-free scheme for result output
Basic idea Calculate the offset for each thread
on the output buffer.
  • Histogram on key size, value size, and record
    count.
  • Prefixsum on key size, value size, and record
    count.
  • Allocate output buffer on GPU memory.
  • Perform computing.

19
Lock-free scheme example
Pick up odd numbers from the array 1, 3, 2, 3,
4, 6, 9, 8. map function as a filter filter
all odd numbers
20
Lock-free scheme example
T1
T2
T3
T4
1, 3, 2, 3, 4, 7,
9, 8
1
3
2
3
4
7
9
8
Step1 Histogram
Step2 Prefixsum
(5)
21
Lock-free scheme example
T1
T2
T3
T4
1, 3, 2, 3, 4, 7,
9, 8
Step3 Allocate
22
Lock-free scheme example
T1
T2
T3
T4
1, 3, 2, 3, 4, 7,
9, 8
Step4 Computation
1
3
2
3
4
7
9
8
Prefixsum
23
Mars workflow
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Sort and Group
ReduceCount
Prefixsum
Allocate output bufer on GPU
Reduce
Output
24
Mars workflow Map Only
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Output
Map only, without grouping and reduce
25
Mars workflow - Without Reduce
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Sort and Group
Output
Map and grouping, without reduce
26
APIs of Mars
User-defined MapCount Map Compare
(optional) ReduceCount (optional) Reduce
(optional)
  • Runtime Provided
  • AddMapInput
  • MapReduce
  • EmitInterCount
  • EmitIntermediate
  • EmitCount (optional)
  • Emit (optional)

27
Overview
  • Introduction
  • Design
  • Implementation
  • Evaluation
  • Conclusion

28
Mars-GPU
  • NVIDIA CUDA
  • Each map invocation or reduce invocation is a GPU
    thread.

Mars-CPU
  • Operating systems thread APIs
  • Each map invocation or reduce invocation is a CPU
    thread.

29
CUDA Features Used in Mars-GPU Implementation
Coalesced Access
Build-in vector type int4, char4, float4
30
Overview
  • Motivation
  • Design
  • Implementation
  • Evaluation
  • Conclusion

31
Experimental Setup
  • Comparison
  • CPU Phoenix, Mars-CPU (cpu thread 4)
  • GPU Mars-GPU (gpu thread app dependent)

32
Applications
  • String Match (SM) Find the position of a string
    in a file.
  • S 32MB, M 64MB, L 128MB
  • Inverted Index (II) Build inverted index for
    links in HTML files.
  • S 16MB, M 32MB, L 64MB
  • Matrix Multiplication (MM) Multiply two
    matrices.
  • S 512x512, M 1024x10242, L 2048x2048

33
Applications (Cont.)
  • Similarity Score (SS) Compute the pair-wise
    similarity score for a set of documents.
  • S 512x128, M 1024x128, L 2048x128
  • Page View Rank (PVR) Count the number of
    distinct page views from web logs.
  • S 32MB, M 64MB, L 96MB
  • Page View Count (PVC) Find the top-10 hot pages
    in the web log.
  • S 32MB, M 64MB, L 96MB

34
Effect of Coalessed Access
Coalessed access achieves a speedup of 1.2-2X
35
Effect of Built-In Data Types
Built-in data types achieve a speedup up to 2
times
36
GPU accelerates computation in MapReduce
With large data set
37
Mars-GPU vs. Phoenix on Quadcore CPU
The speedup is 1.5-16 times with various data
sizes
38
Mars-CPU vs. Phoenix
Mars-CPU is 1-5 times as fast as Phoenix
39
Overview
  • Motivation
  • Design
  • Implementation
  • Evaluation
  • Conclusion

40
Conclusion
  • MapReduce framework on GPUs
  • Ease of programming on GPUs
  • Promising performance
  • Want a Copy of Mars?
  • http//www.cse.ust.hk/gpuqp/Mars.html
Write a Comment
User Comments (0)
About PowerShow.com