Improving MapReduce Performance in Large Virtualized Environments - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Improving MapReduce Performance in Large Virtualized Environments

Description:

Ruby on. Rails environment. VM monitor. local OS functions. trace collection. web svc. APIs ... 93% max gain over native. EC2 Sort Throughput ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 34

Provided by: and66

Category:

more less

Transcript and Presenter's Notes

Title: Improving MapReduce Performance in Large Virtualized Environments

1
Improving MapReduce Performance in Large
Virtualized Environments

Andy Konwinski, Matei ZahariaAnthony Joseph,
Randy Katz, Ion Stoica
RAD Lab, June 2008

2
RAD Lab Overview
Low level spec
Com- piler
High level spec
Instrumentation Backplane
New apps, equipment, global policies (eg SLA)
Offered load, resource utilization, etc.
Director
Training data
performance cost models
Log Mining
3
RAD Lab Overview
Low level spec
Com- piler
High level spec

Instrumentation Backplane
New apps, equipment, global policies (eg SLA)
Offered load, resource utilization, etc.
Director
Training data
performance cost models
Log Mining
4
Motivation Objectives

Motivation
MapReduce growing in popularity
Virtualized computing services like Amazon EC2
provide on-demand computing power
But, MR assumes homogeneous environment
Objectives
Study impact of virtualization on Hadoop
Improve Hadoop performance on EC2

5
Methodology

Measured EC2 performance using low level and
application benchmarks
Identified two challenges
EC2 effects
Heterogeneity
Designed heterogeneity-aware scheduler that
reduces response times up to 2x

6
Outline

Experience running Hadoop on EC2
Choosing EC2 instance types
Disk and network cold start effects
The challenge of heterogeneity
LATE a heterogeneity-aware scheduler
Lessons about virtualized environments

7
Choosing EC2 Instance Types

3 VM types small (0.10/h, 1 disk), large
(0.40/h, 2 disks), xlarge (0.80/h, 4 disks)
Goal maximize computing power / dollar
Observation Small VMs have same CPU RAM per
dollar as large, but more net disk bandwidth
when no contention
Environment chosen
100-900 small VM for slaves
1 large VM for master

8
Disk Cold Start

First write to each disk location is slow
(possibly due to allocation or security?)
Recommendation warm up nodes with dd

Bytes Written
Time (s)
9
Network Cold Start

First time two machines communicate incurs 2-5s
delay (maybe firewall setup?)
Effect on Hadoop sorting 200 GB on 200 nodes
takes 20 min first time, 4 min later
Also makes setup slow
Recommendation warm up network with a short
MapReduce job automate setup

10
Heterogeneity in Virtualized Environments

Virtual machine technology isolates CPU and
memory effectively
However, disk and network are shared
Full bandwidth when no contention
Fair sharing if there is contention

11
Disk IO Heterogeneity on EC2

Result IO bandwidth / VM varies by 2.5x
Same effect seen at application level

Effect of contention on I/O performance on EC2
12
Speculative Scheduling in MapReduce

Hadoop assumes nodes have similar performance,
launches backup copies of slow tasks to speed
up response times
This is critical for short interactive jobs,
which are large fraction of workload
E.g. average job at Google is 395s
Problem How to select backup tasks in a
heterogeneous environment?

13
Hadoops Scheduler

Hadoops scheduler uses fixed threshold create
backup copies of all tasks with progress lt
averageProgress 20
Problems
Too many tasks may be backed up, thrashing shared
resources like the network
Wrong tasks may be backed up if slightly slow
Tasks never backed up if progress gt 80
Tasks may be restarted on slow nodes

14
Example 900 Node Run

80 of reduce tasks had backups started
1.3 hours to sort 100 GB

80!
15
New Scheduler LATE

Longest Approximate Time to End
Back up the task with the largest estimated
finish time
Cap backup tasks to 10
Only launch backups on fast nodes
Only back up tasks that are slow enough

16
LATE Details

Estimating finish time
Thresholds
25th percentiles work well for slow node/task
Backup task cap can be controlled to trade off
throughput vs response time (currently 10)

17
Evaluation

2 environments
200 EC2 VMs, with 1-7 VMs per physical host,
emulating mix seen in production
Smaller, controlled testbed (RECC RADLab
Elastic Compute Cloud)
3 job types Sort, Grep, WordCount
2 metrics response time (primary), throughput
(should not be degraded)

18
EC2 Sort with No Stragglers

Average 27 gain over native, 31 gain over no
speculation

19
EC2 Sort with Stragglers

Average 58 gain over native, 220 gain over no
speculation
93 max gain over native

20
EC2 Sort Throughput
Jobs/second for three simultaneously submitted
Sorts

Average 5.1 gain over native, 18 gain over no
speculation

21
EC2 Grep WordCount
Grep
WordCount

Average 36 gain over native, 57 gain over no
speculation

Average 8.5 gain over native, 179 gain over no
speculation

22
RECC Sort with No Stragglers

Average 162 gain over native, 104 gain over no
speculation
261 max speedup over native.

23
RECC Sort with Stragglers

Average 53 gain over native, 121 gain over no
speculation
100 max speedup over native.

24
RECC WordCount

Average 14 gain over native, 113 gain over no
speculation

25
Summary of Results
26
Lessons for Virtualized Environments

Heterogeneity matters
CPU and memory isolation works well, disk and
network not so much
Ideally, environment should minimize variance
Otherwise, modify app to tolerate variance
Document best practices and quirks
Provide better visibility monitoring tools

27
Conclusion

Analyzed EC2 with a real application
Identified heterogeneity as a challenge and
addressed it in Hadoop
2x better response times through LATE
Demonstrated X-Trace at 900 nodes

28
Future Work

Integrate LATE into Hadoop codebase
Run LATE on more realistic jobs (Hadoop grid mix
benchmark?)
Scheduling among different jobs
Scheduling tasks on the same node

29
Questions?
?
?
?
30
Improving MapReduce Performance in Large
Virtualized Environments

Andy Konwinski and Matei Zaharia

31
Tips for Hadoop on EC2

Automate cluster setup
Warm up system with dds and small job
Choose instance type based on needs
Smalls usually have more IO bandwidth /
Large is necessary for Hadoop master
EC2 now added high-CPU instances too!
Consider turning down DFS replication for a
significant performance gain

32
Tracing Hadoop on EC2
Hadoop Slave
Hadoop Slave
Hadoop Slave
Hadoop Master
TCP
X-Trace Backend
Trace Analysis Web UI
HTTP
HTTP
Derby DB
Filesystem
User
33
Scaling up X-Trace

X-Trace backend improved to use asynchronous IO
and a hybrid filesystem RDMBS data store still
single server
Found overhead of tracing to be negligible in
Hadoop since operations are large
1-2 MB / node / hour of trace data generated
compresses by a factor of 20

Write a Comment

User Comments (0)