The Hadoop Fair Scheduler - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

The Hadoop Fair Scheduler

Description:

... Hadoop usage at Facebook. Fair scheduler basics. Configuring ... Facebook Job Types. Production jobs: load data, compute statistics, detect spam, etc ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 32
Provided by: mate169
Category:

less

Transcript and Presenter's Notes

Title: The Hadoop Fair Scheduler


1
The Hadoop Fair Scheduler
  • Matei Zaharia
  • Cloudera / Facebook / UC Berkeley

2
Outline
  • Motivation / Hadoop usage at Facebook
  • Fair scheduler basics
  • Configuring the fair scheduler
  • Future plans

3
Motivation
  • Provide short response times to small jobs in a
    shared Hadoop cluster
  • Improve utilization over private clusters / HOD

4
Hadoop Usage at Facebook
  • Data warehouse running Hive
  • 600 machines, 4800 cores, 2.4 PB disk
  • 3200 jobs per day
  • 50 engineers have used Hadoop

5
Facebook Data Pipeline
6
Facebook Job Types
  • Production jobs load data, compute statistics,
    detect spam, etc
  • Long experiments machine learning, etc
  • Small ad-hoc queries Hive jobs, sampling

GOAL Provide fast response times for small jobs
and guaranteed service levels for
production jobs
7
Outline
  • Motivation / Hadoop usage at Facebook
  • Fair scheduler basics
  • Configuring the fair scheduler
  • Future plans

8
Fair Scheduler Basics
  • Group jobs into pools
  • Assign each pool a guaranteed minimum share
    (split up among its jobs)
  • Split excess capacity evenly between jobs

9
Pools
  • Determined from a configurable job property
  • Default (before 0.20) mapred.queue.name
  • At Facebook user.name (one pool per user)
  • Unmarked jobs go into a default pool
  • Pools have properties
  • Minimum map slots
  • Minimum reduce slots
  • Limit on of running jobs

10
Scheduling Algorithm
  • Divide each pools min share among its jobs
  • Divide excess capacity among all jobs
  • When a slot needs to be assigned
  • If there is any job below its min share, schedule
    it
  • Else schedule the job that weve been most unfair
    to (based on deficit)

Fair schedulers from Hadoop 0.20 on will share
equally between pools, not jobs patch
available at https//issues.apache.org/jira/browse
/HADOOP-4789
11
Scheduler Dashboard
12
Scheduler Dashboard
13
Scheduler Dashboard
14
Scheduler Dashboard
15
Scheduler Dashboard
16
Additional Features
  • Job weights for unequal sharing
  • Based on priority (each level is 2x more)
  • Based on size (mapred.fairscheduler.sizebasedweigh
    t)
  • Limits on of running jobs
  • Per user
  • Per pool

17
Outline
  • Motivation / Hadoop usage at Facebook
  • Fair scheduler basics
  • Configuring the fair scheduler
  • Future plans

18
Installing the Scheduler
  • Compile it
  • ant package
  • Place it on the classpath
  • cp build/contrib/fairscheduler/.jar lib
  • Alternatively, add the JAR to HADOOP_CLASSPATH in
    conf/hadoop-env.sh

19
Configuration Files
  • Hadoop config (conf/hadoop-site.xml)
  • Contains scheduler options, pointer to pools file
  • Pools file (pools.xml)
  • Contains min share allocations and limits on
    pools
  • Reloaded every 15 seconds to allow reconfiguring
    pools at runtime

20
Minimal hadoop-site.xml
  • ltpropertygt
  • ltnamegtmapred.jobtracker.taskSchedulerlt/namegt
  • ltvaluegtorg.apache.hadoop.mapred.FairSchedulerlt/v
    aluegt
  • lt/propertygt
  • ltpropertygt
  • ltnamegtmapred.fairscheduler.allocation.filelt/name
    gt
  • ltvaluegt/path/to/pools.xmllt/valuegt
  • lt/propertygt

21
Minimal pools.xml
  • lt?xml version"1.0"?gt
  • ltallocationsgt
  • lt/allocationsgt

22
Configuring a Pool
  • lt?xml version"1.0"?gt
  • ltallocationsgt
  • ltpool name"ads"gt
  • ltminMapsgt10lt/minMapsgt
  • ltminReducesgt5lt/minReducesgt
  • lt/poolgt
  • lt/allocationsgt
  • Any pools not configured in pools.xml will have
    minMaps0 and minReduces0

23
Setting Running Job Limits
  • lt?xml version"1.0"?gt
  • ltallocationsgt
  • ltpool name"ads"gt
  • ltminMapsgt10lt/minMapsgt
  • ltminReducesgt5lt/minReducesgt
  • ltmaxRunningJobsgt3lt/maxRunningJobsgt
  • lt/poolgt
  • ltuser name"matei"gt
  • ltmaxRunningJobsgt1lt/maxRunningJobsgt
  • lt/usergt
  • lt/allocationsgt

24
Default Jobs Limit for Users
  • lt?xml version"1.0"?gt
  • ltallocationsgt
  • ltpool name"ads"gt
  • ltminMapsgt10lt/minMapsgt
  • ltminReducesgt5lt/minReducesgt
  • ltmaxRunningJobsgt3lt/maxRunningJobsgt
  • lt/poolgt
  • ltuser name"matei"gt
  • ltmaxRunningJobsgt1lt/maxRunningJobsgt
  • lt/usergt
  • ltuserMaxJobsDefaultgt10lt/userMaxJobsDefaultgt
  • lt/allocationsgt

25
Other hadoop-site.xml Properties
  • mapred.fairscheduler.assignmultiple
  • Assign a map and reduce on each heartbeat
    improves ramp-up speed and throughput
    recommendation set to true

26
Other hadoop-site.xml Properties
  • mapred.fairscheduler.poolnameproperty
  • Which jobconf property to use to determine what
    pool a job is in
  • Default mapred.queue.name (queue name)
  • Another useful option user.name
  • Can also make up your own, e.g. project

27
Other hadoop-site.xml Properties
  • mapred.fairscheduler.weightadjuster
  • Allows modifying job weights through a plugin
    class one useful example is provided a new job
    booster to let short jobs finish faster
  • Please see README for details
  • ltpropertygt
  • ltnamegtmapred.fairscheduler.weightadjusterlt/namegt
  • ltvaluegtorg.apache.hadoop.mapred.NewJobWeightBoos
    terlt/valuegt
  • lt/propertygt

28
Outline
  • Motivation / Hadoop usage at Facebook
  • Fair scheduler basics
  • Configuring the fair scheduler
  • Future plans

29
Future Plans
  • Share equally between pools, not jobs (Hadoop
    0.20 release, HADOOP-4789)
  • Preemption if a job is starved of its min or fair
    share for some timeout (HADOOP-4665)
  • Locality wait optimization (HADOOP-4667)

30
Future Plans
  • Simpler scheduling model (HADOOP-4803)
  • FIFO pools (HADOOP-4803, HADOOP-5186)
  • Delayed job initialization (HADOOP-5186)
  • Scalability and operational improvements

31
Thanks!
  • The Fair Scheduler is available in Hadoop 0.19
    docs in src/contrib/fairscheduler/README
  • Hadoop 0.17 and 0.18 versions at
    http//issues.apache.org/jira/browse/HADOOP-3746

matei_at_cloudera.com
Write a Comment
User Comments (0)
About PowerShow.com