Hadoop Jobs and Tasks


Computes input split for the job. Splits cannot be computed(inputs does't exist), error is throw. ... mapred.task.tracker.expiry.interval

Transcript and Presenter's Notes

Hadoop Jobs and Tasks

Hadoop Jobs and Tasks
  • ReddyRaja

Brief Overview
2getnew Job ID
1run Job
MapReduce Program
5Initialize Job
Job Tracker
4 Submit Job
Client JVM
Job Tracker Node
Client Node
7Heart Beat (returns task)
6 Retrieve Input Splits
3Copy Resources
8 Retrieve Job Resources
Shared File System HDFS
9 Launch
Map Task or Reduce Task
Task Tracker Node
Submit Job
  • Asks the Job Tracker for a new ID
  • Checks output spec of the Job. Checks o/p dir. If
    exists, throws error. Job is not sumitted
  • Computes input split for the job. Splits cannot
    be computed(inputs doest exist), error is throw.
    Job is not submitted
  • Copies the resources needed to run the job.
    Copies to Job Trackers file system, in a dir
    named after job id.
  • Job jar file. Copied with a high replication
    factor, factor of 10.Can be set by
    mapred.submit.replication property
  • Configuration file
  • Computed input splits
  • Tells the JobTracker.. Job is ready

Job Initialization
  • Puts the job in internal Queue
  • Job Scheduler will pickup and initialize it
  • Create a Job object and job being run
  • Encapsulate its tasks
  • Book keeping info to track tasks status and
  • Create list of tasks to run
  • Retrieves number of input splits computed by the
    JobClient from the shared filesystem
  • Creates one map task for split
  • Scheduler creates the Reduce tasks
  • No. of reduce tasks is determined by the
  • Tasks IDs are given for each task

Task Assignment
  • Task trackers send heartbeats to JobTracker
  • Task tracker indicates readines for a new task
  • Job Tracker will allocate a Task
  • Job Tracker communicates the task in a response
    to a heartbeat return
  • Choosing a Task Tracker
  • Job Tracker must choose a Task for a TaskTracker
  • Uses scheduler to choose a task from
  • Job Scheduling algorithms gtdefault one based on
    assigns priority

Task Assignment
  • Task trackers has fixed slots for map tasks and
    for reduce tasks
  • Task tracker may be able to run 2 map and 2
    reduce tasks simultaneously(does not depend on no
    of cores and amount of memory on the task
  • Scheduler fills the Map task slots before filling
    the reduce task slots
  • Job Tracker takes into account the task trackers
    network location and picks up a tasks, whose
    split is as close as possible to the task tracker
  • Ideal case would be to choose a task tracker
    node, where the split resides on. called
  • Rack-local on the same rack, but not on the same
  • Some tasks are neither data-local or rack local.
    Retrieves data from a different rack
  • Use counters to track how many data-local,
    rack-local or non local
  • Job tracker picks the next in its of yet-to-be
    run reduce tasks since there are no data locality

Task Execution
  • Task tracker has been assigned the task
  • Next step is to run the task
  • Localizes the Job by copying the JAR file from
    the shared file system. Copies any other files
  • Creates a local working dir for the task, un-jars
    the contents of the jar onto this dir
  • Creates an instance of TaskRunner to run the task
  • Task runner launches a new JVM to run each task
  • To avoid Task tracker to fail, if any bugs in
    MapReduce tasks
  • Only the child JVM exits in case of a problem

TaskExecution ..continued
Progress and Status Updates
  • MapReduce jobs are long running jobs
  • User needs feedback from time to time on the
    progress of the task
  • Job and tasks have the status
  • Running, successfully completed, failed
  • Progress of maps and reduces
  • Values of Job counters
  • Status messages and description
  • It can decide based on what phase it is running

Progress Reporting
  • Not 100 accurate
  • Nevertheless important to see Job running or not
  • Following operations constitute progress
  • Reading an input record
  • Writing an output record
  • Setting the status descriptor on a reporter
  • Incrementing a counter
  • Calling Reporters progress method
  • Tasks can also set counters
  • Framework built-in ones
  • User defined ones

Progress Reporting .. continued
  • Framework support
  • If progress flag is set, indicates status to be
    sent to the task tracker
  • Flag checked in a separate thread every 3 sec.
    TaskTracker is notified about the status
  • TaskTracker sends the same via heartbeats to the
    JobTracker every 5 sec
  • Status of all the tasks run by TaskTracker is
  • Status of Counters is sent less frequently to
    avoid congestion
  • Job Tracker combines these status reports
  • Gives a global view of all the Jobs and
    constituent tasks and their statuses
  • JobClient receives the status by polling the
    JobTracker every second
  • Client also can call getJobStatus to get the
    status information

Progress Reporting .. continued
Job Completion
  • Job Tracker receives notification that the last
    task of Job is complete
  • Changes the status to successful
  • JobClient polls for the status,
  • Returns message to the user and
  • Returns from the runJob method
  • JobTracker can also send HTTP Job notification
  • Can be configued by clients wishing to get
    notified via callbacks.
  • Clients can set job.end.notification.url
  • JobTracker cleans its working state for the Job
  • Also instructs the TaskTrackers to do the same

Tasks Failure
  • Causes
  • User code is buggy
  • Processes crash
  • Machines fail
  • Hadoop handles it quite smoothly

Tasks Failure .. continued
  • Child JVM reports the error back to the
    TaskTracker before exiting
  • Error logged into users logs
  • TaskTracker marks the Task and failed
  • Frees up slot for another task
  • Hanging tasks
  • TaskTracker senses that, it has not received any
    progress update
  • Proceeds to mark the staus as failed
  • Child JVM process is killed after the timeout
    period which is normally 10 mins
  • Can be configured on a per job basis
  • Setting a timeof zero, never frees up the
    hanging slot avoid this
  • Atleast send the progress update by setting the
    progress flag

Task failure .. continued
  • Task failed
  • Notified to the JobTracker
  • JobTracker will reschedule the execution of the
  • Avoids scheduling on a TaskTracker where is has
    failed earlier
  • Will try 4 times before giving up
  • mapred.map.max.attempts for map tasks
  • mapred.reduce.max.attempts for reduce tasks
  • If any task fails more than 4 times, the job is
    set to failed, regardless of how many times it
    was tried
  • Can be changed by setting
  • mapred.max.map.failures.percent
  • mapred.max.reduce.failures.percent
  • Task can also be killed in case of a speculative
  • Killed tasks do not count for no. of failed tasks

TaskTracker Failure
  • Symptoms
  • Fails to sent Heartbeats
  • Might have crashed or
  • Running slowly
  • JobTracker will mark it as failed and removes it
    from pool of tasktrackers to be scheduled on
  • Heart beat misses for 10 mins or
  • mapred.task.tracker.expiry.interval
  • JobTracker arranges for the tasks to run on
    different TaskTracker for all the
    successful/failed for incomplete Jobs
  • Any tasks in progress are also rescheduled
  • JobTracker can also blaklist a TaskTracker
  • If the no of tasks failed is significantly higher
    than average rate of failure rate on the cluster
  • Blaklisted ones can be restarted to remove from
    the jobtrackers list

Job Scheduling
  • Simple
  • Ran in the order of submission using FIFO
  • Fair Scheduler
  • Capacity Scheduler

Shuffle and Sort
  • MapReduce framework gaurantees that the input to
    every reducer is sorted by key
  • Process by which system performs Sort is Sort
  • Transfers the map outputs to the reducers as
    inputs called Shuffle phase
  • Shuffle code base keeps changing and continuous
    improvements are made
  • Shuffle is the heart of MapReduce

Shuffle and Sort ..continued
Shuffle and Sort ..continued
  • Map
  • Circular Memory
  • Map blocks writing, if the buffer is full Dsd
  • Another thread starts writing to the disk after
    reaching a threshold ..80
  • Map outputs will continue to write into the disk
  • Before writing to disk, the thread partitions
    based on Reducer is has to go to
  • Within each thread, in-memory sort is performed
  • A combiner function is run on the output of the
  • Several spill files are created
  • Spills are merged and partitioned and sorted to
    an output file
  • Combiner is run before the output file is written
  • Data written to the disk can be compressed

Shuffle and Sort ..continued
  • Reduce
  • Needs the map output from several mappers
  • Copy phase copied the mappers data to Reduce
  • Smaller no of copier thread to copy parallely

Shuffle and Sort ..continued
  • How the reducers know where to get the map
    outputs from
  • Tasks notify TaskTracker about map being
  • TaskTracker sends the update to JobTracker
  • JobTracker knows for a given job, the mapper
    outputs and the TaskTrackers they are available
  • Reducers asks this information from JobTracker
    periodically until is has retrieved them all
  • Task Trackers do not delete mapoutput from disk
    till the Job is completed
  • Reducer Task may fail
  • Wait until told to do so from JobTracker

Task Execution
  • Speculative Execution
  • Task JVM Reuse
  • Skipping Bad records
  • Task Execution environment
  • Counters
  • Sorting
  • Secondary Sort
  • Joins
  • Side data distribution

Speculative Execution
  • Tasks are run in parallel
  • A slow task can make the whole job significantly
  • Out of few thousand tasks, some jobs could be
  • Hadoop tries to find slow running tasks
  • Hadoop creates backup tasks when a slow running
    task is expected
  • After the task is ran successfully, any copy of
    the tasks are killed
  • It is an optimization technique. If the task is
    designed to run slow, this may not work
  • Can be turned on or off

Task JVM Reuse
  • Hadoop runs tasks in their own JVM
  • When JVM Reuse is enabled,
  • Tasks in the Child JVM are run sequentially
  • Task Tracker still runs the tasks parallely
  • Tasks from different Jobs are always run on
    different Child JVMs
  • Mapred.job.reuse.jvm.run.tasks
  • -1 indicates no limit.

Skipping Bad Records
  • Large data sets could have corrupt records
  • They often have missing fields
  • In practice, the code should ignore these records
  • Bad records have to handled in Mapper or Reducer.
  • Ignore the records
  • TextInputFormat has a feature to set the length
    of the record.
  • Corrupted records usually have long lengths

Task Execution Environment
  • Hadoop provides environment to tasks
  • Several properties can be accessed from
  • Task files
  • Multiple instances of the same task
  • Should not write into the same file
  • If task failed and is not retired, the old output
    file is still present
  • Speculative Execution Two instances of the same
    task could write into same file?
  • Solution
  • Hadoop writes the file into a temp dir, specific
    to the task attempt.
  • mapred.output.dir/_temporary/mapred.task.id/
  • On successful completion, file is written to the

  • Counters are used to gather statistics about the
  • Quality controls (good vs bad records)
  • Application Level statistics
  • Problem diagnosis
  • Counters are easier to retieve compared to log
  • Built in Counters
  • Input records, bytes
  • Output records bytes etc

User Defined Counters
User Defined Counters
  • Counters are grouped Enum Names
  • Fields are the counter names
  • Dynamic counters for storing values
  • Readable Counter Names
  • Using Resource Bundle
  • Air Temperature Records instead of
  • Temperature.MISSING

Retrieving Counters
  • Counters can be retrieved as follows

  • By default Keys are sorted before sent to the
    Reduce Task
  • Sort order for keys is controlled by
  • Property mapred.output.key.comparator.class
  • Keys must be a subclass of WritableComparable
  • Partitioned MapFileLookup
  • If MapFileOutputFormat is used, lookup by keys
    can be done

Secondary Sort
  • MapReduce sorts record by keys
  • Values are not sorted
  • Use following strategy to get the values sorted
  • Use Composite key(have value portion)
  • KeyComparator orders by Composite key

  • MapReduce can perform joins of large sets
  • Use frameworks such as PIG, HIVE or Cascading to
    achieve a Join
  • Map Joins
  • Use CompositeInputFormat
  • Allows Join to be performed before passing to Map
  • Reduce Joins
  • Key as the join mechanism
  • Multiple Inputs
  • Use different mappers. Map Output to be same

Side data distribution
  • Extra read only data needed by MapReduce Jobs
  • The challenge is to make this data available to
    Map Reduce Jobs
  • Cache Side data in a static field
  • Use Job Configuration
  • Overide the config method
  • To pass Objects, use DefualtStringifier(Hadoop
  • Do not use it to transfer more than 1 kb

Side Data distribution - continued
  • Distributed Cache
  • Copy files and archives once per Job to the task
  • Make them available to the MapReduce functions
  • --files and archives options
  • Files can be local or in HDFS system
  • Hadoop other args -files input/ncdc/metadata/st

Side Data How it works
  • When the Job is launched, Hadoop copies the files
    specified by the files options to the
    JobTrackers file system to a local disk the
  • From the tasks point of view, the files are just
  • Reference count of no of tasks using the file is
    maintained, on zero, the file is eligible for
  • Files are deleted if the cache size exceeds 10
    GB, making way for other jobs
  • Files are localized under (mapred.local.dir)/task
    Tracker/archi dir on task trackers
  • Apps can use the file as it is. Files are
    symbolically linked to a working dir
