MapReduce: Simplified Data Processing on Large Clusters

About This Presentation

Title:

Description:

Number of Views:392

Avg rating:3.0/5.0

Slides: 14

Provided by: Bren204

Category:

more less

Transcript and Presenter's Notes

Title: MapReduce: Simplified Data Processing on Large Clusters

1
MapReduce Simplified Data Processing on Large
Clusters

2
Why Am I Doing This?
3
Model

Two user-defined functions
Map
Input key/value pairs ? intermediate output
key/value pairs
Reduce
Intermediate key K and all intermediate values
associated with K ? output value(s)

4
Examples

Grep
Map emits a line if pattern is matched
Reduce Copies data to output
Inverted Index
Map parses a document and emits ltword, docIdgt
pairs
Reduce takes all pairs for a given word, sorts
the docId values, and emits a ltword, list(docId)gt
pair

5
Execution

6
Execution
7
Fault Recovery

Workers are pinged by master periodically
Non-responsive workers are marked as failed
All tasks in-progress or completed by failed
worker become eligible for rescheduling
Master could periodically checkpoint
Current implementations abort on master failure

8
Execution Issues

9
Benefits

10
Hadoop

11
Hadoop

HDFS also Master/Slave (NameNode / DataNode)
Client in communication with both
Master handles replication, deletion, creation
Slave handles data retrieval
Files stored in many blocks
Each block associated to block Id
Block Id associated with several nodes
hostnameport (depending on level of replication)

12
Hadoop

Used by last.fm on 25 nodes
Chart calculation web log analysis
Yahoo! on 900 nodes
3M files (150TB of data)
Sort benchmark on 200 node cluster using 700
reduce operations 2010GB in 6.6 hours (47 hours
using non-MapReduce)

MapReduce: Simplified Data Processing on Large Clusters - PowerPoint PPT Presentation