Title: An Overview of Map-Reduce Research
1 An Overview of Map-Reduce Research
2Main Themes
- Designing Efficient Algorithms on Map-Reduce
- Extensions on Map-Reduce
- Modeling Map-Reduce Computation
3Limitations
- Selective Access To Data
- High Communication Cost
- Redundant and Wasteful Processing
- Lack of Early Termination
- Lack of Iteration
- Quick Retrieval of Approximate Results
- Load Balancing
- Lack of Real-time and Interactive Processing
- Lack of Support for n-way Operations
4Interactive Processing Streaming Pipelining In-Me
mory Processing Pre-computation Dremel, Tenzing,
BlinkDB M3R, Shark
Data Access Indexing Partitioning Co-location,
Data Layout Co-Hadoop(), Hadoop, HAIL, LlAH,
Llama, Cheetah
Avoidance of Redundant Processing Batch
Processing of Queries Result Materialization Incre
mental Processing Result Sharing ReStore,
InCoop, MRShare
Processing n-way Operations Spatial / Temporal
Joins Additional MR Phase Redistribution of
Keys Record Duplication Controlled-Replicate(),
RCCIS()
Iterative Processing Looping,
Caching Pipelining, Recursion Incremental
Processing HaLoop, ReDoop, InCoop
Extensions On Map-Reduce
Query Optimization Parameter Tuning, Plan
Refinement Operator Reordering, Code
Analysis Data Flow Optimization HadoopDB,
Clydesdale, Starfish, AQUA, Adaptive-MR()
Processing Industry Specific Data Spatio -
Temporal Data Geo-Spatial Data Agriculture / Oil
Gas / Energy BLAST(), Spatial-Hadoop, Hadoop-
GIS
Fair Work Allocation Batching, Sampling,
Re-partitioning Skew-Tune, Skew-Reduce, Themis
Early Termination Sorting , Sampling EARL,
RanKloud
() Contributed by IBM
5Designing Efficient Algorithms on Map-Reduce
- Joins
- Multi-way Joins
- Similarity Joins
- Theta Joins
- Spatial Joins
- Interval Joins
- Entity Resolution
- Graph Algorithms
- Machine Learning
- Computational Geometry
6Modeling Computation on Map-Reduce
- Two main cost components
- Time spent in communication from map tasks to
reduce tasks - Time spent in computation as part of reduce tasks
- These two components involve a trade-off
- Given - an analytics problem, the input-data and
the number of reduce tasks - What is the minimum communication cost, a
map-reduce algorithm for the given analytics and
the corresponding input-data is going to incur?
7Survey References
- A Survey on Large-Scale Analytical Query
Processing in Map-Reduce - Christos Doulkeridis and Kjetil Norwag
- In VLDB Journal, 23(3), 2014
- Distributed Data Management on Map-Reduce
- Feng Li, Beng Chin Ooi, M. Tamer. Ojsu and Sai Wu
- In ACM Computing Survey, 46(3), 2014