An introduction to Apache Crunch - PowerPoint PPT Presentation

About This Presentation
Title:

An introduction to Apache Crunch

Description:

A short introduction to Apache Crunch. What is it and how does it simplify and aid the creation of Hadoop pipelines ? – PowerPoint PPT presentation

Number of Views:176
Slides: 9
Provided by: semtechs

less

Transcript and Presenter's Notes

Title: An introduction to Apache Crunch


1
Apache Crunch
  • What is it ?
  • How does it work ?
  • Why use it ?
  • Hadoop MapReduce pipelines
  • Scrunch
  • Joins

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
2
Apache Crunch Pipe line
  • Crunch is based on Google's FlumeJava
  • Provides a Java based API for M/R pipelines
  • It uses an MST ( multiple serializable type )
    data model
  • Good for processing complex data types
  • Better for non tuple data types i.e.
  • Images
  • Audio
  • Seismic data

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
3
Apache Crunch Pipe line
  • What is a Map Reduce Pipe line ?
  • Map
  • Shuffle
  • Reduce
  • Combine
  • Arranged in sequence and / or in parallel
  • Potentially very long chains

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
4
Apache Crunch Scala
  • Scrunch is a Scala wrapper for Apache Crunch
  • Reduced code
  • Functional and OO styles
  • Uses type inferencing for Map / Reduce
  • Incorporates Java Materialize functionality
  • Includes REPL ( read eval print loop )

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
5
Apache Crunch Joins
  • Details of Joins available in Crunch
  • Inner / Outer like SQL joins
  • Same with Left / Right / Full joins
  • MapSide join is an in memory join

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
6
Apache Crunch Performance
  • A light weight API that runs efficiently
  • Crunch is a thin veneer on top of Map Reduce
  • Two implementations available
  • Hadoop Writeables
  • Avro
  • Avro implementation much faster

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
7
Apache Crunch API
  • Data Model
  • Pipeline
  • MRPipeline
  • MemPipeline
  • Pcollection
  • Ptable
  • PgroupTable
  • Source
  • Target
  • Emitter
  • PType
  • Operators
  • DoFn
  • CombineFn
  • FilterFn
  • Joins
  • Cartesian
  • Sort
  • Secondary Sort
  • Pobject
  • BloomFilters

www.semtech-solutions.co.nz info_at_semtech-solutions
.co.nz
8
Contact Us
  • Feel free to contact us at
  • www.semtech-solutions.co.nz
  • info_at_semtech-solutions.co.nz
  • We offer IT project consultancy
  • We are happy to hear about your problems
  • You can just pay for those hours that you need
  • To solve your problems
Write a Comment
User Comments (0)
About PowerShow.com