Pig, Making Hadoop Easy - PowerPoint PPT Presentation

About This Presentation
Title:

Pig, Making Hadoop Easy

Description:

Interested in Learning Big Data and Hadoop. Click here for more info – PowerPoint PPT presentation

Number of Views:71
Slides: 17
Provided by: DeZyre

less

Transcript and Presenter's Notes

Title: Pig, Making Hadoop Easy


1
Pig, Making Hadoop Easy
  • Alan F. Gates
  • Yahoo!

2
Who Am I?
  • Pig committer
  • Hadoop PMC Member
  • An architect in Yahoo! grid team
  • Or, as one coworker put it, the lipstick on the
    Pig

3
Who are you?
4
Motivation By Example
Suppose you have user data in one file,
website data in another, and you need to find the
top 5 most visited pages by users aged 18 - 25.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
5
In Map Reduce
6
In Pig Latin
Users load users as (name, age)Fltrd
filter Users by age gt 18 and age lt 25
Pages load pages as (user, url)Jnd join
Fltrd by name, Pages by userGrpd group Jnd by
urlSmmd foreach Grpd generate group,
COUNT(Jnd) as clicksSrtd order Smmd by clicks
descTop5 limit Srtd 5store Top5 into
top5sites
7
Performance
0.1
0.2
0.3
0.4, 0.5
0.6, 0.7
8
Why not SQL?
Data Factory Pig Pipelines Iterative
Processing Research
  • Data Warehouse
  • Hive
  • BI Tools
  • Analysis

Data Collection
9
Pig Highlights
  • User defined functions (UDFs) can be written for
    column transformation (TOUPPER), or aggregation
    (SUM)
  • UDFs can be written to take advantage of the
    combiner
  • Four join implementations built in hash,
    fragment-replicate, merge, skewed
  • Multi-query Pig will combine certain types of
    operations together in a single pipeline to
    reduce the number of times data is scanned
  • Order by provides total ordering across reducers
    in a balanced way
  • Writing load and store functions is easy once an
    InputFormat and OutputFormat exist
  • Piggybank, a collection of user contributed UDFs

10
Who uses Pig for What?
  • 70 of production jobs at Yahoo (10ks per day)
  • Also used by Twitter, LinkedIn, Ebay, AOL,
  • Used to
  • Process web logs
  • Build user behavior models
  • Process images
  • Build maps of the web
  • Do research on raw data sets

11
Accessing Pig
  • Submit a script directly
  • Grunt, the pig shell
  • PigServer Java class, a JDBC like interface

12
Components
Job executes on cluster
Hadoop Cluster
Pig resides on user machine
  • User machine

No need to install anything extra on your Hadoop
cluster.
13
How It Works
Pig Latin
A LOAD myfile AS (x, y, z) B FILTER A
by x gt 0 C GROUP B BY x D FOREACH A
GENERATE x, COUNT(B) STORE D INTO output
  • pig.jar
  • parses
  • checks
  • optimizes
  • plans execution
  • submits jar to Hadoop
  • monitors job progress

Execution Plan Map Filter
Count Combine/Reduce Sum
14
Demo
  • s3//hadoopday/pig_tutorial

15
Upcoming Features
  • In 0.8 (plan to branch end of August, release
    this fall)
  • Runtime statistics collection
  • UDFs in scripting languages (e.g. python)
  • Ability to specify a custom partitioner
  • Adding many string and math functions as Pig
    supported UDFs
  • Post 0.8
  • Adding branches, loops, functions, and modules
  • Usability
  • Better error messages
  • Fix ILLUSTRATE
  • Improved integration with workflow systems

16
Learn More
  • Read the online documentation
    http//hadoop.apache.org/pig/
  • On line tutorials
  • From Yahoo, http//developer.yahoo.com/hadoop/tuto
    rial/
  • From Cloudera, http//www.cloudera.com/hadoop-trai
    ning
  • Using Pig on EC2 http//developer.amazonwebservic
    es.com/connect/entry.jspa?externalID2728
  • A couple of Hadoop books available that include
    chapters on Pig, search at your favorite
    bookstore
  • Join the mailing lists
  • pig-user_at_hadoop.apache.org for user questions
  • pig-dev_at_hadoop.apache.com for developer issues
  • howldev_at_yahoogroups.com for Howl
Write a Comment
User Comments (0)
About PowerShow.com