Pig, Making Hadoop Easy

About This Presentation

Title:

Pig, Making Hadoop Easy

Description:

Interested in Learning Big Data and Hadoop. Click here for more info – PowerPoint PPT presentation

Number of Views:71

Slides: 17

Provided by: DeZyre

Category: How To, Education & Training

more less

Transcript and Presenter's Notes

Title: Pig, Making Hadoop Easy

1
Pig, Making Hadoop Easy

Alan F. Gates
Yahoo!

2
Who Am I?

Pig committer
Hadoop PMC Member
An architect in Yahoo! grid team
Or, as one coworker put it, the lipstick on the
Pig

3
Who are you?
4
Motivation By Example
Suppose you have user data in one file,
website data in another, and you need to find the
top 5 most visited pages by users aged 18 - 25.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
5
In Map Reduce
6
In Pig Latin
Users load users as (name, age)Fltrd
filter Users by age gt 18 and age lt 25
Pages load pages as (user, url)Jnd join
Fltrd by name, Pages by userGrpd group Jnd by
urlSmmd foreach Grpd generate group,
COUNT(Jnd) as clicksSrtd order Smmd by clicks
descTop5 limit Srtd 5store Top5 into
top5sites
7
Performance
0.1
0.2
0.3
0.4, 0.5
0.6, 0.7
8
Why not SQL?
Data Factory Pig Pipelines Iterative
Processing Research

Data Warehouse
Hive
BI Tools
Analysis

Data Collection
9
Pig Highlights

User defined functions (UDFs) can be written for
column transformation (TOUPPER), or aggregation
(SUM)
UDFs can be written to take advantage of the
combiner
Four join implementations built in hash,
fragment-replicate, merge, skewed
Multi-query Pig will combine certain types of
operations together in a single pipeline to
reduce the number of times data is scanned
Order by provides total ordering across reducers
in a balanced way
Writing load and store functions is easy once an
InputFormat and OutputFormat exist
Piggybank, a collection of user contributed UDFs

10
Who uses Pig for What?

70 of production jobs at Yahoo (10ks per day)
Also used by Twitter, LinkedIn, Ebay, AOL,
Used to
Process web logs
Build user behavior models
Process images
Build maps of the web
Do research on raw data sets

11
Accessing Pig

Submit a script directly
Grunt, the pig shell
PigServer Java class, a JDBC like interface

12
Components
Job executes on cluster
Hadoop Cluster
Pig resides on user machine

User machine

No need to install anything extra on your Hadoop
cluster.
13
How It Works
Pig Latin
A LOAD myfile AS (x, y, z) B FILTER A
by x gt 0 C GROUP B BY x D FOREACH A
GENERATE x, COUNT(B) STORE D INTO output

pig.jar
parses
checks
optimizes
plans execution
submits jar to Hadoop
monitors job progress

Execution Plan Map Filter
Count Combine/Reduce Sum
14
Demo

s3//hadoopday/pig_tutorial

15
Upcoming Features

In 0.8 (plan to branch end of August, release
this fall)
Runtime statistics collection
UDFs in scripting languages (e.g. python)
Ability to specify a custom partitioner
Adding many string and math functions as Pig
supported UDFs
Post 0.8
Adding branches, loops, functions, and modules
Usability
Better error messages
Fix ILLUSTRATE
Improved integration with workflow systems

16
Learn More

Read the online documentation
http//hadoop.apache.org/pig/
On line tutorials
From Yahoo, http//developer.yahoo.com/hadoop/tuto
rial/
From Cloudera, http//www.cloudera.com/hadoop-trai
ning
Using Pig on EC2 http//developer.amazonwebservic
es.com/connect/entry.jspa?externalID2728
A couple of Hadoop books available that include
chapters on Pig, search at your favorite
bookstore
Join the mailing lists
pig-user_at_hadoop.apache.org for user questions
pig-dev_at_hadoop.apache.com for developer issues
howldev_at_yahoogroups.com for Howl