Title: Pig, Making Hadoop Easy
1Pig, Making Hadoop Easy
2Who Am I?
- Pig committer
- Hadoop PMC Member
- An architect in Yahoo! grid team
- Or, as one coworker put it, the lipstick on the
Pig
3Who are you?
4Motivation By Example
Suppose you have user data in one file,
website data in another, and you need to find the
top 5 most visited pages by users aged 18 - 25.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
5In Map Reduce
6In Pig Latin
Users load users as (name, age)Fltrd
filter Users by age gt 18 and age lt 25
Pages load pages as (user, url)Jnd join
Fltrd by name, Pages by userGrpd group Jnd by
urlSmmd foreach Grpd generate group,
COUNT(Jnd) as clicksSrtd order Smmd by clicks
descTop5 limit Srtd 5store Top5 into
top5sites
7Performance
0.1
0.2
0.3
0.4, 0.5
0.6, 0.7
8Why not SQL?
Data Factory Pig Pipelines Iterative
Processing Research
- Data Warehouse
- Hive
- BI Tools
- Analysis
Data Collection
9Pig Highlights
- User defined functions (UDFs) can be written for
column transformation (TOUPPER), or aggregation
(SUM) - UDFs can be written to take advantage of the
combiner - Four join implementations built in hash,
fragment-replicate, merge, skewed - Multi-query Pig will combine certain types of
operations together in a single pipeline to
reduce the number of times data is scanned - Order by provides total ordering across reducers
in a balanced way - Writing load and store functions is easy once an
InputFormat and OutputFormat exist - Piggybank, a collection of user contributed UDFs
10Who uses Pig for What?
- 70 of production jobs at Yahoo (10ks per day)
- Also used by Twitter, LinkedIn, Ebay, AOL,
- Used to
- Process web logs
- Build user behavior models
- Process images
- Build maps of the web
- Do research on raw data sets
11Accessing Pig
- Submit a script directly
- Grunt, the pig shell
- PigServer Java class, a JDBC like interface
12Components
Job executes on cluster
Hadoop Cluster
Pig resides on user machine
No need to install anything extra on your Hadoop
cluster.
13How It Works
Pig Latin
A LOAD myfile AS (x, y, z) B FILTER A
by x gt 0 C GROUP B BY x D FOREACH A
GENERATE x, COUNT(B) STORE D INTO output
- pig.jar
- parses
- checks
- optimizes
- plans execution
- submits jar to Hadoop
- monitors job progress
Execution Plan Map Filter
Count Combine/Reduce Sum
14Demo
- s3//hadoopday/pig_tutorial
15Upcoming Features
- In 0.8 (plan to branch end of August, release
this fall) - Runtime statistics collection
- UDFs in scripting languages (e.g. python)
- Ability to specify a custom partitioner
- Adding many string and math functions as Pig
supported UDFs - Post 0.8
- Adding branches, loops, functions, and modules
- Usability
- Better error messages
- Fix ILLUSTRATE
- Improved integration with workflow systems
16Learn More
- Read the online documentation
http//hadoop.apache.org/pig/ - On line tutorials
- From Yahoo, http//developer.yahoo.com/hadoop/tuto
rial/ - From Cloudera, http//www.cloudera.com/hadoop-trai
ning - Using Pig on EC2 http//developer.amazonwebservic
es.com/connect/entry.jspa?externalID2728 - A couple of Hadoop books available that include
chapters on Pig, search at your favorite
bookstore - Join the mailing lists
- pig-user_at_hadoop.apache.org for user questions
- pig-dev_at_hadoop.apache.com for developer issues
- howldev_at_yahoogroups.com for Howl