Title: The Pig Latin Dataflow Language
1- The Pig Latin Dataflow Language
- A Brief Overview
- James Jolly
- University of Wisconsin-Madison
- jolly_at_cs.wisc.edu
2What is Pig Latin?
- set-oriented data transformation language
- primitives filter, combine, split, and order data
- users describe transformations in steps
- steps bundled into queries
- each set transformation is stateless
- flexible data model
- nested bags of tuples
- semi-structured datatypes
- extensible
- supports user-defined functions
2
3How is it used in practice?
- useful for computations across large, distributed
datasets - abstracts away details of execution framework
- users can change order of steps to improve
performance - often used in tandem with Hadoop and HDFS
- transformations converted to MapReduce dataflows
- HDFS tracks where data is stored
- operations scheduled nearby their data
3
4An example...
- Given two datasets
- list of words and their frequency of appearance
on webpages - list of users and webpages they visit
- Lets find words users might be interested in
lately.
4
5Dataset words and their frequency of
appearance...
- website word frequency date
- news.bbc.co.uk obama 0.010 20081005
- abcnews.go.com scheme 0.025 20081010
- abcnews.go.com bombing 0.021 20081006
- www.foxnews.com bush 0.001 20081006
- www.cnn.com mccain 0.031 20081017
- www.cnn.com obama 0.001 20081002
- www.reuters.com bush 0.012 20080921
- abcnews.go.com congress 0.002 20080927
- www.reuters.com bush 0.012 20080921
- www.foxnews.com bush 0.001 20081006
- www.latimes.com abortion 0.001 20081015
- www.latimes.com attack 0.010 20081015
- www.reuters.com obama 0.005 20080917
- www.foxnews.com economy 0.038 20081006
5
6Dataset webpages users visit...
- website user
- www.reuters.com bill
- news.bbc.co.uk mike
- www.cnn.com mike
- www.foxnews.com bill
- www.reuters.com drew
- www.latimes.com james
- abcnews.go.com james
6
7Loading word frequency data...
- freqs LOAD '/home/jolly/TestData/NewsWords.txt'
USING PigStorage(',')? - AS (website_indexed, word, freq, date)
- (news.bbc.co.uk, obama, 0.010, 20081005)?
- (abcnews.go.com, scheme, 0.025, 20081010)?
- (abcnews.go.com, bombing, 0.021, 20081006)?
- (www.foxnews.com, bush, 0.001, 20081006)?
- (www.cnn.com, mccain, 0.031, 20081017)?
- (www.cnn.com, obama, 0.001, 20081002)?
- (www.reuters.com, bush, 0.012, 20080921)?
- (abcnews.go.com, congress, 0.002, 20080927)?
- (www.reuters.com, bush, 0.012, 20080921)?
- (www.foxnews.com, bush, 0.001, 20081006)?
- (www.latimes.com, abortion, 0.001, 20081015)?
- (www.latimes.com, attack, 0.010, 20081015)?
- (www.reuters.com, obama, 0.005, 20080917)?
- (www.foxnews.com, economy, 0.038, 20081006)?
7
8Hmm, we have some repeats...
- (news.bbc.co.uk, obama, 0.010, 20081005)?
- (abcnews.go.com, scheme, 0.025, 20081010)?
- (abcnews.go.com, bombing, 0.021, 20081006)?
- (www.foxnews.com, bush, 0.001, 20081006)?
- (www.cnn.com, mccain, 0.031, 20081017)?
- (www.cnn.com, obama, 0.001, 20081002)?
- (www.reuters.com, bush, 0.012, 20080921)?
- (abcnews.go.com, congress, 0.002, 20080927)?
- (www.reuters.com, bush, 0.012, 20080921)?
- (www.foxnews.com, bush, 0.001, 20081006)?
- (www.latimes.com, abortion, 0.001, 20081015)?
- (www.latimes.com, attack, 0.010, 20081015)?
- (www.reuters.com, obama, 0.005, 20080917)?
- (www.foxnews.com, economy, 0.038, 20081006)?
8
9Duplicate data no more!
- distinct_freqs DISTINCT freqs
- (www.cnn.com, obama, 0.001, 20081002)?
- (www.cnn.com, mccain, 0.031, 20081017)?
- (abcnews.go.com, scheme, 0.025, 20081010)?
- (abcnews.go.com, bombing, 0.021, 20081006)?
- (abcnews.go.com, congress, 0.002, 20080927)?
- (news.bbc.co.uk, obama, 0.010, 20081005)?
- (www.foxnews.com, bush, 0.001, 20081006)?
- (www.foxnews.com, economy, 0.038, 20081006)?
- (www.latimes.com, attack, 0.010, 20081015)?
- (www.latimes.com, abortion, 0.001, 20081015)?
- (www.reuters.com, bush, 0.012, 20080921)?
- (www.reuters.com, obama, 0.005, 20080917)?
9
10Hmm, these tuples are old
- (www.cnn.com, obama, 0.001, 20081002)?
- (www.cnn.com, mccain, 0.031, 20081017)?
- (abcnews.go.com, scheme, 0.025, 20081010)?
- (abcnews.go.com, bombing, 0.021, 20081006)?
- (abcnews.go.com, congress, 0.002, 20080927)?
- (news.bbc.co.uk, obama, 0.010, 20081005)?
- (www.foxnews.com, bush, 0.001, 20081006)?
- (www.foxnews.com, economy, 0.038, 20081006)?
- (www.latimes.com, attack, 0.010, 20081015)?
- (www.latimes.com, abortion, 0.001, 20081015)?
- (www.reuters.com, bush, 0.012, 20080921)?
- (www.reuters.com, obama, 0.005, 20080917)?
10
11... and these (green) tuples are not very
significant.
- (www.cnn.com, obama, 0.001, 20081002)?
- (www.cnn.com, mccain, 0.031, 20081017)?
- (abcnews.go.com, scheme, 0.025, 20081010)?
- (abcnews.go.com, bombing, 0.021, 20081006)?
- (abcnews.go.com, congress, 0.002, 20080927)?
- (news.bbc.co.uk, obama, 0.010, 20081005)?
- (www.foxnews.com, bush, 0.001, 20081006)?
- (www.foxnews.com, economy, 0.038, 20081006)?
- (www.latimes.com, attack, 0.010, 20081015)?
- (www.latimes.com, abortion, 0.001, 20081015)?
- (www.reuters.com, bush, 0.012, 20080921)?
- (www.reuters.com, obama, 0.005, 20080917)?
11
12Lets filter them out.
- important_freqs FILTER distinct_freqs
- BY date gt 20081001
AND freq gt 0.002 - (www.cnn.com, mccain, 0.031, 20081017)?
- (abcnews.go.com, scheme, 0.025, 20081010)?
- (abcnews.go.com, bombing, 0.021, 20081006)?
- (news.bbc.co.uk, obama, 0.010, 20081005)?
- (www.foxnews.com, economy, 0.038, 20081006)?
- (www.latimes.com, attack, 0.010, 20081015)?
12
13Hmm, we dont need these anymore...
- (www.cnn.com, mccain, 0.031, 20081017)?
- (abcnews.go.com, scheme, 0.025, 20081010)?
- (abcnews.go.com, bombing, 0.021, 20081006)?
- (news.bbc.co.uk, obama, 0.010, 20081005)?
- (www.foxnews.com, economy, 0.038, 20081006)?
- (www.latimes.com, attack, 0.010, 20081015)?
13
14Lets project them out.
- websites_to_words FOREACH important_freqs
- GENERATE
website_indexed, word - (www.cnn.com, mccain)?
- (abcnews.go.com, scheme)?
- (abcnews.go.com, bombing)?
- (news.bbc.co.uk, obama)?
- (www.foxnews.com, economy)?
- (www.latimes.com, attack)?
14
15Now we are ready to join our lists.
- Websites to Users
- (news.bbc.co.uk, mike)?
- (www.cnn.com, mike)?
- (www.foxnews.com, bill)?
- (www.reuters.com, drew)?
- (www.latimes.com, james)?
- (abcnews.go.com, james)?
Websites to Words (www.cnn.com,
mccain)? (abcnews.go.com, scheme)? (abcnews.go.com
, bombing)? (news.bbc.co.uk, obama)? (www.foxnews.
com, economy)? (www.latimes.com, attack)?
15
16Joining on website finding words interesting to
users...
- users_to_words_equijoin JOIN websites_to_users
BY website_visited, -
websites_to_words BY website_indexed - users_to_words FOREACH users_to_words_equijoin
- GENERATE user, word
- (mike, mccain)?
- (james, scheme)?
- (james, bombing)?
- (mike, obama)?
- (bill, economy)?
- (james, attack)?
16
17Lets group our results.
- interests GROUP users_to_words BY user
- (bill, (bill, economy))?
- (mike, (mike, mccain), (mike, obama))?
- (james, (james, scheme), (james, bombing),
(james, attack))?
17
18How does it work?
- logic factored into MapReduce jobs
- mapper processes run on machines with input
tuples - input tuples processed using MAP( )
function,producing intermediate tuples - intermediate tuples grouped together,transferred
to reducer nodes - reducer processes consume intermediate
tupleswith REDUCE( ) function
18
19Translating Pig Latin to MapReduce...
- transformed_by_map FOREACH input_tuple
- GENERATE
MAP() - intermediate_tuple_partition GROUP
transformed_by_map - BY
input_tuple_key - result_tuples FOREACH intermediate_tuple_partiti
on - GENERATE REDUCE()
These statements can be executed using a single
MapReduce job
19
20Example message traffic...
20
21Why Pig Latin? Why not a C library?
- We could just supply MAP( ) and REDUCE( ) to a C
library... - Pig Latin allows you to
- describe long tasks
- in a friendly scripting language
- use many built-in datatypes
- support for semi-structured data
- use many built-in functions
- filters, projections, joins, unions, splits, etc.
- tends to make user-defined functions simpler
21
22Why Pig Latin? Why not SQL?
- Pig Latin
- is imperative
- lets users manually tune query execution plan
- doesnt need a schema
- can easily read, write, and represent
semi-structured data
22
23Pig Latin really describes a generic dataflow.
inputs LOAD input.txt results FILTER
inputs BY IsBoring(important_attribute) STORE
results into results.txt
23
24Summary
- Pig Latin programs
- typically operate on large volumes of
unstructured data - describe a dataflow between primitive operations
- many RDBMS-like operations built into the
language - custom operations can be provided by the user
- user specifies order of operations
- dataflows can be executed using MapReduce
paradigm - Thanks for listening!
24