The Pig Latin Dataflow Language - PowerPoint PPT Presentation

About This Presentation
Title:

The Pig Latin Dataflow Language

Description:

www.foxnews.com bush 0.001 20081006. www.cnn.com mccain 0.031 20081017 ... (www.foxnews.com, economy, 0.038, 20081006)? 7. Hmm, we have some repeats... – PowerPoint PPT presentation

Number of Views:229
Avg rating:3.0/5.0
Slides: 25
Provided by: kenth
Category:

less

Transcript and Presenter's Notes

Title: The Pig Latin Dataflow Language


1
  • The Pig Latin Dataflow Language
  • A Brief Overview
  • James Jolly
  • University of Wisconsin-Madison
  • jolly_at_cs.wisc.edu

2
What is Pig Latin?
  • set-oriented data transformation language
  • primitives filter, combine, split, and order data
  • users describe transformations in steps
  • steps bundled into queries
  • each set transformation is stateless
  • flexible data model
  • nested bags of tuples
  • semi-structured datatypes
  • extensible
  • supports user-defined functions

2
3
How is it used in practice?
  • useful for computations across large, distributed
    datasets
  • abstracts away details of execution framework
  • users can change order of steps to improve
    performance
  • often used in tandem with Hadoop and HDFS
  • transformations converted to MapReduce dataflows
  • HDFS tracks where data is stored
  • operations scheduled nearby their data

3
4
An example...
  • Given two datasets
  • list of words and their frequency of appearance
    on webpages
  • list of users and webpages they visit
  • Lets find words users might be interested in
    lately.

4
5
Dataset words and their frequency of
appearance...
  • website word frequency date
  • news.bbc.co.uk obama 0.010 20081005
  • abcnews.go.com scheme 0.025 20081010
  • abcnews.go.com bombing 0.021 20081006
  • www.foxnews.com bush 0.001 20081006
  • www.cnn.com mccain 0.031 20081017
  • www.cnn.com obama 0.001 20081002
  • www.reuters.com bush 0.012 20080921
  • abcnews.go.com congress 0.002 20080927
  • www.reuters.com bush 0.012 20080921
  • www.foxnews.com bush 0.001 20081006
  • www.latimes.com abortion 0.001 20081015
  • www.latimes.com attack 0.010 20081015
  • www.reuters.com obama 0.005 20080917
  • www.foxnews.com economy 0.038 20081006

5
6
Dataset webpages users visit...
  • website user
  • www.reuters.com bill
  • news.bbc.co.uk mike
  • www.cnn.com mike
  • www.foxnews.com bill
  • www.reuters.com drew
  • www.latimes.com james
  • abcnews.go.com james

6
7
Loading word frequency data...
  • freqs LOAD '/home/jolly/TestData/NewsWords.txt'
    USING PigStorage(',')?
  • AS (website_indexed, word, freq, date)
  • (news.bbc.co.uk, obama, 0.010, 20081005)?
  • (abcnews.go.com, scheme, 0.025, 20081010)?
  • (abcnews.go.com, bombing, 0.021, 20081006)?
  • (www.foxnews.com, bush, 0.001, 20081006)?
  • (www.cnn.com, mccain, 0.031, 20081017)?
  • (www.cnn.com, obama, 0.001, 20081002)?
  • (www.reuters.com, bush, 0.012, 20080921)?
  • (abcnews.go.com, congress, 0.002, 20080927)?
  • (www.reuters.com, bush, 0.012, 20080921)?
  • (www.foxnews.com, bush, 0.001, 20081006)?
  • (www.latimes.com, abortion, 0.001, 20081015)?
  • (www.latimes.com, attack, 0.010, 20081015)?
  • (www.reuters.com, obama, 0.005, 20080917)?
  • (www.foxnews.com, economy, 0.038, 20081006)?

7
8
Hmm, we have some repeats...
  • (news.bbc.co.uk, obama, 0.010, 20081005)?
  • (abcnews.go.com, scheme, 0.025, 20081010)?
  • (abcnews.go.com, bombing, 0.021, 20081006)?
  • (www.foxnews.com, bush, 0.001, 20081006)?
  • (www.cnn.com, mccain, 0.031, 20081017)?
  • (www.cnn.com, obama, 0.001, 20081002)?
  • (www.reuters.com, bush, 0.012, 20080921)?
  • (abcnews.go.com, congress, 0.002, 20080927)?
  • (www.reuters.com, bush, 0.012, 20080921)?
  • (www.foxnews.com, bush, 0.001, 20081006)?
  • (www.latimes.com, abortion, 0.001, 20081015)?
  • (www.latimes.com, attack, 0.010, 20081015)?
  • (www.reuters.com, obama, 0.005, 20080917)?
  • (www.foxnews.com, economy, 0.038, 20081006)?

8
9
Duplicate data no more!
  • distinct_freqs DISTINCT freqs
  • (www.cnn.com, obama, 0.001, 20081002)?
  • (www.cnn.com, mccain, 0.031, 20081017)?
  • (abcnews.go.com, scheme, 0.025, 20081010)?
  • (abcnews.go.com, bombing, 0.021, 20081006)?
  • (abcnews.go.com, congress, 0.002, 20080927)?
  • (news.bbc.co.uk, obama, 0.010, 20081005)?
  • (www.foxnews.com, bush, 0.001, 20081006)?
  • (www.foxnews.com, economy, 0.038, 20081006)?
  • (www.latimes.com, attack, 0.010, 20081015)?
  • (www.latimes.com, abortion, 0.001, 20081015)?
  • (www.reuters.com, bush, 0.012, 20080921)?
  • (www.reuters.com, obama, 0.005, 20080917)?

9
10
Hmm, these tuples are old
  • (www.cnn.com, obama, 0.001, 20081002)?
  • (www.cnn.com, mccain, 0.031, 20081017)?
  • (abcnews.go.com, scheme, 0.025, 20081010)?
  • (abcnews.go.com, bombing, 0.021, 20081006)?
  • (abcnews.go.com, congress, 0.002, 20080927)?
  • (news.bbc.co.uk, obama, 0.010, 20081005)?
  • (www.foxnews.com, bush, 0.001, 20081006)?
  • (www.foxnews.com, economy, 0.038, 20081006)?
  • (www.latimes.com, attack, 0.010, 20081015)?
  • (www.latimes.com, abortion, 0.001, 20081015)?
  • (www.reuters.com, bush, 0.012, 20080921)?
  • (www.reuters.com, obama, 0.005, 20080917)?

10
11
... and these (green) tuples are not very
significant.
  • (www.cnn.com, obama, 0.001, 20081002)?
  • (www.cnn.com, mccain, 0.031, 20081017)?
  • (abcnews.go.com, scheme, 0.025, 20081010)?
  • (abcnews.go.com, bombing, 0.021, 20081006)?
  • (abcnews.go.com, congress, 0.002, 20080927)?
  • (news.bbc.co.uk, obama, 0.010, 20081005)?
  • (www.foxnews.com, bush, 0.001, 20081006)?
  • (www.foxnews.com, economy, 0.038, 20081006)?
  • (www.latimes.com, attack, 0.010, 20081015)?
  • (www.latimes.com, abortion, 0.001, 20081015)?
  • (www.reuters.com, bush, 0.012, 20080921)?
  • (www.reuters.com, obama, 0.005, 20080917)?

11
12
Lets filter them out.
  • important_freqs FILTER distinct_freqs
  • BY date gt 20081001
    AND freq gt 0.002
  • (www.cnn.com, mccain, 0.031, 20081017)?
  • (abcnews.go.com, scheme, 0.025, 20081010)?
  • (abcnews.go.com, bombing, 0.021, 20081006)?
  • (news.bbc.co.uk, obama, 0.010, 20081005)?
  • (www.foxnews.com, economy, 0.038, 20081006)?
  • (www.latimes.com, attack, 0.010, 20081015)?

12
13
Hmm, we dont need these anymore...
  • (www.cnn.com, mccain, 0.031, 20081017)?
  • (abcnews.go.com, scheme, 0.025, 20081010)?
  • (abcnews.go.com, bombing, 0.021, 20081006)?
  • (news.bbc.co.uk, obama, 0.010, 20081005)?
  • (www.foxnews.com, economy, 0.038, 20081006)?
  • (www.latimes.com, attack, 0.010, 20081015)?

13
14
Lets project them out.
  • websites_to_words FOREACH important_freqs
  • GENERATE
    website_indexed, word
  • (www.cnn.com, mccain)?
  • (abcnews.go.com, scheme)?
  • (abcnews.go.com, bombing)?
  • (news.bbc.co.uk, obama)?
  • (www.foxnews.com, economy)?
  • (www.latimes.com, attack)?

14
15
Now we are ready to join our lists.
  • Websites to Users
  • (news.bbc.co.uk, mike)?
  • (www.cnn.com, mike)?
  • (www.foxnews.com, bill)?
  • (www.reuters.com, drew)?
  • (www.latimes.com, james)?
  • (abcnews.go.com, james)?

Websites to Words (www.cnn.com,
mccain)? (abcnews.go.com, scheme)? (abcnews.go.com
, bombing)? (news.bbc.co.uk, obama)? (www.foxnews.
com, economy)? (www.latimes.com, attack)?
15
16
Joining on website finding words interesting to
users...
  • users_to_words_equijoin JOIN websites_to_users
    BY website_visited,

  • websites_to_words BY website_indexed
  • users_to_words FOREACH users_to_words_equijoin
  • GENERATE user, word
  • (mike, mccain)?
  • (james, scheme)?
  • (james, bombing)?
  • (mike, obama)?
  • (bill, economy)?
  • (james, attack)?

16
17
Lets group our results.
  • interests GROUP users_to_words BY user
  • (bill, (bill, economy))?
  • (mike, (mike, mccain), (mike, obama))?
  • (james, (james, scheme), (james, bombing),
    (james, attack))?

17
18
How does it work?
  • logic factored into MapReduce jobs
  • mapper processes run on machines with input
    tuples
  • input tuples processed using MAP( )
    function,producing intermediate tuples
  • intermediate tuples grouped together,transferred
    to reducer nodes
  • reducer processes consume intermediate
    tupleswith REDUCE( ) function

18
19
Translating Pig Latin to MapReduce...
  • transformed_by_map FOREACH input_tuple
  • GENERATE
    MAP()
  • intermediate_tuple_partition GROUP
    transformed_by_map
  • BY
    input_tuple_key
  • result_tuples FOREACH intermediate_tuple_partiti
    on
  • GENERATE REDUCE()

These statements can be executed using a single
MapReduce job
19
20
Example message traffic...
20
21
Why Pig Latin? Why not a C library?
  • We could just supply MAP( ) and REDUCE( ) to a C
    library...
  • Pig Latin allows you to
  • describe long tasks
  • in a friendly scripting language
  • use many built-in datatypes
  • support for semi-structured data
  • use many built-in functions
  • filters, projections, joins, unions, splits, etc.
  • tends to make user-defined functions simpler

21
22
Why Pig Latin? Why not SQL?
  • Pig Latin
  • is imperative
  • lets users manually tune query execution plan
  • doesnt need a schema
  • can easily read, write, and represent
    semi-structured data

22
23
Pig Latin really describes a generic dataflow.
inputs LOAD input.txt results FILTER
inputs BY IsBoring(important_attribute) STORE
results into results.txt
23
24
Summary
  • Pig Latin programs
  • typically operate on large volumes of
    unstructured data
  • describe a dataflow between primitive operations
  • many RDBMS-like operations built into the
    language
  • custom operations can be provided by the user
  • user specifies order of operations
  • dataflows can be executed using MapReduce
    paradigm
  • Thanks for listening!

24
Write a Comment
User Comments (0)
About PowerShow.com