Extract, Transform and Load - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Extract, Transform and Load

Description:

Disney example: Disney provided kiosks in Epcot that allowed visitors to sign up to receive information regarding Disney vacations (great marketing tool) ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 43
Provided by: thomas848
Category:

less

Transcript and Presenter's Notes

Title: Extract, Transform and Load


1
Extract, Transform and Load
  • Lecture 2

2
Objectives
  • Describe the key design considerations for
    transformation and load
  • Describe a recommended process for designing and
    coding and testing the ETL process.

3
Data Engineering
Acquisition
Engineering
Storage
Data Delivery
Access
Purpose Applies the business meaning and and
rules to the incoming (Extracted) data and
performs all necessary transformations for
compliance to those. Re-engineered data is the
transformed data per the business and integrity
requirements of the warehouse data models.
Metadata
Source Data Model(s)
Warehouse Data Model
Transformation
Staged transformed data
Transformed
Data
4
Data Engineering
Access
Acquisition
Engineering
Storage
Data Delivery
  • Considerations for Design
  • Standardization
  • Cleaning
  • Denormalization
  • Data mapping and type conversion
  • Derivation
  • Aggregation(s)
  • Integration assigning surrogate keys
  • Managing Slowly Changing Dimensions
  • Audit
  • Loading

5
Standardization
  • Often, data extracted from the source system
    needs to be cleaned and standardized. Best way
    to understand is through some examples
  • Standardization
  • Consider the following names
  • Bob Jones
  • Rob Jones
  • Bobby Jones
  • Robert Jones
  • These can all be standardized as Robert Jones

Why would we want to standardize these names?
6
Standardization
  • Consider the following addresses
  • These should be standardized as
  • 123 Main, Antigonish, Nova Scotia, B2G 2A1

Why would we want to do this?
7
Standardization
  • Standardization is a key step in the process of
    integrating data.
  • Using the previous example, assume we get a
    customer record from 2 different sources.

System 1
System 2
8
Standardization
  • In the previous example, it is probably safe to
    assume that the individual referenced and is in
    fact the same person
  • As such, we would only want one record for this
    individual in our data warehouse
  • Assuming we use different primary keys in each
    system (a pretty safe assumption) we would need
    to match on name and address to determine that
    this is the same person
  • Standardization of name and address will give us
    the ability to match the individual records from
    multiple systems and integrate into a single
    customer record

9
Cleaning
  • Cleaning is a process used to remove unwanted
    noise from data and to make data more
    presentable
  • Consider the following shipping address
  • Address 1 123 Main deliver to rear entrance
  • City Antigonish
  • Postal Code B2G 2A1
  • Province NS
  • Cleaning process would remove the comment in
    the address field

10
Cleaning
  • Consider the following name
  • BOB JONES
  • Many tools provide a nice name function that
    will turn above into Bob Jones
  • Consider the following phone number
  • 9028630440
  • Many tools provide the ability to turn this into
    (902) 863-0440

11
Cleaning
  • Disney example
  • Disney provided kiosks in Epcot that allowed
    visitors to sign up to receive information
    regarding Disney vacations (great marketing tool)
  • Got a high of bogus sign ups, like
  • Goofy, 123 Main St
  • Mickey Mouse, etc.
  • Built rules (using a tool) to exclude these
    records from DW)
  • Also, obscenity checking was done and records
    eliminated

12
Cleaning and standardization tools
  • Required primarily for cleaning and standardizing
    the customer and to a lesser extent the product
    dimension.
  • Provide lots of functionality for cleaning and
    standardizing names and addresses
  • Key functionality the ability to match records
    based on very sophisticated heuristic rules
  • Provide users the ability to define own rules
  • Allow matching on a percentage match or score
    users define what constitutes a match 80, 90,
    100?

13
Cleaning and Standardization tools
  • Example of advanced functionality for business
    name/address matching
  • Are the following companies in fact the same
    company?
  • Cleaning and standardization tools provide the
    ability to find that AB and BC so ABC

14
Cleaning and Standardization Vendors
  • Leading Vendors
  • Id Centric from FirstLogic
  • Trillium from Harte Hanks
  • Code 1 from Group 1 software
  • IDR from Vality
  • Typical cost of software 250,000 US and up

15
Denormalization
  • Denormalization is the process of flattening
    dimension and fact tables from multiple tables to
    (in most cases) a single table.
  • Consider the following order (fact) data from a
    production system

Order Fact Customer_ID Emp_ID Branch_ID Timestamp
Order_ID Line_ID Product_ID Price Qty Cost
Order Header Order_ID Order Timestamp
Customer Cust_ID
Order Taker Emp_ID Branch_ID Name
Order Line Order_ID Line_ID Product_ID Price Qty C
ost
Product Product_ID
Branch Branch_ID
16
Denormalization - Dimensions
  • Consider the employee to branch relationship
  • We would probably denormalize that into something
    like a location dimension

Order Taker Emp_ID Branch_ID Name
Location Emp_ID Branch_ID Name Sales_Region City P
rovince
Branch Branch_ID Sales_Region City Province
What are the benefits of denormalization?
17
Derivation
  • We often need to derive data items. Simple
    example is for measures on fact table. Consider
    the order fact

Order Fact Customer_ID Emp_ID Branch_ID Timestamp
Order_ID Line_ID Product_ID Price Qty Cost Revenue
Line_cost Line_GM
In this case we may want to derive data to create
revenue, line_gm (gross margin) and
line_cost attributes as they are often used as
measures.
Order Fact Customer_ID Emp_ID Timestamp Order_ID L
ine_ID Product_ID Price Qty Cost
What is the benefit of deriving the attributes in
the ETL process?
18
Derivation - Households
  • More complex example.
  • Often, marketers are interesting in understanding
    the households that customers live in and the
    relationships that exist (husband, wife, kids,
    etc.). Typically, this information is not
    available from any source system and needs to be
    derived.
  • Household may have more than 1 addresses (if for
    example, a family has a permanent and summer
    address

Why would marketers want this information?
19
Derivation Household example
  • Typically, household derivation is done using
    cleaning/standardization tools like id centric
    and Trillium.
  • These tools look to match individuals into
    households based on standardized names and
    addresses
  • Examples

Is this a household?
20
Derivation - examples
Is this a household?
Is this a household?
21
Derivation Households - Conclusions
  • Deriving households can be tricky and
    organizations tend to be pretty conservative in
    their approach
  • Dont want to assume a relationship that does not
    exist may be offensive
  • Balanced with the desire not to (for example)
    make the same marketing offer (for credit cards,
    as example) multiple times to the same household

22
Aggregation
  • Often, we need to generate aggregated fact data
    for 2 main reasons
  • Speed up queries
  • Generate snapshot facts
  • Speed up query processing.
  • The DBA is continuously monitoring the the
    performance of queries and will create aggregate
    fact when atomic facts are often rolled up a
    certain way
  • The most common rollup date.
  • Assuming we keep atomic fact data, it is very
    common that we want to do analysis of orders by a
    time period such as week and month.
  • If we build aggregate fact tables, these can be
    used in place of atomic level facts and speed
    query processing as they contain many fewer
    records.

23
Aggregation
  • Speed up queries
  • Note that aggregate fact tables retain reference
    to other dimension tables
  • Many query tools are aggregate aware. This
    means that they can detect when an aggregate
    table exists for a fact and automatically select
    that table in query. This is transparent to the
    user
  • Snapshot facts
  • As discussed in a previous lecture, organizations
    typically want both a transactional and snapshot
    view of data.
  • Generation of snapshot facts requires a
    significant amount of aggregation of data.

24
Aggregation
  • When calculating aggregates, the following is
    typically true. Assume we are building an
    aggregate order fact where we want products
    ordered by customer by location by month.
  • The data for the current month would not be
    available until the end of the month
  • The FKs to other relevant dimensions are the same
    as the atomic level fact
  • The FK to the date dimension needs to be to a
    month, not a day. To achieve this, we need to
    specify levels in the date dimension so that
    each unique month has its own unique key. So, in
    the date dimension, we have records that
    represent unique days 3/19/02 AND unique months
    3/02. Makes things more complex for users.

25
Data Mapping and Type conversions
  • Need to implement logic to choose values to map
    into fields (e.g., different gender codes from
    systems)
  • Need to implement IF THEN - ELSE (or CASE)
    logic to determine how individual fields will be
    populated.
  • Often need to convert data types from one form to
    another
  • Example date field stored in a CHAR field,
    convert to DATE.
  • Example time of day stored as part of date
    parse out and store as integer
  • ETL tools are really good at mapping and
    converting from one data type to another

26
Surrogate Key Processing
  • Design needs to consider two aspects of surrogate
    key processing
  • Assigning and maintaining the keys associated
    with dimension records
  • Assigning surrogate keys to fact records

27
Surrogate Keys
  • For a given dimension table, the initial set up
    involves two steps
  • Create a sequence that will provide the
    surrogate keys to the dimension table. The
    sequence generates a sequential set of integers
    that will be assigned to each row in the
    dimension table
  • Create a key map table that will map product
    key(s) to the surrogate key assigned.

28
Surrogate keys Key Map table
  • Assume we have products coming from 2 systems and
    there is overlap in the products in the systems
  • Assume we have a method to tell us X2334R is the
    same as STDHMMR

System 1
System 2
29
Surrogate Keys key map table
  • The role of the key map table is to create and
    maintain a record of the production key(s) that
    map to the surrogate key assigned to a given
    dimension row.
  • Simple Format, using previous example

What is the primary use of a key map table?
30
Surrogate Key assignment
  • Surrogate key assignment is a two step process
  • Manage assignment of keys to dimension records as
    primary keys
  • Manage assignment of surrogate keys to fact
    records
  • Key map table is critical to this process as a
    lookup table

31
Surrogate Key Assignment Process - Dimensions
Product Sequence
New Product Record Surrogate_Key Product_ID Desc
ription Other Attributes
New Product Record Product_ID Description Other
Attributes
2. Get Next sequential Key
4. Update new record with Surrogate key
3. Update Key Map table with Product_ID,
Surrogate Key
Product Key Map Table
Why keep the production key?
1 .Determine if production key mapped
32
Surrogate key assignment - facts
Consider an order fact record
Product Key Map Table
Date Key Map Table
Location Key Map Table
Customer Key Map Table
Order Fact Customer_ID Emp_ID Branch_ID Timestamp
Order_ID Line_ID Product_ID Price Qty Cost Revenue
Line_cost Line_GM
Order Fact Cust_Key Loc_Key Date_key Product_Key
Order_ID Line_ID Price Qty Cost Revenue Line_cost
Line_GM
Replace product_ID with Surrogate key
Replace Emp_ID/Branch_ID with Surrogate key
Replace customer_ID with Surrogate key
Replace timestamp with Surrogate key
Why do we keep order_ID and order_line? What are
these called?
33
Slowly changing dimensions
  • We need to build the logic for processing slowly
    changing dimensions, and this can be challenging.
  • Note that for a complex dimension like customer,
    depending on the type of change, we may implement
    multiple change strategies, depending on what
    data is changing!!
  • Type 1
  • Type 2
  • Type 3
  • Demographics

Examples of where each technique would be applied?
34
Audit
  • As with extract, its important to build audit and
    error handling logic into the process

35
Load Process
  • Always use the bulk loader for ALL inserts even
    those from within a database
  • Most DBMS provide a bulk loader function
  • Basically, the bulk loader provides the ability
    to load data more quickly as it takes a number of
    shortcuts
  • For example, the bulk loader does not insert
    records into recovery log on a record by record
    basis.
  • Presort what is to be loaded by primary key.
  • DBMS provides capabilitiies
  • Can also purchase tools like SynchSort which
    sorts data very quickly.

36
Load Process
  • Many DBMS allow simple transformations (e.g.,
    aggregation of data) during load process
  • Rule of thumb stay away from any transformations
    using load facilities
  • Index management need to determine whether or
    not to drop the index(es) on a table before load,
    then rebuild
  • Rule of thumb If loading more than 10 of
    table, drop and recreate
  • Parallel capabilities of DBMS, OS and hardware to
    do as much processing simultaneously as possible.

37
Design Process Overview
  • Step 1. Environment Setup
  • Test, choose, and implement tools
  • - ETL
  • - Cleaning/Standardization
  • - Sort utilities
  • - Code management
  • - Metadata
  • Define documentation standards
  • Ensure all design documentation is saved as
    metadata

38
Design Process Overview
  • Step 2. High-Level Plan
  • Create a very high-level, one-page schematic of
    the source-to-target flow
  • Identify starting and ending points
  • Label known data sources
  • Label targets
  • Include notes about known gotchas

39
Example Process Overview
40
Design Process Overview
  • Step 3. Detailed Design
  • Drill down by target table or logical group of
    target tables, graphically sketching any complex
    data restructuring or transformations
  • Graphically illustrate the surrogate-key
    generation process
  • Develop a preliminary job sequencing

41
Design Process Overview
  • Step 4 Design ETL for a simple dimension
  • Start with static extract transform and load
  • Step 5 Add change logic for dimension
  • Design and implement the logic required for
    incremental extract and processing
  • Design and implement logic to manage slowly
    changing dimensions
  • Step 6 Design ETL for initial (historical) load
    of facts
  • Step 7 Design ETL for incremental ETL of facts
  • Tune fact table load to take advantage of
    parallelism, bulk load, etc.

42
Design Process Overview
  • Step 8 Design and build process for aggregates
    and for moving data from data warehouse to data
    marts
  • Special consideration required for design of
    MDDB-based marts
  • Step 9 Build ETL automation, including
  • Job scheduling
  • Statistics generation
  • Error handling
Write a Comment
User Comments (0)
About PowerShow.com