Extract, Transform and Load - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Extract, Transform and Load

Description:

Disney example: Disney provided kiosks in Epcot that allowed visitors to sign up to receive information regarding Disney vacations (great marketing tool) ... – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 43

Provided by: thomas848

Category:

more less

Transcript and Presenter's Notes

Title: Extract, Transform and Load

1
Extract, Transform and Load

Lecture 2

2
Objectives

Describe the key design considerations for
transformation and load
Describe a recommended process for designing and
coding and testing the ETL process.

3
Data Engineering
Acquisition
Engineering
Storage
Data Delivery
Access
Purpose Applies the business meaning and and
rules to the incoming (Extracted) data and
performs all necessary transformations for
compliance to those. Re-engineered data is the
transformed data per the business and integrity
requirements of the warehouse data models.
Metadata
Source Data Model(s)
Warehouse Data Model
Transformation
Staged transformed data
Transformed
Data
4
Data Engineering
Access
Acquisition
Engineering
Storage
Data Delivery

Considerations for Design
Standardization
Cleaning
Denormalization
Data mapping and type conversion
Derivation
Aggregation(s)
Integration assigning surrogate keys
Managing Slowly Changing Dimensions
Audit
Loading

5
Standardization

Often, data extracted from the source system
needs to be cleaned and standardized. Best way
to understand is through some examples
Standardization
Consider the following names
Bob Jones
Rob Jones
Bobby Jones
Robert Jones
These can all be standardized as Robert Jones

Why would we want to standardize these names?
6
Standardization

Consider the following addresses
These should be standardized as
123 Main, Antigonish, Nova Scotia, B2G 2A1

Why would we want to do this?
7
Standardization

Standardization is a key step in the process of
integrating data.
Using the previous example, assume we get a
customer record from 2 different sources.

System 1
System 2
8
Standardization

In the previous example, it is probably safe to
assume that the individual referenced and is in
fact the same person
As such, we would only want one record for this
individual in our data warehouse
Assuming we use different primary keys in each
system (a pretty safe assumption) we would need
to match on name and address to determine that
this is the same person
Standardization of name and address will give us
the ability to match the individual records from
multiple systems and integrate into a single
customer record

9
Cleaning

Cleaning is a process used to remove unwanted
noise from data and to make data more
presentable
Consider the following shipping address
Address 1 123 Main deliver to rear entrance
City Antigonish
Postal Code B2G 2A1
Province NS
Cleaning process would remove the comment in
the address field

10
Cleaning

Consider the following name
BOB JONES
Many tools provide a nice name function that
will turn above into Bob Jones
Consider the following phone number
9028630440
Many tools provide the ability to turn this into
(902) 863-0440

11
Cleaning

Disney example
Disney provided kiosks in Epcot that allowed
visitors to sign up to receive information
regarding Disney vacations (great marketing tool)
Got a high of bogus sign ups, like
Goofy, 123 Main St
Mickey Mouse, etc.
Built rules (using a tool) to exclude these
records from DW)
Also, obscenity checking was done and records
eliminated

12
Cleaning and standardization tools

Required primarily for cleaning and standardizing
the customer and to a lesser extent the product
dimension.
Provide lots of functionality for cleaning and
standardizing names and addresses
Key functionality the ability to match records
based on very sophisticated heuristic rules
Provide users the ability to define own rules
Allow matching on a percentage match or score
users define what constitutes a match 80, 90,
100?

13
Cleaning and Standardization tools

Example of advanced functionality for business
name/address matching
Are the following companies in fact the same
company?
Cleaning and standardization tools provide the
ability to find that AB and BC so ABC

14
Cleaning and Standardization Vendors

Leading Vendors
Id Centric from FirstLogic
Trillium from Harte Hanks
Code 1 from Group 1 software
IDR from Vality
Typical cost of software 250,000 US and up

15
Denormalization

Denormalization is the process of flattening
dimension and fact tables from multiple tables to
(in most cases) a single table.
Consider the following order (fact) data from a
production system

Order Fact Customer_ID Emp_ID Branch_ID Timestamp
Order_ID Line_ID Product_ID Price Qty Cost
Order Header Order_ID Order Timestamp
Customer Cust_ID
Order Taker Emp_ID Branch_ID Name
Order Line Order_ID Line_ID Product_ID Price Qty C
ost
Product Product_ID
Branch Branch_ID
16
Denormalization - Dimensions

Consider the employee to branch relationship
We would probably denormalize that into something
like a location dimension

Order Taker Emp_ID Branch_ID Name
Location Emp_ID Branch_ID Name Sales_Region City P
rovince
Branch Branch_ID Sales_Region City Province
What are the benefits of denormalization?
17
Derivation

We often need to derive data items. Simple
example is for measures on fact table. Consider
the order fact

Order Fact Customer_ID Emp_ID Branch_ID Timestamp
Order_ID Line_ID Product_ID Price Qty Cost Revenue
Line_cost Line_GM
In this case we may want to derive data to create
revenue, line_gm (gross margin) and
line_cost attributes as they are often used as
measures.
Order Fact Customer_ID Emp_ID Timestamp Order_ID L
ine_ID Product_ID Price Qty Cost
What is the benefit of deriving the attributes in
the ETL process?
18
Derivation - Households

More complex example.
Often, marketers are interesting in understanding
the households that customers live in and the
relationships that exist (husband, wife, kids,
etc.). Typically, this information is not
available from any source system and needs to be
derived.
Household may have more than 1 addresses (if for
example, a family has a permanent and summer
address

Why would marketers want this information?
19
Derivation Household example

Typically, household derivation is done using
cleaning/standardization tools like id centric
and Trillium.
These tools look to match individuals into
households based on standardized names and
addresses
Examples

Is this a household?
20
Derivation - examples
Is this a household?
Is this a household?
21
Derivation Households - Conclusions

Deriving households can be tricky and
organizations tend to be pretty conservative in
their approach
Dont want to assume a relationship that does not
exist may be offensive
Balanced with the desire not to (for example)
make the same marketing offer (for credit cards,
as example) multiple times to the same household

22
Aggregation

Often, we need to generate aggregated fact data
for 2 main reasons
Speed up queries
Generate snapshot facts
Speed up query processing.
The DBA is continuously monitoring the the
performance of queries and will create aggregate
fact when atomic facts are often rolled up a
certain way
The most common rollup date.
Assuming we keep atomic fact data, it is very
common that we want to do analysis of orders by a
time period such as week and month.
If we build aggregate fact tables, these can be
used in place of atomic level facts and speed
query processing as they contain many fewer
records.

23
Aggregation

Speed up queries
Note that aggregate fact tables retain reference
to other dimension tables
Many query tools are aggregate aware. This
means that they can detect when an aggregate
table exists for a fact and automatically select
that table in query. This is transparent to the
user
Snapshot facts
As discussed in a previous lecture, organizations
typically want both a transactional and snapshot
view of data.
Generation of snapshot facts requires a
significant amount of aggregation of data.

24
Aggregation

When calculating aggregates, the following is
typically true. Assume we are building an
aggregate order fact where we want products
ordered by customer by location by month.
The data for the current month would not be
available until the end of the month
The FKs to other relevant dimensions are the same
as the atomic level fact
The FK to the date dimension needs to be to a
month, not a day. To achieve this, we need to
specify levels in the date dimension so that
each unique month has its own unique key. So, in
the date dimension, we have records that
represent unique days 3/19/02 AND unique months
3/02. Makes things more complex for users.

25
Data Mapping and Type conversions

Need to implement logic to choose values to map
into fields (e.g., different gender codes from
systems)
Need to implement IF THEN - ELSE (or CASE)
logic to determine how individual fields will be
populated.
Often need to convert data types from one form to
another
Example date field stored in a CHAR field,
convert to DATE.
Example time of day stored as part of date
parse out and store as integer
ETL tools are really good at mapping and
converting from one data type to another

26
Surrogate Key Processing

Design needs to consider two aspects of surrogate
key processing
Assigning and maintaining the keys associated
with dimension records
Assigning surrogate keys to fact records

27
Surrogate Keys

For a given dimension table, the initial set up
involves two steps
Create a sequence that will provide the
surrogate keys to the dimension table. The
sequence generates a sequential set of integers
that will be assigned to each row in the
dimension table
Create a key map table that will map product
key(s) to the surrogate key assigned.

28
Surrogate keys Key Map table

Assume we have products coming from 2 systems and
there is overlap in the products in the systems
Assume we have a method to tell us X2334R is the
same as STDHMMR

System 1
System 2
29
Surrogate Keys key map table

The role of the key map table is to create and
maintain a record of the production key(s) that
map to the surrogate key assigned to a given
dimension row.
Simple Format, using previous example

What is the primary use of a key map table?
30
Surrogate Key assignment

Surrogate key assignment is a two step process
Manage assignment of keys to dimension records as
primary keys
Manage assignment of surrogate keys to fact
records
Key map table is critical to this process as a
lookup table

31
Surrogate Key Assignment Process - Dimensions
Product Sequence
New Product Record Surrogate_Key Product_ID Desc
ription Other Attributes
New Product Record Product_ID Description Other
Attributes
2. Get Next sequential Key
4. Update new record with Surrogate key
3. Update Key Map table with Product_ID,
Surrogate Key
Product Key Map Table
Why keep the production key?
1 .Determine if production key mapped
32
Surrogate key assignment - facts
Consider an order fact record
Product Key Map Table
Date Key Map Table
Location Key Map Table
Customer Key Map Table
Order Fact Customer_ID Emp_ID Branch_ID Timestamp
Order_ID Line_ID Product_ID Price Qty Cost Revenue
Line_cost Line_GM
Order Fact Cust_Key Loc_Key Date_key Product_Key
Order_ID Line_ID Price Qty Cost Revenue Line_cost
Line_GM
Replace product_ID with Surrogate key
Replace Emp_ID/Branch_ID with Surrogate key
Replace customer_ID with Surrogate key
Replace timestamp with Surrogate key
Why do we keep order_ID and order_line? What are
these called?
33
Slowly changing dimensions

We need to build the logic for processing slowly
changing dimensions, and this can be challenging.
Note that for a complex dimension like customer,
depending on the type of change, we may implement
multiple change strategies, depending on what
data is changing!!
Type 1
Type 2
Type 3
Demographics

Examples of where each technique would be applied?
34
Audit

As with extract, its important to build audit and
error handling logic into the process

35
Load Process

Always use the bulk loader for ALL inserts even
those from within a database
Most DBMS provide a bulk loader function
Basically, the bulk loader provides the ability
to load data more quickly as it takes a number of
shortcuts
For example, the bulk loader does not insert
records into recovery log on a record by record
basis.
Presort what is to be loaded by primary key.
DBMS provides capabilitiies
Can also purchase tools like SynchSort which
sorts data very quickly.

36
Load Process

Many DBMS allow simple transformations (e.g.,
aggregation of data) during load process
Rule of thumb stay away from any transformations
using load facilities
Index management need to determine whether or
not to drop the index(es) on a table before load,
then rebuild
Rule of thumb If loading more than 10 of
table, drop and recreate
Parallel capabilities of DBMS, OS and hardware to
do as much processing simultaneously as possible.

37
Design Process Overview

Step 1. Environment Setup
Test, choose, and implement tools
- ETL
- Cleaning/Standardization
- Sort utilities
- Code management
- Metadata
Define documentation standards
Ensure all design documentation is saved as
metadata

38
Design Process Overview

Step 2. High-Level Plan
Create a very high-level, one-page schematic of
the source-to-target flow
Identify starting and ending points
Label known data sources
Label targets
Include notes about known gotchas

39
Example Process Overview
40
Design Process Overview

Step 3. Detailed Design
Drill down by target table or logical group of
target tables, graphically sketching any complex
data restructuring or transformations
Graphically illustrate the surrogate-key
generation process
Develop a preliminary job sequencing

41
Design Process Overview

Step 4 Design ETL for a simple dimension
Start with static extract transform and load
Step 5 Add change logic for dimension
Design and implement the logic required for
incremental extract and processing
Design and implement logic to manage slowly
changing dimensions
Step 6 Design ETL for initial (historical) load
of facts
Step 7 Design ETL for incremental ETL of facts
Tune fact table load to take advantage of
parallelism, bulk load, etc.

42
Design Process Overview

Step 8 Design and build process for aggregates
and for moving data from data warehouse to data
marts
Special consideration required for design of
MDDB-based marts
Step 9 Build ETL automation, including
Job scheduling
Statistics generation
Error handling

Write a Comment

User Comments (0)