Title: Extract, Transform and Load
1Extract, Transform and Load
2Objectives
- Describe the key design considerations for
transformation and load - Describe a recommended process for designing and
coding and testing the ETL process.
3Data Engineering
Acquisition
Engineering
Storage
Data Delivery
Access
Purpose Applies the business meaning and and
rules to the incoming (Extracted) data and
performs all necessary transformations for
compliance to those. Re-engineered data is the
transformed data per the business and integrity
requirements of the warehouse data models.
Metadata
Source Data Model(s)
Warehouse Data Model
Transformation
Staged transformed data
Transformed
Data
4Data Engineering
Access
Acquisition
Engineering
Storage
Data Delivery
- Considerations for Design
- Standardization
- Cleaning
- Denormalization
- Data mapping and type conversion
- Derivation
- Aggregation(s)
- Integration assigning surrogate keys
- Managing Slowly Changing Dimensions
- Audit
- Loading
5Standardization
- Often, data extracted from the source system
needs to be cleaned and standardized. Best way
to understand is through some examples - Standardization
- Consider the following names
- Bob Jones
- Rob Jones
- Bobby Jones
- Robert Jones
- These can all be standardized as Robert Jones
Why would we want to standardize these names?
6Standardization
- Consider the following addresses
- These should be standardized as
- 123 Main, Antigonish, Nova Scotia, B2G 2A1
Why would we want to do this?
7Standardization
- Standardization is a key step in the process of
integrating data. - Using the previous example, assume we get a
customer record from 2 different sources.
System 1
System 2
8Standardization
- In the previous example, it is probably safe to
assume that the individual referenced and is in
fact the same person - As such, we would only want one record for this
individual in our data warehouse - Assuming we use different primary keys in each
system (a pretty safe assumption) we would need
to match on name and address to determine that
this is the same person - Standardization of name and address will give us
the ability to match the individual records from
multiple systems and integrate into a single
customer record
9Cleaning
- Cleaning is a process used to remove unwanted
noise from data and to make data more
presentable - Consider the following shipping address
- Address 1 123 Main deliver to rear entrance
- City Antigonish
- Postal Code B2G 2A1
- Province NS
- Cleaning process would remove the comment in
the address field
10Cleaning
- Consider the following name
- BOB JONES
- Many tools provide a nice name function that
will turn above into Bob Jones - Consider the following phone number
- 9028630440
- Many tools provide the ability to turn this into
(902) 863-0440
11Cleaning
- Disney example
- Disney provided kiosks in Epcot that allowed
visitors to sign up to receive information
regarding Disney vacations (great marketing tool) - Got a high of bogus sign ups, like
- Goofy, 123 Main St
- Mickey Mouse, etc.
- Built rules (using a tool) to exclude these
records from DW) - Also, obscenity checking was done and records
eliminated
12Cleaning and standardization tools
- Required primarily for cleaning and standardizing
the customer and to a lesser extent the product
dimension. - Provide lots of functionality for cleaning and
standardizing names and addresses - Key functionality the ability to match records
based on very sophisticated heuristic rules - Provide users the ability to define own rules
- Allow matching on a percentage match or score
users define what constitutes a match 80, 90,
100?
13Cleaning and Standardization tools
- Example of advanced functionality for business
name/address matching - Are the following companies in fact the same
company? - Cleaning and standardization tools provide the
ability to find that AB and BC so ABC
14Cleaning and Standardization Vendors
- Leading Vendors
- Id Centric from FirstLogic
- Trillium from Harte Hanks
- Code 1 from Group 1 software
- IDR from Vality
- Typical cost of software 250,000 US and up
15Denormalization
- Denormalization is the process of flattening
dimension and fact tables from multiple tables to
(in most cases) a single table. - Consider the following order (fact) data from a
production system
Order Fact Customer_ID Emp_ID Branch_ID Timestamp
Order_ID Line_ID Product_ID Price Qty Cost
Order Header Order_ID Order Timestamp
Customer Cust_ID
Order Taker Emp_ID Branch_ID Name
Order Line Order_ID Line_ID Product_ID Price Qty C
ost
Product Product_ID
Branch Branch_ID
16Denormalization - Dimensions
- Consider the employee to branch relationship
- We would probably denormalize that into something
like a location dimension
Order Taker Emp_ID Branch_ID Name
Location Emp_ID Branch_ID Name Sales_Region City P
rovince
Branch Branch_ID Sales_Region City Province
What are the benefits of denormalization?
17Derivation
- We often need to derive data items. Simple
example is for measures on fact table. Consider
the order fact
Order Fact Customer_ID Emp_ID Branch_ID Timestamp
Order_ID Line_ID Product_ID Price Qty Cost Revenue
Line_cost Line_GM
In this case we may want to derive data to create
revenue, line_gm (gross margin) and
line_cost attributes as they are often used as
measures.
Order Fact Customer_ID Emp_ID Timestamp Order_ID L
ine_ID Product_ID Price Qty Cost
What is the benefit of deriving the attributes in
the ETL process?
18Derivation - Households
- More complex example.
- Often, marketers are interesting in understanding
the households that customers live in and the
relationships that exist (husband, wife, kids,
etc.). Typically, this information is not
available from any source system and needs to be
derived. - Household may have more than 1 addresses (if for
example, a family has a permanent and summer
address
Why would marketers want this information?
19Derivation Household example
- Typically, household derivation is done using
cleaning/standardization tools like id centric
and Trillium. - These tools look to match individuals into
households based on standardized names and
addresses - Examples
Is this a household?
20Derivation - examples
Is this a household?
Is this a household?
21Derivation Households - Conclusions
- Deriving households can be tricky and
organizations tend to be pretty conservative in
their approach - Dont want to assume a relationship that does not
exist may be offensive - Balanced with the desire not to (for example)
make the same marketing offer (for credit cards,
as example) multiple times to the same household
22Aggregation
- Often, we need to generate aggregated fact data
for 2 main reasons - Speed up queries
- Generate snapshot facts
- Speed up query processing.
- The DBA is continuously monitoring the the
performance of queries and will create aggregate
fact when atomic facts are often rolled up a
certain way - The most common rollup date.
- Assuming we keep atomic fact data, it is very
common that we want to do analysis of orders by a
time period such as week and month. - If we build aggregate fact tables, these can be
used in place of atomic level facts and speed
query processing as they contain many fewer
records.
23Aggregation
- Speed up queries
- Note that aggregate fact tables retain reference
to other dimension tables - Many query tools are aggregate aware. This
means that they can detect when an aggregate
table exists for a fact and automatically select
that table in query. This is transparent to the
user - Snapshot facts
- As discussed in a previous lecture, organizations
typically want both a transactional and snapshot
view of data. - Generation of snapshot facts requires a
significant amount of aggregation of data.
24Aggregation
- When calculating aggregates, the following is
typically true. Assume we are building an
aggregate order fact where we want products
ordered by customer by location by month. - The data for the current month would not be
available until the end of the month - The FKs to other relevant dimensions are the same
as the atomic level fact - The FK to the date dimension needs to be to a
month, not a day. To achieve this, we need to
specify levels in the date dimension so that
each unique month has its own unique key. So, in
the date dimension, we have records that
represent unique days 3/19/02 AND unique months
3/02. Makes things more complex for users.
25Data Mapping and Type conversions
- Need to implement logic to choose values to map
into fields (e.g., different gender codes from
systems) - Need to implement IF THEN - ELSE (or CASE)
logic to determine how individual fields will be
populated. - Often need to convert data types from one form to
another - Example date field stored in a CHAR field,
convert to DATE. - Example time of day stored as part of date
parse out and store as integer - ETL tools are really good at mapping and
converting from one data type to another
26Surrogate Key Processing
- Design needs to consider two aspects of surrogate
key processing - Assigning and maintaining the keys associated
with dimension records - Assigning surrogate keys to fact records
27Surrogate Keys
- For a given dimension table, the initial set up
involves two steps - Create a sequence that will provide the
surrogate keys to the dimension table. The
sequence generates a sequential set of integers
that will be assigned to each row in the
dimension table - Create a key map table that will map product
key(s) to the surrogate key assigned.
28Surrogate keys Key Map table
- Assume we have products coming from 2 systems and
there is overlap in the products in the systems - Assume we have a method to tell us X2334R is the
same as STDHMMR
System 1
System 2
29Surrogate Keys key map table
- The role of the key map table is to create and
maintain a record of the production key(s) that
map to the surrogate key assigned to a given
dimension row. - Simple Format, using previous example
What is the primary use of a key map table?
30Surrogate Key assignment
- Surrogate key assignment is a two step process
- Manage assignment of keys to dimension records as
primary keys - Manage assignment of surrogate keys to fact
records - Key map table is critical to this process as a
lookup table
31Surrogate Key Assignment Process - Dimensions
Product Sequence
New Product Record Surrogate_Key Product_ID Desc
ription Other Attributes
New Product Record Product_ID Description Other
Attributes
2. Get Next sequential Key
4. Update new record with Surrogate key
3. Update Key Map table with Product_ID,
Surrogate Key
Product Key Map Table
Why keep the production key?
1 .Determine if production key mapped
32Surrogate key assignment - facts
Consider an order fact record
Product Key Map Table
Date Key Map Table
Location Key Map Table
Customer Key Map Table
Order Fact Customer_ID Emp_ID Branch_ID Timestamp
Order_ID Line_ID Product_ID Price Qty Cost Revenue
Line_cost Line_GM
Order Fact Cust_Key Loc_Key Date_key Product_Key
Order_ID Line_ID Price Qty Cost Revenue Line_cost
Line_GM
Replace product_ID with Surrogate key
Replace Emp_ID/Branch_ID with Surrogate key
Replace customer_ID with Surrogate key
Replace timestamp with Surrogate key
Why do we keep order_ID and order_line? What are
these called?
33Slowly changing dimensions
- We need to build the logic for processing slowly
changing dimensions, and this can be challenging. - Note that for a complex dimension like customer,
depending on the type of change, we may implement
multiple change strategies, depending on what
data is changing!! - Type 1
- Type 2
- Type 3
- Demographics
Examples of where each technique would be applied?
34Audit
- As with extract, its important to build audit and
error handling logic into the process
35Load Process
- Always use the bulk loader for ALL inserts even
those from within a database - Most DBMS provide a bulk loader function
- Basically, the bulk loader provides the ability
to load data more quickly as it takes a number of
shortcuts - For example, the bulk loader does not insert
records into recovery log on a record by record
basis. - Presort what is to be loaded by primary key.
- DBMS provides capabilitiies
- Can also purchase tools like SynchSort which
sorts data very quickly.
36Load Process
- Many DBMS allow simple transformations (e.g.,
aggregation of data) during load process - Rule of thumb stay away from any transformations
using load facilities - Index management need to determine whether or
not to drop the index(es) on a table before load,
then rebuild - Rule of thumb If loading more than 10 of
table, drop and recreate - Parallel capabilities of DBMS, OS and hardware to
do as much processing simultaneously as possible.
37Design Process Overview
- Step 1. Environment Setup
- Test, choose, and implement tools
- - ETL
- - Cleaning/Standardization
- - Sort utilities
- - Code management
- - Metadata
- Define documentation standards
- Ensure all design documentation is saved as
metadata
38Design Process Overview
- Step 2. High-Level Plan
- Create a very high-level, one-page schematic of
the source-to-target flow - Identify starting and ending points
- Label known data sources
- Label targets
- Include notes about known gotchas
39Example Process Overview
40Design Process Overview
- Step 3. Detailed Design
- Drill down by target table or logical group of
target tables, graphically sketching any complex
data restructuring or transformations - Graphically illustrate the surrogate-key
generation process - Develop a preliminary job sequencing
41Design Process Overview
- Step 4 Design ETL for a simple dimension
- Start with static extract transform and load
- Step 5 Add change logic for dimension
- Design and implement the logic required for
incremental extract and processing - Design and implement logic to manage slowly
changing dimensions - Step 6 Design ETL for initial (historical) load
of facts - Step 7 Design ETL for incremental ETL of facts
- Tune fact table load to take advantage of
parallelism, bulk load, etc.
42Design Process Overview
- Step 8 Design and build process for aggregates
and for moving data from data warehouse to data
marts - Special consideration required for design of
MDDB-based marts - Step 9 Build ETL automation, including
- Job scheduling
- Statistics generation
- Error handling