Data Warehouse Lecture 3

About This Presentation

Title:

Data Warehouse Lecture 3

Description:

Holidays, workdays, fiscal periods, week numbers, last day of month flags, must ... because SQL updates will result into unacceptable database logging activity ... – PowerPoint PPT presentation

Number of Views:223

Avg rating:3.0/5.0

Slides: 46

Provided by: csDr

Category:

more less

Transcript and Presenter's Notes

Title: Data Warehouse Lecture 3

1
Data Warehouse Lecture 3

ETL Tools, Techniques and Best Practices

2
Agenda

ETL tools, techniques, methodologies, best
practices, delivering dimension tables to the DW
Sources used for this lecture
Ralph Kimball, Joe Caserta, The Data Warehouse
ETL Toolkit Practical Techniques for
Extracting, Cleaning, Conforming and Delivering
Data

3
ETL Solution Desired Features
An ETL tool is a tool that reads data from one or
more sources, transforms the data so that it is
compatible with a destination and loads the data
to the destination

Automated data movement across data stores and
the analytical area in support of a data
warehouse, data mart or ODS effort
Extensible, robust, scalable infrastructure
Standardization of ETL processes across the
enterprise
Reusability of custom and pre-build functions
Better utilization of existing hardware resources
Faster change-control management
Integrated meta-data management
Complete development environment, work as you
think design metaphor

Desired Features
4
Custom ETL Solutions

Execute stored procedures at the source to
extract data, perhaps filtering some
Store the extracted data into staging tables at
the source
Generate a file with the source staging data
Transfer the file into the destination
Load the data into staging tables at the
destination
Execute stored procedures at the destination to
read the incoming staged data and compute what
changed since the last time by comparing with the
current value of the data warehouse
Populate the data warehouse with new changes
Supports incremental updates to sources
Supports simple data cleansing
Logs warning/errors

5
Vendor ETL tools

DataStage
big player, high ability to execute, gets good
results from Gartner, visionary
Informatica
another big player, gets good results from
Gartner, visionary with high ability to execute
most expensive
SAS ETL Server
fast becoming a major player, very positive
results from Gartner
low exposure as an ETL tool (SAS a significant
statistical analysis vendor)
Information Builder's Data Migrator/ETL Manager
tool suite
part of Enterprise Focus/WebFocus
not a major player but industrial strength
language, data connectors, etc
Sunopsis
cheap
relies on native RDBMS functionality
CIGNA people exposed to it at conferences liked
it

6
Buy versus Build

Vendor tools promote standardization of the ETL
process, reusability of custom and pre-built
functions, lowering the time (and cost) of
additional ETL efforts
Vendor ETL tools are somewhat self-documenting
Many tools can connect to a variety of sources
(RDBMSs, non-relational, different OS, ERP,
PeopleSoft, etc) without exposing the ETL
developer to the differences
Vendor tools deal with changes in the source
systems or the necessary transformations better,
reducing long term ETL maintenance
Meta-data management is a huge advantage,
especially when sharing data from many
applications
Vendor ETL tools place discipline in the
extraction and transformation process
ETL prices have not dropped much over the years,
but there is increased functionality,
performance, and usability from vendor tools

7
To stage or not to stage

A conflict between
getting the data from the operational systems as
fast as possible
having the ability to restart without repeating
the process from the beginning
Reasons for staging
Recoverability stage the data as soon as it has
been extracted from the source systems and
immediately after major processing (cleaning,
transformation, etc).
Backup can reload the data warehouse from the
staging tables without going to the sources
Auditing lineage between the source data and the
underlying transformations before the load to the
data warehouse

8
Designing the staging area

The staging area is owned by the ETL team
no indexes, no aggregations, no presentation
access, no querying, no service level agreements
Users are not allowed in the staging area for any
reason
staging is a construction site
Reports cannot access data in the staging area
tables can be added, or dropped without modifying
the user community
Only ETL processes can read/write the staging
area (ETL developers must capture table names,
update strategies, load frequency, ETL jobs,
expected growth and other details about the
staging area)
The staging area consists of both RDBMS tables
and data files

9
Staging Area data Structures in the ETL System

Flat files
fast to write, append to, sort and filter (grep)
but slow to update, access or join
enables restart without going to the sources
XML Data Sets (not really used in Staging)
Relational Tables
Metadata, SQL interface, DBA support
Dimensional Model Constructs Facts, Dimensions,
Atomic Facts tables, Aggregate Fact Tables (OLAP
Cubes)
Surrogate Key Mapping Tables
map natural keys from the OLTP systems to the
surrogate key from the DW
can be stored in files or the RDBMS (but you can
use the IDENTITY function if you go with the
RDBMS approach)
Best Practices about these data structures
perform impact analysis, capture metadata, use
naming conventions,

10
Extracting

Effectively integrate data from
different DBMS, OS, H/W, communication protocols
need a logical map, data movement view documents,
data lineage report
have a plan
identity source candidates
analyze source systems with a data profiling tool
receive walk-through of data lineage and business
rules (from the DW architect and business analyst
to the ETL developer)
data alterations during data cleansing,
calculations and formulas
measure twice, cut once
standard conformance to dimensions and numerical
facts
receive walk-through of the dimensional model

11
Components of the Data Movement Views

Target table and column, table type (Dimension,
Fact)
Slow-changing dimension type per target column
Type 1, overwrite (Customer first name)
Type 2, retain history (Customer last name)
Type 3, retain multiple valid alternative values
Source database, table, column
Transformations (the guts of the document)
Where do you capture this information? Which
tool? How do you maintain this metadata?

12
Keeping track of source systems

Data modelers and ETL developers should maintain
a documentation of the source systems including
subject area, application name, business name,
department name
priority
business owner and technical owner
DBMS, production server
DB size, users, complexity, transactions per
day
comments
Determine the system of record

13
Some good rules

Dealing with derived data (derive from base facts
or accepted calculated columns from the source
systems?)
if calculations are done by the DW
infrastructure, the ETL developer is responsible
for them
and what if the numbers dont match?
recommendations stay true to the definition of
system-of-record
The further downstream you go from the
originating data source, the more you increase
the risk of extracting corrupt data. Barring rare
exceptions, maintain the practice of sourcing
data only from the system-of-record.
Analyze your source system
get a ER-model for the system or reverse
engineering one (develop one by looking at the
metadata of the system)
reverse engineering is not the same as forward
engineering, i.e., given the ER-models of the
source systems derive the dimensional schema of
the data warehouse

14
Data Analysis

Reverse engineering of the understanding of a
source system
unique identifiers and natural keys
data types
relationships between tables (1-to-1, many-to-1,
many to many), problematic when source database
does not have foreign keys defined
discrete relationships (static data, reference
tables)
Data content analysis
NULL values, especially in foreign keys, NULL
result in lossy joins
In spite of the most detailed analysis, we
recommend using outer join logic when extracting
from relational source systems, simply because
referential integrity often cannot be trusted on
remote systems
Dates in non-date fields

15
Extract data from disparate systems

What is the standard for the enterprise?
ODBC, OLE DB, JDBC, .NET
access databases from windows applications, so
that applications are portable
performance is major drawback
every DBMS has an ODBC driver, even flat files
Adds two layers of interaction between the ETL
and the database

16
Extracting from different sources

Mainframe
COBOL copybooks give you the datatypes
EBCDIC and not ASCII character set (FTP does the
translation between the mainframe and
Unix/Windows)
Working with redefined fields (To save space the
same field is used for different types of data)
Extracting from IMS, IDMS, Adabase
you need special adapters or you get someone in
those systems to give you a file
XML sources, Web Log Files doable, if you
undestand the structure of those sources
Enterprise-Resource-Planning ERP Systems (SAP,
PeopleSoft, Oracle)
Dont treat it as a relational system -- its a
mess
Use adapters

17
Extracting Changed Data

Using Audit Columns
Use the last update timestamp, populated by
triggers or the front-end application
Must ensure that the timestamp is dependable,
that is if the front-end modifies it, a batch job
does not override it
Index the timestamp it if its dependable
Database Log Scrapping or sniffing
Take the log of the source file and try to
determine the transactions that affect you
Sniffing does it real time
Timed extracts
Retrieve all records from the source that were
modified today
POTENTIALLY dangerous -- what if the process
fails today? When it runs tomorrow, youd have
lost todays changes
Process of elimination
Preserve yesterdays data in the stage area
Bring todays entire data in the stage area
Perform a comparison
Inefficient, but the most reliable
Initial and Incremental Loads
Create two tables, previous-load and current-load
Load into the current-load, compare with the
previous-load, when you are done drop the
previous-load, rename the current-load into
previous-load, create a new curent-log

18
Tips for Extracting

Constrain on indexed columns
Retrieve only the data you need
Use DISTINCT sparingly
Use the SET operations sparingly
Use HINT (HINT tells the DBMS to make sure it
uses a certain index)
Avoid NOT
Avoid functions in the where clause
Avoid subqueries

19
Rules

When a dimension is populated by several distinct
systems, it is important to include the unique
identifier from each of those systems in the
target dimension in the data warehouse. Those
identifiers should be viewable by end users to
ensure peace of mind that the dimension reflects
their data that they can tie back to in their
transaction system.

20
Delivering Dimension Tables
21
The basic structure of a dimension

Primary key (PK)
Meaningless, unique integer
Aka as surrogate key
Joins to Fact Tables
Is a Foreign Key to Fact Tables
Natural key (NK)
Meaningful key extracted from source systems
1-to-1 relationship to the PK for static
dimensions
1-to-many relationship to the PK for slowly
changing dimensions, tracks history of changes to
the dimension
Descriptive Attributes
Primary textual but numbers legitimate but not
numbers that are measured quantities
100 such attributes normal
Static or slow changing only
Product price -- either fact or dimension
attribute

22
Generating surrogate keys for Dimensions

Via triggers in the DBMS
Read the latest surrogate key, generate the next
value, create the record
Disadvantages severe performance bottlenecks
Via the ETL process, an ETL tool or a 3-rd party
application generate the unique numbers
A surrogate key counter per dimension
Maintain consistency of surrogate keys between
dev, test and production
Using Smart Keys
Concatenate the natural key of the dimension in
the source(s) with the timestamp of the record in
the source or the Data Warehouse.
Tempting but wrong

23
Why smart keys are wrong

By definition
Surrogate keys are supposed to be meaningless
Do you update the concatenate smart key if the
natural key changes?
Performance
Natural keys may be chars and varchars, not
integers
Adding a timestamp to it makes the key very big
The dimension is bigger
The fact tables containing the foreign key are
bigger
Joining facts with dimensions based on
chars/varchars become inefficient
Heterogeneous sources
Smart keys work for homogeneous environments,
but most likely than not the sources are
heterogeneous, each having the own definition of
the dimension
How does the definition of the smart key changes
when there is another source added? It doesnt
scale very well.
One advantage simplicity in the ETL process

24
The grain of a dimension

The definition of the key of the dimension in
business terms, what does the dimension represent
Challenge analyze the source systems so that a
particular set of fields in that source
corresponds to the grain of the dimension
Verify that a given source (file) implements the
intended grain
Nothing should be returned by this from the
source system/file
If something is returned by this, the fields A, B
and C do not represent the grain of the dimension

select A, B, C, count() from DimensionTableSource
group by A, B, C having count() gt 1
25
The basic load plan for a dimension

Simple Case the dimension is loaded as a lookup
table
Typical Case
Data cleaning
Validate the data, apply business rules to make
the data consistent, column validity enforcement,
cross-column value checking, row de-duplication
Data conforming
Align the content of some or all of the fields in
the dimension with fields in similar or identical
dimensions in other parts of the data warehouse
Fact tables billing transactions, customer
support calls
IF they use the same dimensions, then the
dimensions are conformed
Data Delivery
All the steps required to deal with slow-changing
dimensions
Write the dimension to the physical table
Creating and assigning the surrogate key, making
sure the natural key is correct, etc.

26
Dimensions and embedded hierarchies

Dimensions are de-normalized flat tables
It is normal to have multiple embedded
hierarchical structures in the same dimension
Eg. Store dimension
Geographic hierarchy location, city, state,
country
Merchandizing hierarchy location, district,
region
Can co-exist in the same dimension IF every
attribute is single valued in the presence of the
dimensions primary key
If some attribute takes multiple values, it is
not part of the dimension
E.g., a store can have multiple cash-registers,
I.e., if the grain of the dimension is the store,
the cash-register cannot be part of the store
dimension, to include the cash register in the
dimension the grain must be changed to be the
cash-register, not the store
Normalized structures (3NF)
ok to exist in the staging area
should be flattened (2NF) before the delivery
phase, using a simple query over the normalized
structures

27
Snow-flaked Dimensions

Dimensions that remain normalized
Looks like a snowflake, if the levels of the
hierarchy obey perfect many-to-1 relationships

28
Flat Dimensions versus Snow-flakes

There is no semantic difference, the information
content between the two approaches are the same
Problems with snow-flakes
If the hierarchical model changes, the normalized
table schema and the declared joins must change,
as well as the resulting end-user queries
More confusing to the end user, flat dimensions
mask the complexity of the data
However, appropriate snow-flakes do exist (as
sub-dimensions of dimensions)

29
Date and Time Dimensions

Virtually everywhere measurements are defined at
specific times, repeated over time, etc.
Most common calendar-day dimension with the
grain of a single day, many attributes
Doesnt have a conventional source
Built by hand, speadsheet
Holidays, workdays, fiscal periods, week numbers,
last day of month flags, must be entered manually
10 years are about 4K rows

30
Date Dimension

Note the Natural key a day type and a full date
Day type date and non-date types such as
inapplicable date, corrupted date, hasnt
happened yet date
fact tables must point to a valid date from the
dimension, so we need special date types, at
least one, the N/A date
How to generate the primary key?
Meaningless integer?
Or 10102005 meaning Oct 10, 2005? (reserving
9999999 to mean N/A?)
This is a close call, but even if meaningless
integers are used, the numbers should appear in
numerical order (why? Because of data
partitioning requirements in a DW, data in a fact
table can be partitioned by time)

31
Other Time Dimensions

Also typically needed are time dimensions whose
grain is a month, a week, a quarter or a year, if
there are fact tables in each of these grains
These are physically different tables
Are generated by eliminating selected columns
and rows from the Date dimension, keep either the
first of the last day of the month
Do NOT use database views
A view would drag a much larger table (the date)
into a month-based fact table

32
Time Dimensions

How about a time dimension based on seconds?
There are over 31 million seconds in a year!
Avoid them as dimensions
But keep the SQL date-timestamp data as basic
attributes in facts (not as dimensions), if
needed to compute precise queries based on
specific times
Older approach keep a dimension of minutes or
seconds and make it based on an offset from
midnight of each day, but its messy when
timestamps cross days
Might need something fancy though if the
enterprise has well defined time slices within a
day such as shift names, advertising slots --
then build a dimension

33
Big and Small Dimensions

SMALL
Examples Transaction Type, Claim Status
Tiny lookup tables with only a few records and
one ore more columns
Build by typing into a spreadsheet and loading
the data into the DW
These dimensions should NOT be conformed
JUNK dimension a tactical maneuver to reduce the
number of FKs from a fact table by combining
the low-cardinality values of small dimensions
into a single junk dimension, generate as you go,
dont generate the cartesian product

BIG
Examples Customer, Product, Location
Millions or records with hundreds of fields
(insurance customers)
Or hundreds of millions of records with few
fields (supermarket customers)
Always derived by multiple sources
These dimensions should be conformed

34
One dimension or two?

Are the dimensions independent? Or is there a
statistical correlation between them?
Weak statistical correlation and big dimensions
Even if there is a weak correlation, build the DW
as if the dimensions are independent, especially
for large dimensions
100 stores, 1 million products, 100K millions for
combining the two, too large
If the statistical correlation is significant,
build a fact table for the correlation, and there
may be many such correlations, e.g.,
merchandizing correlation, pricing-strategy
correlation, changing-seasonality correlation
Leave the dimensions simple and independent if
the dimensions are large and the statistical
correlation is weak
Strong correlations and small resulting
dimensions
If the correlation is strong, e.g., product
always has a brand, leave the brand OUT of the
fact table, because the product always rolls up
to a single brand and COMBINE the product and the
brand into a single dimension
Arbitrary bound 100K records is not small
dimension

35
Dimensional Roles

When the same dimension is attached to a fact
table multiple times
Sale has an order date, payment date, a shipping
date, a return date
A claim transaction has multiple internal people,
a claim intake, a medical nurse, a claim
adjuster, a payer, etc
End-user might be confused when they do
drill-down into the dimension as to what the
dimension represents, so it helps to have the
dimension have different names
Solution Create views over the dimensions for
each of the dimensional role to help with the
name recognition
For very large dimensions (locations) might be
appropriate to create physical copies of the
dimension table to help with the name confusion

36
Other dimensions

Degenerate dimensions
When a parent-child relationship exists and the
grain of the fact table is the child, the parent
is kind of left out in the design process
Example
grain of the fact able is the line item in an
order
the order number is significant part of the key
but we dont create a dimension for the order
number, because it would be useless
we insert the order number as part of the key, as
if it was a dimension, but we dont create a
dimension table for it

37
Slow-changing Dimensions

When the DW receives notification that some
record in a dimension has changed, there are
three basic responses
Type 1 slow changing dimension (Overwrite)
Type 2 slow changing dimension (Partitioning
History)
Type 3 slow changing dimension (Alternate
Realities)

38
Type 1 Slowly Changing Dimension (Overwrite)

Overwrite one or more values of the dimension
with the new value
Use when
the data are corrected
there is no interest in keeping history
there is no need to run previous reports or the
changed value is immaterial to the report
Type 1 Overwrite results in an UPDATE SQL
statement when the value changes
If a column is Type-1, the ETL subsystem must
Add the dimension record, if its a new value or
Update the dimension attribute in place
Must also update any Staging tables, so that any
subsequent DW load from the staging tables will
preserve the overwrite
This update never affects the surrogate key
But it affects materialized aggregates that were
built on the value that changed (will be
discussed more next week when we talk about
delivering fact tables)

39
Type 1 Slowly Changing Dimension (Overwrite) -
Cont

Beware of ETL tools Update else Insert
statements, which are convenient but inefficient
Some developers use UPDATE else INSERT for
fast changing dimensions and INSERT else UPDATE
for very slow changing dimensions
Better Approach Segregate INSERTS from UPDATES,
and feed the DW independently for the updates and
for the inserts
No need to invoke a bulk loader for small tables,
simply execute the SQL updates, the performance
impact is immaterial, even with the DW logging
the SQL statement
For larger tables, a loader is preferable,
because SQL updates will result into unacceptable
database logging activity
Turn the logger off before you update with SQL
Updates and separate SQL Inserts
Or use a bulk loader
Prepare the new dimension in a staging file
Drop the old dimension table
Load the new dimension table using the bulk loader

40
Type-2 Slowly Changing Dimension (Partitioning
History)

Standard
When a record changes, instead of overwriting
create a new dimension record
with a new surrogate key
add the new record into the dimension table
use this record going forward in all fact tables
no fact tables need to change
no aggregates need to be re-computed
Perfectly partitions history because at each
detailed version of the dimension is correctly
connected to the span of fact tables for which
that version is correct

41
Type-2 Slowly Changing Dimensions (history
overwrite)

The natural key does not change
The job attribute changes
We can constraint our query
the Manager job
Joes employee id
Type-2 do not change the natural key (the natural
key should never change)

42
Type-2 SCD Precise Time Stamping

With a Type-2 change, you might want to include
the following additional attributes in the
dimension
Date of change
Exact timestamp of change
Reason for change
Current Flag (current/expired)

43
Worst Case Scenario

The source system does not notify changes and
does not time/date-stamp its own updates
For small tables, use a brute force approach of
comparing every incoming field with every field
in the DW to see if anything changes
For larger tables, use the CRC, or
cyclic-redundancy checksum, a number of about 20
digits, like this
Treat the entire incoming record as a string
Compute the CRC value of the string (some
numeric value)
Compare the CRC value of todays record with the
CRC value of yesterdays record
If the CRC values match, todays record is
identical to yesterdays record
If the CRC values do not match, do a field by
field comparison to see what has changed
Depending on whether the changed field is a
Type-1, Type-2 or Type-3 change, do the necessary
updates
Most ETL packages and the Internet include CRC
computation and comparison code

44
Identifying Deletions

To identify deletions
Read the source log file, if one exists
Determine that the CRC number of a record in the
DW does not match some incoming record
Use the MINUS set operator to compare the DW
dimension with the incoming dimension (usually
works if the ETL staging table with todays
dimension and the DW are part of the same
database)
Delete the record if this is valid for the
business, most cases you dont want to delete
anything from the DW

45
Type-3 Slowly Changing Dimensions (Alternate
Realities)

Applicable when a change happens to a dimension
record but the old record remains valid as a
second choice
Product category designations
Sales-territory assignments
Instead of creating a new row, a new column is
inserted (if it does not already exist)
The old value is added to the secondary column
Before the new value overrides the primary column
Example old category, new category
Usually defined by the business after the main
ETL process is implemented
Please move Brand X from Mens Sportswear to
Leather goods but allow me to track Brand X
optionally in the old category
The old category is described as an Alternate
reality