Title: Data Warehouse Lecture 3
1Data Warehouse Lecture 3
- ETL Tools, Techniques and Best Practices
2Agenda
- ETL tools, techniques, methodologies, best
practices, delivering dimension tables to the DW - Sources used for this lecture
- Ralph Kimball, Joe Caserta, The Data Warehouse
ETL Toolkit Practical Techniques for
Extracting, Cleaning, Conforming and Delivering
Data
3ETL Solution Desired Features
An ETL tool is a tool that reads data from one or
more sources, transforms the data so that it is
compatible with a destination and loads the data
to the destination
- Automated data movement across data stores and
the analytical area in support of a data
warehouse, data mart or ODS effort - Extensible, robust, scalable infrastructure
- Standardization of ETL processes across the
enterprise - Reusability of custom and pre-build functions
- Better utilization of existing hardware resources
- Faster change-control management
- Integrated meta-data management
- Complete development environment, work as you
think design metaphor
Desired Features
4Custom ETL Solutions
- Execute stored procedures at the source to
extract data, perhaps filtering some - Store the extracted data into staging tables at
the source - Generate a file with the source staging data
- Transfer the file into the destination
- Load the data into staging tables at the
destination - Execute stored procedures at the destination to
read the incoming staged data and compute what
changed since the last time by comparing with the
current value of the data warehouse - Populate the data warehouse with new changes
- Supports incremental updates to sources
- Supports simple data cleansing
- Logs warning/errors
5Vendor ETL tools
- DataStage
- big player, high ability to execute, gets good
results from Gartner, visionary - Informatica
- another big player, gets good results from
Gartner, visionary with high ability to execute - most expensive
- SAS ETL Server
- fast becoming a major player, very positive
results from Gartner - low exposure as an ETL tool (SAS a significant
statistical analysis vendor) - Information Builder's Data Migrator/ETL Manager
tool suite - part of Enterprise Focus/WebFocus
- not a major player but industrial strength
language, data connectors, etc - Sunopsis
- cheap
- relies on native RDBMS functionality
- CIGNA people exposed to it at conferences liked
it
6Buy versus Build
- Vendor tools promote standardization of the ETL
process, reusability of custom and pre-built
functions, lowering the time (and cost) of
additional ETL efforts - Vendor ETL tools are somewhat self-documenting
- Many tools can connect to a variety of sources
(RDBMSs, non-relational, different OS, ERP,
PeopleSoft, etc) without exposing the ETL
developer to the differences - Vendor tools deal with changes in the source
systems or the necessary transformations better,
reducing long term ETL maintenance - Meta-data management is a huge advantage,
especially when sharing data from many
applications - Vendor ETL tools place discipline in the
extraction and transformation process - ETL prices have not dropped much over the years,
but there is increased functionality,
performance, and usability from vendor tools
7To stage or not to stage
- A conflict between
- getting the data from the operational systems as
fast as possible - having the ability to restart without repeating
the process from the beginning - Reasons for staging
- Recoverability stage the data as soon as it has
been extracted from the source systems and
immediately after major processing (cleaning,
transformation, etc). - Backup can reload the data warehouse from the
staging tables without going to the sources - Auditing lineage between the source data and the
underlying transformations before the load to the
data warehouse
8Designing the staging area
- The staging area is owned by the ETL team
- no indexes, no aggregations, no presentation
access, no querying, no service level agreements - Users are not allowed in the staging area for any
reason - staging is a construction site
- Reports cannot access data in the staging area
- tables can be added, or dropped without modifying
the user community - Only ETL processes can read/write the staging
area (ETL developers must capture table names,
update strategies, load frequency, ETL jobs,
expected growth and other details about the
staging area) - The staging area consists of both RDBMS tables
and data files
9Staging Area data Structures in the ETL System
- Flat files
- fast to write, append to, sort and filter (grep)
but slow to update, access or join - enables restart without going to the sources
- XML Data Sets (not really used in Staging)
- Relational Tables
- Metadata, SQL interface, DBA support
- Dimensional Model Constructs Facts, Dimensions,
Atomic Facts tables, Aggregate Fact Tables (OLAP
Cubes) - Surrogate Key Mapping Tables
- map natural keys from the OLTP systems to the
surrogate key from the DW - can be stored in files or the RDBMS (but you can
use the IDENTITY function if you go with the
RDBMS approach) - Best Practices about these data structures
- perform impact analysis, capture metadata, use
naming conventions,
10Extracting
- Effectively integrate data from
- different DBMS, OS, H/W, communication protocols
- need a logical map, data movement view documents,
data lineage report - have a plan
- identity source candidates
- analyze source systems with a data profiling tool
- receive walk-through of data lineage and business
rules (from the DW architect and business analyst
to the ETL developer) - data alterations during data cleansing,
calculations and formulas - measure twice, cut once
- standard conformance to dimensions and numerical
facts - receive walk-through of the dimensional model
11Components of the Data Movement Views
- Target table and column, table type (Dimension,
Fact) - Slow-changing dimension type per target column
- Type 1, overwrite (Customer first name)
- Type 2, retain history (Customer last name)
- Type 3, retain multiple valid alternative values
- Source database, table, column
- Transformations (the guts of the document)
- Where do you capture this information? Which
tool? How do you maintain this metadata?
12Keeping track of source systems
- Data modelers and ETL developers should maintain
a documentation of the source systems including - subject area, application name, business name,
department name - priority
- business owner and technical owner
- DBMS, production server
- DB size, users, complexity, transactions per
day - comments
- Determine the system of record
13Some good rules
- Dealing with derived data (derive from base facts
or accepted calculated columns from the source
systems?) - if calculations are done by the DW
infrastructure, the ETL developer is responsible
for them - and what if the numbers dont match?
- recommendations stay true to the definition of
system-of-record - The further downstream you go from the
originating data source, the more you increase
the risk of extracting corrupt data. Barring rare
exceptions, maintain the practice of sourcing
data only from the system-of-record. - Analyze your source system
- get a ER-model for the system or reverse
engineering one (develop one by looking at the
metadata of the system) - reverse engineering is not the same as forward
engineering, i.e., given the ER-models of the
source systems derive the dimensional schema of
the data warehouse
14Data Analysis
- Reverse engineering of the understanding of a
source system - unique identifiers and natural keys
- data types
- relationships between tables (1-to-1, many-to-1,
many to many), problematic when source database
does not have foreign keys defined - discrete relationships (static data, reference
tables) - Data content analysis
- NULL values, especially in foreign keys, NULL
result in lossy joins - In spite of the most detailed analysis, we
recommend using outer join logic when extracting
from relational source systems, simply because
referential integrity often cannot be trusted on
remote systems - Dates in non-date fields
15Extract data from disparate systems
- What is the standard for the enterprise?
- ODBC, OLE DB, JDBC, .NET
- access databases from windows applications, so
that applications are portable - performance is major drawback
- every DBMS has an ODBC driver, even flat files
- Adds two layers of interaction between the ETL
and the database
16Extracting from different sources
- Mainframe
- COBOL copybooks give you the datatypes
- EBCDIC and not ASCII character set (FTP does the
translation between the mainframe and
Unix/Windows) - Working with redefined fields (To save space the
same field is used for different types of data) - Extracting from IMS, IDMS, Adabase
- you need special adapters or you get someone in
those systems to give you a file - XML sources, Web Log Files doable, if you
undestand the structure of those sources - Enterprise-Resource-Planning ERP Systems (SAP,
PeopleSoft, Oracle) - Dont treat it as a relational system -- its a
mess - Use adapters
17Extracting Changed Data
- Using Audit Columns
- Use the last update timestamp, populated by
triggers or the front-end application - Must ensure that the timestamp is dependable,
that is if the front-end modifies it, a batch job
does not override it - Index the timestamp it if its dependable
- Database Log Scrapping or sniffing
- Take the log of the source file and try to
determine the transactions that affect you - Sniffing does it real time
- Timed extracts
- Retrieve all records from the source that were
modified today - POTENTIALLY dangerous -- what if the process
fails today? When it runs tomorrow, youd have
lost todays changes - Process of elimination
- Preserve yesterdays data in the stage area
- Bring todays entire data in the stage area
- Perform a comparison
- Inefficient, but the most reliable
- Initial and Incremental Loads
- Create two tables, previous-load and current-load
- Load into the current-load, compare with the
previous-load, when you are done drop the
previous-load, rename the current-load into
previous-load, create a new curent-log
18Tips for Extracting
- Constrain on indexed columns
- Retrieve only the data you need
- Use DISTINCT sparingly
- Use the SET operations sparingly
- Use HINT (HINT tells the DBMS to make sure it
uses a certain index) - Avoid NOT
- Avoid functions in the where clause
- Avoid subqueries
19Rules
- When a dimension is populated by several distinct
systems, it is important to include the unique
identifier from each of those systems in the
target dimension in the data warehouse. Those
identifiers should be viewable by end users to
ensure peace of mind that the dimension reflects
their data that they can tie back to in their
transaction system.
20Delivering Dimension Tables
21The basic structure of a dimension
- Primary key (PK)
- Meaningless, unique integer
- Aka as surrogate key
- Joins to Fact Tables
- Is a Foreign Key to Fact Tables
- Natural key (NK)
- Meaningful key extracted from source systems
- 1-to-1 relationship to the PK for static
dimensions - 1-to-many relationship to the PK for slowly
changing dimensions, tracks history of changes to
the dimension - Descriptive Attributes
- Primary textual but numbers legitimate but not
numbers that are measured quantities - 100 such attributes normal
- Static or slow changing only
- Product price -- either fact or dimension
attribute
22Generating surrogate keys for Dimensions
- Via triggers in the DBMS
- Read the latest surrogate key, generate the next
value, create the record - Disadvantages severe performance bottlenecks
- Via the ETL process, an ETL tool or a 3-rd party
application generate the unique numbers - A surrogate key counter per dimension
- Maintain consistency of surrogate keys between
dev, test and production - Using Smart Keys
- Concatenate the natural key of the dimension in
the source(s) with the timestamp of the record in
the source or the Data Warehouse. - Tempting but wrong
23Why smart keys are wrong
- By definition
- Surrogate keys are supposed to be meaningless
- Do you update the concatenate smart key if the
natural key changes? - Performance
- Natural keys may be chars and varchars, not
integers - Adding a timestamp to it makes the key very big
- The dimension is bigger
- The fact tables containing the foreign key are
bigger - Joining facts with dimensions based on
chars/varchars become inefficient - Heterogeneous sources
- Smart keys work for homogeneous environments,
but most likely than not the sources are
heterogeneous, each having the own definition of
the dimension - How does the definition of the smart key changes
when there is another source added? It doesnt
scale very well. - One advantage simplicity in the ETL process
24The grain of a dimension
- The definition of the key of the dimension in
business terms, what does the dimension represent - Challenge analyze the source systems so that a
particular set of fields in that source
corresponds to the grain of the dimension - Verify that a given source (file) implements the
intended grain - Nothing should be returned by this from the
source system/file - If something is returned by this, the fields A, B
and C do not represent the grain of the dimension
select A, B, C, count() from DimensionTableSource
group by A, B, C having count() gt 1
25The basic load plan for a dimension
- Simple Case the dimension is loaded as a lookup
table - Typical Case
- Data cleaning
- Validate the data, apply business rules to make
the data consistent, column validity enforcement,
cross-column value checking, row de-duplication - Data conforming
- Align the content of some or all of the fields in
the dimension with fields in similar or identical
dimensions in other parts of the data warehouse - Fact tables billing transactions, customer
support calls - IF they use the same dimensions, then the
dimensions are conformed - Data Delivery
- All the steps required to deal with slow-changing
dimensions - Write the dimension to the physical table
- Creating and assigning the surrogate key, making
sure the natural key is correct, etc.
26Dimensions and embedded hierarchies
- Dimensions are de-normalized flat tables
- It is normal to have multiple embedded
hierarchical structures in the same dimension - Eg. Store dimension
- Geographic hierarchy location, city, state,
country - Merchandizing hierarchy location, district,
region - Can co-exist in the same dimension IF every
attribute is single valued in the presence of the
dimensions primary key - If some attribute takes multiple values, it is
not part of the dimension - E.g., a store can have multiple cash-registers,
I.e., if the grain of the dimension is the store,
the cash-register cannot be part of the store
dimension, to include the cash register in the
dimension the grain must be changed to be the
cash-register, not the store - Normalized structures (3NF)
- ok to exist in the staging area
- should be flattened (2NF) before the delivery
phase, using a simple query over the normalized
structures
27Snow-flaked Dimensions
- Dimensions that remain normalized
- Looks like a snowflake, if the levels of the
hierarchy obey perfect many-to-1 relationships
28Flat Dimensions versus Snow-flakes
- There is no semantic difference, the information
content between the two approaches are the same - Problems with snow-flakes
- If the hierarchical model changes, the normalized
table schema and the declared joins must change,
as well as the resulting end-user queries - More confusing to the end user, flat dimensions
mask the complexity of the data - However, appropriate snow-flakes do exist (as
sub-dimensions of dimensions)
29Date and Time Dimensions
- Virtually everywhere measurements are defined at
specific times, repeated over time, etc. - Most common calendar-day dimension with the
grain of a single day, many attributes - Doesnt have a conventional source
- Built by hand, speadsheet
- Holidays, workdays, fiscal periods, week numbers,
last day of month flags, must be entered manually - 10 years are about 4K rows
30Date Dimension
- Note the Natural key a day type and a full date
- Day type date and non-date types such as
inapplicable date, corrupted date, hasnt
happened yet date - fact tables must point to a valid date from the
dimension, so we need special date types, at
least one, the N/A date - How to generate the primary key?
- Meaningless integer?
- Or 10102005 meaning Oct 10, 2005? (reserving
9999999 to mean N/A?) - This is a close call, but even if meaningless
integers are used, the numbers should appear in
numerical order (why? Because of data
partitioning requirements in a DW, data in a fact
table can be partitioned by time)
31Other Time Dimensions
- Also typically needed are time dimensions whose
grain is a month, a week, a quarter or a year, if
there are fact tables in each of these grains - These are physically different tables
- Are generated by eliminating selected columns
and rows from the Date dimension, keep either the
first of the last day of the month - Do NOT use database views
- A view would drag a much larger table (the date)
into a month-based fact table
32Time Dimensions
- How about a time dimension based on seconds?
- There are over 31 million seconds in a year!
- Avoid them as dimensions
- But keep the SQL date-timestamp data as basic
attributes in facts (not as dimensions), if
needed to compute precise queries based on
specific times - Older approach keep a dimension of minutes or
seconds and make it based on an offset from
midnight of each day, but its messy when
timestamps cross days - Might need something fancy though if the
enterprise has well defined time slices within a
day such as shift names, advertising slots --
then build a dimension
33Big and Small Dimensions
- SMALL
- Examples Transaction Type, Claim Status
- Tiny lookup tables with only a few records and
one ore more columns - Build by typing into a spreadsheet and loading
the data into the DW - These dimensions should NOT be conformed
- JUNK dimension a tactical maneuver to reduce the
number of FKs from a fact table by combining
the low-cardinality values of small dimensions
into a single junk dimension, generate as you go,
dont generate the cartesian product
- BIG
- Examples Customer, Product, Location
- Millions or records with hundreds of fields
(insurance customers) - Or hundreds of millions of records with few
fields (supermarket customers) - Always derived by multiple sources
- These dimensions should be conformed
34One dimension or two?
- Are the dimensions independent? Or is there a
statistical correlation between them? - Weak statistical correlation and big dimensions
- Even if there is a weak correlation, build the DW
as if the dimensions are independent, especially
for large dimensions - 100 stores, 1 million products, 100K millions for
combining the two, too large - If the statistical correlation is significant,
build a fact table for the correlation, and there
may be many such correlations, e.g.,
merchandizing correlation, pricing-strategy
correlation, changing-seasonality correlation - Leave the dimensions simple and independent if
the dimensions are large and the statistical
correlation is weak - Strong correlations and small resulting
dimensions - If the correlation is strong, e.g., product
always has a brand, leave the brand OUT of the
fact table, because the product always rolls up
to a single brand and COMBINE the product and the
brand into a single dimension - Arbitrary bound 100K records is not small
dimension
35Dimensional Roles
- When the same dimension is attached to a fact
table multiple times - Sale has an order date, payment date, a shipping
date, a return date - A claim transaction has multiple internal people,
a claim intake, a medical nurse, a claim
adjuster, a payer, etc - End-user might be confused when they do
drill-down into the dimension as to what the
dimension represents, so it helps to have the
dimension have different names - Solution Create views over the dimensions for
each of the dimensional role to help with the
name recognition - For very large dimensions (locations) might be
appropriate to create physical copies of the
dimension table to help with the name confusion
36Other dimensions
- Degenerate dimensions
- When a parent-child relationship exists and the
grain of the fact table is the child, the parent
is kind of left out in the design process - Example
- grain of the fact able is the line item in an
order - the order number is significant part of the key
- but we dont create a dimension for the order
number, because it would be useless - we insert the order number as part of the key, as
if it was a dimension, but we dont create a
dimension table for it
37Slow-changing Dimensions
- When the DW receives notification that some
record in a dimension has changed, there are
three basic responses - Type 1 slow changing dimension (Overwrite)
- Type 2 slow changing dimension (Partitioning
History) - Type 3 slow changing dimension (Alternate
Realities)
38Type 1 Slowly Changing Dimension (Overwrite)
- Overwrite one or more values of the dimension
with the new value - Use when
- the data are corrected
- there is no interest in keeping history
- there is no need to run previous reports or the
changed value is immaterial to the report - Type 1 Overwrite results in an UPDATE SQL
statement when the value changes - If a column is Type-1, the ETL subsystem must
- Add the dimension record, if its a new value or
- Update the dimension attribute in place
- Must also update any Staging tables, so that any
subsequent DW load from the staging tables will
preserve the overwrite - This update never affects the surrogate key
- But it affects materialized aggregates that were
built on the value that changed (will be
discussed more next week when we talk about
delivering fact tables)
39Type 1 Slowly Changing Dimension (Overwrite) -
Cont
- Beware of ETL tools Update else Insert
statements, which are convenient but inefficient - Some developers use UPDATE else INSERT for
fast changing dimensions and INSERT else UPDATE
for very slow changing dimensions - Better Approach Segregate INSERTS from UPDATES,
and feed the DW independently for the updates and
for the inserts - No need to invoke a bulk loader for small tables,
simply execute the SQL updates, the performance
impact is immaterial, even with the DW logging
the SQL statement - For larger tables, a loader is preferable,
because SQL updates will result into unacceptable
database logging activity - Turn the logger off before you update with SQL
Updates and separate SQL Inserts - Or use a bulk loader
- Prepare the new dimension in a staging file
- Drop the old dimension table
- Load the new dimension table using the bulk loader
40Type-2 Slowly Changing Dimension (Partitioning
History)
- Standard
- When a record changes, instead of overwriting
- create a new dimension record
- with a new surrogate key
- add the new record into the dimension table
- use this record going forward in all fact tables
- no fact tables need to change
- no aggregates need to be re-computed
- Perfectly partitions history because at each
detailed version of the dimension is correctly
connected to the span of fact tables for which
that version is correct
41Type-2 Slowly Changing Dimensions (history
overwrite)
- The natural key does not change
- The job attribute changes
- We can constraint our query
- the Manager job
- Joes employee id
- Type-2 do not change the natural key (the natural
key should never change)
42Type-2 SCD Precise Time Stamping
- With a Type-2 change, you might want to include
the following additional attributes in the
dimension - Date of change
- Exact timestamp of change
- Reason for change
- Current Flag (current/expired)
43Worst Case Scenario
- The source system does not notify changes and
does not time/date-stamp its own updates - For small tables, use a brute force approach of
comparing every incoming field with every field
in the DW to see if anything changes - For larger tables, use the CRC, or
cyclic-redundancy checksum, a number of about 20
digits, like this - Treat the entire incoming record as a string
- Compute the CRC value of the string (some
numeric value) - Compare the CRC value of todays record with the
CRC value of yesterdays record - If the CRC values match, todays record is
identical to yesterdays record - If the CRC values do not match, do a field by
field comparison to see what has changed - Depending on whether the changed field is a
Type-1, Type-2 or Type-3 change, do the necessary
updates - Most ETL packages and the Internet include CRC
computation and comparison code
44Identifying Deletions
- To identify deletions
- Read the source log file, if one exists
- Determine that the CRC number of a record in the
DW does not match some incoming record - Use the MINUS set operator to compare the DW
dimension with the incoming dimension (usually
works if the ETL staging table with todays
dimension and the DW are part of the same
database) - Delete the record if this is valid for the
business, most cases you dont want to delete
anything from the DW
45Type-3 Slowly Changing Dimensions (Alternate
Realities)
- Applicable when a change happens to a dimension
record but the old record remains valid as a
second choice - Product category designations
- Sales-territory assignments
- Instead of creating a new row, a new column is
inserted (if it does not already exist) - The old value is added to the secondary column
- Before the new value overrides the primary column
- Example old category, new category
- Usually defined by the business after the main
ETL process is implemented - Please move Brand X from Mens Sportswear to
Leather goods but allow me to track Brand X
optionally in the old category - The old category is described as an Alternate
reality