Title: Data Warehousing (SS ZG515) Extraction
1Data Warehousing (SS ZG515)Extraction
Transformation - Loading
- Prof. Navneet Goyal/SP Vimal
- Computer Science Department
- BITS, Pilani
2To discuss
- Requirements
- Data Structures
- Extraction
- Cleaning Conforming
- Delivering Dimension Tables
- Delivering Fact Tables
31. Requirements
4ETL
-
- A Properly designed ETL system extracts data
from the source systems, enforces data quality
and consistency standards, conforms data so that
separate sources can be used together, and
finally delivers data in a presentation-ready
format so that application developers can build
applications and end users can make decisions
ETL makes or breaks the data warehouse Ralph
Kimball
5Requirements
- Business Needs
- Information requirements of the end user.
- Captured by interview with users, independent
investigations about the possible sources by the
ETL team. - Compliance Requirements
- Sarbanes-Oxley Act 2002 (Deals with the
regulation of corporate governance) (more on
http//www.soxlaw.com) - Proof of complete transaction flow that changed
any data.
6Requirements (contd)
- Data Profiling
- Systematic examination of quality, scope and the
context of a data source - Helps ETL team determine how much data cleaning
activity require. - Data Profiling employs analytic methods for
looking at data for the purpose of developing a
thorough understanding of the content, structure
and the quality of the data. A good data
profiling system can process very large amounts
of data, with the skills of analyst, uncover all
sorts of issues that need to be addressed Jack
Olson
7Requirements (contd)
- Security Requirements
- ETL team have complete read/ write access to the
entire corporate data. - ETL workstations on the company intranet, A major
threat. Keep it in a separate subnet with packet
filtering gateway. - Secure Backups, as well.
- Data Integration
- Identified as conform step
- Conform Dimensions Conform Facts
8Requirements (contd)
- Data Latency
- How quickly the data can be delivered to the
users? - ETL architecture has direct impact on it.
- Archiving Lineage
- Stage data after each major transformations, Not
just after all the four steps viz extract, clean,
conform deliver. - Each archived/ staged data set should have
accompanying metadata. - Tracking this lineage is explicitly required be
certain compliance requirements
9Requirements (contd)
- End user delivery Interfaces
- ETL team is responsible for the content and
structure of data, making the end user
applications fast. - Available Skills
- Expertise in building ETL system around vendors
tool - Decision between hand coded or vendors package of
ETL tools - Legacy Licenses
- Managements insistence to use legacy licenses
10Choice of Architecture ? Tool Based ETL
- Simpler, Cheaper Faster development
- People with business skills not much technical
skills can use it. - Automatically generate Metadata
- Automatically generates data Lineage data
dependency analysis - Offers in-line encryption compression
capabilities - Manage complex load balancing across servers
11Choice of Architecture ? Hand-Coded ETL
- Quality of tool by exhaustive unit testing
- Better metadata
- Requirement may be just file based processes not
database-stored procedures - Use of existing legacy routines
- Use of in-house programmers
- Unlimited flexibility
12Choice of Architecture Batch Vs Streaming
- Batch Update
- ETL processing is done on periodic batch extracts
from the source systems - Streaming Dataflow
- Data at record level flows from source system to
users screens - Extract-Clean-Conform-Deliver must be modified to
accommodate record by record processing. - Arrival of data may not be predictable.
132. Data Structures
14To stage or not to stage
- A conflict between
- getting the data from the operational systems as
fast as possible - having the ability to restart without repeating
the process from the beginning - Reasons for staging
- Recoverability stage the data as soon as it has
been extracted from the source systems and
immediately after major processing (cleaning,
transformation, etc). - Backup can reload the data warehouse from the
staging tables without going to the sources - Auditing lineage between the source data and the
underlying transformations before the load to the
data warehouse
15Designing the staging area
- The staging area is owned by the ETL team
- no indexes, no aggregations, no presentation
access, no querying, no service level agreements - Users are not allowed in the staging area for any
reason - staging is a construction site
- Reports cannot access data in the staging area
- tables can be added, or dropped without notifying
the user community - Controlled environment
16Designing the staging area (contd)
- Only ETL processes can read/write the staging
area (ETL developers must capture table names,
update strategies, load frequency, ETL jobs,
expected growth and other details about the
staging area) - The staging area consists of both RDBMS tables
and data files
17Staging Area data Structures in the ETL System
- Flat files
- fast to write, append to, sort and filter (grep)
but slow to update, access or join - enables restart without going to the sources
- XML Data Sets
- Used as a medium of data transfer between
incompatible data sources - Gives enough information to create tables using
CREATE TABLE - Relational Tables
- Metadata, SQL interface, DBA support
18Staging Area data Structures in the ETL System
(contd)
- Dimensional Model Constructs Facts, Dimensions,
Atomic Facts tables, Aggregate Fact Tables (OLAP
Cubes) - Surrogate Key Mapping Tables
- map natural keys from the OLTP systems to the
surrogate key from the DW - can be stored in files or the RDBMS
193. Extraction
20Logical data map
- Represented as a table with the following
attributes - Target table and column, table type (Dimension,
Fact) - Slow-changing dimension type per target column of
each dimensions - Type 1, overwrite (Customer first name)
- Type 2, retain history (Customer last name)
- Type 3, retain multiple valid alternative values
- Source database, table, column
- Transformations
21Logical Map development
- Have a plan !
- Logical map is provided by Datawarehouse
architect to ETL Team serves as the
specification of ETL processes. - Identify data lineage between the data source
target - Identity source candidates
- Analyze source systems with a data profiling tool
- Detect data anomaly, identify appropriate actions
document it. - Identify the quality
22Logical Map development (contd)
- Receive walk-through of data lineage and business
rules (from the DW architect and business analyst
to the ETL developer) - data alterations during data cleansing,
calculations and formulas - standard conformance to dimensions and numerical
facts - Receive walk-through of the dimensional model
- The objective of ETL team is to deliver data to
the dimensional model in a more effective way
hence the understanding of dimensional model is
helpful - Validate calculations Formulas used in ETL.
23Logical Map development (contd)
- Complete logical map cannot exist until all the
source systems are identified analyzed. - Analysis of source
- Data discovery phase
- Collecting documenting Source systems
- Keeping track of source systems
- Identify ownership, responsible for the content
and its storage usage statistics - Determine the system-of-record (source of data)
- Identify the source, when redundant sources
coexist - Analyze the source systems for any relationship
between tables - Anomaly detection phase (Data content Analysis)
- NULL values
- Dates in Nondate fields
24Some good rules
- Analyze your source system
- get a ER-model for the system or reverse
engineering one (develop one by looking at the
metadata of the system) - reverse engineering is not the same as forward
engineering, i.e., given the ER-models of the
source systems derive the dimensional schema of
the data warehouse - Reverse engineering helps understanding
- unique identifiers and natural keys
- data types
- relationships between tables
- Of the source systems
25Extract data from disparate systems
- ODBC Manager
- Accepts SQL from ETL applications routes it
through appropriate ODBC driver - ODBC can provide a common gateway to diverse
sources
26Different sources
- Mainframe Sources
- Flat Files
- XML sources
- Web Log sources
- ERP system sources
27Tips for Extracting
- Constrain on indexed columns
- Retrieve only the data you need
- Do not retrieve entire table select from that.
- Use DISTINCT Set operations sparingly
- Try if the slower DISTINCT, Set operations UNION,
MINUS and INTERSECT operations can be avoided - Do the best effort to avoid NOT operation, which
normally scans the entire database - Avoid functions in the WHERE clause
- Difficult to avoid
- Try different techniques before using the
functions, at least. - Avoid subqueries
28Do it also !
- When a dimension is populated by several distinct
systems, it is important to include the unique
identifier from each of those systems in the
target dimension in the data warehouse. Those
identifiers should be viewable by end users to
ensure peace of mind that the dimension reflects
their data that they can tie back to in their
transaction system.
294. Cleaning Conforming
30Cleaning and Conforming
- While the Extracting and Loading part of an ETL
process simply moves data, the cleaning and
conforming part , the transformation part that
truly adds value - How do we deal with dirty data?
- Data Profiling report
- The Error Event fact table
- Audit Dimension
- Challenges
- Completeness Vs Speed
- Corrective Vs Transparent
- Too corrective system hides/obscures the
operational deficiencies slows organizational
progress
31Defining Data Quality
- Basic definition of data quality is data accuracy
and that means - Correct the values of the data are valid, e.g.,
my resident state is PA - Unambiguous The values of the data can mean only
one thing, e.g., there is only one PA - Consistent the values of the data use the same
format, e.g., PA and not Penn, or Pennsylvania - Complete data are not null, and aggregates do
not lose data record somewhere in the information
flow
32Cleaning Deliverables
- Keep accurate records of the types of data
quality problems you look for, when you look,
what you look at, etc - Is data quality getting better or worse?
- Which source systems generate the most data
quality errors? - Is there any correlation between data quality
levels and the performance of the organization as
a whole?
33Data Profiling Deliverable
- Start before building the ETL system
- Data profiling analysis including
- Schema definitions
- Business objects
- Domains
- Data Sources
- Table definitions
- Synonyms
- Data rules
- Value rules
- Issues that need to be addressed
34Error Event Table Deliverable
- Built as a star schema
- Each data quality error or issue is added to the
table
35Conforming
- Integration of data
- A conformed product dimension is the enterprises
agreed upon master list of products, including
all attributes. It can be considered as a master
table of products with clean surrogate product
key with all relevant attributes. - Processes of conforming
- Standardizing
- Matching deduplication
- Surviving
365. Delivering Dimension Tables
37The basic structure of a dimension
- Primary key (PK)
- Mostly a surrogate key
- Foreign Key to Fact Tables
- Natural key (NK)
- Meaningful key extracted from source systems
- 1-to-1 relationship to the PK for static
dimensions - 1-to-many relationship to the PK for slowly
changing dimensions, tracks history of changes to
the dimension - Descriptive Attributes
- Primary textual but numbers legitimate but not
numbers that are measured quantities - 100 such attributes normal
- Static or slow changing only
- Product price -- either fact or dimension
attribute
38Smart keys in place of Surrogate keys
- Smart Key
- Natural Key concatenated with timestamp
- A wrong approach Reasons
- Keys assume responsibility
- Changes may happen to natural key, causing the
change of all fact table records referring it. - Poor performance
- Key may take CHAR or VARCHAR type, comparison is
slower. - Joins between two CHAR / VARCHAR is slower
process - Heterogeneous source systems
- Common dimensions are sourced by different
sources. - Conflict arise when second source comes into the
scene to source the key.
39The grain of a dimension
- Grain of the dimension means, the definition of
the key of the dimension - The grain of the customer dimension is customer
- Verify that a given source (file) implements the
intended grain - Nothing should be returned by the following query
from the source system/file for the fields A,B
C to form the natural key - If something is returned by this, the fields A, B
and C do not represent the grain of the dimension
select A, B, C, count() from DimensionTableSource
group by A, B, C having count() gt 1
40The basic load plan for a dimension
- Data cleaning
- Validate the data, apply business rules to make
the data consistent, column validity enforcement,
cross-column value checking, row de-duplication - Data conforming
- Align the content of some or all of the fields in
the dimension with fields in similar or identical
dimensions in other parts of the data warehouse - Data Delivery
- All the steps required to deal with slow-changing
dimensions - Write the dimension to the physical table
- Creating and assigning the surrogate key, making
sure the natural key is correct, etc.
41Date and Time Dimensions
- Virtually everywhere measurements are defined at
specific times, repeated over time, etc. - Most common calendar-day dimension with the
grain of a single day, many attributes - Doesnt have a conventional source
- Built by hand, speadsheet
- Holidays, workdays, fiscal periods, week numbers,
last day of month flags, must be entered manually - 10 years are about 4K rows
42Date Dimension
- Note the Natural key a day type and a full date
- Day type date and non-date types such as
inapplicable date, corrupted date, hasnt
happened yet date - fact tables must point to a valid date from the
dimension, so we need special date types, at
least one, the N/A date - How to generate the primary key?
- Surrogate Keys, whose zero value is assigned to
the standard starting date of the history of
records - Partition warehouse on date helps improve
performance allows work on the current
partition such as, creating dropping index. - Surrogate key 9999999 may be designed to mean
N/A. This high value will keep it always in the
current partition.
43Other Time Dimensions
- If the grain is month, not day, then create a
separate dimension table for month. Similarly, if
the grain is week or some domain specific span. - Creating month view on the date dimension is not
desirable, since it would drag larger table in to
the month based query. 120 records will be there
for monthly grain dimension table for 10 yrs. - When the grain is seconds, it is not advisable to
create 375 million rows per year (approx) for
every seconds instead use the design shown.
Date Component
Fact Table
Calendar Date (FK)
Other Dims (FKs)
SQL data-time
Other facts
Precise Time stamp, not a dimension
This design is preferred over creating a
dimension table record for every minutes or
seconds
44Big and Small Dimensions
- SMALL
- Examples Transaction Type, Claim Status
- Tiny lookup tables with only a few records and
one ore more columns - Build by typing into a spreadsheet and loading
the data into the DW - These dimensions should NOT be conformed
- JUNK dimension
- A tactical maneuver to reduce the number of FKs
from a fact table by combining the
low-cardinality values of small dimensions into a
single junk dimension. - Generate as you go, dont generate the cartesian
product
- BIG
- Examples Customer, Product, Location
- Millions or records with hundreds of fields
(insurance customers) - Or hundreds of millions of records with few
fields (supermarket customers) - Always derived by multiple sources
- These dimensions should be conformed
45Dimensional Roles
- The same dimension can be attached to a fact
table multiple times - Sale has an order date, payment date, a shipping
date, a return date - A claim transaction has multiple internal people,
a claim intake, a medical nurse, a claim
adjuster, a payer, etc - End-user might be confused when they do
drill-down into the dimension as to what the
dimension represents, so it helps to have the
dimension have different names
46Type-2 Slowly Changing Dimension
- When a record changes, instead of overwriting
- create a new dimension record
- with a new surrogate key
- add the new record into the dimension table
- use this record going forward in all fact tables
- no fact tables need to change
- no aggregates need to be re-computed
- Perfectly partitions history because at each
detailed version of the dimension is correctly
connected to the span of fact tables for which
that version is correct
47Type-2 Slowly Changing Dimensions
- Type-2 do not change the natural key (the natural
key should never change)
48Type-2 SCD Precise Time Stamping
- With a Type-2 change, you might want to include
the following additional attributes in the
dimension - Date of change
- Exact time of change
- Reason for change
- Current Flag (current/expired)
49Type-2 SCD using CRC
- For small tables, use a brute force approach of
comparing every incoming field with every field
in the DW to see if anything changes - For larger tables, use the CRC, or
cyclic-redundancy checksum, a number of about 20
digits, like this - Treat the entire incoming record as a string
- Compute the CRC value of the string (some
numeric value) - Compare the CRC value of todays record with the
CRC value of yesterdays record - If the CRC values match, todays record is
identical to yesterdays record - If the CRC values do not match, do a field by
field comparison to see what has changed - Depending on whether the changed field is a
Type-1, Type-2 or Type-3 change, do the necessary
updates - Most ETL packages and the Internet include CRC
computation and comparison code
505. Delivering Fact Tables
51The basic structure of a fact table
- Every table defined by its grain
- in business terms
- in terms of the dimension foreign keys and other
fields - A set of foreign keys (FK)
- context for the fact
- Join to Dimension Tables
- Degenerate Dimensions
- Part of the key
- Not a foreign key to a Dimension table
- Primary Key
- a subset of the FKs
- must be defined in the table
- Surrogate keys
- Has no business intelligence
- Should not be delivered to the users
52Guaranteeing Referential Integrity
- EVERY FACT TABLE ROW MUST BE FILLED WITH
LEGITIMATE FOREIGN KEYS - Check Before Loading
- Check before you add fact records
- Check before you delete dimension records
- Best approach
- Check While Loading
- Serious performance issue, while loading
thousands of data. - Check After Loading
- Theoretically ok, to find all violations
- Running query over the entire fact table cause
serious performance issues. - If the checking is restricted to data loaded only
today, we make an assumption that the date
foreign key is reliable
53Options for loading the Surrogate Keys of
Dimensions
- Look up the current surrogate key in each
dimension, fetch the record with the most current
surrogate key for the natural key and use that
surrogate key. Good option but very slow. - Maintain a surrogate key lookup table for each
dimension. This table is updated whenever a new
record is added or when a Type-2 update occurs in
an existing dimensional entity. - The dimensions must be updated with Type-2
updates before any facts are loaded into the Data
Warehouse, to guarantee referential integrity
54Surrogate Key Pipeline
- Assume that all records to be added to the fact
table are current, i.e., - in the incoming fact records, the value for the
natural key of each dimension is the most current
value known to the DW - When loading a fact table, the final ETL step
converts the natural keys of the new input
records into the correct surrogate key of the
dimensions using the key mapping tables
55Surrogate Key Pipeline (contd)
56Grains of Fact Tables
- Every fact table should belong to exactly one of
the following grains - Transaction grain
- Periodic snapshot grain
- Accumulating snapshot grain
57Transaction Grain Fact Tables
- The grain represents an instantaneous measurement
at a specific point in space and time - retail sales transaction
- More detailed fact table
- Generally consists of more number of dimensions
- Each record correspond to the instant of
measurement. - Transactions are time stamped to the nearest
second/minutes - Fact table records arrive in batches at regular
intervals
58Periodic Snapshot Fact Tables
- The grain represents a span of time periodically
repeated - A periodic snapshot for a checking account in a
bank, reported every month - Beginning balance
- Ending balance
- Number of deposits
- Total for deposits
- Number of withdrawals
- Total for withdrawals
- Attributed by more number of facts
- Predictable sparseness
- one checking account record per account per month
59Accumulating Snapshot Fact Tables
- The grain represents finite processes that have a
definite beginning and an end - Order fulfillment, claim processing, small
workflows - But not large complicated looping workflows
- Example shipment invoice line item
- Order date
- Requested ship date
- Actual ship date
- Delivery date
- Last payment date
- Return date
- Characteristics large number of date dimensions,
data are created and overwritten multiple times
as events unfold
60Loading a Table
- Separate inserts from updates
- Identify update records from the bulk data to be
loaded and process it first - Perform the insertion of the rest of the records
using a bulk loader. - Use a bulk loader
- To improve performance of the inserts decrease
database overhead - With many bulk loaders, updating is not possible
- Load in parallel
- ETL tool supports breaking data in logical
segments, say one per year load the data in
parallel - Some ETL tools offers dynamic partition to help
bulk loading - Minimize physical updates
- To decrease database overhead with writing the
rollback logs, delete the records requires update
load it all afresh using bulk loader
61Updating and Correcting Facts
- Negate the fact
- Create an exact duplicate of the fact where all
the measurements are negated (-minus), so the
measures cancel each other in summaries - Do it for audit reasons and/or if
capturing/measuring erroneous entries is
significant to the business - Update the fact
- Delete and reload the fact
- Drawback current versions of previously released
reports no longer valid - Physical deletes ? the record is deleted
- Logical deletes ? the record is tagged deleted
62Graceful Modifications
- Modifications to the dimensional schema, without
affecting end user applications - Modifications includes
- Adding a fact to the existing fact table at the
same grain - Adding a dimension to the existing fact table at
the same grain - Adding an attribute to an existing dimension
- Increasing granularity of existing fact
dimension table - Complicated Job!
- Changing weekly grain to daily grain ?
- Constraints on weekly aggregation only work with
the historical data. - Further additions can be at daily grain
63Aggregate tables
- Aggregate navigator sits between user query
DBMS, intercepting queries and transforms base
level queries into aggregate aware queries
wherever possible - Good aggregate program
- Provides performance enhancement
- Add only reasonable amount of storage
- Transparent to the end user
- Users are not aware of it.
- No user program can refer to aggregation tables
directly! - Affect the cost of extract system least possible
- Each extract leads to the updation of aggregates!
- Not adds much burden to the DBA
- Generates metedata automatically
64Aggregate tables- Design Requirements
- Aggregates must be stored in their own fact
tables, separate from base-level data. - Dimension tables attached to the aggregate tables
must be shrunken version of the corresponding
dimension - The most shrunken version of the dimension is
the dimension removed altogether - The base table all its aggregate tables must be
related together as a family of schemas, so that
aggregate navigator knows which of the tables
are related together - All queries from the users must be forced to
refer the base table only.
65Thank You