Data Warehousing (SS ZG515) Extraction

About This Presentation

Title:

Data Warehousing (SS ZG515) Extraction

Description:

Data Warehousing (SS ZG515) Extraction Transformation - Loading Prof. Navneet Goyal/SP Vimal Computer Science Department BITS, Pilani To discuss – PowerPoint PPT presentation

Number of Views:228

Avg rating:3.0/5.0

Slides: 66

Provided by: pc75130

Category:

more less

Transcript and Presenter's Notes

Title: Data Warehousing (SS ZG515) Extraction

1
Data Warehousing (SS ZG515)Extraction
Transformation - Loading

Prof. Navneet Goyal/SP Vimal
Computer Science Department
BITS, Pilani

2
To discuss

Requirements
Data Structures
Extraction
Cleaning Conforming
Delivering Dimension Tables
Delivering Fact Tables

3
1. Requirements

4
ETL

A Properly designed ETL system extracts data
from the source systems, enforces data quality
and consistency standards, conforms data so that
separate sources can be used together, and
finally delivers data in a presentation-ready
format so that application developers can build
applications and end users can make decisions
ETL makes or breaks the data warehouse Ralph
Kimball

5
Requirements

Business Needs
Information requirements of the end user.
Captured by interview with users, independent
investigations about the possible sources by the
ETL team.
Compliance Requirements
Sarbanes-Oxley Act 2002 (Deals with the
regulation of corporate governance) (more on
http//www.soxlaw.com)
Proof of complete transaction flow that changed
any data.

6
Requirements (contd)

Data Profiling
Systematic examination of quality, scope and the
context of a data source
Helps ETL team determine how much data cleaning
activity require.
Data Profiling employs analytic methods for
looking at data for the purpose of developing a
thorough understanding of the content, structure
and the quality of the data. A good data
profiling system can process very large amounts
of data, with the skills of analyst, uncover all
sorts of issues that need to be addressed Jack
Olson

7
Requirements (contd)

Security Requirements
ETL team have complete read/ write access to the
entire corporate data.
ETL workstations on the company intranet, A major
threat. Keep it in a separate subnet with packet
filtering gateway.
Secure Backups, as well.
Data Integration
Identified as conform step
Conform Dimensions Conform Facts

8
Requirements (contd)

Data Latency
How quickly the data can be delivered to the
users?
ETL architecture has direct impact on it.
Archiving Lineage
Stage data after each major transformations, Not
just after all the four steps viz extract, clean,
conform deliver.
Each archived/ staged data set should have
accompanying metadata.
Tracking this lineage is explicitly required be
certain compliance requirements

9
Requirements (contd)

End user delivery Interfaces
ETL team is responsible for the content and
structure of data, making the end user
applications fast.
Available Skills
Expertise in building ETL system around vendors
tool
Decision between hand coded or vendors package of
ETL tools
Legacy Licenses
Managements insistence to use legacy licenses

10
Choice of Architecture ? Tool Based ETL

Simpler, Cheaper Faster development
People with business skills not much technical
skills can use it.
Automatically generate Metadata
Automatically generates data Lineage data
dependency analysis
Offers in-line encryption compression
capabilities
Manage complex load balancing across servers

11
Choice of Architecture ? Hand-Coded ETL

Quality of tool by exhaustive unit testing
Better metadata
Requirement may be just file based processes not
database-stored procedures
Use of existing legacy routines
Use of in-house programmers
Unlimited flexibility

12
Choice of Architecture Batch Vs Streaming

Batch Update
ETL processing is done on periodic batch extracts
from the source systems
Streaming Dataflow
Data at record level flows from source system to
users screens
Extract-Clean-Conform-Deliver must be modified to
accommodate record by record processing.
Arrival of data may not be predictable.

13
2. Data Structures

14
To stage or not to stage

A conflict between
getting the data from the operational systems as
fast as possible
having the ability to restart without repeating
the process from the beginning
Reasons for staging
Recoverability stage the data as soon as it has
been extracted from the source systems and
immediately after major processing (cleaning,
transformation, etc).
Backup can reload the data warehouse from the
staging tables without going to the sources
Auditing lineage between the source data and the
underlying transformations before the load to the
data warehouse

15
Designing the staging area

The staging area is owned by the ETL team
no indexes, no aggregations, no presentation
access, no querying, no service level agreements
Users are not allowed in the staging area for any
reason
staging is a construction site
Reports cannot access data in the staging area
tables can be added, or dropped without notifying
the user community
Controlled environment

16
Designing the staging area (contd)

Only ETL processes can read/write the staging
area (ETL developers must capture table names,
update strategies, load frequency, ETL jobs,
expected growth and other details about the
staging area)
The staging area consists of both RDBMS tables
and data files

17
Staging Area data Structures in the ETL System

Flat files
fast to write, append to, sort and filter (grep)
but slow to update, access or join
enables restart without going to the sources
XML Data Sets
Used as a medium of data transfer between
incompatible data sources
Gives enough information to create tables using
CREATE TABLE
Relational Tables
Metadata, SQL interface, DBA support

18
Staging Area data Structures in the ETL System
(contd)

Dimensional Model Constructs Facts, Dimensions,
Atomic Facts tables, Aggregate Fact Tables (OLAP
Cubes)
Surrogate Key Mapping Tables
map natural keys from the OLTP systems to the
surrogate key from the DW
can be stored in files or the RDBMS

19
3. Extraction

20
Logical data map

Represented as a table with the following
attributes
Target table and column, table type (Dimension,
Fact)
Slow-changing dimension type per target column of
each dimensions
Type 1, overwrite (Customer first name)
Type 2, retain history (Customer last name)
Type 3, retain multiple valid alternative values
Source database, table, column
Transformations

21
Logical Map development

Have a plan !
Logical map is provided by Datawarehouse
architect to ETL Team serves as the
specification of ETL processes.
Identify data lineage between the data source
target
Identity source candidates
Analyze source systems with a data profiling tool
Detect data anomaly, identify appropriate actions
document it.
Identify the quality

22
Logical Map development (contd)

Receive walk-through of data lineage and business
rules (from the DW architect and business analyst
to the ETL developer)
data alterations during data cleansing,
calculations and formulas
standard conformance to dimensions and numerical
facts
Receive walk-through of the dimensional model
The objective of ETL team is to deliver data to
the dimensional model in a more effective way
hence the understanding of dimensional model is
helpful
Validate calculations Formulas used in ETL.

23
Logical Map development (contd)

Complete logical map cannot exist until all the
source systems are identified analyzed.
Analysis of source
Data discovery phase
Collecting documenting Source systems
Keeping track of source systems
Identify ownership, responsible for the content
and its storage usage statistics
Determine the system-of-record (source of data)
Identify the source, when redundant sources
coexist
Analyze the source systems for any relationship
between tables
Anomaly detection phase (Data content Analysis)
NULL values
Dates in Nondate fields

24
Some good rules

Analyze your source system
get a ER-model for the system or reverse
engineering one (develop one by looking at the
metadata of the system)
reverse engineering is not the same as forward
engineering, i.e., given the ER-models of the
source systems derive the dimensional schema of
the data warehouse
Reverse engineering helps understanding
unique identifiers and natural keys
data types
relationships between tables
Of the source systems

25
Extract data from disparate systems

ODBC Manager
Accepts SQL from ETL applications routes it
through appropriate ODBC driver
ODBC can provide a common gateway to diverse
sources

26
Different sources

Mainframe Sources
Flat Files
XML sources
Web Log sources
ERP system sources

27
Tips for Extracting

Constrain on indexed columns
Retrieve only the data you need
Do not retrieve entire table select from that.
Use DISTINCT Set operations sparingly
Try if the slower DISTINCT, Set operations UNION,
MINUS and INTERSECT operations can be avoided
Do the best effort to avoid NOT operation, which
normally scans the entire database
Avoid functions in the WHERE clause
Difficult to avoid
Try different techniques before using the
functions, at least.
Avoid subqueries

28
Do it also !

When a dimension is populated by several distinct
systems, it is important to include the unique
identifier from each of those systems in the
target dimension in the data warehouse. Those
identifiers should be viewable by end users to
ensure peace of mind that the dimension reflects
their data that they can tie back to in their
transaction system.

29
4. Cleaning Conforming

30
Cleaning and Conforming

While the Extracting and Loading part of an ETL
process simply moves data, the cleaning and
conforming part , the transformation part that
truly adds value
How do we deal with dirty data?
Data Profiling report
The Error Event fact table
Audit Dimension
Challenges
Completeness Vs Speed
Corrective Vs Transparent
Too corrective system hides/obscures the
operational deficiencies slows organizational
progress

31
Defining Data Quality

Basic definition of data quality is data accuracy
and that means
Correct the values of the data are valid, e.g.,
my resident state is PA
Unambiguous The values of the data can mean only
one thing, e.g., there is only one PA
Consistent the values of the data use the same
format, e.g., PA and not Penn, or Pennsylvania
Complete data are not null, and aggregates do
not lose data record somewhere in the information
flow

32
Cleaning Deliverables

Keep accurate records of the types of data
quality problems you look for, when you look,
what you look at, etc
Is data quality getting better or worse?
Which source systems generate the most data
quality errors?
Is there any correlation between data quality
levels and the performance of the organization as
a whole?

33
Data Profiling Deliverable

Start before building the ETL system
Data profiling analysis including
Schema definitions
Business objects
Domains
Data Sources
Table definitions
Synonyms
Data rules
Value rules
Issues that need to be addressed

34
Error Event Table Deliverable

Built as a star schema
Each data quality error or issue is added to the
table

35
Conforming

Integration of data
A conformed product dimension is the enterprises
agreed upon master list of products, including
all attributes. It can be considered as a master
table of products with clean surrogate product
key with all relevant attributes.
Processes of conforming
Standardizing
Matching deduplication
Surviving

36
5. Delivering Dimension Tables

37
The basic structure of a dimension

Primary key (PK)
Mostly a surrogate key
Foreign Key to Fact Tables
Natural key (NK)
Meaningful key extracted from source systems
1-to-1 relationship to the PK for static
dimensions
1-to-many relationship to the PK for slowly
changing dimensions, tracks history of changes to
the dimension
Descriptive Attributes
Primary textual but numbers legitimate but not
numbers that are measured quantities
100 such attributes normal
Static or slow changing only
Product price -- either fact or dimension
attribute

38
Smart keys in place of Surrogate keys

Smart Key
Natural Key concatenated with timestamp
A wrong approach Reasons
Keys assume responsibility
Changes may happen to natural key, causing the
change of all fact table records referring it.
Poor performance
Key may take CHAR or VARCHAR type, comparison is
slower.
Joins between two CHAR / VARCHAR is slower
process
Heterogeneous source systems
Common dimensions are sourced by different
sources.
Conflict arise when second source comes into the
scene to source the key.

39
The grain of a dimension

Grain of the dimension means, the definition of
the key of the dimension
The grain of the customer dimension is customer
Verify that a given source (file) implements the
intended grain
Nothing should be returned by the following query
from the source system/file for the fields A,B
C to form the natural key
If something is returned by this, the fields A, B
and C do not represent the grain of the dimension

select A, B, C, count() from DimensionTableSource
group by A, B, C having count() gt 1
40
The basic load plan for a dimension

Data cleaning
Validate the data, apply business rules to make
the data consistent, column validity enforcement,
cross-column value checking, row de-duplication
Data conforming
Align the content of some or all of the fields in
the dimension with fields in similar or identical
dimensions in other parts of the data warehouse
Data Delivery
All the steps required to deal with slow-changing
dimensions
Write the dimension to the physical table
Creating and assigning the surrogate key, making
sure the natural key is correct, etc.

41
Date and Time Dimensions

Virtually everywhere measurements are defined at
specific times, repeated over time, etc.
Most common calendar-day dimension with the
grain of a single day, many attributes
Doesnt have a conventional source
Built by hand, speadsheet
Holidays, workdays, fiscal periods, week numbers,
last day of month flags, must be entered manually
10 years are about 4K rows

42
Date Dimension

Note the Natural key a day type and a full date
Day type date and non-date types such as
inapplicable date, corrupted date, hasnt
happened yet date
fact tables must point to a valid date from the
dimension, so we need special date types, at
least one, the N/A date
How to generate the primary key?
Surrogate Keys, whose zero value is assigned to
the standard starting date of the history of
records
Partition warehouse on date helps improve
performance allows work on the current
partition such as, creating dropping index.
Surrogate key 9999999 may be designed to mean
N/A. This high value will keep it always in the
current partition.

43
Other Time Dimensions

If the grain is month, not day, then create a
separate dimension table for month. Similarly, if
the grain is week or some domain specific span.
Creating month view on the date dimension is not
desirable, since it would drag larger table in to
the month based query. 120 records will be there
for monthly grain dimension table for 10 yrs.
When the grain is seconds, it is not advisable to
create 375 million rows per year (approx) for
every seconds instead use the design shown.

Date Component
Fact Table
Calendar Date (FK)
Other Dims (FKs)
SQL data-time
Other facts
Precise Time stamp, not a dimension
This design is preferred over creating a
dimension table record for every minutes or
seconds
44
Big and Small Dimensions

SMALL
Examples Transaction Type, Claim Status
Tiny lookup tables with only a few records and
one ore more columns
Build by typing into a spreadsheet and loading
the data into the DW
These dimensions should NOT be conformed
JUNK dimension
A tactical maneuver to reduce the number of FKs
from a fact table by combining the
low-cardinality values of small dimensions into a
single junk dimension.
Generate as you go, dont generate the cartesian
product

BIG
Examples Customer, Product, Location
Millions or records with hundreds of fields
(insurance customers)
Or hundreds of millions of records with few
fields (supermarket customers)
Always derived by multiple sources
These dimensions should be conformed

45
Dimensional Roles

The same dimension can be attached to a fact
table multiple times
Sale has an order date, payment date, a shipping
date, a return date
A claim transaction has multiple internal people,
a claim intake, a medical nurse, a claim
adjuster, a payer, etc
End-user might be confused when they do
drill-down into the dimension as to what the
dimension represents, so it helps to have the
dimension have different names

46
Type-2 Slowly Changing Dimension

When a record changes, instead of overwriting
create a new dimension record
with a new surrogate key
add the new record into the dimension table
use this record going forward in all fact tables
no fact tables need to change
no aggregates need to be re-computed
Perfectly partitions history because at each
detailed version of the dimension is correctly
connected to the span of fact tables for which
that version is correct

47
Type-2 Slowly Changing Dimensions

Type-2 do not change the natural key (the natural
key should never change)

48
Type-2 SCD Precise Time Stamping

With a Type-2 change, you might want to include
the following additional attributes in the
dimension
Date of change
Exact time of change
Reason for change
Current Flag (current/expired)

49
Type-2 SCD using CRC

For small tables, use a brute force approach of
comparing every incoming field with every field
in the DW to see if anything changes
For larger tables, use the CRC, or
cyclic-redundancy checksum, a number of about 20
digits, like this
Treat the entire incoming record as a string
Compute the CRC value of the string (some
numeric value)
Compare the CRC value of todays record with the
CRC value of yesterdays record
If the CRC values match, todays record is
identical to yesterdays record
If the CRC values do not match, do a field by
field comparison to see what has changed
Depending on whether the changed field is a
Type-1, Type-2 or Type-3 change, do the necessary
updates
Most ETL packages and the Internet include CRC
computation and comparison code

50
5. Delivering Fact Tables

51
The basic structure of a fact table

Every table defined by its grain
in business terms
in terms of the dimension foreign keys and other
fields
A set of foreign keys (FK)
context for the fact
Join to Dimension Tables
Degenerate Dimensions
Part of the key
Not a foreign key to a Dimension table
Primary Key
a subset of the FKs
must be defined in the table
Surrogate keys
Has no business intelligence
Should not be delivered to the users

52
Guaranteeing Referential Integrity

EVERY FACT TABLE ROW MUST BE FILLED WITH
LEGITIMATE FOREIGN KEYS
Check Before Loading
Check before you add fact records
Check before you delete dimension records
Best approach
Check While Loading
Serious performance issue, while loading
thousands of data.
Check After Loading
Theoretically ok, to find all violations
Running query over the entire fact table cause
serious performance issues.
If the checking is restricted to data loaded only
today, we make an assumption that the date
foreign key is reliable

53
Options for loading the Surrogate Keys of
Dimensions

Look up the current surrogate key in each
dimension, fetch the record with the most current
surrogate key for the natural key and use that
surrogate key. Good option but very slow.
Maintain a surrogate key lookup table for each
dimension. This table is updated whenever a new
record is added or when a Type-2 update occurs in
an existing dimensional entity.
The dimensions must be updated with Type-2
updates before any facts are loaded into the Data
Warehouse, to guarantee referential integrity

54
Surrogate Key Pipeline

Assume that all records to be added to the fact
table are current, i.e.,
in the incoming fact records, the value for the
natural key of each dimension is the most current
value known to the DW
When loading a fact table, the final ETL step
converts the natural keys of the new input
records into the correct surrogate key of the
dimensions using the key mapping tables

55
Surrogate Key Pipeline (contd)
56
Grains of Fact Tables

Every fact table should belong to exactly one of
the following grains
Transaction grain
Periodic snapshot grain
Accumulating snapshot grain

57
Transaction Grain Fact Tables

The grain represents an instantaneous measurement
at a specific point in space and time
retail sales transaction
More detailed fact table
Generally consists of more number of dimensions
Each record correspond to the instant of
measurement.
Transactions are time stamped to the nearest
second/minutes
Fact table records arrive in batches at regular
intervals

58
Periodic Snapshot Fact Tables

The grain represents a span of time periodically
repeated
A periodic snapshot for a checking account in a
bank, reported every month
Beginning balance
Ending balance
Number of deposits
Total for deposits
Number of withdrawals
Total for withdrawals
Attributed by more number of facts
Predictable sparseness
one checking account record per account per month

59
Accumulating Snapshot Fact Tables

The grain represents finite processes that have a
definite beginning and an end
Order fulfillment, claim processing, small
workflows
But not large complicated looping workflows
Example shipment invoice line item
Order date
Requested ship date
Actual ship date
Delivery date
Last payment date
Return date
Characteristics large number of date dimensions,
data are created and overwritten multiple times
as events unfold

60
Loading a Table

Separate inserts from updates
Identify update records from the bulk data to be
loaded and process it first
Perform the insertion of the rest of the records
using a bulk loader.
Use a bulk loader
To improve performance of the inserts decrease
database overhead
With many bulk loaders, updating is not possible
Load in parallel
ETL tool supports breaking data in logical
segments, say one per year load the data in
parallel
Some ETL tools offers dynamic partition to help
bulk loading
Minimize physical updates
To decrease database overhead with writing the
rollback logs, delete the records requires update
load it all afresh using bulk loader

61
Updating and Correcting Facts

Negate the fact
Create an exact duplicate of the fact where all
the measurements are negated (-minus), so the
measures cancel each other in summaries
Do it for audit reasons and/or if
capturing/measuring erroneous entries is
significant to the business
Update the fact
Delete and reload the fact
Drawback current versions of previously released
reports no longer valid
Physical deletes ? the record is deleted
Logical deletes ? the record is tagged deleted

62
Graceful Modifications

Modifications to the dimensional schema, without
affecting end user applications
Modifications includes
Adding a fact to the existing fact table at the
same grain
Adding a dimension to the existing fact table at
the same grain
Adding an attribute to an existing dimension
Increasing granularity of existing fact
dimension table
Complicated Job!
Changing weekly grain to daily grain ?
Constraints on weekly aggregation only work with
the historical data.
Further additions can be at daily grain

63
Aggregate tables

Aggregate navigator sits between user query
DBMS, intercepting queries and transforms base
level queries into aggregate aware queries
wherever possible
Good aggregate program
Provides performance enhancement
Add only reasonable amount of storage
Transparent to the end user
Users are not aware of it.
No user program can refer to aggregation tables
directly!
Affect the cost of extract system least possible
Each extract leads to the updation of aggregates!
Not adds much burden to the DBA
Generates metedata automatically

64
Aggregate tables- Design Requirements

Aggregates must be stored in their own fact
tables, separate from base-level data.
Dimension tables attached to the aggregate tables
must be shrunken version of the corresponding
dimension
The most shrunken version of the dimension is
the dimension removed altogether
The base table all its aggregate tables must be
related together as a family of schemas, so that
aggregate navigator knows which of the tables
are related together
All queries from the users must be forced to
refer the base table only.

65
Thank You

Write a Comment

User Comments (0)

About PowerShow.com

Data Warehousing (SS ZG515) Extraction - PowerPoint PPT Presentation

Data Warehousing (SS ZG515) Extraction

Data Warehousing (SS ZG515) Extraction Transformation - Loading Prof. Navneet Goyal/SP Vimal Computer Science Department BITS, Pilani To discuss – PowerPoint PPT presentation