Data Warehouse Lecture 3 - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Data Warehouse Lecture 3

Description:

Holidays, workdays, fiscal periods, week numbers, last day of month flags, must ... because SQL updates will result into unacceptable database logging activity ... – PowerPoint PPT presentation

Number of Views:223
Avg rating:3.0/5.0
Slides: 46
Provided by: csDr
Category:

less

Transcript and Presenter's Notes

Title: Data Warehouse Lecture 3


1
Data Warehouse Lecture 3
  • ETL Tools, Techniques and Best Practices

2
Agenda
  • ETL tools, techniques, methodologies, best
    practices, delivering dimension tables to the DW
  • Sources used for this lecture
  • Ralph Kimball, Joe Caserta, The Data Warehouse
    ETL Toolkit Practical Techniques for
    Extracting, Cleaning, Conforming and Delivering
    Data

3
ETL Solution Desired Features
An ETL tool is a tool that reads data from one or
more sources, transforms the data so that it is
compatible with a destination and loads the data
to the destination
  • Automated data movement across data stores and
    the analytical area in support of a data
    warehouse, data mart or ODS effort
  • Extensible, robust, scalable infrastructure
  • Standardization of ETL processes across the
    enterprise
  • Reusability of custom and pre-build functions
  • Better utilization of existing hardware resources
  • Faster change-control management
  • Integrated meta-data management
  • Complete development environment, work as you
    think design metaphor

Desired Features
4
Custom ETL Solutions
  • Execute stored procedures at the source to
    extract data, perhaps filtering some
  • Store the extracted data into staging tables at
    the source
  • Generate a file with the source staging data
  • Transfer the file into the destination
  • Load the data into staging tables at the
    destination
  • Execute stored procedures at the destination to
    read the incoming staged data and compute what
    changed since the last time by comparing with the
    current value of the data warehouse
  • Populate the data warehouse with new changes
  • Supports incremental updates to sources
  • Supports simple data cleansing
  • Logs warning/errors

5
Vendor ETL tools
  • DataStage
  • big player, high ability to execute, gets good
    results from Gartner, visionary
  • Informatica
  • another big player, gets good results from
    Gartner, visionary with high ability to execute
  • most expensive
  • SAS ETL Server
  • fast becoming a major player, very positive
    results from Gartner
  • low exposure as an ETL tool (SAS a significant
    statistical analysis vendor)
  • Information Builder's Data Migrator/ETL Manager
    tool suite
  • part of Enterprise Focus/WebFocus
  • not a major player but industrial strength
    language, data connectors, etc
  • Sunopsis
  • cheap
  • relies on native RDBMS functionality
  • CIGNA people exposed to it at conferences liked
    it

6
Buy versus Build
  • Vendor tools promote standardization of the ETL
    process, reusability of custom and pre-built
    functions, lowering the time (and cost) of
    additional ETL efforts
  • Vendor ETL tools are somewhat self-documenting
  • Many tools can connect to a variety of sources
    (RDBMSs, non-relational, different OS, ERP,
    PeopleSoft, etc) without exposing the ETL
    developer to the differences
  • Vendor tools deal with changes in the source
    systems or the necessary transformations better,
    reducing long term ETL maintenance
  • Meta-data management is a huge advantage,
    especially when sharing data from many
    applications
  • Vendor ETL tools place discipline in the
    extraction and transformation process
  • ETL prices have not dropped much over the years,
    but there is increased functionality,
    performance, and usability from vendor tools

7
To stage or not to stage
  • A conflict between
  • getting the data from the operational systems as
    fast as possible
  • having the ability to restart without repeating
    the process from the beginning
  • Reasons for staging
  • Recoverability stage the data as soon as it has
    been extracted from the source systems and
    immediately after major processing (cleaning,
    transformation, etc).
  • Backup can reload the data warehouse from the
    staging tables without going to the sources
  • Auditing lineage between the source data and the
    underlying transformations before the load to the
    data warehouse

8
Designing the staging area
  • The staging area is owned by the ETL team
  • no indexes, no aggregations, no presentation
    access, no querying, no service level agreements
  • Users are not allowed in the staging area for any
    reason
  • staging is a construction site
  • Reports cannot access data in the staging area
  • tables can be added, or dropped without modifying
    the user community
  • Only ETL processes can read/write the staging
    area (ETL developers must capture table names,
    update strategies, load frequency, ETL jobs,
    expected growth and other details about the
    staging area)
  • The staging area consists of both RDBMS tables
    and data files

9
Staging Area data Structures in the ETL System
  • Flat files
  • fast to write, append to, sort and filter (grep)
    but slow to update, access or join
  • enables restart without going to the sources
  • XML Data Sets (not really used in Staging)
  • Relational Tables
  • Metadata, SQL interface, DBA support
  • Dimensional Model Constructs Facts, Dimensions,
    Atomic Facts tables, Aggregate Fact Tables (OLAP
    Cubes)
  • Surrogate Key Mapping Tables
  • map natural keys from the OLTP systems to the
    surrogate key from the DW
  • can be stored in files or the RDBMS (but you can
    use the IDENTITY function if you go with the
    RDBMS approach)
  • Best Practices about these data structures
  • perform impact analysis, capture metadata, use
    naming conventions,

10
Extracting
  • Effectively integrate data from
  • different DBMS, OS, H/W, communication protocols
  • need a logical map, data movement view documents,
    data lineage report
  • have a plan
  • identity source candidates
  • analyze source systems with a data profiling tool
  • receive walk-through of data lineage and business
    rules (from the DW architect and business analyst
    to the ETL developer)
  • data alterations during data cleansing,
    calculations and formulas
  • measure twice, cut once
  • standard conformance to dimensions and numerical
    facts
  • receive walk-through of the dimensional model

11
Components of the Data Movement Views
  • Target table and column, table type (Dimension,
    Fact)
  • Slow-changing dimension type per target column
  • Type 1, overwrite (Customer first name)
  • Type 2, retain history (Customer last name)
  • Type 3, retain multiple valid alternative values
  • Source database, table, column
  • Transformations (the guts of the document)
  • Where do you capture this information? Which
    tool? How do you maintain this metadata?

12
Keeping track of source systems
  • Data modelers and ETL developers should maintain
    a documentation of the source systems including
  • subject area, application name, business name,
    department name
  • priority
  • business owner and technical owner
  • DBMS, production server
  • DB size, users, complexity, transactions per
    day
  • comments
  • Determine the system of record

13
Some good rules
  • Dealing with derived data (derive from base facts
    or accepted calculated columns from the source
    systems?)
  • if calculations are done by the DW
    infrastructure, the ETL developer is responsible
    for them
  • and what if the numbers dont match?
  • recommendations stay true to the definition of
    system-of-record
  • The further downstream you go from the
    originating data source, the more you increase
    the risk of extracting corrupt data. Barring rare
    exceptions, maintain the practice of sourcing
    data only from the system-of-record.
  • Analyze your source system
  • get a ER-model for the system or reverse
    engineering one (develop one by looking at the
    metadata of the system)
  • reverse engineering is not the same as forward
    engineering, i.e., given the ER-models of the
    source systems derive the dimensional schema of
    the data warehouse

14
Data Analysis
  • Reverse engineering of the understanding of a
    source system
  • unique identifiers and natural keys
  • data types
  • relationships between tables (1-to-1, many-to-1,
    many to many), problematic when source database
    does not have foreign keys defined
  • discrete relationships (static data, reference
    tables)
  • Data content analysis
  • NULL values, especially in foreign keys, NULL
    result in lossy joins
  • In spite of the most detailed analysis, we
    recommend using outer join logic when extracting
    from relational source systems, simply because
    referential integrity often cannot be trusted on
    remote systems
  • Dates in non-date fields

15
Extract data from disparate systems
  • What is the standard for the enterprise?
  • ODBC, OLE DB, JDBC, .NET
  • access databases from windows applications, so
    that applications are portable
  • performance is major drawback
  • every DBMS has an ODBC driver, even flat files
  • Adds two layers of interaction between the ETL
    and the database

16
Extracting from different sources
  • Mainframe
  • COBOL copybooks give you the datatypes
  • EBCDIC and not ASCII character set (FTP does the
    translation between the mainframe and
    Unix/Windows)
  • Working with redefined fields (To save space the
    same field is used for different types of data)
  • Extracting from IMS, IDMS, Adabase
  • you need special adapters or you get someone in
    those systems to give you a file
  • XML sources, Web Log Files doable, if you
    undestand the structure of those sources
  • Enterprise-Resource-Planning ERP Systems (SAP,
    PeopleSoft, Oracle)
  • Dont treat it as a relational system -- its a
    mess
  • Use adapters

17
Extracting Changed Data
  • Using Audit Columns
  • Use the last update timestamp, populated by
    triggers or the front-end application
  • Must ensure that the timestamp is dependable,
    that is if the front-end modifies it, a batch job
    does not override it
  • Index the timestamp it if its dependable
  • Database Log Scrapping or sniffing
  • Take the log of the source file and try to
    determine the transactions that affect you
  • Sniffing does it real time
  • Timed extracts
  • Retrieve all records from the source that were
    modified today
  • POTENTIALLY dangerous -- what if the process
    fails today? When it runs tomorrow, youd have
    lost todays changes
  • Process of elimination
  • Preserve yesterdays data in the stage area
  • Bring todays entire data in the stage area
  • Perform a comparison
  • Inefficient, but the most reliable
  • Initial and Incremental Loads
  • Create two tables, previous-load and current-load
  • Load into the current-load, compare with the
    previous-load, when you are done drop the
    previous-load, rename the current-load into
    previous-load, create a new curent-log

18
Tips for Extracting
  • Constrain on indexed columns
  • Retrieve only the data you need
  • Use DISTINCT sparingly
  • Use the SET operations sparingly
  • Use HINT (HINT tells the DBMS to make sure it
    uses a certain index)
  • Avoid NOT
  • Avoid functions in the where clause
  • Avoid subqueries

19
Rules
  • When a dimension is populated by several distinct
    systems, it is important to include the unique
    identifier from each of those systems in the
    target dimension in the data warehouse. Those
    identifiers should be viewable by end users to
    ensure peace of mind that the dimension reflects
    their data that they can tie back to in their
    transaction system.

20
Delivering Dimension Tables
21
The basic structure of a dimension
  • Primary key (PK)
  • Meaningless, unique integer
  • Aka as surrogate key
  • Joins to Fact Tables
  • Is a Foreign Key to Fact Tables
  • Natural key (NK)
  • Meaningful key extracted from source systems
  • 1-to-1 relationship to the PK for static
    dimensions
  • 1-to-many relationship to the PK for slowly
    changing dimensions, tracks history of changes to
    the dimension
  • Descriptive Attributes
  • Primary textual but numbers legitimate but not
    numbers that are measured quantities
  • 100 such attributes normal
  • Static or slow changing only
  • Product price -- either fact or dimension
    attribute

22
Generating surrogate keys for Dimensions
  • Via triggers in the DBMS
  • Read the latest surrogate key, generate the next
    value, create the record
  • Disadvantages severe performance bottlenecks
  • Via the ETL process, an ETL tool or a 3-rd party
    application generate the unique numbers
  • A surrogate key counter per dimension
  • Maintain consistency of surrogate keys between
    dev, test and production
  • Using Smart Keys
  • Concatenate the natural key of the dimension in
    the source(s) with the timestamp of the record in
    the source or the Data Warehouse.
  • Tempting but wrong

23
Why smart keys are wrong
  • By definition
  • Surrogate keys are supposed to be meaningless
  • Do you update the concatenate smart key if the
    natural key changes?
  • Performance
  • Natural keys may be chars and varchars, not
    integers
  • Adding a timestamp to it makes the key very big
  • The dimension is bigger
  • The fact tables containing the foreign key are
    bigger
  • Joining facts with dimensions based on
    chars/varchars become inefficient
  • Heterogeneous sources
  • Smart keys work for homogeneous environments,
    but most likely than not the sources are
    heterogeneous, each having the own definition of
    the dimension
  • How does the definition of the smart key changes
    when there is another source added? It doesnt
    scale very well.
  • One advantage simplicity in the ETL process

24
The grain of a dimension
  • The definition of the key of the dimension in
    business terms, what does the dimension represent
  • Challenge analyze the source systems so that a
    particular set of fields in that source
    corresponds to the grain of the dimension
  • Verify that a given source (file) implements the
    intended grain
  • Nothing should be returned by this from the
    source system/file
  • If something is returned by this, the fields A, B
    and C do not represent the grain of the dimension

select A, B, C, count() from DimensionTableSource
group by A, B, C having count() gt 1
25
The basic load plan for a dimension
  • Simple Case the dimension is loaded as a lookup
    table
  • Typical Case
  • Data cleaning
  • Validate the data, apply business rules to make
    the data consistent, column validity enforcement,
    cross-column value checking, row de-duplication
  • Data conforming
  • Align the content of some or all of the fields in
    the dimension with fields in similar or identical
    dimensions in other parts of the data warehouse
  • Fact tables billing transactions, customer
    support calls
  • IF they use the same dimensions, then the
    dimensions are conformed
  • Data Delivery
  • All the steps required to deal with slow-changing
    dimensions
  • Write the dimension to the physical table
  • Creating and assigning the surrogate key, making
    sure the natural key is correct, etc.

26
Dimensions and embedded hierarchies
  • Dimensions are de-normalized flat tables
  • It is normal to have multiple embedded
    hierarchical structures in the same dimension
  • Eg. Store dimension
  • Geographic hierarchy location, city, state,
    country
  • Merchandizing hierarchy location, district,
    region
  • Can co-exist in the same dimension IF every
    attribute is single valued in the presence of the
    dimensions primary key
  • If some attribute takes multiple values, it is
    not part of the dimension
  • E.g., a store can have multiple cash-registers,
    I.e., if the grain of the dimension is the store,
    the cash-register cannot be part of the store
    dimension, to include the cash register in the
    dimension the grain must be changed to be the
    cash-register, not the store
  • Normalized structures (3NF)
  • ok to exist in the staging area
  • should be flattened (2NF) before the delivery
    phase, using a simple query over the normalized
    structures

27
Snow-flaked Dimensions
  • Dimensions that remain normalized
  • Looks like a snowflake, if the levels of the
    hierarchy obey perfect many-to-1 relationships

28
Flat Dimensions versus Snow-flakes
  • There is no semantic difference, the information
    content between the two approaches are the same
  • Problems with snow-flakes
  • If the hierarchical model changes, the normalized
    table schema and the declared joins must change,
    as well as the resulting end-user queries
  • More confusing to the end user, flat dimensions
    mask the complexity of the data
  • However, appropriate snow-flakes do exist (as
    sub-dimensions of dimensions)

29
Date and Time Dimensions
  • Virtually everywhere measurements are defined at
    specific times, repeated over time, etc.
  • Most common calendar-day dimension with the
    grain of a single day, many attributes
  • Doesnt have a conventional source
  • Built by hand, speadsheet
  • Holidays, workdays, fiscal periods, week numbers,
    last day of month flags, must be entered manually
  • 10 years are about 4K rows

30
Date Dimension
  • Note the Natural key a day type and a full date
  • Day type date and non-date types such as
    inapplicable date, corrupted date, hasnt
    happened yet date
  • fact tables must point to a valid date from the
    dimension, so we need special date types, at
    least one, the N/A date
  • How to generate the primary key?
  • Meaningless integer?
  • Or 10102005 meaning Oct 10, 2005? (reserving
    9999999 to mean N/A?)
  • This is a close call, but even if meaningless
    integers are used, the numbers should appear in
    numerical order (why? Because of data
    partitioning requirements in a DW, data in a fact
    table can be partitioned by time)

31
Other Time Dimensions
  • Also typically needed are time dimensions whose
    grain is a month, a week, a quarter or a year, if
    there are fact tables in each of these grains
  • These are physically different tables
  • Are generated by eliminating selected columns
    and rows from the Date dimension, keep either the
    first of the last day of the month
  • Do NOT use database views
  • A view would drag a much larger table (the date)
    into a month-based fact table

32
Time Dimensions
  • How about a time dimension based on seconds?
  • There are over 31 million seconds in a year!
  • Avoid them as dimensions
  • But keep the SQL date-timestamp data as basic
    attributes in facts (not as dimensions), if
    needed to compute precise queries based on
    specific times
  • Older approach keep a dimension of minutes or
    seconds and make it based on an offset from
    midnight of each day, but its messy when
    timestamps cross days
  • Might need something fancy though if the
    enterprise has well defined time slices within a
    day such as shift names, advertising slots --
    then build a dimension

33
Big and Small Dimensions
  • SMALL
  • Examples Transaction Type, Claim Status
  • Tiny lookup tables with only a few records and
    one ore more columns
  • Build by typing into a spreadsheet and loading
    the data into the DW
  • These dimensions should NOT be conformed
  • JUNK dimension a tactical maneuver to reduce the
    number of FKs from a fact table by combining
    the low-cardinality values of small dimensions
    into a single junk dimension, generate as you go,
    dont generate the cartesian product
  • BIG
  • Examples Customer, Product, Location
  • Millions or records with hundreds of fields
    (insurance customers)
  • Or hundreds of millions of records with few
    fields (supermarket customers)
  • Always derived by multiple sources
  • These dimensions should be conformed

34
One dimension or two?
  • Are the dimensions independent? Or is there a
    statistical correlation between them?
  • Weak statistical correlation and big dimensions
  • Even if there is a weak correlation, build the DW
    as if the dimensions are independent, especially
    for large dimensions
  • 100 stores, 1 million products, 100K millions for
    combining the two, too large
  • If the statistical correlation is significant,
    build a fact table for the correlation, and there
    may be many such correlations, e.g.,
    merchandizing correlation, pricing-strategy
    correlation, changing-seasonality correlation
  • Leave the dimensions simple and independent if
    the dimensions are large and the statistical
    correlation is weak
  • Strong correlations and small resulting
    dimensions
  • If the correlation is strong, e.g., product
    always has a brand, leave the brand OUT of the
    fact table, because the product always rolls up
    to a single brand and COMBINE the product and the
    brand into a single dimension
  • Arbitrary bound 100K records is not small
    dimension

35
Dimensional Roles
  • When the same dimension is attached to a fact
    table multiple times
  • Sale has an order date, payment date, a shipping
    date, a return date
  • A claim transaction has multiple internal people,
    a claim intake, a medical nurse, a claim
    adjuster, a payer, etc
  • End-user might be confused when they do
    drill-down into the dimension as to what the
    dimension represents, so it helps to have the
    dimension have different names
  • Solution Create views over the dimensions for
    each of the dimensional role to help with the
    name recognition
  • For very large dimensions (locations) might be
    appropriate to create physical copies of the
    dimension table to help with the name confusion

36
Other dimensions
  • Degenerate dimensions
  • When a parent-child relationship exists and the
    grain of the fact table is the child, the parent
    is kind of left out in the design process
  • Example
  • grain of the fact able is the line item in an
    order
  • the order number is significant part of the key
  • but we dont create a dimension for the order
    number, because it would be useless
  • we insert the order number as part of the key, as
    if it was a dimension, but we dont create a
    dimension table for it

37
Slow-changing Dimensions
  • When the DW receives notification that some
    record in a dimension has changed, there are
    three basic responses
  • Type 1 slow changing dimension (Overwrite)
  • Type 2 slow changing dimension (Partitioning
    History)
  • Type 3 slow changing dimension (Alternate
    Realities)

38
Type 1 Slowly Changing Dimension (Overwrite)
  • Overwrite one or more values of the dimension
    with the new value
  • Use when
  • the data are corrected
  • there is no interest in keeping history
  • there is no need to run previous reports or the
    changed value is immaterial to the report
  • Type 1 Overwrite results in an UPDATE SQL
    statement when the value changes
  • If a column is Type-1, the ETL subsystem must
  • Add the dimension record, if its a new value or
  • Update the dimension attribute in place
  • Must also update any Staging tables, so that any
    subsequent DW load from the staging tables will
    preserve the overwrite
  • This update never affects the surrogate key
  • But it affects materialized aggregates that were
    built on the value that changed (will be
    discussed more next week when we talk about
    delivering fact tables)

39
Type 1 Slowly Changing Dimension (Overwrite) -
Cont
  • Beware of ETL tools Update else Insert
    statements, which are convenient but inefficient
  • Some developers use UPDATE else INSERT for
    fast changing dimensions and INSERT else UPDATE
    for very slow changing dimensions
  • Better Approach Segregate INSERTS from UPDATES,
    and feed the DW independently for the updates and
    for the inserts
  • No need to invoke a bulk loader for small tables,
    simply execute the SQL updates, the performance
    impact is immaterial, even with the DW logging
    the SQL statement
  • For larger tables, a loader is preferable,
    because SQL updates will result into unacceptable
    database logging activity
  • Turn the logger off before you update with SQL
    Updates and separate SQL Inserts
  • Or use a bulk loader
  • Prepare the new dimension in a staging file
  • Drop the old dimension table
  • Load the new dimension table using the bulk loader

40
Type-2 Slowly Changing Dimension (Partitioning
History)
  • Standard
  • When a record changes, instead of overwriting
  • create a new dimension record
  • with a new surrogate key
  • add the new record into the dimension table
  • use this record going forward in all fact tables
  • no fact tables need to change
  • no aggregates need to be re-computed
  • Perfectly partitions history because at each
    detailed version of the dimension is correctly
    connected to the span of fact tables for which
    that version is correct

41
Type-2 Slowly Changing Dimensions (history
overwrite)
  • The natural key does not change
  • The job attribute changes
  • We can constraint our query
  • the Manager job
  • Joes employee id
  • Type-2 do not change the natural key (the natural
    key should never change)

42
Type-2 SCD Precise Time Stamping
  • With a Type-2 change, you might want to include
    the following additional attributes in the
    dimension
  • Date of change
  • Exact timestamp of change
  • Reason for change
  • Current Flag (current/expired)

43
Worst Case Scenario
  • The source system does not notify changes and
    does not time/date-stamp its own updates
  • For small tables, use a brute force approach of
    comparing every incoming field with every field
    in the DW to see if anything changes
  • For larger tables, use the CRC, or
    cyclic-redundancy checksum, a number of about 20
    digits, like this
  • Treat the entire incoming record as a string
  • Compute the CRC value of the string (some
    numeric value)
  • Compare the CRC value of todays record with the
    CRC value of yesterdays record
  • If the CRC values match, todays record is
    identical to yesterdays record
  • If the CRC values do not match, do a field by
    field comparison to see what has changed
  • Depending on whether the changed field is a
    Type-1, Type-2 or Type-3 change, do the necessary
    updates
  • Most ETL packages and the Internet include CRC
    computation and comparison code

44
Identifying Deletions
  • To identify deletions
  • Read the source log file, if one exists
  • Determine that the CRC number of a record in the
    DW does not match some incoming record
  • Use the MINUS set operator to compare the DW
    dimension with the incoming dimension (usually
    works if the ETL staging table with todays
    dimension and the DW are part of the same
    database)
  • Delete the record if this is valid for the
    business, most cases you dont want to delete
    anything from the DW

45
Type-3 Slowly Changing Dimensions (Alternate
Realities)
  • Applicable when a change happens to a dimension
    record but the old record remains valid as a
    second choice
  • Product category designations
  • Sales-territory assignments
  • Instead of creating a new row, a new column is
    inserted (if it does not already exist)
  • The old value is added to the secondary column
  • Before the new value overrides the primary column
  • Example old category, new category
  • Usually defined by the business after the main
    ETL process is implemented
  • Please move Brand X from Mens Sportswear to
    Leather goods but allow me to track Brand X
    optionally in the old category
  • The old category is described as an Alternate
    reality
Write a Comment
User Comments (0)
About PowerShow.com