Data Warehousing (SS ZG515) Extraction - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Data Warehousing (SS ZG515) Extraction

Description:

Data Warehousing (SS ZG515) Extraction Transformation - Loading Prof. Navneet Goyal/SP Vimal Computer Science Department BITS, Pilani To discuss – PowerPoint PPT presentation

Number of Views:224
Avg rating:3.0/5.0
Slides: 66
Provided by: pc75130
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing (SS ZG515) Extraction


1
Data Warehousing (SS ZG515)Extraction
Transformation - Loading
  • Prof. Navneet Goyal/SP Vimal
  • Computer Science Department
  • BITS, Pilani

2
To discuss
  • Requirements
  • Data Structures
  • Extraction
  • Cleaning Conforming
  • Delivering Dimension Tables
  • Delivering Fact Tables

3
1. Requirements

4
ETL
  • A Properly designed ETL system extracts data
    from the source systems, enforces data quality
    and consistency standards, conforms data so that
    separate sources can be used together, and
    finally delivers data in a presentation-ready
    format so that application developers can build
    applications and end users can make decisions
    ETL makes or breaks the data warehouse Ralph
    Kimball

5
Requirements
  • Business Needs
  • Information requirements of the end user.
  • Captured by interview with users, independent
    investigations about the possible sources by the
    ETL team.
  • Compliance Requirements
  • Sarbanes-Oxley Act 2002 (Deals with the
    regulation of corporate governance) (more on
    http//www.soxlaw.com)
  • Proof of complete transaction flow that changed
    any data.

6
Requirements (contd)
  • Data Profiling
  • Systematic examination of quality, scope and the
    context of a data source
  • Helps ETL team determine how much data cleaning
    activity require.
  • Data Profiling employs analytic methods for
    looking at data for the purpose of developing a
    thorough understanding of the content, structure
    and the quality of the data. A good data
    profiling system can process very large amounts
    of data, with the skills of analyst, uncover all
    sorts of issues that need to be addressed Jack
    Olson

7
Requirements (contd)
  • Security Requirements
  • ETL team have complete read/ write access to the
    entire corporate data.
  • ETL workstations on the company intranet, A major
    threat. Keep it in a separate subnet with packet
    filtering gateway.
  • Secure Backups, as well.
  • Data Integration
  • Identified as conform step
  • Conform Dimensions Conform Facts

8
Requirements (contd)
  • Data Latency
  • How quickly the data can be delivered to the
    users?
  • ETL architecture has direct impact on it.
  • Archiving Lineage
  • Stage data after each major transformations, Not
    just after all the four steps viz extract, clean,
    conform deliver.
  • Each archived/ staged data set should have
    accompanying metadata.
  • Tracking this lineage is explicitly required be
    certain compliance requirements

9
Requirements (contd)
  • End user delivery Interfaces
  • ETL team is responsible for the content and
    structure of data, making the end user
    applications fast.
  • Available Skills
  • Expertise in building ETL system around vendors
    tool
  • Decision between hand coded or vendors package of
    ETL tools
  • Legacy Licenses
  • Managements insistence to use legacy licenses

10
Choice of Architecture ? Tool Based ETL
  • Simpler, Cheaper Faster development
  • People with business skills not much technical
    skills can use it.
  • Automatically generate Metadata
  • Automatically generates data Lineage data
    dependency analysis
  • Offers in-line encryption compression
    capabilities
  • Manage complex load balancing across servers

11
Choice of Architecture ? Hand-Coded ETL
  • Quality of tool by exhaustive unit testing
  • Better metadata
  • Requirement may be just file based processes not
    database-stored procedures
  • Use of existing legacy routines
  • Use of in-house programmers
  • Unlimited flexibility

12
Choice of Architecture Batch Vs Streaming
  • Batch Update
  • ETL processing is done on periodic batch extracts
    from the source systems
  • Streaming Dataflow
  • Data at record level flows from source system to
    users screens
  • Extract-Clean-Conform-Deliver must be modified to
    accommodate record by record processing.
  • Arrival of data may not be predictable.

13
2. Data Structures

14
To stage or not to stage
  • A conflict between
  • getting the data from the operational systems as
    fast as possible
  • having the ability to restart without repeating
    the process from the beginning
  • Reasons for staging
  • Recoverability stage the data as soon as it has
    been extracted from the source systems and
    immediately after major processing (cleaning,
    transformation, etc).
  • Backup can reload the data warehouse from the
    staging tables without going to the sources
  • Auditing lineage between the source data and the
    underlying transformations before the load to the
    data warehouse

15
Designing the staging area
  • The staging area is owned by the ETL team
  • no indexes, no aggregations, no presentation
    access, no querying, no service level agreements
  • Users are not allowed in the staging area for any
    reason
  • staging is a construction site
  • Reports cannot access data in the staging area
  • tables can be added, or dropped without notifying
    the user community
  • Controlled environment

16
Designing the staging area (contd)
  • Only ETL processes can read/write the staging
    area (ETL developers must capture table names,
    update strategies, load frequency, ETL jobs,
    expected growth and other details about the
    staging area)
  • The staging area consists of both RDBMS tables
    and data files

17
Staging Area data Structures in the ETL System
  • Flat files
  • fast to write, append to, sort and filter (grep)
    but slow to update, access or join
  • enables restart without going to the sources
  • XML Data Sets
  • Used as a medium of data transfer between
    incompatible data sources
  • Gives enough information to create tables using
    CREATE TABLE
  • Relational Tables
  • Metadata, SQL interface, DBA support

18
Staging Area data Structures in the ETL System
(contd)
  • Dimensional Model Constructs Facts, Dimensions,
    Atomic Facts tables, Aggregate Fact Tables (OLAP
    Cubes)
  • Surrogate Key Mapping Tables
  • map natural keys from the OLTP systems to the
    surrogate key from the DW
  • can be stored in files or the RDBMS

19
3. Extraction

20
Logical data map
  • Represented as a table with the following
    attributes
  • Target table and column, table type (Dimension,
    Fact)
  • Slow-changing dimension type per target column of
    each dimensions
  • Type 1, overwrite (Customer first name)
  • Type 2, retain history (Customer last name)
  • Type 3, retain multiple valid alternative values
  • Source database, table, column
  • Transformations

21
Logical Map development
  • Have a plan !
  • Logical map is provided by Datawarehouse
    architect to ETL Team serves as the
    specification of ETL processes.
  • Identify data lineage between the data source
    target
  • Identity source candidates
  • Analyze source systems with a data profiling tool
  • Detect data anomaly, identify appropriate actions
    document it.
  • Identify the quality

22
Logical Map development (contd)
  • Receive walk-through of data lineage and business
    rules (from the DW architect and business analyst
    to the ETL developer)
  • data alterations during data cleansing,
    calculations and formulas
  • standard conformance to dimensions and numerical
    facts
  • Receive walk-through of the dimensional model
  • The objective of ETL team is to deliver data to
    the dimensional model in a more effective way
    hence the understanding of dimensional model is
    helpful
  • Validate calculations Formulas used in ETL.

23
Logical Map development (contd)
  • Complete logical map cannot exist until all the
    source systems are identified analyzed.
  • Analysis of source
  • Data discovery phase
  • Collecting documenting Source systems
  • Keeping track of source systems
  • Identify ownership, responsible for the content
    and its storage usage statistics
  • Determine the system-of-record (source of data)
  • Identify the source, when redundant sources
    coexist
  • Analyze the source systems for any relationship
    between tables
  • Anomaly detection phase (Data content Analysis)
  • NULL values
  • Dates in Nondate fields

24
Some good rules
  • Analyze your source system
  • get a ER-model for the system or reverse
    engineering one (develop one by looking at the
    metadata of the system)
  • reverse engineering is not the same as forward
    engineering, i.e., given the ER-models of the
    source systems derive the dimensional schema of
    the data warehouse
  • Reverse engineering helps understanding
  • unique identifiers and natural keys
  • data types
  • relationships between tables
  • Of the source systems

25
Extract data from disparate systems
  • ODBC Manager
  • Accepts SQL from ETL applications routes it
    through appropriate ODBC driver
  • ODBC can provide a common gateway to diverse
    sources

26
Different sources
  • Mainframe Sources
  • Flat Files
  • XML sources
  • Web Log sources
  • ERP system sources

27
Tips for Extracting
  • Constrain on indexed columns
  • Retrieve only the data you need
  • Do not retrieve entire table select from that.
  • Use DISTINCT Set operations sparingly
  • Try if the slower DISTINCT, Set operations UNION,
    MINUS and INTERSECT operations can be avoided
  • Do the best effort to avoid NOT operation, which
    normally scans the entire database
  • Avoid functions in the WHERE clause
  • Difficult to avoid
  • Try different techniques before using the
    functions, at least.
  • Avoid subqueries

28
Do it also !
  • When a dimension is populated by several distinct
    systems, it is important to include the unique
    identifier from each of those systems in the
    target dimension in the data warehouse. Those
    identifiers should be viewable by end users to
    ensure peace of mind that the dimension reflects
    their data that they can tie back to in their
    transaction system.

29
4. Cleaning Conforming

30
Cleaning and Conforming
  • While the Extracting and Loading part of an ETL
    process simply moves data, the cleaning and
    conforming part , the transformation part that
    truly adds value
  • How do we deal with dirty data?
  • Data Profiling report
  • The Error Event fact table
  • Audit Dimension
  • Challenges
  • Completeness Vs Speed
  • Corrective Vs Transparent
  • Too corrective system hides/obscures the
    operational deficiencies slows organizational
    progress

31
Defining Data Quality
  • Basic definition of data quality is data accuracy
    and that means
  • Correct the values of the data are valid, e.g.,
    my resident state is PA
  • Unambiguous The values of the data can mean only
    one thing, e.g., there is only one PA
  • Consistent the values of the data use the same
    format, e.g., PA and not Penn, or Pennsylvania
  • Complete data are not null, and aggregates do
    not lose data record somewhere in the information
    flow

32
Cleaning Deliverables
  • Keep accurate records of the types of data
    quality problems you look for, when you look,
    what you look at, etc
  • Is data quality getting better or worse?
  • Which source systems generate the most data
    quality errors?
  • Is there any correlation between data quality
    levels and the performance of the organization as
    a whole?

33
Data Profiling Deliverable
  • Start before building the ETL system
  • Data profiling analysis including
  • Schema definitions
  • Business objects
  • Domains
  • Data Sources
  • Table definitions
  • Synonyms
  • Data rules
  • Value rules
  • Issues that need to be addressed

34
Error Event Table Deliverable
  • Built as a star schema
  • Each data quality error or issue is added to the
    table

35
Conforming
  • Integration of data
  • A conformed product dimension is the enterprises
    agreed upon master list of products, including
    all attributes. It can be considered as a master
    table of products with clean surrogate product
    key with all relevant attributes.
  • Processes of conforming
  • Standardizing
  • Matching deduplication
  • Surviving

36
5. Delivering Dimension Tables

37
The basic structure of a dimension
  • Primary key (PK)
  • Mostly a surrogate key
  • Foreign Key to Fact Tables
  • Natural key (NK)
  • Meaningful key extracted from source systems
  • 1-to-1 relationship to the PK for static
    dimensions
  • 1-to-many relationship to the PK for slowly
    changing dimensions, tracks history of changes to
    the dimension
  • Descriptive Attributes
  • Primary textual but numbers legitimate but not
    numbers that are measured quantities
  • 100 such attributes normal
  • Static or slow changing only
  • Product price -- either fact or dimension
    attribute

38
Smart keys in place of Surrogate keys
  • Smart Key
  • Natural Key concatenated with timestamp
  • A wrong approach Reasons
  • Keys assume responsibility
  • Changes may happen to natural key, causing the
    change of all fact table records referring it.
  • Poor performance
  • Key may take CHAR or VARCHAR type, comparison is
    slower.
  • Joins between two CHAR / VARCHAR is slower
    process
  • Heterogeneous source systems
  • Common dimensions are sourced by different
    sources.
  • Conflict arise when second source comes into the
    scene to source the key.

39
The grain of a dimension
  • Grain of the dimension means, the definition of
    the key of the dimension
  • The grain of the customer dimension is customer
  • Verify that a given source (file) implements the
    intended grain
  • Nothing should be returned by the following query
    from the source system/file for the fields A,B
    C to form the natural key
  • If something is returned by this, the fields A, B
    and C do not represent the grain of the dimension

select A, B, C, count() from DimensionTableSource
group by A, B, C having count() gt 1
40
The basic load plan for a dimension
  • Data cleaning
  • Validate the data, apply business rules to make
    the data consistent, column validity enforcement,
    cross-column value checking, row de-duplication
  • Data conforming
  • Align the content of some or all of the fields in
    the dimension with fields in similar or identical
    dimensions in other parts of the data warehouse
  • Data Delivery
  • All the steps required to deal with slow-changing
    dimensions
  • Write the dimension to the physical table
  • Creating and assigning the surrogate key, making
    sure the natural key is correct, etc.

41
Date and Time Dimensions
  • Virtually everywhere measurements are defined at
    specific times, repeated over time, etc.
  • Most common calendar-day dimension with the
    grain of a single day, many attributes
  • Doesnt have a conventional source
  • Built by hand, speadsheet
  • Holidays, workdays, fiscal periods, week numbers,
    last day of month flags, must be entered manually
  • 10 years are about 4K rows

42
Date Dimension
  • Note the Natural key a day type and a full date
  • Day type date and non-date types such as
    inapplicable date, corrupted date, hasnt
    happened yet date
  • fact tables must point to a valid date from the
    dimension, so we need special date types, at
    least one, the N/A date
  • How to generate the primary key?
  • Surrogate Keys, whose zero value is assigned to
    the standard starting date of the history of
    records
  • Partition warehouse on date helps improve
    performance allows work on the current
    partition such as, creating dropping index.
  • Surrogate key 9999999 may be designed to mean
    N/A. This high value will keep it always in the
    current partition.

43
Other Time Dimensions
  • If the grain is month, not day, then create a
    separate dimension table for month. Similarly, if
    the grain is week or some domain specific span.
  • Creating month view on the date dimension is not
    desirable, since it would drag larger table in to
    the month based query. 120 records will be there
    for monthly grain dimension table for 10 yrs.
  • When the grain is seconds, it is not advisable to
    create 375 million rows per year (approx) for
    every seconds instead use the design shown.

Date Component
Fact Table
Calendar Date (FK)
Other Dims (FKs)
SQL data-time
Other facts
Precise Time stamp, not a dimension
This design is preferred over creating a
dimension table record for every minutes or
seconds
44
Big and Small Dimensions
  • SMALL
  • Examples Transaction Type, Claim Status
  • Tiny lookup tables with only a few records and
    one ore more columns
  • Build by typing into a spreadsheet and loading
    the data into the DW
  • These dimensions should NOT be conformed
  • JUNK dimension
  • A tactical maneuver to reduce the number of FKs
    from a fact table by combining the
    low-cardinality values of small dimensions into a
    single junk dimension.
  • Generate as you go, dont generate the cartesian
    product
  • BIG
  • Examples Customer, Product, Location
  • Millions or records with hundreds of fields
    (insurance customers)
  • Or hundreds of millions of records with few
    fields (supermarket customers)
  • Always derived by multiple sources
  • These dimensions should be conformed

45
Dimensional Roles
  • The same dimension can be attached to a fact
    table multiple times
  • Sale has an order date, payment date, a shipping
    date, a return date
  • A claim transaction has multiple internal people,
    a claim intake, a medical nurse, a claim
    adjuster, a payer, etc
  • End-user might be confused when they do
    drill-down into the dimension as to what the
    dimension represents, so it helps to have the
    dimension have different names

46
Type-2 Slowly Changing Dimension
  • When a record changes, instead of overwriting
  • create a new dimension record
  • with a new surrogate key
  • add the new record into the dimension table
  • use this record going forward in all fact tables
  • no fact tables need to change
  • no aggregates need to be re-computed
  • Perfectly partitions history because at each
    detailed version of the dimension is correctly
    connected to the span of fact tables for which
    that version is correct

47
Type-2 Slowly Changing Dimensions
  • Type-2 do not change the natural key (the natural
    key should never change)

48
Type-2 SCD Precise Time Stamping
  • With a Type-2 change, you might want to include
    the following additional attributes in the
    dimension
  • Date of change
  • Exact time of change
  • Reason for change
  • Current Flag (current/expired)

49
Type-2 SCD using CRC
  • For small tables, use a brute force approach of
    comparing every incoming field with every field
    in the DW to see if anything changes
  • For larger tables, use the CRC, or
    cyclic-redundancy checksum, a number of about 20
    digits, like this
  • Treat the entire incoming record as a string
  • Compute the CRC value of the string (some
    numeric value)
  • Compare the CRC value of todays record with the
    CRC value of yesterdays record
  • If the CRC values match, todays record is
    identical to yesterdays record
  • If the CRC values do not match, do a field by
    field comparison to see what has changed
  • Depending on whether the changed field is a
    Type-1, Type-2 or Type-3 change, do the necessary
    updates
  • Most ETL packages and the Internet include CRC
    computation and comparison code

50
5. Delivering Fact Tables

51
The basic structure of a fact table
  • Every table defined by its grain
  • in business terms
  • in terms of the dimension foreign keys and other
    fields
  • A set of foreign keys (FK)
  • context for the fact
  • Join to Dimension Tables
  • Degenerate Dimensions
  • Part of the key
  • Not a foreign key to a Dimension table
  • Primary Key
  • a subset of the FKs
  • must be defined in the table
  • Surrogate keys
  • Has no business intelligence
  • Should not be delivered to the users

52
Guaranteeing Referential Integrity
  • EVERY FACT TABLE ROW MUST BE FILLED WITH
    LEGITIMATE FOREIGN KEYS
  • Check Before Loading
  • Check before you add fact records
  • Check before you delete dimension records
  • Best approach
  • Check While Loading
  • Serious performance issue, while loading
    thousands of data.
  • Check After Loading
  • Theoretically ok, to find all violations
  • Running query over the entire fact table cause
    serious performance issues.
  • If the checking is restricted to data loaded only
    today, we make an assumption that the date
    foreign key is reliable

53
Options for loading the Surrogate Keys of
Dimensions
  • Look up the current surrogate key in each
    dimension, fetch the record with the most current
    surrogate key for the natural key and use that
    surrogate key. Good option but very slow.
  • Maintain a surrogate key lookup table for each
    dimension. This table is updated whenever a new
    record is added or when a Type-2 update occurs in
    an existing dimensional entity.
  • The dimensions must be updated with Type-2
    updates before any facts are loaded into the Data
    Warehouse, to guarantee referential integrity

54
Surrogate Key Pipeline
  • Assume that all records to be added to the fact
    table are current, i.e.,
  • in the incoming fact records, the value for the
    natural key of each dimension is the most current
    value known to the DW
  • When loading a fact table, the final ETL step
    converts the natural keys of the new input
    records into the correct surrogate key of the
    dimensions using the key mapping tables

55
Surrogate Key Pipeline (contd)
56
Grains of Fact Tables
  • Every fact table should belong to exactly one of
    the following grains
  • Transaction grain
  • Periodic snapshot grain
  • Accumulating snapshot grain

57
Transaction Grain Fact Tables
  • The grain represents an instantaneous measurement
    at a specific point in space and time
  • retail sales transaction
  • More detailed fact table
  • Generally consists of more number of dimensions
  • Each record correspond to the instant of
    measurement.
  • Transactions are time stamped to the nearest
    second/minutes
  • Fact table records arrive in batches at regular
    intervals

58
Periodic Snapshot Fact Tables
  • The grain represents a span of time periodically
    repeated
  • A periodic snapshot for a checking account in a
    bank, reported every month
  • Beginning balance
  • Ending balance
  • Number of deposits
  • Total for deposits
  • Number of withdrawals
  • Total for withdrawals
  • Attributed by more number of facts
  • Predictable sparseness
  • one checking account record per account per month

59
Accumulating Snapshot Fact Tables
  • The grain represents finite processes that have a
    definite beginning and an end
  • Order fulfillment, claim processing, small
    workflows
  • But not large complicated looping workflows
  • Example shipment invoice line item
  • Order date
  • Requested ship date
  • Actual ship date
  • Delivery date
  • Last payment date
  • Return date
  • Characteristics large number of date dimensions,
    data are created and overwritten multiple times
    as events unfold

60
Loading a Table
  • Separate inserts from updates
  • Identify update records from the bulk data to be
    loaded and process it first
  • Perform the insertion of the rest of the records
    using a bulk loader.
  • Use a bulk loader
  • To improve performance of the inserts decrease
    database overhead
  • With many bulk loaders, updating is not possible
  • Load in parallel
  • ETL tool supports breaking data in logical
    segments, say one per year load the data in
    parallel
  • Some ETL tools offers dynamic partition to help
    bulk loading
  • Minimize physical updates
  • To decrease database overhead with writing the
    rollback logs, delete the records requires update
    load it all afresh using bulk loader

61
Updating and Correcting Facts
  • Negate the fact
  • Create an exact duplicate of the fact where all
    the measurements are negated (-minus), so the
    measures cancel each other in summaries
  • Do it for audit reasons and/or if
    capturing/measuring erroneous entries is
    significant to the business
  • Update the fact
  • Delete and reload the fact
  • Drawback current versions of previously released
    reports no longer valid
  • Physical deletes ? the record is deleted
  • Logical deletes ? the record is tagged deleted

62
Graceful Modifications
  • Modifications to the dimensional schema, without
    affecting end user applications
  • Modifications includes
  • Adding a fact to the existing fact table at the
    same grain
  • Adding a dimension to the existing fact table at
    the same grain
  • Adding an attribute to an existing dimension
  • Increasing granularity of existing fact
    dimension table
  • Complicated Job!
  • Changing weekly grain to daily grain ?
  • Constraints on weekly aggregation only work with
    the historical data.
  • Further additions can be at daily grain

63
Aggregate tables
  • Aggregate navigator sits between user query
    DBMS, intercepting queries and transforms base
    level queries into aggregate aware queries
    wherever possible
  • Good aggregate program
  • Provides performance enhancement
  • Add only reasonable amount of storage
  • Transparent to the end user
  • Users are not aware of it.
  • No user program can refer to aggregation tables
    directly!
  • Affect the cost of extract system least possible
  • Each extract leads to the updation of aggregates!
  • Not adds much burden to the DBA
  • Generates metedata automatically

64
Aggregate tables- Design Requirements
  • Aggregates must be stored in their own fact
    tables, separate from base-level data.
  • Dimension tables attached to the aggregate tables
    must be shrunken version of the corresponding
    dimension
  • The most shrunken version of the dimension is
    the dimension removed altogether
  • The base table all its aggregate tables must be
    related together as a family of schemas, so that
    aggregate navigator knows which of the tables
    are related together
  • All queries from the users must be forced to
    refer the base table only.

65
Thank You
Write a Comment
User Comments (0)
About PowerShow.com