Data Cleansing - PowerPoint PPT Presentation

About This Presentation
Title:

Data Cleansing

Description:

One fix applies to many targets. May only fix symptoms. Fixes problems at the source ... Auditing reports data integrity violations, but does not fix them. ... – PowerPoint PPT presentation

Number of Views:2557
Avg rating:3.0/5.0
Slides: 16
Provided by: ABSchwa
Learn more at: https://www.ou.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Cleansing


1
Data Cleansing
  • Rule Based Strategy

2
Without Data Cleansing Strategy
  • DW will suffer from
  • lack of quality
  • loss of trust
  • diminishing user base, and
  • loss of business sponsorship and funding
    Entity-Relationship vs. Dimensional Models

3
What is Data quality?
  • well understood
  • integrated
  • satisfies the needs of the business
  • user is satisfied with the quality of the data
    and the information derived from that data
  • complete
  • no duplicate records
  • Data anomalies
  • accurate
  • stored according to data type
  • has integrity
  • consistent
  • well designed database
  • non redundant
  • follows business rules
  • corresponds to established domains
  • timely

4
Rule-based strategy
  • Begins with the understanding that there are
    really only two options for data cleansing
  • clean the source data
  • clean the warehouse data.

5
2 options follow a sequence of
  • Define a robust and comprehensive set of data
    cleansing rules.
  • Data models are the keys.
  • Package the rules as executable components.
  • Objects or modules depending on the environment
    in which they will execute.

6
Data cleansing rules 2 parts
  1. Integrity rules tests integrity of the data
  2. Cleansing Rules specification of an action to be
    taken when integrity violations are encountered.

7
Uses of Data cleansing rules
  • Audit data
  • Report on the occurrence of integrity violations
    without taking any steps to restore integrity to
    the data.
  • Filter data
  • Detect data integrity violations and remove the
    offending data from the set used to populate the
    given target.
  • Correct data
  • Detect data integrity violations and repair the
    data to restore its integrity.

8
Integrity rules vs. Cleansing rules
  • Integrity rules
  • refer to the way the data must conform to meet
    business rules.
  • Cleansing rules
  • combine the definition from the integrity rule
    with the action to be taken in the event of a
    violation.

9
Data Integrity Rules Classification
10
Rule Based Cleansing Process
  • Determine the action to be taken from
    perspectives of
  • At which data is the rule directed source data
    or target data?
  • What type of cleansing action is being performed
    auditing, filtering, or correction?
  • At what point in the warehousing process will the
    cleansing action occur?

11
Source data vs. Target data
Source Data Cleansing Target Data Cleansing
Requires source data models Rules from target data models
Typically large number of rules Typically smaller number of rules
Implicit rules difficult to determine Implicit rules readily identified
Fixes problems at the source May only fix symptoms
One fix applies to many targets Multiple targets may redundant cleansing
May require operational system changes Avoids operational systems issues
12
Data Cleansing Economics
  • Completeness, complexity, and cost must all be
    considered when developing a data cleansing
    strategy.
  • Auditing least costly and least complex action,
    but least complete solution.
  • Correction most complex and costly, but most
    complete solution.
  • Cleansing of source data more complete solution,
    at higher cost and greater complexity, than does
    cleansing of target data.

13
Data Cleansing Economics
14
Principles of Rule Based DC
Data quality directly affects data warehouse acceptance.
Source data is typically dirty and must be made reliable for warehousing.
Data models are a good source of data integrity rules
Both source and target data models yield integrity rules.
Data cleansing rules combine integrity rules with actions to be taken when violations are detected.
Both source data and target data may be cleansed.
Auditing reports data integrity violations, but does not fix them.
Filtering discards data that violates integrity rules.
Correction repairs data that violates integrity rules.
Choice of techniques involves severity of the data quality problem, available data models, and commitment of time and resources.
Whatever techniques are chosen, a systematic, rule-based approach yields better results than an unstructured approach.
15
Bibliography
  • Bischoff, Alexander, 1997 Data Warehousing
    Joyce Bishop and Ted Alexander, 1997
  • Duncan, Wells, 1999 Rule Based Data Cleansing
    for Data Warehousing Karolyn Duncan and David
    Wells
  • Kelly, 1997 Data Warehousing in Action Sean
    Kelly, 1997
Write a Comment
User Comments (0)
About PowerShow.com