Title: Rule Based Data Cleansing for Data Warehousing
1Rule Based Data Cleansingfor Data Warehousing
David L. Wells Infocentric - Business
Intelligence and Knowledge Systems
Consulting (425) 503-4352 / dlwells_at_sprynet.com
2Role of Data Cleansing in a Warehouse Environment
Data Warehouse
3A Real Life Example
2 million claims per month. 377 data items per
claim. 9 items with cleansing rules.
Warehouse users sometimes receive incomplete or
inaccurate information.
Business decisions may be adversely affected.
368 data items with no rules for validation and
cleansing
An error rate of of .001 means 736,000 errors per
month -- more than 8.8 million per year!
DW
Trust in the warehouse as a reliable source of
business information is diminished.
And those errors are compounded and propagated in
data marts!
4Rule Based Data Cleansing Options
Audit, filter, and correct data using rules for
cleansing of source data
- Find rules in source data models
- package as executable components
- embed in data acquisition processes
Audit, filter, and correct data using rules for
cleansing of warehouse data
DW
- Find rules in warehouse data models
- package as executable components
- embed in warehouse population processes
5Recent Statistics on Data Quality Plans
- In a recent TDWI study, 175 data warehousing
professionals responded to the question Do you
have a data quality plan for your data
warehouse? Survey results showed that - 7 have no current plan
- 24 recognize the importance but had no current
plan - 30 recognize the importance, have a plan, but
not yet implemented - 22 have a plan and have started implementation
- 17 have implemented a solution
- 31 of respondents have no quality plan!
- 83 do not have an implemented solution!
Source TDWI Flashpoint July 21, 1999
6Data Cleansing Issues
What are data cleansing rules?
Business-based rules for data integrity, combined
with known actions to be taken when integrity
violations are detected.
Extract
Transform
Where do I find them?
Load
Integrity rules can be found in the data models
both source warehouse data models contain
rules. Analysis is necessary to determine
actions upon violation.
Secondary Extracts, Transforms, Loads
And how do I use them?
Data auditing, data filtering, and data
correction are all possible once the set of data
cleansing rules is known.
Dependent Data Marts
7A Classification of Data Integrity Rules
Data Integrity Rules
8An Example Data Model
9Identity Rules
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime - - -
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
10Reference Rules
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime - - -
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
11Cardinal Rules
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime - - -
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
12Value Set Rules
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime - - -
position filled by
employee administers
position administered by
- list of allowable values
- range of allowable values
- allowable character set
- pattern mask
- combination of above
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
13Inheritance Rules
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime - - -
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
14Relationship Dependency
State Dependent
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
15Relationship Dependency
Mutually Dependent
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
16Relationship Dependency
Mutually Exclusive
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-ssn - - -
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
17Attribute Dependency
State Dependent
EMPLOYEE
POSITION
employee holds
employee-id empl-name empl-status-code empl-separa
tion-date
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
18Attribute Dependency
Mutually Exclusive
EMPLOYEE
POSITION
employee holds
employee-id empl-percent-of-fulltime empl-hourly-p
ay-rate empl-monthly-pay-rate
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
19Attribute Dependency
Mutually Dependent - Derived
EMPLOYEE
POSITION
employee holds
employee-id empl-percent-of-fulltime empl-fulltime
-salary empl-monthly-pay-rate
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
20Attribute Dependency
Mutually Dependent - Constrained
EMPLOYEE
POSITION
employee holds
employee-id empl-percent-of-fulltime empl-initial-
hire-date empl-separation-date
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
21Attribute Dependency
Mutually Dependent - Constrained (across
relationship)
EMPLOYEE
POSITION
employee holds
employee-id empl-percent-of-fulltime empl-initial-
hire-date empl-separation-date
position-number posn-description posn-percent-of-f
ulltime posn-status-code
position filled by
employee administers
position administered by
account expended by
account funds
employee authorized for
X
position funded by
ACCOUNT
is one of
account-number acct-type-code acct-begin-date acct
-end-date acct-title
SEASONAL POSITION
PERMANENT POSITION
posn-begin-date posn-end-date
22The Number of Possible Rules
This small data model yields . . .
- 3 identity rules - 1 for each non-subtype
entity - 6 reference rules - 1 per 1many relationship
and 2 per manymany relationship - 4 cardinal rules - 1 per cardinality (except
zero-many) - 13 value set rules - at least 1 per attribute
- 6 inheritance rules - 2 per supertype and 2 per
subtype - 32 explicit rules (and we saw 8 dependency rules
in the examples) - How many rules are possible in your warehouse?
How many are applied?
23Applying Rules for Data Cleansing
Extract
SOURCE CLEANSING
Transform
Load
TARGET CLEANSING
DW
24Choosing Data Cleansing Techniques
Audit Tests data against integrity rules and
individually and/or statistically reports rule
violations. Helps to know size and scope of
integrity problems, but does not fix the
problems. Filter Tests data against integrity
rules and removes data that violates the rules
from the set of data used to populate the data
warehouse. May result in a warehouse with
incomplete information. Correct Tests data
against integrity rules and repairs data found
to be in violation. Repairs may involve finding
an alternate data source, deriving a value from
other data, using a default value, etc.
25Data Auditing - A Continuous Process
- Auditing the data is a continuous process
expected to occur in many places
Implementation
26Data Auditing Techniques - Noting Integrity
Violations
Documentation or Metadata
Fields added to DW tables
new DW tables
Location
View quality info together with data
Simple Approach
Support for extensive detail
Pros
Quality info separated from the data
Static with limited detail
Cons
Complexity
27Data Auditing - Summary
- Auditing is an essential first step to data
quality - Problem scope cannot be assessed until the
problem is understood - Solutions cannot be appropriately designed until
the problem is understood - Auditing validates data models used to define
integrity and cleansing rules
28Data Filtering - Techniques
- Filtering can be performed at multiple levels
- Element Level - Setting null values for
individual elements - Row Level - Removing an entire row from a target
load set - Data Group - Removing rows from multiple related
target load sets
29Data Filtering - Summary
CAUTION!
- Filtering must be used with CAUTION!
- Removing data from a load set may decrease data
quality - Generally, an ineffective technique when used
with staging areas or atomic warehouses - More applicable for data marts
30Data Correction - Techniques
- Correction alters the specific value of
individual fields - Replacement value techniques include
- Inserting a default value
- Translating based on error patterns
- Deriving a value based on surrounding contents
- Searching for alternative sources
31Data Correction - Summary
- Corrections are often forced into the data
warehousing space - Corrections may be made close to the point of
entry for global changes - Other corrections may be better suited for
specific data marts - Always consider corrections to the source systems
32Choosing Data Cleansing Techniques
Source Data Cleansing requires source data
models typically large number of rules implicit
rules difficult to determine fixes problems at
the source one fix applies to many targets may
require operational system changes
Target Data Cleansing rules from target data
models typically smaller number of rules implicit
rules readily identified may only fix
symptoms multiple targets redundant
cleansing avoids operational systems issues
33Choosing Data Cleansing Techniques
near source
Completeness of Solution
Cost and Complexity
near target
filter
audit
correct
34Recalling the Example
2 million claims per month. 377 data items per
claim. 9 items with cleansing rules.
368 data items with no rules for validation and
cleansing
An error rate of of .001 means 736,000 errors per
month -- more than 8.8 million per year!
DW
And those errors are compounded and propagated in
data marts!
35Applying Rule Based Cleansing to the Example
Identify data integrity rules (IRs) and package
them as modules or objects.
Audit after extract and begin to monitor source
data quality.
DW
Correct and filter at transformation. Using
results of monitoring to address most severe
problems first.
Finally, consider using IRs to correct quality
problems at the source.
36Summary of Key Points
- Data quality directly affects data warehouse
acceptance. - Source data is typically dirty and must be
cleaned for warehousing. - Data models are a good source of data integrity
rules. - Seven kinds of integrity rules are either
expressed or implied by entity- relationship data
models. - Both source and target data models yield
integrity rules. - When combined with actions to be taken upon
violation, data integrity rules become data
cleansing rules. - Both source data and target data may be cleansed.
- Auditing reports data integrity violations but
does not fix them. - Filtering discards data that violates integrity
rules. - Correction repairs data that violates integrity
rules. - Choosing a technique depends on many factors
including severity of the problem, available data
models, and commitment of time resources. - Whatever techniques you choose, a systematic
rule-based approach yields better results than an
unstructured approach.
37 Thank You !
Please feel free to contact me with questions and
comments, or to share your experiences with this
or other data cleansing techniques
David L. Wells Infocentric - Business
Intelligence and Knowledge Systems
Consulting (425) 503-4352 / dlwells_at_sprynet.com