Title: Data Cleansing
1Data Cleansing
2A companys most important asset is information.
A corporations ability to compete, adapt, and
grow in a business climate of rapid change is
dependent in large measure on how well the
company uses information to make decisions
Sharing information that isnt clean and
consolidated to the fullest extent can
substantially reduce the effectiveness of a
system of significant investment and considerable
pay-off potential.
Stoker, 1999
3Todays Coverage
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- Data Cleansing and Data Quality
- Steps in Data Cleansing
- Why is Dirty Data a Problem?
- Why is Legacy Data Dirty?
- To Cleanse or Not To Cleanse
- Parsing Matching
- Correcting Consolidating
- Standardizing
- Conclusion
- Demonstration
- Questions
-
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
CONCLUSION
4Data Cleansing and Data Quality
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- Data is a product that can be characterized as
either quality or non-quality. The ability
to make quality decisions depends in part on the
decision-makers ability to access quality data. - Data cleansing is the process that insures that
the same piece of information is referred to in
only ONE way. When data is clean, its users can
focus on its use and not its credibility.
5Steps in Data Cleansing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- Parsing
- Correcting
- Standardizing
- Matching
- Consolidating
6Why is Data Dirty and Why is This a Problem?
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- Simply put, dirty data for data warehouses is the
product of relying on data from legacy systems. - But if companys have relied on this data for
decades, why is it a problem today? - Because a data warehouse promises to deliver a
single version of the truth. Unfortunately
integrating data from different sources magnifies
its problems.
7Why is Legacy Data Dirty ?
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- Dummy Values,
- Absence of Data,
- Multipurpose Fields,
- Cryptic Data,
- Contradicting Data,
- Inappropriate Use of Address Lines,
- Violation of Business Rules,
- Reused Primary Keys,
- Non-Unique Identifiers, and
- Data Integration Problems
8To Cleanse or Not to Cleanse
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- CAN the legacy data be cleansed?
- Sometimes the answer is NO
- Then, SHOULD it be cleansed?
- Again, sometimes NO
- Next, WHERE should it be cleansed?
- Finally, HOW should it be cleansed?
9Steps in Cleansing Data
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- Parsing
- Correcting
- Standardizing
- Matching
- Consolidating
10Parsing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
11Parsing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
12Correcting
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- Corrects parsed individual data components
using sophisticated data algorithms and secondary
data sources.
13Correcting
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
14Standardizing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- Standardizing applies conversion routines to
transform data into its preferred (and
consistent) format using both standard and custom
business rules.
15Standardizing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
16Parsing, Correcting, Standardizing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
TITLE
FIRST
CONC.
LAST
GENER.
NAME LINE
William
Mr. Bill St. John III 101 S.
Main Strete Sant. Louis, MO 63181
HSNO
ST-NM
ST-TYPE
ST-DIR
St.
STREET LINE
CITY
STATE
POST
St.
63118
GEOG. LINE
17Matching
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
-
- Searching and matching records within and across
the parsed, corrected and standardized data based
on predefined business rules to eliminate
duplications.
18Match Patterns
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
Customer /Tax ID
Branch Type
Vendor Code
Pattern I.D.
Business Name
Street
City
Pattern
Exact
Exact
Exact
Exact
Exact
Exact
AAAAAA
P110
Exact
Exact
Exact
VClose
VClose
Blanks
ABAAA-
P115
Exact
Exact
Exact
Exact
VClose
Blanks
ABA-AA
P120
Exact
Exact
Exact
VClose
Close
Close
ABCCAA
S300
Exact
Exact
Exact
VClose
VClose
Close
BBACAA
S310
19Matching
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
20Consolidating
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
- Analyzing and identifying relationships
between matched records and consolidating/merging
them into ONE representation.
21Consolidating
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
22Consolidating
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
23Recommended Best Practices
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
1. Use metadata to document rules 2. Determine
data cleansing schedule 3. Build quality into
new and existing systems
24Legacy Systems View (3 Clients)
CLEANSING STEPS
INTRODUCTION
WHY DIRTY DATA
CONCLUSION
Account No.83451234
Policy No.ME309451-2
Transaction B498/97
25The Reality ONE Client
CLEANSING STEPS
INTRODUCTION
WHY DIRTY DATA
CONCLUSION
Account No.83451234
Policy No.ME309451-2
Transaction B498/97
26Demonstration
CLEANSING STEPS
INTRODUCTION
WHY DIRTY DATA
CONCLUSION
-
- Valityhttp//www.vality.com
- Trillium Software http//www.trilliumsoft.com
- First Logichttp//www.firstlogic.com