Data Cleansing - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Data Cleansing

Description:

Simply put, dirty data for data warehouses is the ... Why is Legacy Data 'Dirty' ? Dummy Values, Absence of Data, Multipurpose ... WHY 'DIRTY' DATA ... – PowerPoint PPT presentation

Number of Views:1676
Avg rating:3.0/5.0
Slides: 27
Provided by: terrybus
Category:
Tags: cleansing | data | dirty

less

Transcript and Presenter's Notes

Title: Data Cleansing


1
Data Cleansing
2
A companys most important asset is information.
A corporations ability to compete, adapt, and
grow in a business climate of rapid change is
dependent in large measure on how well the
company uses information to make decisions
Sharing information that isnt clean and
consolidated to the fullest extent can
substantially reduce the effectiveness of a
system of significant investment and considerable
pay-off potential.

Stoker, 1999
3
Todays Coverage
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Data Cleansing and Data Quality
  • Steps in Data Cleansing
  • Why is Dirty Data a Problem?
  • Why is Legacy Data Dirty?
  • To Cleanse or Not To Cleanse
  • Parsing Matching
  • Correcting Consolidating
  • Standardizing
  • Conclusion
  • Demonstration
  • Questions

INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
CONCLUSION
4
Data Cleansing and Data Quality
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Data is a product that can be characterized as
    either quality or non-quality. The ability
    to make quality decisions depends in part on the
    decision-makers ability to access quality data.
  • Data cleansing is the process that insures that
    the same piece of information is referred to in
    only ONE way. When data is clean, its users can
    focus on its use and not its credibility.

5
Steps in Data Cleansing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Parsing
  • Correcting
  • Standardizing
  • Matching
  • Consolidating

6
Why is Data Dirty and Why is This a Problem?
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Simply put, dirty data for data warehouses is the
    product of relying on data from legacy systems.
  • But if companys have relied on this data for
    decades, why is it a problem today?
  • Because a data warehouse promises to deliver a
    single version of the truth. Unfortunately
    integrating data from different sources magnifies
    its problems.

7
Why is Legacy Data Dirty ?
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Dummy Values,
  • Absence of Data,
  • Multipurpose Fields,
  • Cryptic Data,
  • Contradicting Data,
  • Inappropriate Use of Address Lines,
  • Violation of Business Rules,
  • Reused Primary Keys,
  • Non-Unique Identifiers, and
  • Data Integration Problems

8
To Cleanse or Not to Cleanse
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • CAN the legacy data be cleansed?
  • Sometimes the answer is NO
  • Then, SHOULD it be cleansed?
  • Again, sometimes NO
  • Next, WHERE should it be cleansed?
  • Finally, HOW should it be cleansed?

9
Steps in Cleansing Data
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Parsing
  • Correcting
  • Standardizing
  • Matching
  • Consolidating

10
Parsing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Parsing locates and identifies individual data
    elements in the source files and then isolates
    these data elements in the target files.

11
Parsing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
12
Correcting
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Corrects parsed individual data components
    using sophisticated data algorithms and secondary
    data sources.

13
Correcting
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
14
Standardizing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Standardizing applies conversion routines to
    transform data into its preferred (and
    consistent) format using both standard and custom
    business rules.

15
Standardizing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
16
Parsing, Correcting, Standardizing
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
TITLE
FIRST
CONC.
LAST
GENER.

NAME LINE
William
Mr. Bill St. John III 101 S.
Main Strete Sant. Louis, MO 63181
HSNO
ST-NM
ST-TYPE
ST-DIR
St.
STREET LINE
CITY
STATE
POST
St.
63118
GEOG. LINE
17
Matching
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Searching and matching records within and across
    the parsed, corrected and standardized data based
    on predefined business rules to eliminate
    duplications.

18
Match Patterns
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS

Customer /Tax ID
Branch Type
Vendor Code
Pattern I.D.
Business Name
Street
City
Pattern
Exact
Exact
Exact
Exact
Exact
Exact
AAAAAA
P110
Exact
Exact
Exact
VClose
VClose
Blanks
ABAAA-
P115
Exact
Exact
Exact
Exact
VClose
Blanks
ABA-AA
P120
Exact
Exact
Exact
VClose
Close
Close
ABCCAA
S300
Exact
Exact
Exact
VClose
VClose
Close
BBACAA
S310
19
Matching
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
20
Consolidating
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
  • Analyzing and identifying relationships
    between matched records and consolidating/merging
    them into ONE representation.

21
Consolidating
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
22
Consolidating
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS

23
Recommended Best Practices
CONCLUSION
INTRODUCTION
WHY DIRTY DATA
CLEANSING STEPS
1. Use metadata to document rules 2. Determine
data cleansing schedule 3. Build quality into
new and existing systems

24
Legacy Systems View (3 Clients)
CLEANSING STEPS
INTRODUCTION
WHY DIRTY DATA
CONCLUSION

Account No.83451234
Policy No.ME309451-2
Transaction B498/97
25
The Reality ONE Client
CLEANSING STEPS
INTRODUCTION
WHY DIRTY DATA
CONCLUSION

Account No.83451234
Policy No.ME309451-2
Transaction B498/97
26
Demonstration
CLEANSING STEPS
INTRODUCTION
WHY DIRTY DATA
CONCLUSION
  • Valityhttp//www.vality.com
  • Trillium Software http//www.trilliumsoft.com
  • First Logichttp//www.firstlogic.com
Write a Comment
User Comments (0)
About PowerShow.com