Title: Avoiding theGarbage In, Garbage Out Issue in Data Warehousing
1Avoiding theGarbage In, Garbage Out Issue in
Data Warehousing
- Best Practices for Creating, Measuring
Maintaining Quality Data in Data Warehousing
Systems - Susan Teague Rector Alex Vidas, 2004
2Questions to Address
- Why is Data Quality Important in Data Warehousing
Systems? - What are the Steps and Best Practices for
Implementing a Data Quality Initiative? - How do you measure Quality Data?
- What Tools help Data Warehouse Developers and
Architects Create Quality Data?
3Questions to Address
- What is Clean, Quality Data?
- Why is Data Quality Important in Data
Warehousing Systems? - What are the Steps and Best Practices for
Implementing a Data Quality Initiative? - How do you identify and measure Quality Data?
- How do you identify problems
- How do you maintain quality and fix problems
- What Tools help Data Warehouse Developers and
Architects Create Quality Data?
4Context of data acquisition ETL
- ETL- Short for extract, transform, load, three
database functions that are combined into one
tool to pull data out of one database and place
it into another database. - Extract -- the process of reading data from a
database. - Transform -- the process of converting the
extracted data from its previous form into the
form it needs to be in so that it can be placed
into another database. Transformation occurs by
using rules or lookup tables or by combining the
data with other data. - Load -- the process of writing the data into the
target database. - ETL is used to migrate data from one database to
another, to form data marts and data warehouses
and also to convert databases from one format or
type to another. http//sbc.webopedia.com
/TERM/E/ETL.html - ETL (Extract, Transform, Load) ETL refers to the
process of getting data out of one data store
(Extract) modifying it (Transform), and inserting
it into a different data store (Load).
http//www.sdgcomputing.com/glossary.htm
5What is Clean Data?
- Clean data is the result of some combination of
two efforts making sure that data entered into
the system are clean, and cleaning up problems
once the data has been accepted. - (Data Quality Methods)
- Methods and Terms
- Data Cleansing/Scrubbing removing errors and
inconsistencies from data being imported into a
data warehouse - Data Quality Assurance (a.k.a. Data Cleansing or
Data Scrubbing) The process of checking the
quality of the data being imported into the data
warehouse. -
http//www.sdgcomputing.com/glossary.htm
6What is Quality Data?
- In this sense
- data quality doesn't really refer to whether
data is clean or dirty - it's a matter of whether
the data is good (valid) or bad (invalid).
Validity is a measure of relevance of the data to
the process or analysis at hand. The measure of
that validity changes with the context in which
the data is used. - (Peterson, 2000)
7Where do problems in data quality come from?
- Quality degenerates over time around 2 of
records become obsolete within a month because
customers die, divorce, marry and move. - Data-entry errors
- Lack of validation routines
- Valid, but nor correct data (as with a typo that
slips through the validation scheme, I.e.,
misspelled last name or SSN) - Missing values
- System migrations and changes to source systems
- Source system changes may include additional code
values or unanticipated use of fields - Updating graphic interfaces may not keep pace
with changes in the systems - Data conversion errors, as with failing to
migrate business processes that govern the use of
data - Data integration issues
- Mismatched syntax, formats and structures (as
with integrating multiple systems, where
differences in syntax first-middle-last name vs.
last-first-middle name, data formats different
byte allowances, code structures male-female
vs. m-f vs. 1-2 may appear - Lack of referential integrity checks
- Inconsistently represented data multiple
conceptions of of how the data is/should be
entered and stored ( must be accommodated in
access) and their usefulness across business
units - Duplicate data- can lead to inaccurate analysis,
increased marketing/mailing costs, customer
annoyance, relationship breakdown in a relational
data-storage system) - Data integration Columns that constitute join
fields between multiple datasets must be
consistently represented to avoid errors - Data verification and augmentation resolving
missing or inaccurate values - Data structural incompatibilities
8Why Is Clean Data Important in Data Warehousing
Systems?
- Important to Business Data warehousing
practices have become mission critical - Important to End Users Many uses of warehoused
data affect the consumers ability to receive
information (for instance, via mailing addresses
or correct spelling of names). Incorrect personal
information can hamper user interactions or
alienate them from a company (thus causing loss
of referrals, sales base, and possible revenue
potential) - External Standards - Industry and governmental
regulations such as the Health Insurance
Portability and Accountability Act (HIPAA) and
Bank Secrecy Act require organizations to
carefully manage customer data and privacy or
face penalties, unfavorable publicity and loss of
credibility
9Some cautionary examples
- A telecommunications firm list 8 million a month
because data-entry errors incorrectly coded
accounts, preventing bills from being sent out - An insurance company lost hundreds of thousands
of dollars annually in mailing costs due to
duplicate customer records - An information services firm lost 500,000
annually and alienated customers because it
repeatedly recalled reports sent to subscribers
due to inaccurate data - A large bank discovered that 62 of its
home-equity loans were being calculated
incorrectly, with the principal getting larger
each month - A health insurance company in the Midwest delayed
a decision support system for two years because
the quality of its data was suspect - A global chemical company discovered it was
losing millions of dollars in volume discounts in
procuring supplies because it could not correctly
identify and reconcile suppliers on a global
basis. - A regional bank was unable to calculate customer
and product profitability due to missing and
inaccurate cost data
- In a conservative estimate, more than 175, 000
IRS and state tax refund checks were marked as
undeliverable by the post office last year. - Three zeroes inadvertently added and reported as
a trade volume amount of an inside executive of a
public company in Atlanta caused its stock to
plummet over 30 in one day - After an audit, it was estimated that 15-20 of
voters on voter registration lists have either
moved or are deceased when compared to data
gathered from post office relocation data - An acquiring company learned long after the deal
was closed that their new consumer business only
had half the customers as they thought because of
the large presence of duplicate data in the
customer database - A fiber-optics company lost 500,000 after a
mislabeled shipment caused the wrong cable to be
laid at the bottom of a lake - A mailing order company faced massive litigation
because it was unable to catch bogus, insulting
names from being entered into the catalog request
section of their website that were ultimately
mailed to unsuspecting and then humiliated
receivers of the catalog
10Statistics
- Data cleansing accounts for up to 70 of the
cost and effort of implementing most data
warehouse projects - In 2001, the Data Warehousing Institute estimated
that dirty data costs U.S. businesses 600
billion per year - Data cleanliness and quality was the No. 2
problemright behind budget cuts---cited in a
2003 IDC survey of 1,648 companies implementing
business analytics software enterprise-wise - (COMPUTERWORLD, February 10, 2003)
11Components of the Data Warehouse
Data Warehouse
Data Extraction Data Transform Data Scrubbi
ng
Data Access Data Delivery
OLAP
operational system
Data Storing Data Cataloguing
OLTP
analysts
DSS
LEGACY SYSTEM
OLAP
(Figure from Shahzad)
12Implementing a Data Quality Initiative
- You are the Data Warehouse Administrator and
youve managed to convince upper-level business
managers that a Data Quality is imperative to the
future of the corporation. -
-
What are your next steps?
13Implementing a Data Quality Initiative
- meet with stakeholders/experts
- analyze metadata documentation
- determine Business Needs
- select process
- implement process
- document process
- continue to measure
- incorporate statistical methods
- sample the data
- statistical process control (SPC)
- sample transactions
- sample the data repository
- flag suspect data
- develop metrics
(Elements Borrowed from Data Quality Methods)
14Other points to consider
- How much of the data will you clean?
- Twenty percent of the data may meet 80 of the
essential needs. - Cleanse only those portions of data that have
real business value. - Consider automated software tools for conversion
and migration
(Atre, S., 1998)
15Best Practices for Cleaning Data
- Scrubbing your data too soon or too late is a
waste of effort plan carefully to make your
quality checks at the right time. - (Duncan Gentry, 2002)
16Best Practices for Cleaning Data
- Sweet spot lies in the ETL process between the
staging area and the data warehouse
(Duncan Gentry, 2002)
17Implement a Cleaning Plan
- Questions to Ask
- How will you deal with Inconsistent Spellings?
- IBM, I.B.M, ibm, Int. Business Machinesm I.bm
copr - How will you Map Data from Disparate Systems?
- How will you address Missing Data?
- ignore?
- fill in values such as unknown?
- get the mean attribute value and use this?
- use a machine learning method such as Bayesian
inference or decision trees? - Do you need to Derive New Data from the Existing?
- calculations, aggregation
- How do you deal with Inconsistent Field names
such as addresses, names and phone numbers?
18Implement a Cleaning Plan An Example
Data Sources
Customer Table HR Database
Customer Table Accounts Receivable Database
Customer Table CRM/Sales Database
Data Warehouse
19Cleansing Data Creating Data Rules with Oracle
- Oracle 9i Warehouse Builder (OWB)
- Use name and address cleansing, parsing
standardization - Generates postal reports
- Example
- Sue Smith, 187 XYZ Lane, Chapel Hill NC
- SueSusanSmith187XYZLaneChapel
HillNC273442345
20Cleansing Data Creating Data Rules with Oracle
- Use the match and merge functionality
- can be specific swap the first and last name of
the person - could be generic match 2 words with 90
similarity - use Soundex to match
- fuzzy matching merging
- can be applied to all types of data
21How do you maintain cleanness/fix problems
- Update and implement business and system rules
- Make sure definitions across systems are
compliant (such as customer or profit) - Consider the original purpose for which the data
was collected when adapting it for different uses
(may flag as incomplete, missing, or less
accurate than desired, and needing augmentation)
- Keep an eye on how your data is entered and
obtained and its subsequent migration - Use ETL tools to implement standardization,
matching, data verification, de-duplication, data
integration, accuracy, data quality business rule
management in your data warehousing environment.
Pick your ETL tool carefully to make sure it
suits your business needs.
22Tool Statistics
- Just 23 of 130 companies surveyed by Cutter
Consortium on their data warehousing and
business-intelligence practices use specialized
data cleansing tools - Of those companies in the Cutter Consortium study
using specialized data scrubbing software, 31
are using tools that were built in-house. - (COMPUTERWORLD, February 10, 2003)
23Tools to Clean Data
- Oracle Data Warehouse Builder
- SAS Data Warehouse Administrator
- DataFlux dfPower Studio 4.0
- Trillium http//www.trilliumsoftware.com/site/cont
ent/product/methodology.asp - Commercial tools review http//www.kismeta.com/ex
2.html - Other companies
- http//dmoz.org/Computers/Software/Datbases/Dat
a_Warehousing/Data_Integraty_and_Cleansing_Tools/
24Apertus Technologies, Enterprise Integrator Tool
Users write transformation rules. Data is
filtered against domains and ranges of legal
values and compared to other data structures.
Bachman Modeling and reverse engineering tool.
The reverse engineering tool derives a consistent
set of metadata from several potential source
system's metadata. Carleton, Passport Users
enter extraction and transformation parameters,
and Data is filtered against domains and ranges
of legal values. DBStar, Migration Architect
High-End rules and relationships discovery.
Embarcado, ER1 Departmental modeling and
reverse engineering tool. The reverse engineering
tool derives a consistent set of metadata from
several potential source system's metadata.
Evolutionary Technology (ETI), EXTRACT Users
write transformation rules. Data is filtered
against domains and ranges of legal values and
compared to other data structures. Information
Discovery, IDI Low-end rule discovery. Kismet
Analytic, KisMeta Workbench Enterprise data
modeling and reverse engineering tool. The tool
derives a unified conceptual schema, and
rationalizes and cross-indexes all of the
enterprise's metadata. Evaluates metadata with
lexical analysis, natural language
interpertation, and statistical techniques.
LogicWorks, ERWIN ERX Departmental modeling
and reverse engineering tool. The reverse
engineering tool derives a consistent set of
metadata from several potential source system's
metadata. Oracle, Symmetric Replicator A data
replication product designed to extract data from
several platforms, perform some filtering and
transformation, and distribute and load to
another database or databases. Platinum,
InfoRefiner A data pump product designed to
extract data from several mainframe platforms,
perform some filtering and transformation, and
distribute and load to another mainframe platform
database. The extraction uses custom code modules
written in COBOL. This is a Mainframe/MVS based
tool.
25Platinum, InfoPump A data pump product designed
to extract data from several mainframe and client
server platforms, perform some filtering and
transformation, and distribute and load to
another mainframe platform database. Requires
InfoHub for most services. The extraction uses
custom code modules. This is a client/server
based tool. Platinum, InfoHub A middleware
product designed to establish a permanent
relationship (including filtering and
transformation) between source systems and a
logical model. The logical model is then
available as a virtual database to end-user query
tools or a data migration product such as
InfoPump. Praxis, Omni Replicator A data
replication product designed to extract data from
several platforms, perform some filtering and
transformation, and distribute and load to
another database or databases. Prism Solutions,
Warehouse Manager Users enter extraction and
transformation parameters, and Data is filtered
against domains and ranges of legal values.
PostalSoft, ACE Specialty Address correction
tool. Uses a "brute force" strategy of comparing
each address with a USPS database with over one
hundred million valid US addresses. PostalSoft,
Library A class library of callable routines
that developers user to create "address-smart"
data entry screens that automatically edit most
address characteristics. Group 1 Specialty
Address correction tool. Uses a "brute force"
strategy of comparing each address with a USPS
database with over one hundred million valid US
addresses. Mailers 4 A class library of
callable routines that developers user to create
"address-smart" data entry screens that
automatically edit most address characteristics.
QDB Solutions, QDB/Analyze High-End rules and
relationships discovery. Search Software
America (SSA) Name and address data quality
tool. Similar in technology to Vality Integrity,
SSA evaluates data with lexical analysis tied to
a database of known name frequency distribution
in a given population. Vality Technology,
Integrity Integrity evaluates data with lexical
analysis rather than with rules or look-up
tables. Best for parsing unstructured text
fields. WizSoft, WizRule Low-end
rule-discovery.
26References
- Atre, S. (1998). Rules for Data Cleansing.
Computer World. http//www.computerworld.com/news/
1998/story/0,11280,30084,00.html - Bowen, P., Fuhrer, D. Guess, F. (1998).
Continuously Improving Data Quality in Persistent
Databases. Data Quality, 4(1). http//www.dataqual
ity.com/998bowen.htm. - Database Advisor. (2002). Get Started with a Data
Quality Initiative. www.advisor.com.
http//www.advisor.com/Articles.nsf/aid/SMITT633. - Data Extraction, Transformation and Migration
Tools - http//www.kismeta.com/extract.html
- Data Quality Methods (1996). http//database.ittoo
lbox.com/browse.asp?cDBPeerPublishingrhttp3A2
F2Fwww2Ekismeta2Ecom2Fcleand12Ehtml) - Dubois, L. (2002). Ten Critical Factors for
Successful Enterprise-Wide Data Quality.
CRMguru.com. http//www.crmguru.com/features/2002a
/0418ld.html. - Droogendyk, W. Harschnitz, L. Successful
Business Intelligence Systems Improving
Information Quality with the SAS System.
http//www2.sas.com/proceedings/sugi22/POSTERS/PAP
ER243.PDF.
- Duncan, K. Gentry, J. (2002). When Should You
Clean Warehoused Data? Enterprise Systems.
http//www.esj.com/columns/article.asp?EditorialsI
D18. - Eckerson, W. (2003). The Evolution of ETL.
Business Intelligence Journal, 8(4). - Oracle 9i Data Warehousing Guide, Release 2
(9.2). http//download-east.oracle.com/docs/cd/B10
501_01/server.920/a96520.pdf. - Oracle 9i DS Daily Feature.(2001). Advanced Data
Cleansing with Oracle 9i Warehouse Builder.
Oracle Technology Network. http//otn.oracle.com/p
roducts/ids/daily/jul01.html. - Oracle 9i Warehouse Builder, Integrated Data
Quality. (2003). http//otn.oracle.com/products/wa
rehouse/pdf/Warehouse_92_data_quality_whitepaper.p
df. - Peterson, T. (2000). Data Scrubbing. Computer
World. http//www.computerworld.com/databasetopics
/data/story/0,10801,78230,00.html. - Shahzad, M. (2004) Data Warehousing with Oracle.
Oracular. com. http//www.oracular.com/white_paper
_pdfs/DataWarehousingwithOracle.pdf. - White, Colin. Data Warehousing Cleaning and
Transforming Data. InfoDB, 10(6), p. 11-12.
http//www.evaltech.com/wpapers/dwcleansing.pdf.