Avoiding theGarbage In, Garbage Out Issue in Data Warehousing - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Avoiding theGarbage In, Garbage Out Issue in Data Warehousing

Description:

Transformation occurs by using rules or lookup tables or by combining the data with other data. ... Enterprise data modeling and reverse engineering tool. ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 26
Provided by: steague
Category:

less

Transcript and Presenter's Notes

Title: Avoiding theGarbage In, Garbage Out Issue in Data Warehousing


1
Avoiding theGarbage In, Garbage Out Issue in
Data Warehousing
  • Best Practices for Creating, Measuring
    Maintaining Quality Data in Data Warehousing
    Systems
  • Susan Teague Rector Alex Vidas, 2004

2
Questions to Address
  • Why is Data Quality Important in Data Warehousing
    Systems?
  • What are the Steps and Best Practices for
    Implementing a Data Quality Initiative?
  • How do you measure Quality Data?
  • What Tools help Data Warehouse Developers and
    Architects Create Quality Data?

3
Questions to Address
  • What is Clean, Quality Data?
  • Why is Data Quality Important in Data
    Warehousing Systems?
  • What are the Steps and Best Practices for
    Implementing a Data Quality Initiative?
  • How do you identify and measure Quality Data?
  • How do you identify problems
  • How do you maintain quality and fix problems
  • What Tools help Data Warehouse Developers and
    Architects Create Quality Data?

4
Context of data acquisition ETL
  • ETL- Short for extract, transform, load, three
    database functions that are combined into one
    tool to pull data out of one database and place
    it into another database.
  • Extract -- the process of reading data from a
    database.
  • Transform -- the process of converting the
    extracted data from its previous form into the
    form it needs to be in so that it can be placed
    into another database. Transformation occurs by
    using rules or lookup tables or by combining the
    data with other data.
  • Load -- the process of writing the data into the
    target database.
  • ETL is used to migrate data from one database to
    another, to form data marts and data warehouses
    and also to convert databases from one format or
    type to another. http//sbc.webopedia.com
    /TERM/E/ETL.html
  • ETL (Extract, Transform, Load) ETL refers to the
    process of getting data out of one data store
    (Extract) modifying it (Transform), and inserting
    it into a different data store (Load).
    http//www.sdgcomputing.com/glossary.htm

5
What is Clean Data?
  • Clean data is the result of some combination of
    two efforts making sure that data entered into
    the system are clean, and cleaning up problems
    once the data has been accepted.
  • (Data Quality Methods)
  • Methods and Terms
  • Data Cleansing/Scrubbing removing errors and
    inconsistencies from data being imported into a
    data warehouse
  • Data Quality Assurance (a.k.a. Data Cleansing or
    Data Scrubbing) The process of checking the
    quality of the data being imported into the data
    warehouse.


  • http//www.sdgcomputing.com/glossary.htm

6
What is Quality Data?
  • In this sense
  • data quality doesn't really refer to whether
    data is clean or dirty - it's a matter of whether
    the data is good (valid) or bad (invalid).
    Validity is a measure of relevance of the data to
    the process or analysis at hand. The measure of
    that validity changes with the context in which
    the data is used.
  • (Peterson, 2000)

7
Where do problems in data quality come from?
  • Quality degenerates over time around 2 of
    records become obsolete within a month because
    customers die, divorce, marry and move.
  • Data-entry errors
  • Lack of validation routines
  • Valid, but nor correct data (as with a typo that
    slips through the validation scheme, I.e.,
    misspelled last name or SSN)
  • Missing values
  • System migrations and changes to source systems
  • Source system changes may include additional code
    values or unanticipated use of fields
  • Updating graphic interfaces may not keep pace
    with changes in the systems
  • Data conversion errors, as with failing to
    migrate business processes that govern the use of
    data
  • Data integration issues
  • Mismatched syntax, formats and structures (as
    with integrating multiple systems, where
    differences in syntax first-middle-last name vs.
    last-first-middle name, data formats different
    byte allowances, code structures male-female
    vs. m-f vs. 1-2 may appear
  • Lack of referential integrity checks
  • Inconsistently represented data multiple
    conceptions of of how the data is/should be
    entered and stored ( must be accommodated in
    access) and their usefulness across business
    units
  • Duplicate data- can lead to inaccurate analysis,
    increased marketing/mailing costs, customer
    annoyance, relationship breakdown in a relational
    data-storage system)
  • Data integration Columns that constitute join
    fields between multiple datasets must be
    consistently represented to avoid errors
  • Data verification and augmentation resolving
    missing or inaccurate values
  • Data structural incompatibilities

8
Why Is Clean Data Important in Data Warehousing
Systems?
  • Important to Business Data warehousing
    practices have become mission critical
  • Important to End Users Many uses of warehoused
    data affect the consumers ability to receive
    information (for instance, via mailing addresses
    or correct spelling of names). Incorrect personal
    information can hamper user interactions or
    alienate them from a company (thus causing loss
    of referrals, sales base, and possible revenue
    potential)
  • External Standards - Industry and governmental
    regulations such as the Health Insurance
    Portability and Accountability Act (HIPAA) and
    Bank Secrecy Act require organizations to
    carefully manage customer data and privacy or
    face penalties, unfavorable publicity and loss of
    credibility

9
Some cautionary examples
  • A telecommunications firm list 8 million a month
    because data-entry errors incorrectly coded
    accounts, preventing bills from being sent out
  • An insurance company lost hundreds of thousands
    of dollars annually in mailing costs due to
    duplicate customer records
  • An information services firm lost 500,000
    annually and alienated customers because it
    repeatedly recalled reports sent to subscribers
    due to inaccurate data
  • A large bank discovered that 62 of its
    home-equity loans were being calculated
    incorrectly, with the principal getting larger
    each month
  • A health insurance company in the Midwest delayed
    a decision support system for two years because
    the quality of its data was suspect
  • A global chemical company discovered it was
    losing millions of dollars in volume discounts in
    procuring supplies because it could not correctly
    identify and reconcile suppliers on a global
    basis.
  • A regional bank was unable to calculate customer
    and product profitability due to missing and
    inaccurate cost data
  • In a conservative estimate, more than 175, 000
    IRS and state tax refund checks were marked as
    undeliverable by the post office last year.
  • Three zeroes inadvertently added and reported as
    a trade volume amount of an inside executive of a
    public company in Atlanta caused its stock to
    plummet over 30 in one day
  • After an audit, it was estimated that 15-20 of
    voters on voter registration lists have either
    moved or are deceased when compared to data
    gathered from post office relocation data
  • An acquiring company learned long after the deal
    was closed that their new consumer business only
    had half the customers as they thought because of
    the large presence of duplicate data in the
    customer database
  • A fiber-optics company lost 500,000 after a
    mislabeled shipment caused the wrong cable to be
    laid at the bottom of a lake
  • A mailing order company faced massive litigation
    because it was unable to catch bogus, insulting
    names from being entered into the catalog request
    section of their website that were ultimately
    mailed to unsuspecting and then humiliated
    receivers of the catalog

10
Statistics
  • Data cleansing accounts for up to 70 of the
    cost and effort of implementing most data
    warehouse projects
  • In 2001, the Data Warehousing Institute estimated
    that dirty data costs U.S. businesses 600
    billion per year
  • Data cleanliness and quality was the No. 2
    problemright behind budget cuts---cited in a
    2003 IDC survey of 1,648 companies implementing
    business analytics software enterprise-wise
  • (COMPUTERWORLD, February 10, 2003)

11
Components of the Data Warehouse
Data Warehouse
Data Extraction Data Transform Data Scrubbi
ng
Data Access Data Delivery
OLAP
operational system
Data Storing Data Cataloguing
OLTP
analysts
DSS
LEGACY SYSTEM
OLAP
(Figure from Shahzad)
12
Implementing a Data Quality Initiative
  • You are the Data Warehouse Administrator and
    youve managed to convince upper-level business
    managers that a Data Quality is imperative to the
    future of the corporation.

What are your next steps?
13
Implementing a Data Quality Initiative
  • meet with stakeholders/experts
  • analyze metadata documentation
  • determine Business Needs
  • select process
  • implement process
  • document process
  • continue to measure
  • incorporate statistical methods
  • sample the data
  • statistical process control (SPC)
  • sample transactions
  • sample the data repository
  • flag suspect data
  • develop metrics

(Elements Borrowed from Data Quality Methods)
14
Other points to consider
  • How much of the data will you clean?
  • Twenty percent of the data may meet 80 of the
    essential needs.
  • Cleanse only those portions of data that have
    real business value.
  • Consider automated software tools for conversion
    and migration

(Atre, S., 1998)
15
Best Practices for Cleaning Data
  • Scrubbing your data too soon or too late is a
    waste of effort plan carefully to make your
    quality checks at the right time.
  • (Duncan Gentry, 2002)

16
Best Practices for Cleaning Data
  • Sweet spot lies in the ETL process between the
    staging area and the data warehouse

(Duncan Gentry, 2002)
17
Implement a Cleaning Plan
  • Questions to Ask
  • How will you deal with Inconsistent Spellings?
  • IBM, I.B.M, ibm, Int. Business Machinesm I.bm
    copr
  • How will you Map Data from Disparate Systems?
  • How will you address Missing Data?
  • ignore?
  • fill in values such as unknown?
  • get the mean attribute value and use this?
  • use a machine learning method such as Bayesian
    inference or decision trees?
  • Do you need to Derive New Data from the Existing?
  • calculations, aggregation
  • How do you deal with Inconsistent Field names
    such as addresses, names and phone numbers?

18
Implement a Cleaning Plan An Example
Data Sources
Customer Table HR Database
Customer Table Accounts Receivable Database
Customer Table CRM/Sales Database
Data Warehouse
19
Cleansing Data Creating Data Rules with Oracle
  • Oracle 9i Warehouse Builder (OWB)
  • Use name and address cleansing, parsing
    standardization
  • Generates postal reports
  • Example
  • Sue Smith, 187 XYZ Lane, Chapel Hill NC
  • SueSusanSmith187XYZLaneChapel
    HillNC273442345

20
Cleansing Data Creating Data Rules with Oracle
  • Use the match and merge functionality
  • can be specific swap the first and last name of
    the person
  • could be generic match 2 words with 90
    similarity
  • use Soundex to match
  • fuzzy matching merging
  • can be applied to all types of data

21
How do you maintain cleanness/fix problems
  • Update and implement business and system rules
  • Make sure definitions across systems are
    compliant (such as customer or profit)
  • Consider the original purpose for which the data
    was collected when adapting it for different uses
    (may flag as incomplete, missing, or less
    accurate than desired, and needing augmentation)
  • Keep an eye on how your data is entered and
    obtained and its subsequent migration
  • Use ETL tools to implement standardization,
    matching, data verification, de-duplication, data
    integration, accuracy, data quality business rule
    management in your data warehousing environment.
    Pick your ETL tool carefully to make sure it
    suits your business needs.

22
Tool Statistics
  • Just 23 of 130 companies surveyed by Cutter
    Consortium on their data warehousing and
    business-intelligence practices use specialized
    data cleansing tools
  • Of those companies in the Cutter Consortium study
    using specialized data scrubbing software, 31
    are using tools that were built in-house.
  • (COMPUTERWORLD, February 10, 2003)

23
Tools to Clean Data
  • Oracle Data Warehouse Builder
  • SAS Data Warehouse Administrator
  • DataFlux dfPower Studio 4.0
  • Trillium http//www.trilliumsoftware.com/site/cont
    ent/product/methodology.asp
  • Commercial tools review http//www.kismeta.com/ex
    2.html
  • Other companies
  • http//dmoz.org/Computers/Software/Datbases/Dat
    a_Warehousing/Data_Integraty_and_Cleansing_Tools/

24
Apertus Technologies, Enterprise Integrator Tool
Users write transformation rules. Data is
filtered against domains and ranges of legal
values and compared to other data structures.
Bachman Modeling and reverse engineering tool.
The reverse engineering tool derives a consistent
set of metadata from several potential source
system's metadata. Carleton, Passport Users
enter extraction and transformation parameters,
and Data is filtered against domains and ranges
of legal values. DBStar, Migration Architect
High-End rules and relationships discovery.
Embarcado, ER1 Departmental modeling and
reverse engineering tool. The reverse engineering
tool derives a consistent set of metadata from
several potential source system's metadata.
Evolutionary Technology (ETI), EXTRACT Users
write transformation rules. Data is filtered
against domains and ranges of legal values and
compared to other data structures. Information
Discovery, IDI Low-end rule discovery. Kismet
Analytic, KisMeta Workbench Enterprise data
modeling and reverse engineering tool. The tool
derives a unified conceptual schema, and
rationalizes and cross-indexes all of the
enterprise's metadata. Evaluates metadata with
lexical analysis, natural language
interpertation, and statistical techniques.
LogicWorks, ERWIN ERX Departmental modeling
and reverse engineering tool. The reverse
engineering tool derives a consistent set of
metadata from several potential source system's
metadata. Oracle, Symmetric Replicator A data
replication product designed to extract data from
several platforms, perform some filtering and
transformation, and distribute and load to
another database or databases. Platinum,
InfoRefiner A data pump product designed to
extract data from several mainframe platforms,
perform some filtering and transformation, and
distribute and load to another mainframe platform
database. The extraction uses custom code modules
written in COBOL. This is a Mainframe/MVS based
tool.
25
Platinum, InfoPump A data pump product designed
to extract data from several mainframe and client
server platforms, perform some filtering and
transformation, and distribute and load to
another mainframe platform database. Requires
InfoHub for most services. The extraction uses
custom code modules. This is a client/server
based tool. Platinum, InfoHub A middleware
product designed to establish a permanent
relationship (including filtering and
transformation) between source systems and a
logical model. The logical model is then
available as a virtual database to end-user query
tools or a data migration product such as
InfoPump. Praxis, Omni Replicator A data
replication product designed to extract data from
several platforms, perform some filtering and
transformation, and distribute and load to
another database or databases. Prism Solutions,
Warehouse Manager Users enter extraction and
transformation parameters, and Data is filtered
against domains and ranges of legal values.
PostalSoft, ACE Specialty Address correction
tool. Uses a "brute force" strategy of comparing
each address with a USPS database with over one
hundred million valid US addresses. PostalSoft,
Library A class library of callable routines
that developers user to create "address-smart"
data entry screens that automatically edit most
address characteristics. Group 1 Specialty
Address correction tool. Uses a "brute force"
strategy of comparing each address with a USPS
database with over one hundred million valid US
addresses. Mailers 4 A class library of
callable routines that developers user to create
"address-smart" data entry screens that
automatically edit most address characteristics.
QDB Solutions, QDB/Analyze High-End rules and
relationships discovery. Search Software
America (SSA) Name and address data quality
tool. Similar in technology to Vality Integrity,
SSA evaluates data with lexical analysis tied to
a database of known name frequency distribution
in a given population. Vality Technology,
Integrity Integrity evaluates data with lexical
analysis rather than with rules or look-up
tables. Best for parsing unstructured text
fields. WizSoft, WizRule Low-end
rule-discovery.
26
References
  • Atre, S. (1998). Rules for Data Cleansing.
    Computer World. http//www.computerworld.com/news/
    1998/story/0,11280,30084,00.html
  • Bowen, P., Fuhrer, D. Guess, F. (1998).
    Continuously Improving Data Quality in Persistent
    Databases. Data Quality, 4(1). http//www.dataqual
    ity.com/998bowen.htm.
  • Database Advisor. (2002). Get Started with a Data
    Quality Initiative. www.advisor.com.
    http//www.advisor.com/Articles.nsf/aid/SMITT633.
  • Data Extraction, Transformation and Migration
    Tools
  • http//www.kismeta.com/extract.html
  • Data Quality Methods (1996). http//database.ittoo
    lbox.com/browse.asp?cDBPeerPublishingrhttp3A2
    F2Fwww2Ekismeta2Ecom2Fcleand12Ehtml)
  • Dubois, L. (2002). Ten Critical Factors for
    Successful Enterprise-Wide Data Quality.
    CRMguru.com. http//www.crmguru.com/features/2002a
    /0418ld.html.
  • Droogendyk, W. Harschnitz, L. Successful
    Business Intelligence Systems Improving
    Information Quality with the SAS System.
    http//www2.sas.com/proceedings/sugi22/POSTERS/PAP
    ER243.PDF.
  • Duncan, K. Gentry, J. (2002). When Should You
    Clean Warehoused Data? Enterprise Systems.
    http//www.esj.com/columns/article.asp?EditorialsI
    D18.
  • Eckerson, W. (2003). The Evolution of ETL.
    Business Intelligence Journal, 8(4).
  • Oracle 9i Data Warehousing Guide, Release 2
    (9.2). http//download-east.oracle.com/docs/cd/B10
    501_01/server.920/a96520.pdf.
  • Oracle 9i DS Daily Feature.(2001). Advanced Data
    Cleansing with Oracle 9i Warehouse Builder.
    Oracle Technology Network. http//otn.oracle.com/p
    roducts/ids/daily/jul01.html.
  • Oracle 9i Warehouse Builder, Integrated Data
    Quality. (2003). http//otn.oracle.com/products/wa
    rehouse/pdf/Warehouse_92_data_quality_whitepaper.p
    df.
  • Peterson, T. (2000). Data Scrubbing. Computer
    World. http//www.computerworld.com/databasetopics
    /data/story/0,10801,78230,00.html.
  • Shahzad, M. (2004) Data Warehousing with Oracle.
    Oracular. com. http//www.oracular.com/white_paper
    _pdfs/DataWarehousingwithOracle.pdf.
  • White, Colin. Data Warehousing Cleaning and
    Transforming Data. InfoDB, 10(6), p. 11-12.
    http//www.evaltech.com/wpapers/dwcleansing.pdf.
Write a Comment
User Comments (0)
About PowerShow.com