Data Cleaning and Transformation - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Data Cleaning and Transformation

Description:

Data Cleaning and Transformation. Playing in the Mud ... Get it into another system (e.g. ERP) ... http://gnosis.cx/publish/programming/regular_expressions.html ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 13
Provided by: marki152
Category:

less

Transcript and Presenter's Notes

Title: Data Cleaning and Transformation


1
Data Cleaning and Transformation
  • Playing in the Mud

2
The Many Roles of Knowledge Workers
Brilliant IS idea
3
The Gist of the Problem
  • Getting data out of some system to
  • Analyze it (e.g. Excel, Access, stats package)
  • Get it into another system (e.g. ERP)
  • Smart manipulation of electronic reports with
    embedded data
  • Dont want to do it manually (why?)

4
A former student describing new job with major
consulting firm
  • Lots of modeling, muddy data problems, and
    working with OLAP tools and data warehousing. I
    know the muddy data area was a particular area of
    interest to you and it seems that it is a really
    BIG issue for many businesses - actually bigger
    than I imagined.

5
ACD Report Example
Report Header
Date
Split
Blank lines
Data!
Totals Lines
Next report
6
Why Talk About This?
  • Examples
  • Deloitte Touche Data Quality and Integrity
  • Oracle ERP project, Help Desk description field
    mess
  • Ptube log files
  • Many MIS 446/646 projects
  • Huge amount of time wasted doing manual
    processing
  • Dirty data is a big problem
  • http//www.computerworld.com/databasetopics/data/s
    tory/0,10801,78260,00.html
  • Really useful spreadsheet and database skills
    (and mindset)
  • Example from the EXCEL-L Developers listserv
  • Its fun to play in the mud sometimes
  • Commercial products such as Monarch Data Pump,
    Data Integrator, TextPipe, are available if you
    need to do this stuff routinely. Doing it here
    will allow you to become power users of such
    products quickly.
  • Next few slides are screen shots from Content
    Extractor, an earlier version of Data Integrator
    (and formerly called Cambio)

7
Useful resources
  • http//www.asciitable.com/
  • Dorian Pyles website author of the book Data
    Preparation for Data Mining
  • http//www.modelandmine.com/
  • An (Informal) Taxonomy of Data Warehouse Data
    Errors
  • http//www.dwinfocenter.org/errors.html
  • MIT Total Data Quality Management
  • http//web.mit.edu/tdqm/www/index.shtml
  • Dealing with Dirty Data
  • http//www.dbmsmag.com/9609d14.html
  • Regular expressions (for wizards)
  • Example Help Desk description field mess
  • Date1 desc1 date2 desc2 date3 desc3
  • http//www.regular-expressions.info/
  • http//gnosis.cx/publish/programming/regular_expre
    ssions.html

8
(No Transcript)
9
The date is always in the same spot so we can use
Fixed Column to get it out of the line weve
defined as DAY.
10
Use Floating Tags for data that could appear in
different positions on a line. Ex we dont know
how long the split name will be, but we do know
that it will end with a period.
For data appearing in headers/footers, (e.g.
Split Name and ), we tell Content Extractor to
Propogate Field Contents so we get the name and
number with each detail line.
11
Repeating day, split num and split name
Check out our progress.
12
Export to a wide variety of file types.
ContExTest-acddata.asc
Heres an export I did to a delimited ASCII file.
Write a Comment
User Comments (0)
About PowerShow.com