Title: Data Cleaning and Transformation
1Data Cleaning and Transformation
2The Many Roles of Knowledge Workers
Brilliant IS idea
3The Gist of the Problem
- Getting data out of some system to
- Analyze it (e.g. Excel, Access, stats package)
- Get it into another system (e.g. ERP)
- Smart manipulation of electronic reports with
embedded data - Dont want to do it manually (why?)
4A former student describing new job with major
consulting firm
- Lots of modeling, muddy data problems, and
working with OLAP tools and data warehousing. I
know the muddy data area was a particular area of
interest to you and it seems that it is a really
BIG issue for many businesses - actually bigger
than I imagined.
5ACD Report Example
Report Header
Date
Split
Blank lines
Data!
Totals Lines
Next report
6Why Talk About This?
- Examples
- Deloitte Touche Data Quality and Integrity
- Oracle ERP project, Help Desk description field
mess - Ptube log files
- Many MIS 446/646 projects
- Huge amount of time wasted doing manual
processing - Dirty data is a big problem
- http//www.computerworld.com/databasetopics/data/s
tory/0,10801,78260,00.html - Really useful spreadsheet and database skills
(and mindset) - Example from the EXCEL-L Developers listserv
- Its fun to play in the mud sometimes
- Commercial products such as Monarch Data Pump,
Data Integrator, TextPipe, are available if you
need to do this stuff routinely. Doing it here
will allow you to become power users of such
products quickly. - Next few slides are screen shots from Content
Extractor, an earlier version of Data Integrator
(and formerly called Cambio)
7Useful resources
- http//www.asciitable.com/
- Dorian Pyles website author of the book Data
Preparation for Data Mining - http//www.modelandmine.com/
- An (Informal) Taxonomy of Data Warehouse Data
Errors - http//www.dwinfocenter.org/errors.html
- MIT Total Data Quality Management
- http//web.mit.edu/tdqm/www/index.shtml
- Dealing with Dirty Data
- http//www.dbmsmag.com/9609d14.html
- Regular expressions (for wizards)
- Example Help Desk description field mess
- Date1 desc1 date2 desc2 date3 desc3
- http//www.regular-expressions.info/
- http//gnosis.cx/publish/programming/regular_expre
ssions.html
8(No Transcript)
9The date is always in the same spot so we can use
Fixed Column to get it out of the line weve
defined as DAY.
10Use Floating Tags for data that could appear in
different positions on a line. Ex we dont know
how long the split name will be, but we do know
that it will end with a period.
For data appearing in headers/footers, (e.g.
Split Name and ), we tell Content Extractor to
Propogate Field Contents so we get the name and
number with each detail line.
11Repeating day, split num and split name
Check out our progress.
12Export to a wide variety of file types.
ContExTest-acddata.asc
Heres an export I did to a delimited ASCII file.