Title: Data Cleaning and Transformation
1Data Cleaning and Transformation
2The Many Roles of Knowledge Workers
Brilliant IS idea
3The Gist of the Problem
- Getting data out of some system to
- Analyze it (e.g. Excel, Access, stats package)
- Get it into another system (e.g. ERP)
- Smart manipulation of electronic reports with
embedded data - Dont want to do it manually (why?)
4A former student describing new job with major
consulting firm
- Lots of modeling, muddy data problems, and
working with OLAP tools and data warehousing. I
know the muddy data area was a particular area of
interest to you and it seems that it is a really
BIG issue for many businesses - actually bigger
than I imagined.
5ACD Report Example
Report Header
Date
Split
Blank lines
Data!
Totals Lines
Next report
6Why Talk About This?
- Very common problem in business
- insert examples
- Huge amount of time wasted doing manual
processing - Really useful spreadsheet and database skills
(and mindset) - Example from the EXCEL-L Developers listserv
- Its fun to play in the mud sometimes
- Commercial products such as Content Extractor
(formerly Cambio) (www.datajunction.com) are
available if you need to do this stuff routinely.
Doing it here will allow you to become power
users of such products quickly. - Next few slides are screen shots from Content
Extractor
7This is the main window where one defines style
definitions for the different types of lines in
the data file.
8The date is always in the same spot so we can use
Fixed Column to get it out of the line weve
defined as DAY.
9Use Floating Tags for data that could appear in
different positions on a line. Ex we dont know
how long the split name will be, but we do know
that it will end with a period.
For data appearing in headers/footers, (e.g.
Split Name and ), we tell Content Extractor to
Propogate Field Contents so we get the name and
number with each detail line.
10Repeating day, split num and split name
Check out our progress.
11Export to a wide variety of file types.
ContExTest-acddata.asc
Heres an export I did to a delimited ASCII file.