Title: Data%20Cleaning
1Data Mining Process Source CRISP-DM (SPSS.com
website)
2Data Preparation
- Data Aggregation
- Turning raw data into variables at the right
level of aggregation for analysis. - New Variable Creation
- Typically creating ratio variables from existing
ones. - Eg. Number of Crimes/ Population, Number of Gold
Medals / Population - Data Cleaning
- includes preliminary analysis to find
- Missing Data
- Outliers
- Misclassification, Incorrect coding of values
- Variable Transformation
- Creating dummies
- Creating quadratic, log, or other transforms
- Creating interaction terms
3Data Cleaning
- MIS Issues
- (Source Article by Ralph Kimball)
- Analyst Issues
4MIS Issues
- Elementizing (Parsing)
- Standardizing
- Verifying
- Matching,
- Householding
- Documenting
5Elementising
- Ralph B and Julianne Kimball Trustees for Kimball
Fred CSte. 11613150 Hiway 9Box 1234 Boulder
CrkColo 95006
6Addressee First Name(1) RalphAddressee Middle
Initial(1) BAddressee Last Name(1)
KimballAddressee First Name(2)
JulianneAddressee Last Name(2)
KimballAddressee Relationship Trustees
forRelationship Person First Name
FredRelationship Person Middle Name
CRelationship Person Last Name KimballStreet
Address Number 13150Street Name Hiway 9Suite
Number 116Post Office Box Number 1234City
Boulder CrkState ColoFive Digit Zip 95006
7Standardizing
- Ste suite
- Hiway 9 Highway 9
- Other example -
- Grade D Distinction in Australia
8Verification
- Zip code 95006 is CA, not Colorado
9Matching/Householding
- Match record with other customer records
containing Ralph and Julianne Kimball - Establish that they are part of the same household
10Analyst Issues
- Physical data problems
- Data Dictionaries
- Validation (Frequencies)
- Missing Data
- The zero value problem
- Inappropriate (Future) data for modeling
- Unavailable data
11Physical
- Cannot access data
- ASCII vs EBCDIC
- On a medium that you cant use (certain type of
tape, for instance)
12Data Dictionaries
- What are the fields?
- Where are they located?
- What format are they stored in?
13Missing Data
- Ignore
- Find the right values if you can
- Use Average for that variable
- Replace with number that matches its
characteristics - (What do the missing people look like in terms
of the dependent? Who else looks like that?
14The zero problem
- What does 0 mean?
- If Number of Revolving Bankcard Trades Currently
Past Due 0, what does that mean?
15 of Bank Rev. Trds Currently Past Due
Cumulative
Cumulative BRPSTD Frequency
Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0
13485 96.0 13485 96.0 1
486 3.5 13971
99.5 2 57
0.4 14028 99.9 3
12 0.1 14040 100.0 4
2 0.0
14042 100.0
16 of Trds Cumulative
Cumulative TRADES Frequency
Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0
1606 11.4 1606 11.4
1 1080 7.7
2686 19.1 2
1056 7.5 3742 26.6
3 1007 7.2 4749
33.8 4 949
6.8 5698 40.6 5
911 6.5 6609
47.1 6 849
6.0 7458 53.1 7
793 5.6 8251 58.8
8 682 4.9
8933 63.6 9
622 4.4 9555 68.0
10 4487 32.0
14042 100.0
17 of Bank Rev. Trds
Cumulative Cumulative
BRTRDS Frequency Percent Frequency
Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ INQS. PR ONLY
64 0.5 64 0.5
PR ONLY 22 0.2
86 0.6 INQS. ONLY
960 6.8 1046 7.4 NO RECORD
560 4.0 1606
11.4 0 6183
44.0 7789 55.5 1
2616 18.6 10405 74.1
2 1427 10.2
11832 84.3 3
831 5.9 12663 90.2 4
496 3.5 13159
93.7 5 287
2.0 13446 95.8 6
188 1.3 13634 97.1 7
142 1.0
13776 98.1 8
92 0.7 13868 98.8 9
60 0.4 13928
99.2 10 114
0.8 14042 100.0
18 of Bank Rev. Trds Currently Past Due
Cumulative
Cumulative BRPSTD Frequency
Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ NO TRADES OF THIS TYPE
6183 44.0 6183 44.0 INQS. PR
ONLY 64 0.5 6247
44.5 PR ONLY 22
0.2 6269 44.6 INQS. ONLY
960 6.8 7229 51.5 NO
RECORD 560 4.0
7789 55.5 MISSING
3475 24.7 11264 80.2 0
2221 15.8 13485
96.0 1 486
3.5 13971 99.5 2
57 0.4 14028 99.9 3
12 0.1
14040 100.0 4
2 0.0 14042 100.0
19Inappropriate Data Used
- Future data used to build great looking model.
- Used payments till month end instead of payments
until cycle date.
20Unavailable Data
- Data on Rejected Applicants
- Would they have been Good or Bad had they been
accepted? - Use Reject Inferencing techniques.