Data%20Cleaning - PowerPoint PPT Presentation

About This Presentation
Title:

Data%20Cleaning

Description:

Street Address Number: 13150. Street Name: Hiway 9. Suite Number: 116 ... Zip code 95006 is CA, not Colorado. Matching/Householding ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 21
Provided by: acs5
Category:

less

Transcript and Presenter's Notes

Title: Data%20Cleaning


1
Data Mining Process Source CRISP-DM (SPSS.com
website)
2
Data Preparation
  • Data Aggregation
  • Turning raw data into variables at the right
    level of aggregation for analysis.
  • New Variable Creation
  • Typically creating ratio variables from existing
    ones.
  • Eg. Number of Crimes/ Population, Number of Gold
    Medals / Population
  • Data Cleaning
  • includes preliminary analysis to find
  • Missing Data
  • Outliers
  • Misclassification, Incorrect coding of values
  • Variable Transformation
  • Creating dummies
  • Creating quadratic, log, or other transforms
  • Creating interaction terms

3
Data Cleaning
  • MIS Issues
  • (Source Article by Ralph Kimball)
  • Analyst Issues

4
MIS Issues
  • Elementizing (Parsing)
  • Standardizing
  • Verifying
  • Matching,
  • Householding
  • Documenting

5
Elementising
  • Ralph B and Julianne Kimball Trustees for Kimball
    Fred CSte. 11613150 Hiway 9Box 1234 Boulder
    CrkColo 95006

6
Addressee First Name(1) RalphAddressee Middle
Initial(1) BAddressee Last Name(1)
KimballAddressee First Name(2)
JulianneAddressee Last Name(2)
KimballAddressee Relationship Trustees
forRelationship Person First Name
FredRelationship Person Middle Name
CRelationship Person Last Name KimballStreet
Address Number 13150Street Name Hiway 9Suite
Number 116Post Office Box Number 1234City
Boulder CrkState ColoFive Digit Zip 95006
7
Standardizing
  • Ste suite
  • Hiway 9 Highway 9
  • Other example -
  • Grade D Distinction in Australia

8
Verification
  • Zip code 95006 is CA, not Colorado

9
Matching/Householding
  • Match record with other customer records
    containing Ralph and Julianne Kimball
  • Establish that they are part of the same household

10
Analyst Issues
  • Physical data problems
  • Data Dictionaries
  • Validation (Frequencies)
  • Missing Data
  • The zero value problem
  • Inappropriate (Future) data for modeling
  • Unavailable data

11
Physical
  • Cannot access data
  • ASCII vs EBCDIC
  • On a medium that you cant use (certain type of
    tape, for instance)

12
Data Dictionaries
  • What are the fields?
  • Where are they located?
  • What format are they stored in?

13
Missing Data
  • Ignore
  • Find the right values if you can
  • Use Average for that variable
  • Replace with number that matches its
    characteristics
  • (What do the missing people look like in terms
    of the dependent? Who else looks like that?

14
The zero problem
  • What does 0 mean?
  • If Number of Revolving Bankcard Trades Currently
    Past Due 0, what does that mean?

15
of Bank Rev. Trds Currently Past Due
Cumulative
Cumulative BRPSTD Frequency
Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0
13485 96.0 13485 96.0 1
486 3.5 13971
99.5 2 57
0.4 14028 99.9 3
12 0.1 14040 100.0 4
2 0.0
14042 100.0
16
of Trds Cumulative
Cumulative TRADES Frequency
Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0
1606 11.4 1606 11.4
1 1080 7.7
2686 19.1 2
1056 7.5 3742 26.6
3 1007 7.2 4749
33.8 4 949
6.8 5698 40.6 5
911 6.5 6609
47.1 6 849
6.0 7458 53.1 7
793 5.6 8251 58.8
8 682 4.9
8933 63.6 9
622 4.4 9555 68.0
10 4487 32.0
14042 100.0
17
of Bank Rev. Trds
Cumulative Cumulative
BRTRDS Frequency Percent Frequency
Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ INQS. PR ONLY
64 0.5 64 0.5
PR ONLY 22 0.2
86 0.6 INQS. ONLY
960 6.8 1046 7.4 NO RECORD
560 4.0 1606
11.4 0 6183
44.0 7789 55.5 1
2616 18.6 10405 74.1
2 1427 10.2
11832 84.3 3
831 5.9 12663 90.2 4
496 3.5 13159
93.7 5 287
2.0 13446 95.8 6
188 1.3 13634 97.1 7
142 1.0
13776 98.1 8
92 0.7 13868 98.8 9
60 0.4 13928
99.2 10 114
0.8 14042 100.0
18
of Bank Rev. Trds Currently Past Due
Cumulative
Cumulative BRPSTD Frequency
Percent Frequency Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ NO TRADES OF THIS TYPE
6183 44.0 6183 44.0 INQS. PR
ONLY 64 0.5 6247
44.5 PR ONLY 22
0.2 6269 44.6 INQS. ONLY
960 6.8 7229 51.5 NO
RECORD 560 4.0
7789 55.5 MISSING
3475 24.7 11264 80.2 0
2221 15.8 13485
96.0 1 486
3.5 13971 99.5 2
57 0.4 14028 99.9 3
12 0.1
14040 100.0 4
2 0.0 14042 100.0
19
Inappropriate Data Used
  • Future data used to build great looking model.
  • Used payments till month end instead of payments
    until cycle date.

20
Unavailable Data
  • Data on Rejected Applicants
  • Would they have been Good or Bad had they been
    accepted?
  • Use Reject Inferencing techniques.
Write a Comment
User Comments (0)
About PowerShow.com