Title: Dirty Data - Can You Afford It?
1Dirty Data - Can You Afford It?
- Faron Kincheloe, Baylor University
2Types of Dirty Data
- Too many wrong mistakes Yogi Berra
- Nonidentical duplicates (names addresses)
- Missing data (gender)
- Non-standard entities (church names)
3Identity Crisis
Ever Get 2 of the Same Piece of Mail?
Laura
Lauren
- 2 Girls
- 5 Viewbooks
- Daughters of Baylor administrators
4Lauren Laura
5Duplication Dirt Devils
- Character recognition
- Misspelling
- Marriage/Divorce
- Middle name preference
- Nicknames
- First last name reversal
- Electronic downloads
- Delimiters in data
- Variable field lengths
6How Bad Was the Problem?
- 1.15 725 out of 62,000 (1450 pairs)
- 2175 at 3 per viewbook
- 3 yield on 725 prospects 21 students
- 15,000 per year per student not enrolled
- 315,000 upper limit
- Misapplied data
- Lost credibility
7The Little Brwon Chruch
8Top Ten List
9Wheres My Church?
How many ways can you say, First Baptist?
10Let Me Count the Ways
11Farons New Top Ten List
12Its Not My Mess!!!
- Why should I clean it up?
- Overall financial impact
- Data expertise
- Data Mine instead of Not Mine
- Improved accuracy
- Partner with data owners
13Cleaning Tools
- DataFlux dfPower Suite (GUI)
- SAS Data Quality Server (Code module)
14Data Knowledge Definitions
- ACCOUNT NUMBER
- ADDRESS
- CITY
- DATE
- E-MAIL
- NAME
- ORGANIZATION
- PHONE
- STATE
- TEXT
- ZIP
15Cleaning Functions
- PARSE Creates delimited text string
- GENDER Estimates gender based on name
- MATCH Creates index for matching
- SCHEME Standardizes data
16Parsing
FULL_NAME parsename
Smith Thomas //Thomas////Smith////
Thomas Smith //Thomas////Smith////
Mr. Christopher McDougal Mr.//Christopher////McDougal////
Christopher Mc Dougal Jr //Christopher////Mc Dougal//Jr//
Ryan Dawson //Ryan////Dawson////
Dawson, C. Ryan //C.//Ryan//Dawson////
Brandon Ledbetter //Brandon////Ledbetter////
Brandon Ledbtt Er //Brandon//Ledbtt//Er////
Dr. Walt Wagner Sr., EdD Dr.//Walt////Wagner//Sr.//EdD
17Gender
FULL_NAME parsename gender
C. Luck //C.////Luck//// U
C. Matthew Luck //C.//Matthew//Luck//// M
Lee Fanning //Lee////Fanning//// U
Lee Ellen Fanning //Lee//Ellen//Fanning//// F
Eli Beth Eizner //Eli//Beth//Eizner//// U
Elizabeth Eizner //Elizabeth////Eizner//// F
Jesse Merrimann //Jesse////Merrimann//// U
Jesse Herrmann //Jesse////Herrmann//// U
18A Sneak Peek at Match Codes
fullname MC_name MC_name95
1 Thomas Weldon Johnson CB4B CB4BB
2 Thomas Johnson CB4B CB4BB
3 Tom Johnson CB4B CB4BB
4 Johnson, Thomas Weldon CB4B CB4BB
5 Johnson, Theodore CB48 CB4B87
19Using Match Codes
- Prepare the data
- Create multiple match codes
- Create groups to target specific matches
- Merge groups together
- Remove repeated rows and sort by clusters
- Print list for cleanup
- Mark records for future match tests
20Scheming Against the Data
- Create scheme
- Church Names
- Cities
- Customize finalize schemes
- Apply schemes to data
- Use SAS code to override exceptions
- Compare with original entries
- Update student information system
- Automate the process
21Questions?
Faron_Kincheloe_at_baylor.edu