Dirty Data - Can You Afford It? - PowerPoint PPT Presentation

About This Presentation
Title:

Dirty Data - Can You Afford It?

Description:

'Too many wrong mistakes' Yogi Berra. Nonidentical duplicates ... A Sneak Peek at Match Codes. Using Match Codes. Prepare the data. Create multiple match codes ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 22
Provided by: FaronKi
Learn more at: https://texas-air.org
Category:
Tags: afford | data | dirty

less

Transcript and Presenter's Notes

Title: Dirty Data - Can You Afford It?


1
Dirty Data - Can You Afford It?
  • Faron Kincheloe, Baylor University

2
Types of Dirty Data
  • Too many wrong mistakes Yogi Berra
  • Nonidentical duplicates (names addresses)
  • Missing data (gender)
  • Non-standard entities (church names)

3
Identity Crisis
Ever Get 2 of the Same Piece of Mail?
Laura
Lauren
  • 2 Girls
  • 5 Viewbooks
  • Daughters of Baylor administrators

4
Lauren Laura
5
Duplication Dirt Devils
  • Character recognition
  • Misspelling
  • Marriage/Divorce
  • Middle name preference
  • Nicknames
  • First last name reversal
  • Electronic downloads
  • Delimiters in data
  • Variable field lengths

6
How Bad Was the Problem?
  • 1.15 725 out of 62,000 (1450 pairs)
  • 2175 at 3 per viewbook
  • 3 yield on 725 prospects 21 students
  • 15,000 per year per student not enrolled
  • 315,000 upper limit
  • Misapplied data
  • Lost credibility

7
The Little Brwon Chruch
8
Top Ten List
9
Wheres My Church?
How many ways can you say, First Baptist?
10
Let Me Count the Ways

11
Farons New Top Ten List
12
Its Not My Mess!!!
  • Why should I clean it up?
  • Overall financial impact
  • Data expertise
  • Data Mine instead of Not Mine
  • Improved accuracy
  • Partner with data owners

13
Cleaning Tools
  • DataFlux dfPower Suite (GUI)
  • SAS Data Quality Server (Code module)

14
Data Knowledge Definitions
  • ACCOUNT NUMBER
  • ADDRESS
  • CITY
  • DATE
  • E-MAIL
  • NAME
  • ORGANIZATION
  • PHONE
  • STATE
  • TEXT
  • ZIP

15
Cleaning Functions
  • PARSE Creates delimited text string
  • GENDER Estimates gender based on name
  • MATCH Creates index for matching
  • SCHEME Standardizes data

16
Parsing
FULL_NAME parsename
Smith Thomas //Thomas////Smith////
Thomas Smith //Thomas////Smith////
Mr. Christopher McDougal Mr.//Christopher////McDougal////
Christopher Mc Dougal Jr //Christopher////Mc Dougal//Jr//
Ryan Dawson //Ryan////Dawson////
Dawson, C. Ryan //C.//Ryan//Dawson////
Brandon Ledbetter //Brandon////Ledbetter////
Brandon Ledbtt Er //Brandon//Ledbtt//Er////
Dr. Walt Wagner Sr., EdD Dr.//Walt////Wagner//Sr.//EdD
17
Gender
FULL_NAME parsename gender
C. Luck //C.////Luck//// U
C. Matthew Luck //C.//Matthew//Luck//// M
Lee Fanning //Lee////Fanning//// U
Lee Ellen Fanning //Lee//Ellen//Fanning//// F
Eli Beth Eizner //Eli//Beth//Eizner//// U
Elizabeth Eizner //Elizabeth////Eizner//// F
Jesse Merrimann //Jesse////Merrimann//// U
Jesse Herrmann //Jesse////Herrmann//// U
18
A Sneak Peek at Match Codes
fullname MC_name MC_name95
1 Thomas Weldon Johnson CB4B CB4BB
2 Thomas Johnson CB4B CB4BB
3 Tom Johnson CB4B CB4BB
4 Johnson, Thomas Weldon CB4B CB4BB
5 Johnson, Theodore CB48 CB4B87
19
Using Match Codes
  • Prepare the data
  • Create multiple match codes
  • Create groups to target specific matches
  • Merge groups together
  • Remove repeated rows and sort by clusters
  • Print list for cleanup
  • Mark records for future match tests

20
Scheming Against the Data
  • Create scheme
  • Church Names
  • Cities
  • Customize finalize schemes
  • Apply schemes to data
  • Use SAS code to override exceptions
  • Compare with original entries
  • Update student information system
  • Automate the process

21
Questions?
Faron_Kincheloe_at_baylor.edu
Write a Comment
User Comments (0)
About PowerShow.com