Data Quality

About This Presentation

Title:

Data Quality

Description:

Critical component of data quality applications. Linkage involves finding a link ... Matching can be a combination of exact matching on particular fields, to ... – PowerPoint PPT presentation

Number of Views:13

Avg rating:3.0/5.0

Slides: 20

Provided by: davidl116

Learn more at: https://cs.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Quality

1
Data Quality

Class 10

2
Agenda

Review of Last week
Cleansing Applications
Guest Speaker

3
Review of Similarity

We use measures of similarity to establish
measures and thresholds for linkage
Edit Distance
Phonetic Similarity
Ngrams

4
Linkage

Critical component of data quality applications
Linkage involves finding a link between a pair of
records, either through an exact match or through
an approximate match
Linkage is useful for
Deduplification
Merge/Purge
Enhancement
Householding

5
Linkage 2

Two records are linked when they match with
enough weighted similarity
Matching can be a combination of exact matching
on particular fields, to approximat ematching
with some degree of similarity
For example two customer records can be linked
via an account number, if account numbers are
uniquely assigned to customers

6
Thresholding

We have functions to determine similarity
Tune these functions to return a value between 0
and 1 (i.e., 0 and 100)
0 absolutely no match
1 absolute match

7
Threhsolding 2

For any pair of values, we can apply the
similarity function and get a score
For different kinds of data, we can set a minimum
threshold, above which the two values are said to
match
For example, for an n-gram match, we can set the
threshold at 75

8
Thresholding 3

When comparing 2 records, we apply the similairty
functions in a pairwise manner to each of the
important attribute value pairs
We assign some weight to each attribute value
pair score
Total similarity score for each pair of records
is the sum of individual attribute value pairs
scores, adjusted by weight
We can assign an overall threshold indicating a
match

9
Thresholds and Matches

We actually assign two thresholds, to partition
scores into three groups
Definite matches
User review
Definite no-matches
Scores above the high threshold are definite
matches
Scores between low and high thresholds are user
review matches
All others are not matches

10
Deduplification

Duplicates are sets of records that represent the
same entity
Duplicate elimination is the process of finding
records that belong to the same equivalence class
based on similarity above a specified threshold
When duplicates are found, one record is created
to replace all the suplicates
That record is composed of the best data
gleaned from all equivalence class members

11
Merge/Purge

Similar to duplicate elimination
Application used when merging two or more
different database
Goal find all the records associated with the
same entity
Example when two banks merge, find all accounts
owned by the same person in both banks

12
Enhancement

We can enhance data by merging it with other data
sets
Linkage may be based on profile information
extracted from each record

13
Householding

We link on address as well as some permutation of
the entity name
Look to establish a location match and some
relation between entities at that location

14
Naive Algorithm

Goal find all possible matches between any pair
of records
Approach Perform a pairwise similarity score for
all record pairs
Downside O(n2)

15
Improvements

Desire to reduce the number of candidates for
pairwise similarity testing
We can use fast matching for fixed pivot values
when merging data sets
We use a concept called blocking to reduce the
search set

16
Fast Matching for Merging

We have looked at Bloom filters for fast matching
We can load all (recordID,attribute value) tuples
from the first data set into the Bloom filter
O(n)
We can then test each of the values from the
second set to see if the pivot matches in the
first set

17
Blocking

Goal reduce number of match candidates by using
some form of compression on the records to be
linked
Example phonetic encodings
Example limit by fixing one of the attributes
Example find a pivot attribute and use that for
affinity

18
Blocking 2

Example if we want to perform householding on a
mailing list
Block by ZIP code, since we dont expect to find
members of the same household living in different
locations

19
Linkage Algorithms

All linkage algorithms make use of these ideas
A blocking mechanism
We must choose based on the data available and
what makes sense for the application
Similarity functions
Every data type and data domain should have an
associated similarity function, even if it is a
0/1 exact match test
Weights for the similarity functions
This requires more insight into the problem, to
see how each attributes scores should weigh for
the overall score