Data Quality - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Data Quality

Description:

Critical component of data quality applications. Linkage involves finding a link ... Matching can be a combination of exact matching on particular fields, to ... – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 20
Provided by: davidl116
Learn more at: https://cs.nyu.edu
Category:
Tags: data | forsee | quality

less

Transcript and Presenter's Notes

Title: Data Quality


1
Data Quality
  • Class 10

2
Agenda
  • Review of Last week
  • Cleansing Applications
  • Guest Speaker

3
Review of Similarity
  • We use measures of similarity to establish
    measures and thresholds for linkage
  • Edit Distance
  • Phonetic Similarity
  • Ngrams

4
Linkage
  • Critical component of data quality applications
  • Linkage involves finding a link between a pair of
    records, either through an exact match or through
    an approximate match
  • Linkage is useful for
  • Deduplification
  • Merge/Purge
  • Enhancement
  • Householding

5
Linkage 2
  • Two records are linked when they match with
    enough weighted similarity
  • Matching can be a combination of exact matching
    on particular fields, to approximat ematching
    with some degree of similarity
  • For example two customer records can be linked
    via an account number, if account numbers are
    uniquely assigned to customers

6
Thresholding
  • We have functions to determine similarity
  • Tune these functions to return a value between 0
    and 1 (i.e., 0 and 100)
  • 0 absolutely no match
  • 1 absolute match

7
Threhsolding 2
  • For any pair of values, we can apply the
    similarity function and get a score
  • For different kinds of data, we can set a minimum
    threshold, above which the two values are said to
    match
  • For example, for an n-gram match, we can set the
    threshold at 75

8
Thresholding 3
  • When comparing 2 records, we apply the similairty
    functions in a pairwise manner to each of the
    important attribute value pairs
  • We assign some weight to each attribute value
    pair score
  • Total similarity score for each pair of records
    is the sum of individual attribute value pairs
    scores, adjusted by weight
  • We can assign an overall threshold indicating a
    match

9
Thresholds and Matches
  • We actually assign two thresholds, to partition
    scores into three groups
  • Definite matches
  • User review
  • Definite no-matches
  • Scores above the high threshold are definite
    matches
  • Scores between low and high thresholds are user
    review matches
  • All others are not matches

10
Deduplification
  • Duplicates are sets of records that represent the
    same entity
  • Duplicate elimination is the process of finding
    records that belong to the same equivalence class
    based on similarity above a specified threshold
  • When duplicates are found, one record is created
    to replace all the suplicates
  • That record is composed of the best data
    gleaned from all equivalence class members

11
Merge/Purge
  • Similar to duplicate elimination
  • Application used when merging two or more
    different database
  • Goal find all the records associated with the
    same entity
  • Example when two banks merge, find all accounts
    owned by the same person in both banks

12
Enhancement
  • We can enhance data by merging it with other data
    sets
  • Linkage may be based on profile information
    extracted from each record

13
Householding
  • We link on address as well as some permutation of
    the entity name
  • Look to establish a location match and some
    relation between entities at that location

14
Naive Algorithm
  • Goal find all possible matches between any pair
    of records
  • Approach Perform a pairwise similarity score for
    all record pairs
  • Downside O(n2)

15
Improvements
  • Desire to reduce the number of candidates for
    pairwise similarity testing
  • We can use fast matching for fixed pivot values
    when merging data sets
  • We use a concept called blocking to reduce the
    search set

16
Fast Matching for Merging
  • We have looked at Bloom filters for fast matching
  • We can load all (recordID,attribute value) tuples
    from the first data set into the Bloom filter
    O(n)
  • We can then test each of the values from the
    second set to see if the pivot matches in the
    first set

17
Blocking
  • Goal reduce number of match candidates by using
    some form of compression on the records to be
    linked
  • Example phonetic encodings
  • Example limit by fixing one of the attributes
  • Example find a pivot attribute and use that for
    affinity

18
Blocking 2
  • Example if we want to perform householding on a
    mailing list
  • Block by ZIP code, since we dont expect to find
    members of the same household living in different
    locations

19
Linkage Algorithms
  • All linkage algorithms make use of these ideas
  • A blocking mechanism
  • We must choose based on the data available and
    what makes sense for the application
  • Similarity functions
  • Every data type and data domain should have an
    associated similarity function, even if it is a
    0/1 exact match test
  • Weights for the similarity functions
  • This requires more insight into the problem, to
    see how each attributes scores should weigh for
    the overall score
Write a Comment
User Comments (0)
About PowerShow.com