Reasoning about Record Matching Rules - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Reasoning about Record Matching Rules

Description:

Matched by the deduced rule, but NOT by the given ones! ... Store deduced MDs in a table M. Process M based on inference rules, until M becomes stable ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 30
Provided by: vldb7
Category:

less

Transcript and Presenter's Notes

Title: Reasoning about Record Matching Rules


1
Reasoning about Record Matching Rules
  • Wenfei Fan 1, 2 Xibei Jia 1 Shuai Ma 1
  • 1University of Edinburgh 2Bell Labs
  • Jianzhong Li
  • Harbin Institute of Technology

2
Record matching
To identify tuples (from one or more unreliable
relations) that refer to the same real-world
object.
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
the same person?
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500

Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
Record linkage, entity resolution, data
deduplication, merge/purge,
3
Why bother?
Data quality, data integration, payment card
fraud detection,
Records for card holders
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
fraud?
Records for transaction logs
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500

Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
World-wide losses in 2006 4.84 billion
(www.sas.com)
4
Nontrivial A longstanding problem
  • Real-life data is often dirty errors in the data
    sources
  • Data is often represented differently in
    different sources

FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500

Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
Pairwise comparing attributes via equality only
does not work!
5
Matching rules (Hernndez Stolfo, 1995)
IF cardLN, address transLN, post AND
cardFN and transFN are similar, THEN identify
the two tuples
card
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
?

trans
Match
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500

Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
Accommodate errors in the data sources
6
A new class of dependencies for record matching
cardLN, address transLN, post ? cardFN ?
transFN ? cardX ? transY
cardtel transphn ? cardaddress ?
transpost
Identifying attributes (not necessarily entire
records), across sources
X
card
FN LN Address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
trans
Y
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500

Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
2(mn) configurations
What attributes to compare? How to compare them?
7
Deducing new dependencies from given ones
cardLN,address transLN,post ? cardFN ?
transFN ? cardX ? transY
cardtel transphn ? cardaddress ?
transpost
deduction
cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
card
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
Radically different
Match
trans
FN LN post phn when where amount
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
Matched by the deduced rule, but NOT by the given
ones!
8
Error correction, data enrichment,
1. cardLN,address transLN,post ? cardFN ?
transFN ? cardX ? transY
2. cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
3. cardtel transphn ? cardaddress ?
transpost
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
inconsistent
1
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500

Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
enrich
2
Match
The need for matching dependencies and for
reasoning about them
9
Outline
  • Matching dependencies (MDs) a departure from
    traditional dependencies
  • Dynamic semantics, similarity operators, across
    relations
  • Reasoning about matching dependencies
  • A sound and complete inference system
  • A low polynomial algorithm
  • Relative candidate keys (RCKs) matching rules
  • Deducing RCKs from MDs an exponential-time
    problem
  • An effective (heuristic) polynomial-time
    algorithm
  • Applications record matching, blocking,
    windowing
  • Experimental study

A dependency theory for record matching
10
Matching dependencies (MDs)
(R1A1 ?1 R2B1 ? . . . ? R1Ak ?k R2Bk)
? R1Z1 ? R2Z2
  • (Aj,Bj) pair of attributes in (R1, R2)
  • ?j similarity operator (equality, edit
    distance, q-gram, jaro distance, )
  • (Z1, Z2) lists of attributes in (R1, R2), of
    the same length
  • ? matching operator (identify two lists of
    attributes via updates)
  • R1X cardX , R2Y transY
  • cardLN, address transLN, post ? cardFN
    ? transFN ? cardX ? transY
  • cardtel transphn ? cardaddress ?
    transpost
  • cardLN, tel transLN, phn ? cardFN ?
    transFN ? cardX ? transY

Semantic relationship on attributes across
different sources
11
Dynamic semantics
? (R1A1 ?1 R2B1 ? . . . ? R1Ak ?k
R2Bk) ? R1Z1 ? R2Z2
  • (D1, D2) satisfies ? iff for all (t1, t2) ? D1,
  • if t1A1 ?1 t2B1 ? . . . ? t1Ak ?k t2Bk
    in D1
  • then (t1, t2) ? D2, and t1Z1 t2Z2 in D2
  • If (t1, t2) match the LHS, then their RHS are
    updated and equalized

tel address
3256777 10 Oak St, EDI
tel address
3256777 10 Oak St, EDI, EH8 9LE
phn post
3256777 PO Box 25, EDI
phn post
3256777 10 Oak St, EDI, EH8 9LE
D1
D2
Two instances are needed to cope with the dynamic
semantics
12
An extension of functional dependencies (FDs)?
MD (R1A1 ?1 R2B1 ? . . . ? R1Ak ?k
R2Bk) ? R1Z1 ? R2Z2
developed for schema design for clean data
FD tel ? address
to accommodate unreliable data
  • similarity operators vs. equality () only
  • across different relations (R1, R2) vs. on a
    single relation
  • dynamic semantics (matching operator ?) vs.
    static semantics

tel address
3256777 10 Oak St, EDI
3256777 PO Box 25, EDI
tel address
3256777 10 Oak St, EDI, EH8 9LE
3256777 10 Oak St, EDI, EH8 9LE
violation of the FD
satisfying the MD
D1
D2
A departure from traditional dependency theory
13
An inference system for deduction of MDs
Recall Armstrongs axioms for FDs
There is a finite set of axioms sound and
complete for MD deduction
Example MD ? is provable from ?1, ?2 by using
the inference system
?1 cardtel transphn ? cardaddress ?
transpost
Augmentation Rule
?1 cardLN, tel transLN, phn ? cardLN,
address ? transLN, post
?2 cardLN,address transLN,post ? cardFN
? transFN ? cardX ? transY
Transitivity Rule
? cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
  • More involved than Armstrongs axioms (11 axioms
    vs. 3)
  • two relations, generic reasoning for similarity
    operators

14
An algorithm for deducing MDs from given MDs
  • Algorithm MDClosure
  • Input a set ? of MDs and a single ?
  • Output yes if ? can be deduced from ?, in O(n2)
    time
  • Main ideas
  • Store deduced MDs in a table M
  • Process M based on inference rules, until M
    becomes stable
  • If the LHS of an MD is in M, then its RHS is
    added to M
  • Return yes if the RHS of ? is in M, and no
    otherwise
  • The algorithm is well designed to have low
    complexity - O(n2)

comparable to O(n) time for FDs
The deduction analysis can be conducted
efficiently
15
An algorithm for deducing MDs from given MDs
Example MD ? can be deduced from ?1, ?2
?1 cardtel transphn ? cardaddress ?
transpost ?2 cardLN,address
transLN,post ? cardFN ? transFN ? cardX ?
transY
? cardLN, tel transLN, phn ? cardFN
? transFN ? cardX ? transY
Step1 M cardLN, tel transLN, phn,
cardFN ? transFN
add the LHS of ?
Step2 M M ? cardaddress transpost
apply ?1
Step3 M M ? cardX transY
apply ?2
Return yes
A match may be found by deduced MDs, but NOT by
given ones
16
Relative Candidate Keys (RCKs)
relative to R1X and R2Y
Ultimate goal to decide whether R1X and R2Y
refer to the same object
(R1A1 ?1 R2B1 ? . . . ? R1Ak ?k R2Bk)
? R1X ? R2Y (R1A1, , Ak, R2B1, ,
Bk ?1 , . . ., ?k)
what to compare and how to compare
  • R1X cardX , R2Y transY
  • cardLN,address transLN,post ? cardFN ?
    transFN ? cardX?transY ? (cardLN,
    address, FN, transLN, post, FN , , ?)
  • cardtel transphn ? cardaddress ?
    transpost NOT an RCK
  • cardLN, tel transLN, phn ? cardFN ?
    transFN ? cardX ? transY
  • ? (cardLN, tel, FN, transLN, phn, FN
    , , ?)

A departure from candidate keys similarity,
different sources
17
What is special about RCKs?
  • Matching rules identify records from unreliable
    data sources
  • Optimization efficiency is a big issue for
    record matching
  • blocking

only records in the same block are compared
B1
D
B2
discriminating attributes
B3
  • windowing (sorted neighborhood)

window of a fixed size only records in the same
window are compared
D
D
sliding window
sorting via keys
The match quality is highly dependent on the
choices of keys
18
Deducing quality RCKs from MDs
Input a set ? of MDs, (R1X, R2Y), and a
number k Output a set ? of top k RCKs deduced
from ?
  • A quality metric
  • nonredundancy
  • the diversity of attributes
  • the lengths of attributes
  • the accuracy of attributes

exponential time
  • Nontrivial
  • first compute ALL RCKs, and then pick the top-k

The deduction analysis can be conducted
efficiently
19
A heuristic algorithm for deducing quality RCKs
  • Algorithm findRCKs
  • Input a set ? of MDs, (R1X, R2Y), and a
    number k
  • Output a set ? of top k RCKs deduced from ?, in
    O(kn3) time
  • Main ideas
  • A notion of completeness
  • if RCKs deduced from ? are already covered by
    smaller RCKs in ?
  • Deduction
  • (R1X, R2Y , , ) itself is an RCK
  • Make use of algorithm MDClosure to deduce RCKs

n the size of ? (meta-data)
A new RCK
(R1V1, Z1, R2V2, Z2 ?,, ? ) (R1U1 ?
R2U2 ? R1Z1 ? R2Z2)
(R1V1,U1, R2V2, U2 ?,, ? )
One can efficiently deduce keys for matching,
blocking, windowing
20
A heuristic algorithm for deducing quality RCKs
Example Given a set ?1, ?2 of MDs, (cardX,
transY) , deduce RCKs rck1, rck2,
rck3.
?1 cardLN,address transLN,post ? cardFN
? transFN ? cardX ? transY ?2 cardtel
transphn ? cardaddress ? transpost
Step1 rck1 (cardX, transY , , )
Step2 rk2 (cardLN, address, FN, transLN,
post, FN , , ?) Step3 rck2
miniminze(rk2)
Apply ?1 to rck1
Step4 rk3 (cardLN, tel, FN, transLN, phn,
FN , , ?) Step5 rck3 miniminze(rk3)
Apply ?2 to rck2
Return rck1, rck2, rck3.
Minimize remove redundant attribute pairs in an
RCK
21
Experimental study The reasoning algorithms
also scales well with k the number of RCKs
scales well with the number of MDs
The algorithm scales well (100 seconds for 2k MDs
50 RCKs)
22
The number of RCKs derived
Quality reasonably diverse
Sufficient quality RCKs can be deduced from a
small number of MDs
23
Experimental study Match quality (FS)
  • Fellegi-Sunter method a statistical method in
    action
  • Credit payment data scraped from the Web
    (relations of arity 21 and 13, with (X, Y) of
    length 11)
  • 7 MDs, using Damerau-Levenshtein distance,
    soundex for similarity
  • Precision (to all matches found), recall (to all
    true matches)

improving the precision without lowering the
recall
RCKs indeed improve the match quality (up to 20)
24
Experimental study Efficiency (FS)
comparable performance
RCKs do not incur extra cost while improving
match quality
25
Experimental study Precision (SN)
  • Sorted neighborhood method a rule-based method

insensitive to data size
RCKs consistently improve the precision (by 20)
26
Experimental study Recall (SN)
RCKs consistently improve the recall (by 20)
27
Experimental study Efficiency (SN)
by 30
RCKs reduce the number of comparisons and improve
efficiency
28
Experimental study Blocking
  • Partial RCKs as keys for blocking
  • Pair completeness S/N, numbers of matches with
    and without blocking

similar results for windowing
RCKs make effective blocking (windowing) keys
29
Summary
  • A dependency theory for matching unreliable
    records
  • Matching dependencies, relative candidate keys
    dynamic semantics, similarity operators, across
    unreliable data sources
  • A sound and complete inference system
  • An O(n2)-time algorithm for the deduction
    analysis
  • An efficient (heuristic) algorithm for deducing
    quality RCKs
  • Record matching, optimization (blocking,
    windowing)
  • Future work
  • Negative rules if condition then NO match
  • Conditions with constants
  • Interaction of record matching and data
    repairing being treated as separated processes

A practical tool for deducing matching rules
Write a Comment
User Comments (0)
About PowerShow.com