Title: Reasoning about Record Matching Rules
1Reasoning about Record Matching Rules
- Wenfei Fan 1, 2 Xibei Jia 1 Shuai Ma 1
- 1University of Edinburgh 2Bell Labs
- Jianzhong Li
- Harbin Institute of Technology
2Record matching
To identify tuples (from one or more unreliable
relations) that refer to the same real-world
object.
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
the same person?
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
Record linkage, entity resolution, data
deduplication, merge/purge,
3Why bother?
Data quality, data integration, payment card
fraud detection,
Records for card holders
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
fraud?
Records for transaction logs
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
World-wide losses in 2006 4.84 billion
(www.sas.com)
4Nontrivial A longstanding problem
- Real-life data is often dirty errors in the data
sources - Data is often represented differently in
different sources
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
Pairwise comparing attributes via equality only
does not work!
5Matching rules (Hernndez Stolfo, 1995)
IF cardLN, address transLN, post AND
cardFN and transFN are similar, THEN identify
the two tuples
card
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
?
trans
Match
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
Accommodate errors in the data sources
6A new class of dependencies for record matching
cardLN, address transLN, post ? cardFN ?
transFN ? cardX ? transY
cardtel transphn ? cardaddress ?
transpost
Identifying attributes (not necessarily entire
records), across sources
X
card
FN LN Address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
trans
Y
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
2(mn) configurations
What attributes to compare? How to compare them?
7Deducing new dependencies from given ones
cardLN,address transLN,post ? cardFN ?
transFN ? cardX ? transY
cardtel transphn ? cardaddress ?
transpost
deduction
cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
card
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
Radically different
Match
trans
FN LN post phn when where amount
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
Matched by the deduced rule, but NOT by the given
ones!
8Error correction, data enrichment,
1. cardLN,address transLN,post ? cardFN ?
transFN ? cardX ? transY
2. cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
3. cardtel transphn ? cardaddress ?
transpost
FN LN address tel DOB gender
Mark Smith 10 Oak St, EDI, EH8 9LE 3256777 10/27/97 M
inconsistent
1
FN LN post phn when where amount
M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI 3,500
Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 NYC 6,300
enrich
2
Match
The need for matching dependencies and for
reasoning about them
9Outline
- Matching dependencies (MDs) a departure from
traditional dependencies - Dynamic semantics, similarity operators, across
relations - Reasoning about matching dependencies
- A sound and complete inference system
- A low polynomial algorithm
- Relative candidate keys (RCKs) matching rules
- Deducing RCKs from MDs an exponential-time
problem - An effective (heuristic) polynomial-time
algorithm - Applications record matching, blocking,
windowing - Experimental study
A dependency theory for record matching
10Matching dependencies (MDs)
(R1A1 ?1 R2B1 ? . . . ? R1Ak ?k R2Bk)
? R1Z1 ? R2Z2
- (Aj,Bj) pair of attributes in (R1, R2)
- ?j similarity operator (equality, edit
distance, q-gram, jaro distance, ) - (Z1, Z2) lists of attributes in (R1, R2), of
the same length - ? matching operator (identify two lists of
attributes via updates)
- R1X cardX , R2Y transY
- cardLN, address transLN, post ? cardFN
? transFN ? cardX ? transY - cardtel transphn ? cardaddress ?
transpost - cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
Semantic relationship on attributes across
different sources
11Dynamic semantics
? (R1A1 ?1 R2B1 ? . . . ? R1Ak ?k
R2Bk) ? R1Z1 ? R2Z2
- (D1, D2) satisfies ? iff for all (t1, t2) ? D1,
- if t1A1 ?1 t2B1 ? . . . ? t1Ak ?k t2Bk
in D1 - then (t1, t2) ? D2, and t1Z1 t2Z2 in D2
- If (t1, t2) match the LHS, then their RHS are
updated and equalized
tel address
3256777 10 Oak St, EDI
tel address
3256777 10 Oak St, EDI, EH8 9LE
phn post
3256777 PO Box 25, EDI
phn post
3256777 10 Oak St, EDI, EH8 9LE
D1
D2
Two instances are needed to cope with the dynamic
semantics
12An extension of functional dependencies (FDs)?
MD (R1A1 ?1 R2B1 ? . . . ? R1Ak ?k
R2Bk) ? R1Z1 ? R2Z2
developed for schema design for clean data
FD tel ? address
to accommodate unreliable data
- similarity operators vs. equality () only
- across different relations (R1, R2) vs. on a
single relation - dynamic semantics (matching operator ?) vs.
static semantics
tel address
3256777 10 Oak St, EDI
3256777 PO Box 25, EDI
tel address
3256777 10 Oak St, EDI, EH8 9LE
3256777 10 Oak St, EDI, EH8 9LE
violation of the FD
satisfying the MD
D1
D2
A departure from traditional dependency theory
13An inference system for deduction of MDs
Recall Armstrongs axioms for FDs
There is a finite set of axioms sound and
complete for MD deduction
Example MD ? is provable from ?1, ?2 by using
the inference system
?1 cardtel transphn ? cardaddress ?
transpost
Augmentation Rule
?1 cardLN, tel transLN, phn ? cardLN,
address ? transLN, post
?2 cardLN,address transLN,post ? cardFN
? transFN ? cardX ? transY
Transitivity Rule
? cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
- More involved than Armstrongs axioms (11 axioms
vs. 3) - two relations, generic reasoning for similarity
operators
14An algorithm for deducing MDs from given MDs
- Algorithm MDClosure
- Input a set ? of MDs and a single ?
- Output yes if ? can be deduced from ?, in O(n2)
time
- Main ideas
- Store deduced MDs in a table M
- Process M based on inference rules, until M
becomes stable - If the LHS of an MD is in M, then its RHS is
added to M - Return yes if the RHS of ? is in M, and no
otherwise - The algorithm is well designed to have low
complexity - O(n2)
comparable to O(n) time for FDs
The deduction analysis can be conducted
efficiently
15An algorithm for deducing MDs from given MDs
Example MD ? can be deduced from ?1, ?2
?1 cardtel transphn ? cardaddress ?
transpost ?2 cardLN,address
transLN,post ? cardFN ? transFN ? cardX ?
transY
? cardLN, tel transLN, phn ? cardFN
? transFN ? cardX ? transY
Step1 M cardLN, tel transLN, phn,
cardFN ? transFN
add the LHS of ?
Step2 M M ? cardaddress transpost
apply ?1
Step3 M M ? cardX transY
apply ?2
Return yes
A match may be found by deduced MDs, but NOT by
given ones
16Relative Candidate Keys (RCKs)
relative to R1X and R2Y
Ultimate goal to decide whether R1X and R2Y
refer to the same object
(R1A1 ?1 R2B1 ? . . . ? R1Ak ?k R2Bk)
? R1X ? R2Y (R1A1, , Ak, R2B1, ,
Bk ?1 , . . ., ?k)
what to compare and how to compare
- R1X cardX , R2Y transY
- cardLN,address transLN,post ? cardFN ?
transFN ? cardX?transY ? (cardLN,
address, FN, transLN, post, FN , , ?) - cardtel transphn ? cardaddress ?
transpost NOT an RCK - cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY - ? (cardLN, tel, FN, transLN, phn, FN
, , ?)
A departure from candidate keys similarity,
different sources
17What is special about RCKs?
- Matching rules identify records from unreliable
data sources
- Optimization efficiency is a big issue for
record matching - blocking
only records in the same block are compared
B1
D
B2
discriminating attributes
B3
- windowing (sorted neighborhood)
window of a fixed size only records in the same
window are compared
D
D
sliding window
sorting via keys
The match quality is highly dependent on the
choices of keys
18Deducing quality RCKs from MDs
Input a set ? of MDs, (R1X, R2Y), and a
number k Output a set ? of top k RCKs deduced
from ?
- A quality metric
- nonredundancy
- the diversity of attributes
- the lengths of attributes
- the accuracy of attributes
exponential time
- Nontrivial
- first compute ALL RCKs, and then pick the top-k
The deduction analysis can be conducted
efficiently
19A heuristic algorithm for deducing quality RCKs
- Algorithm findRCKs
- Input a set ? of MDs, (R1X, R2Y), and a
number k - Output a set ? of top k RCKs deduced from ?, in
O(kn3) time
- Main ideas
- A notion of completeness
- if RCKs deduced from ? are already covered by
smaller RCKs in ? - Deduction
- (R1X, R2Y , , ) itself is an RCK
- Make use of algorithm MDClosure to deduce RCKs
n the size of ? (meta-data)
A new RCK
(R1V1, Z1, R2V2, Z2 ?,, ? ) (R1U1 ?
R2U2 ? R1Z1 ? R2Z2)
(R1V1,U1, R2V2, U2 ?,, ? )
One can efficiently deduce keys for matching,
blocking, windowing
20A heuristic algorithm for deducing quality RCKs
Example Given a set ?1, ?2 of MDs, (cardX,
transY) , deduce RCKs rck1, rck2,
rck3.
?1 cardLN,address transLN,post ? cardFN
? transFN ? cardX ? transY ?2 cardtel
transphn ? cardaddress ? transpost
Step1 rck1 (cardX, transY , , )
Step2 rk2 (cardLN, address, FN, transLN,
post, FN , , ?) Step3 rck2
miniminze(rk2)
Apply ?1 to rck1
Step4 rk3 (cardLN, tel, FN, transLN, phn,
FN , , ?) Step5 rck3 miniminze(rk3)
Apply ?2 to rck2
Return rck1, rck2, rck3.
Minimize remove redundant attribute pairs in an
RCK
21Experimental study The reasoning algorithms
also scales well with k the number of RCKs
scales well with the number of MDs
The algorithm scales well (100 seconds for 2k MDs
50 RCKs)
22The number of RCKs derived
Quality reasonably diverse
Sufficient quality RCKs can be deduced from a
small number of MDs
23Experimental study Match quality (FS)
- Fellegi-Sunter method a statistical method in
action - Credit payment data scraped from the Web
(relations of arity 21 and 13, with (X, Y) of
length 11) - 7 MDs, using Damerau-Levenshtein distance,
soundex for similarity - Precision (to all matches found), recall (to all
true matches)
improving the precision without lowering the
recall
RCKs indeed improve the match quality (up to 20)
24Experimental study Efficiency (FS)
comparable performance
RCKs do not incur extra cost while improving
match quality
25Experimental study Precision (SN)
- Sorted neighborhood method a rule-based method
insensitive to data size
RCKs consistently improve the precision (by 20)
26Experimental study Recall (SN)
RCKs consistently improve the recall (by 20)
27Experimental study Efficiency (SN)
by 30
RCKs reduce the number of comparisons and improve
efficiency
28Experimental study Blocking
- Partial RCKs as keys for blocking
- Pair completeness S/N, numbers of matches with
and without blocking
similar results for windowing
RCKs make effective blocking (windowing) keys
29Summary
- A dependency theory for matching unreliable
records - Matching dependencies, relative candidate keys
dynamic semantics, similarity operators, across
unreliable data sources - A sound and complete inference system
- An O(n2)-time algorithm for the deduction
analysis - An efficient (heuristic) algorithm for deducing
quality RCKs - Record matching, optimization (blocking,
windowing)
- Future work
- Negative rules if condition then NO match
- Conditions with constants
- Interaction of record matching and data
repairing being treated as separated processes
A practical tool for deducing matching rules