Schema: Cust(country, area-code, phone, street, city, zip

About This Presentation

Title:

Schema: Cust(country, area-code, phone, street, city, zip

Description:

Schema: Cust(country, area-code, phone, street, city, zip) Instance: 07974. NYC. Mountain Ave ... zip. city. street. phone. area-code. country. functional ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 57

Provided by: homepage7

Category:

more less

Transcript and Presenter's Notes

Title: Schema: Cust(country, area-code, phone, street, city, zip

1

TDD Research Topics in Distributed Databases

Dependencies for improving data quality
Conditional functional dependencies (CFDs)
Syntax and semantics
Static analysis consistency and implication,
axiom system
Conditional inclusion dependencies (CINDs)
Syntax and semantics
Static analysis consistency and implication
Matching dependencies for record matching (MDs)
Syntax and semantics
Relative candidate keys

2
Characterizing the consistency of data

One of the central technical problems for data
consistency is how to tell whether the data is
dirty or clean
Integrity constraints (data dependencies) as data
quality rules
Inconsistencies emerge as violations of
constraints
Traditional dependencies
functional dependencies
inclusion dependencies
denial constraints (a special case of full
dependencies)
. . .
Question are these traditional dependencies
sufficient?

3
Example customer relation

Schema Cust(country, area-code, phone, street,
city, zip)
Instance

functional dependencies (FDs)
custcountry, area-code, phone ? custstreet,
city, zip
custcountry, area-code ? custcity
The database satisfies the FDs. Is the data
consistent?

4
Capturing inconsistencies in the data

cust (country 44, zip ? street)
In the UK, zip code uniquely determines the
street
The constraint may not hold for other countries
It expresses a fundamental part of the semantics
of the data
It can NOT be expressed as a traditional FD
It does not hold on the entire relation instead,
it holds on tuples representing UK customers only

5
Two more constraints

cust(country 44, area-code 131, phone ?
street, zip, city EDI)
cust(country 01, area-code 908, phone ?
street, zip, city MH)
In the UK, if the area code is 131, then the city
has to be EDI
In the US, if the area code is 908, then the city
has to be MH
t1, t2 and t3 violate these constraints
refining cust(country, area-code, phno ?
street, city, zip)
combining data values and variables

6
The need for new constraints

cust(country 44, zip ? street)
cust(country 44, area-code 131, phone ?
street, zip, city EDI)
cust(country 01, area-code 908, phone ?
street, zip, city MH)
They capture inconsistencies that traditional FDs
cannot detect
Traditional constraints were developed for
schema design, not for data cleaning!
Data integration in real-life source constraints
hold on a subset of sources
hold conditionally on the integrated data
They are NOT expressible as traditional FDs
do not hold on the entire relation
contain constant data values, besides logical
variables

7

Conditional Functional Dependencies (CFDs)

An extension of traditional FDs (R X ? Y, Tp)
X ? Y embedded traditional FD on R
Tp a pattern tableau
attributes X ? Y
each tuple in Tp consists of constants and
unnamed variable _
Example cust(country 44, zip ? street)
(cust (country, zip ? street), Tp)
pattern tableau Tp

8

Example CFDs

cust(country 44, area-code 131, phone ?
street, zip, city EDI)
cust(country 01, area-code 908, phone ?
street, zip, city MH)
cust(country, area-code, phone ? street,
city, zip)
as a SINGLE CFD
(cust(country, area-code, phone ? street, city,
zip), Tp)
pattern tableau Tp one tuple for each constraint

CFDs subsume traditional FDs. Why?
9

Traditional FDs as a special case

Express
custcountry, area-code ? custcity
as a CFD
(cust(country, area-code, ? city), Tp)
pattern tableau Tp a single tuple consisting of
_ only

10

Semantics of CFDs

a ? b (a matches b) if
either a or b is _
both a and b are constants and a b
tuple t1 matches t2 t1 ? t2
(a, b) ? (a, _), but (a, b) does not match (a,
c)
DB satisfies (R X ? Y, Tp) iff for any tuple tp
in the pattern tableau Tp and for any tuples t1,
t2 in DB, if t1X t2X ? tpX, then t1Y
t2Y ? tpY
tpX identifying the set of tuples on which the
constraint tp applies, ie, t tX ? tpX
t1Y t2Y ? tpY enforcing the embedded FD,
and the pattern of tp

11

Example violation of CFDs

cust(country 44, zip ? street)
Tuples t1 and t2 violate the CFD
t1country, zip t2country, zip ? tpcountry,
zip
t1street ? t2street
The CFD applies to t1 and t2 since they match
tpcountry, zip

CFDs enforcing binding of semantically related
data values
12

Violation of CFDs by a single tuple

(cust(country, area-code ? city), Tp)
Tuple t1 does not satisfy the CFD
t1country, area-code t1country, area-code ?
tp1country, area-code
t1city t1city however, t1city does not
match tp1city
In contrast to traditional FDs, a single tuple
may violate a CFD

13

Exercise

(cust(country, area-code, phno ? street, city,
zip), Tp)
Tuple t3 violates the CFD. Why?
Tuples t1 and t4 violate the CFD. Why?

14
Dirty constraints?

A set of CFDs may be inconsistent!
Inconsistent (R(A ? B), Tp)
In any nonempty database DB and for any tuple t
in DB,
tp1 tB must be b
tp2 tB must be c
Inconsistent if b and c are different
inconsistent ? ?1, ?2 , ?1 (R(A ? B),
Tp1), ?2 (R(B ? A), Tp2)
Why?

Tp
15
The consistency problem

The consistency problem for CFDs is to determine,
given a set ? of CFDs, whether or not there
exists a nonempty database DB that satisfies ?,
i.e., for any ? in ?, DB satisfies ?.
Whether or not ? makes sense
For traditional FDs, the consistency problem is
not an issue one can specify any FDs without
worrying about their consistency
A set of CFDs may be inconsistent!
Theorem. The consistency problem for CFDs is
NP-complete.
Nontrivial contrast this with the trivial
consistency analysis of FDs!

16
The implication problem

The implication problem for CFDs is to determine,
given a set ? of CFDs and a single CFD ?, whether
? implies ?, denoted by ? ?, i.e., for any
database DB, if DB satisfies ?, then DB satisfies
?.
Example
? ?1, ?2 , ?1 (R(A ? B), Tp1), ?2
(R(B ? C), Tp2)
? (R(A ? C), Tp)
? ?.

Tp1
Tp2
17
The complexity of the implication problem

For traditional FDs, the implication problem is
in linear time
In contrast, the implication problem for CFDs is
intractable
Theorem. The implication problem for CFDs is
coNP-complete.

Question how about constant CFDs (without
wildcard)? Would it simplify the consistency and
implication analyses?
The expressive power of CFDs comes at a price
18
Finite axiomatizability Flashback

Armstrongs axioms can be found in every database
textbook
Reflexivity If Y ? X, then X ? Y
Augmentation If X ? Y , then XZ ? YZ
Transitivity If X ? Y and Y ? Z, then X ? Z
Sound and complete for FD implication, i.e, ?
? iff ? can be inferred ? from using
reflexivity, augmentation, transitivity.
Question is there a sound and complete inference
system for the implication analysis of CFDs?

19
Finite axiomatizability of CFDs

Theorem. There is a sound and complete inference
system I for implication analysis of CFDs
Sound if ? - ?, i.e., ? can be proved from ?
using I, then ? ?
Complete if ? ?, then ? - ? using I
The inference system is more involved than its
counterpart for traditional FDs, namely,
Armstrongs axioms.
There are 5 axioms.
A normal form of CFDs (R X ? A, tp), tp is a
single pattern tuple.

20
Axioms for CFDs transitivity

Transitivity if (A1,,Ak ? B1,,Bm, tp)

and (B1,,Bm ? C1,,Cn, tp)
match
(A1,,Ak ? C1,,Cn, tp)
21
Axioms for CFDs reduction

reduction if (B, X ? A, tp), tpB _,
and tpA a

then (X ? A, tp)
22
Static analyses CFD vs. FD

General setting

in the absence of finite-domain attributes

complications finite-domain attributes

23

Conditional Constraints for Data Cleaning

Conditional functional dependencies (CFDs)
Syntax and semantics
Static analysis consistency and implication,
axiom system
SQL techniques for inconsistency detection and
incremental detection
Conditional inclusion dependencies (CINDs)
Syntax and semantics
Static analysis consistency and implication
Matching dependencies for record matching (MDs)
Syntax and semantics
Relative candidate keys

24
Example Amazon database

Schema
order(asin, title, type, price, country, county)
-- source
book(asin, isbn, title, price, format)
-- target
CD(asin, title, price, genre)
asin Amazon standard identification number
Instances

order
book
CD
25
Schema matching

Inclusion dependencies from source to target
(e.g., Clio)

Do these make sense?

Traditional inclusion dependencies
orderasin, title, price ? bookasin, title,
price
orderasin, title, price ? CDasin, title,
price
These inclusion dependencies do not make sense!

26
Schema matching dependencies with conditions

Conditional inclusion dependencies
orderasin, title, price type book ?
bookasin, title, price
orderasin, title, price type CD ? CDasin,
title, price
orderasin, title, price ? bookasin, title,
price holds only if type book
orderasin, title, price ? CDasin, title,
price holds only if type CD
The constraints do not hold on the entire order
table

27
Date cleaning with conditional dependencies

CIND1 orderasin, title, price type book
? bookasin, title, price
CIND2 orderasin, title, price type CD ?
CDasin, title, price
Tuple t1 violates CIND1
Tuple t2 violates CIND2, why?

order
CD
book
28
More on data cleaning
CD
book

CDasin, title, price genre a-book ?
bookasin, title, price format audio
Inclusion dependency CDasin, title, price ?
bookasin, title, price holds only if genre
a-book, i.e., when the CD is an audio book
In addition, the format of the corresponding book
must be audio a pattern for the referenced tuple

And what?
29

Conditional Inclusion Dependencies (CINDs)

(R1X Xp ? R2Y Yp, Tp)
R1X ? R2Y embedded traditional IND from R1
to R2
Tp a pattern tableau
attributes Xp ? Yp
tuples in Tp consist of constants and unnamed
variable _
Example express
CIND1 orderasin, title, price type book
? bookasin, title, price
(orderasin, title, price type ? bookasin,
title, price nil, Tp)
nil empty list
pattern tableau Tp

30

Traditional CINDs as a special case

R1X ? R2Y
X A1, , An
Y B1, , Bn
As a CIND (R1X nil ? R2Y nil, Tp)
What is the pattern tableau?
pattern tableau Tp a single tuple ( )

CINDs subsume traditional INDs
31

Exercise

Express the following as CINDs
CIND2 orderasin, title, price type CD ?
CDasin, title, price
CIND3 CDasin, title, price genre a-book
? bookasin, title, price format audio

(orderasin, title, price type ? CDasin,
title, price nil, Tp)
(CDasin, title, price genre ? bookasin,
title, price format, Tp)

32

Semantics of CINDs

DB (DB1, DB2), where DBj is an instance of Rj,
j 1, 2.
DB satisfies (R1X Xp ? R2Y Yp, Tp) iff for
any tuples t1 in DB1, and any tuple tp in the
pattern tableau Tp, if t1Xp ? tpXp, then
there exists t2 in DB2 such that
t1Y t2Y (traditional IND semantics)
t2Yp ? tpYp (matching the pattern tuple on
Y, Yp)
Patterns
t1Xp ? tpXp identifying the set of R1
tuples on which tp applies t1 t1Xp ?
tpXp
t2Yp ? tpYp enforcing the embedded IND and
the constraint specified by patterns Yp

33

Example

(CDasin, title, price genre ? bookasin,
title, price format, Tp)
The following DB satisfies the CIND

book
CD
34
Exercise

CIND1 (orderasin, title, price type ?
bookasin, title, price nil, Tp)
The following DB violates CIND1. Why?

order
book
CD
35
The satisfiability problem for CINDs

The consistency problem for CINDs is to
determine, given a set ? of CINDs, whether or not
there exists a nonempty database DB that
satisfies ?, i.e., for any ? in ?, DB satisfies
?.
Recall
Any set of traditional INDs is always consistent!
For CFDs, the satisfiability problem is
intractable.
In contrast.
Theorem. Any set of CINDs is always consistent!
Despite the increased expressive power, the
complexity of the satisfiability analysis does
not go up.

36
The implication problem for CINDs

The implication problem for CINDs is to decide,
given a set ? of CINDs and a single CIND ?,
whether ? implies ? (? ?).
For traditional INDs, the implication problem is
PSPACE-complete
For CINDs, the complexity does not hike up, to an
extent
Theorem. For CINDs containing no finite-domain
attributes, the implication problem is
PSPACE-complete
In the general setting, however, we have to pay a
price
Theorem. The implication problem for CINDs is
EXPTIME-complete

37
Finite axiomatizability of CINDs

Rules for inferring IND implication
Reflexivity If RX ? RX
Projection and Permutation If R1A1, , Ak ?
R2B1, , Bk, then R1Ai1, , Aik ? R2Bi1, ,
Bik,
Transitivity If R1X ? R2Y and R2Y ? R3Z,
then R1X ? R3Z
Sound and complete for IND implication
CINDs retain the finite axiomatizability
Theorem. There is a sound and complete inference
system for implication analysis of CINDs
There are 8 axioms.

38
Inference rules for CINDs

Normal form of CINDs (R1X Xp ? R2Y Yp,
tp),
tp is a single pattern tuple
tpA is a constant iff A is in Xp or Yp (tpB
_ if B is in X or Y)
Inference rules

Reflexivity (RX nil ? RX nil, tp), where
tp ( )

Projection and permutation If (R1X Xp ?
R2Y Yp, tp), then (R1X Xp ? R2Y Yp,
tp), for any permutation of X, Xp

tp
tp
39
Axioms for CINDs transitivity

Transitivity if (R1X Xp ? R2Y Yp, tp),

and (R2Y Yp ? R3Z Zp, tp),
equal
(R1X Xp ? R3Z Zp, tp)
40
Axioms for CINDs augmentation

augmentation if (R1X Xp ? R2Y Yp, tp), A
? attr(R1),

(R1X Xp, A ? R2Y Yp, tp)
41
Static analyses CIND vs. IND

General setting

in the absence of finite-domain attributes

CINDs retain most complexity bounds of their
traditional counterpart
42
CFDs and CINDs taken together

We need both CFDs and CINDs for
data cleaning
schema matching
Theorem. The implication problem for CFDs and
CINDs is undecidable
Not surprising The implication problem for
traditional FDs and INDs is already undecidable
Theorem. The consistency problem for CFDs and
CINDs is undecidable
In contrast, any set of traditional FDs and INDs
is consistent!

43
Static analyses CFD CIND vs. FD IND

CINDs and CFDs properly subsume FDs and INDs
Both the satisfiability analysis and
implication analysis are
beyond reach in practice
This calls for effective heuristic methods

44

Conditional Constraints for Data Cleaning

Conditional functional dependencies (CFDs)
Syntax and semantics
Static analysis consistency and implication,
axiom system
SQL techniques for inconsistency detection and
incremental detection
Conditional inclusion dependencies (CINDs)
Syntax and semantics
Static analysis consistency and implication
Matching dependencies for record matching (MDs)
Syntax and semantics
Relative candidate keys

45
Record matching
To identify tuples from one or more unreliable
sources that refer to the same real-world object.

Nontrivial
Real-life data is often dirty errors in the
data sources
Data is often represented differently in
different sources

the same person?
Pairwise comparison of attributes via equality
only does not work!
Record linkage, entity resolution, data
deduplication, merge/purge,
45
46
Matching rules (Hernndez Stolfo, 1995)
IF cardLN, address transLN, post AND
cardFN and transFN are similar, THEN identify
the two tuples
?

card
Match
trans
Accommodate errors in the data sources
46
47
Dependencies for record matching
cardLN, address transLN, post ? cardFN ?
transFN ? cardX ? transY
cardtel transphn ? cardaddress ?
transpost
Identifying attributes (not necessarily entire
records), across sources
X
card
Y
trans
2(mn) configurations
What attributes to compare? How to compare them?
47
48
Deducing new dependencies from given rules
cardLN,address transLN,post ? cardFN ?
transFN ? cardX ? transY
cardtel transphn ? cardaddress ?
transpost
deduction
cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
card
Radically different
Match
trans
Matched by the deduced rule, but NOT by the given
ones!
48
49
Matching dependencies (MDs)
(R1A1 ?1 R2B1 ? . . . ? R1Ak ?k R2Bk)
? R1Z1 ? R2Z2

R1X, R2Y entities to be identified
(Z1, Z2) lists of attributes in (X, Y), of the
same length
?j similarity operator (edit distance, q-gram,
jaro distance, )
? matching operator (identify two lists of
attributes via updates)

R1X cardFN, LN, address , R2Y
transFN, LN, post
cardLN, address transLN, post ? cardFN
? transFN ? cardX ? transY
cardtel transphn ? cardaddress ?
transpost
cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY

tel and phn are not part of X, Y
Semantic relationship on attributes across
different sources
49
50
Dynamic semantics
? (R1A1 ?1 R2B1 ? . . . ? R1Ak ?k
R2Bk) ? R1Z1 ? R2Z2

(D1, D2) satisfies ? iff for all (t1, t2) ? D1,
if t1A1 ?1 t2B1 ? . . . ? t1Ak ?k t2Bk
in D1
then (t1, t2) ? D2, and t1Z1 t2Z2 in D2
If (t1, t2) match the LHS, then their RHS are
updated and equalized

Different from FDs?
D1
D2
Two instances are needed to cope with the dynamic
semantics
50
51
An extension of functional dependencies (FDs)?
MD (R1A1 ?1 R2B1 ? . . . ? R1Ak ?k
R2Bk) ? R1Z1 ? R2Z2
developed for schema design for clean data
FD tel ? address
to accommodate unreliable data

similarity operators vs. equality () only
across different relations (R1, R2) vs. on a
single relation
dynamic semantic (matching operator ?) vs. static
semantics

violation of the FD
satisfying the MD
D1
D2
A departure from traditional dependency theory
51
52
Deduction of new MDs from given MDs
Given a set ? of MDs and a single ?, can ? be
deduced from ? ?

For all (D1, D2) if
(D1, D2) satisfies ? and
(D2, D2) satisfies ?
then (D1, D2) satisfies ?

fixpoint reached by enforcing ?
? is a logical consequence of ?
D1
D2
No matter how ? is interpreted, if ? is enforced,
so is ?
Example deduction of ? from ?1, ?2, where
? cardLN, tel transLN, phn ? cardFN
? transFN ? cardX ? transY
?1 cardtel transphn ? cardaddress ?
transpost ?2 cardLN,address
transLN,post ? cardFN ? transFN ? cardX ?
transY
Different from our familiar notion of implication
52
53
An inference system for MDs
Recall Armstrongs axioms for FDs
There is a finite set of axioms sound and
complete for MD deduction
Example MD ? is provable from ?1, ?2 by using
the inference system
?1 cardtel transphn ? cardaddress ?
transpost
Augmentation Rule
cardLN, tel transLN, phn ? cardLN,
address ? transLN, post
?2 cardLN,address transLN,post ? cardFN
? transFN ? cardX ? transY
Transitivity Rule
? cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY

More involved than Armstrongs axioms (11 axioms
vs. 3)
two relations, generic reasoning for similarity
operators

53
54
Relative Candidate Keys (RCKs)
relative to R1X and R2Y
Ultimate goal to decide whether R1X and R2Y
refer to the same object
(R1A1 ?1 R2B1 ? . . . ? R1Ak ?k R2Bk)
? R1X ? R2Y (R1A1, , Ak, R2B1, ,
Bk ?1 , . . ., ?k)
what to compare and how to compare

R1X cardX , R2Y transY
cardLN,address transLN,post ? cardFN ?
transFN ? cardX?transY ? (cardLN,
address, FN, transLN, post, FN , , ?)
cardtel transphn ? cardaddress ?
transpost NOT an RCK
cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
? (cardLN, tel, FN, transLN, phn, FN
, , ?)

A departure from candidate keys similarity,
different sources
54
55
What is special about RCKs?
RCKs can be deduced from matching dependencies

Matching rules identify records from unreliable
data sources

Optimization efficiency is a big issue for
record matching
blocking

only records in the same block are compared
B1
D
B2
discriminating attributes
B3

windowing (sorted neighborhood)

window of a fixed size only records in the same
window are compared
D
D
sliding window
sorting via keys
The match quality is highly dependent on the
choices of keys
55
56
Summary and review

What are CFDs? CINDs? Why do we need new
constraints?
What is the consistency problem? Complexity?
What is the implication problem? Inference
system? Sound and complete?
What is record matching? Why bother?
What are matching rules?
Why matching dependencies and relative candidate
keys? Why are they different from functional
dependencies and relational keys?
What are blocking and windowing?
Projects
CFDs across different relations?
Negative rules for MDs if condition then NO
match
Conditions for MDs if ? then match

Write a Comment

User Comments (0)

About PowerShow.com

Schema: Cust(country, area-code, phone, street, city, zip - PowerPoint PPT Presentation

Schema: Cust(country, area-code, phone, street, city, zip

Schema: Cust(country, area-code, phone, street, city, zip) Instance: 07974. NYC. Mountain Ave ... zip. city. street. phone. area-code. country. functional ... – PowerPoint PPT presentation