Schema: Cust(country, area-code, phone, street, city, zip - PowerPoint PPT Presentation

About This Presentation
Title:

Schema: Cust(country, area-code, phone, street, city, zip

Description:

Schema: Cust(country, area-code, phone, street, city, zip) Instance: 07974. NYC. Mountain Ave ... zip. city. street. phone. area-code. country. functional ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 57
Provided by: homepage7
Category:
Tags: area | city | code | country | cust | phone | schema | street | zip

less

Transcript and Presenter's Notes

Title: Schema: Cust(country, area-code, phone, street, city, zip


1

TDD Research Topics in Distributed Databases
  • Dependencies for improving data quality
  • Conditional functional dependencies (CFDs)
  • Syntax and semantics
  • Static analysis consistency and implication,
    axiom system
  • Conditional inclusion dependencies (CINDs)
  • Syntax and semantics
  • Static analysis consistency and implication
  • Matching dependencies for record matching (MDs)
  • Syntax and semantics
  • Relative candidate keys

2
Characterizing the consistency of data
  • One of the central technical problems for data
    consistency is how to tell whether the data is
    dirty or clean
  • Integrity constraints (data dependencies) as data
    quality rules
  • Inconsistencies emerge as violations of
    constraints
  • Traditional dependencies
  • functional dependencies
  • inclusion dependencies
  • denial constraints (a special case of full
    dependencies)
  • . . .
  • Question are these traditional dependencies
    sufficient?

3
Example customer relation
  • Schema Cust(country, area-code, phone, street,
    city, zip)
  • Instance
  • functional dependencies (FDs)
  • custcountry, area-code, phone ? custstreet,
    city, zip
  • custcountry, area-code ? custcity
  • The database satisfies the FDs. Is the data
    consistent?

4
Capturing inconsistencies in the data
  • cust (country 44, zip ? street)
  • In the UK, zip code uniquely determines the
    street
  • The constraint may not hold for other countries
  • It expresses a fundamental part of the semantics
    of the data
  • It can NOT be expressed as a traditional FD
  • It does not hold on the entire relation instead,
    it holds on tuples representing UK customers only

5
Two more constraints
  • cust(country 44, area-code 131, phone ?
    street, zip, city EDI)
  • cust(country 01, area-code 908, phone ?
    street, zip, city MH)
  • In the UK, if the area code is 131, then the city
    has to be EDI
  • In the US, if the area code is 908, then the city
    has to be MH
  • t1, t2 and t3 violate these constraints
  • refining cust(country, area-code, phno ?
    street, city, zip)
  • combining data values and variables

6
The need for new constraints
  • cust(country 44, zip ? street)
  • cust(country 44, area-code 131, phone ?
    street, zip, city EDI)
  • cust(country 01, area-code 908, phone ?
    street, zip, city MH)
  • They capture inconsistencies that traditional FDs
    cannot detect
  • Traditional constraints were developed for
    schema design, not for data cleaning!
  • Data integration in real-life source constraints
  • hold on a subset of sources
  • hold conditionally on the integrated data
  • They are NOT expressible as traditional FDs
  • do not hold on the entire relation
  • contain constant data values, besides logical
    variables

7

Conditional Functional Dependencies (CFDs)
  • An extension of traditional FDs (R X ? Y, Tp)
  • X ? Y embedded traditional FD on R
  • Tp a pattern tableau
  • attributes X ? Y
  • each tuple in Tp consists of constants and
    unnamed variable _
  • Example cust(country 44, zip ? street)
  • (cust (country, zip ? street), Tp)
  • pattern tableau Tp

8

Example CFDs
  • cust(country 44, area-code 131, phone ?
    street, zip, city EDI)
  • cust(country 01, area-code 908, phone ?
    street, zip, city MH)
  • cust(country, area-code, phone ? street,
    city, zip)
  • as a SINGLE CFD
  • (cust(country, area-code, phone ? street, city,
    zip), Tp)
  • pattern tableau Tp one tuple for each constraint

CFDs subsume traditional FDs. Why?
9

Traditional FDs as a special case
  • Express
  • custcountry, area-code ? custcity
  • as a CFD
  • (cust(country, area-code, ? city), Tp)
  • pattern tableau Tp a single tuple consisting of
    _ only

10

Semantics of CFDs
  • a ? b (a matches b) if
  • either a or b is _
  • both a and b are constants and a b
  • tuple t1 matches t2 t1 ? t2
  • (a, b) ? (a, _), but (a, b) does not match (a,
    c)
  • DB satisfies (R X ? Y, Tp) iff for any tuple tp
    in the pattern tableau Tp and for any tuples t1,
    t2 in DB, if t1X t2X ? tpX, then t1Y
    t2Y ? tpY
  • tpX identifying the set of tuples on which the
    constraint tp applies, ie, t tX ? tpX
  • t1Y t2Y ? tpY enforcing the embedded FD,
    and the pattern of tp

11

Example violation of CFDs
  • cust(country 44, zip ? street)
  • Tuples t1 and t2 violate the CFD
  • t1country, zip t2country, zip ? tpcountry,
    zip
  • t1street ? t2street
  • The CFD applies to t1 and t2 since they match
    tpcountry, zip

CFDs enforcing binding of semantically related
data values
12

Violation of CFDs by a single tuple
  • (cust(country, area-code ? city), Tp)
  • Tuple t1 does not satisfy the CFD
  • t1country, area-code t1country, area-code ?
    tp1country, area-code
  • t1city t1city however, t1city does not
    match tp1city
  • In contrast to traditional FDs, a single tuple
    may violate a CFD

13

Exercise
  • (cust(country, area-code, phno ? street, city,
    zip), Tp)
  • Tuple t3 violates the CFD. Why?
  • Tuples t1 and t4 violate the CFD. Why?

14
Dirty constraints?
  • A set of CFDs may be inconsistent!
  • Inconsistent (R(A ? B), Tp)
  • In any nonempty database DB and for any tuple t
    in DB,
  • tp1 tB must be b
  • tp2 tB must be c
  • Inconsistent if b and c are different
  • inconsistent ? ?1, ?2 , ?1 (R(A ? B),
    Tp1), ?2 (R(B ? A), Tp2)
  • Why?

Tp
15
The consistency problem
  • The consistency problem for CFDs is to determine,
    given a set ? of CFDs, whether or not there
    exists a nonempty database DB that satisfies ?,
    i.e., for any ? in ?, DB satisfies ?.
  • Whether or not ? makes sense
  • For traditional FDs, the consistency problem is
    not an issue one can specify any FDs without
    worrying about their consistency
  • A set of CFDs may be inconsistent!
  • Theorem. The consistency problem for CFDs is
    NP-complete.
  • Nontrivial contrast this with the trivial
    consistency analysis of FDs!

16
The implication problem
  • The implication problem for CFDs is to determine,
    given a set ? of CFDs and a single CFD ?, whether
    ? implies ?, denoted by ? ?, i.e., for any
    database DB, if DB satisfies ?, then DB satisfies
    ?.
  • Example
  • ? ?1, ?2 , ?1 (R(A ? B), Tp1), ?2
    (R(B ? C), Tp2)
  • ? (R(A ? C), Tp)
  • ? ?.

Tp1
Tp2
17
The complexity of the implication problem
  • For traditional FDs, the implication problem is
    in linear time
  • In contrast, the implication problem for CFDs is
    intractable
  • Theorem. The implication problem for CFDs is
    coNP-complete.

Question how about constant CFDs (without
wildcard)? Would it simplify the consistency and
implication analyses?
The expressive power of CFDs comes at a price
18
Finite axiomatizability Flashback
  • Armstrongs axioms can be found in every database
    textbook
  • Reflexivity If Y ? X, then X ? Y
  • Augmentation If X ? Y , then XZ ? YZ
  • Transitivity If X ? Y and Y ? Z, then X ? Z
  • Sound and complete for FD implication, i.e, ?
    ? iff ? can be inferred ? from using
    reflexivity, augmentation, transitivity.
  • Question is there a sound and complete inference
    system for the implication analysis of CFDs?

19
Finite axiomatizability of CFDs
  • Theorem. There is a sound and complete inference
    system I for implication analysis of CFDs
  • Sound if ? - ?, i.e., ? can be proved from ?
    using I, then ? ?
  • Complete if ? ?, then ? - ? using I
  • The inference system is more involved than its
    counterpart for traditional FDs, namely,
    Armstrongs axioms.
  • There are 5 axioms.
  • A normal form of CFDs (R X ? A, tp), tp is a
    single pattern tuple.

20
Axioms for CFDs transitivity
  • Transitivity if (A1,,Ak ? B1,,Bm, tp)

and (B1,,Bm ? C1,,Cn, tp)
match
(A1,,Ak ? C1,,Cn, tp)
21
Axioms for CFDs reduction
  • reduction if (B, X ? A, tp), tpB _,
    and tpA a

then (X ? A, tp)
22
Static analyses CFD vs. FD
  • General setting
  • in the absence of finite-domain attributes
  • complications finite-domain attributes

23

Conditional Constraints for Data Cleaning
  • Conditional functional dependencies (CFDs)
  • Syntax and semantics
  • Static analysis consistency and implication,
    axiom system
  • SQL techniques for inconsistency detection and
    incremental detection
  • Conditional inclusion dependencies (CINDs)
  • Syntax and semantics
  • Static analysis consistency and implication
  • Matching dependencies for record matching (MDs)
  • Syntax and semantics
  • Relative candidate keys

24
Example Amazon database
  • Schema
  • order(asin, title, type, price, country, county)
    -- source
  • book(asin, isbn, title, price, format)
    -- target
  • CD(asin, title, price, genre)
  • asin Amazon standard identification number
  • Instances

order
book
CD
25
Schema matching
  • Inclusion dependencies from source to target
    (e.g., Clio)

Do these make sense?
  • Traditional inclusion dependencies
  • orderasin, title, price ? bookasin, title,
    price
  • orderasin, title, price ? CDasin, title,
    price
  • These inclusion dependencies do not make sense!

26
Schema matching dependencies with conditions
  • Conditional inclusion dependencies
  • orderasin, title, price type book ?
    bookasin, title, price
  • orderasin, title, price type CD ? CDasin,
    title, price
  • orderasin, title, price ? bookasin, title,
    price holds only if type book
  • orderasin, title, price ? CDasin, title,
    price holds only if type CD
  • The constraints do not hold on the entire order
    table

27
Date cleaning with conditional dependencies
  • CIND1 orderasin, title, price type book
    ? bookasin, title, price
  • CIND2 orderasin, title, price type CD ?
    CDasin, title, price
  • Tuple t1 violates CIND1
  • Tuple t2 violates CIND2, why?

order
CD
book
28
More on data cleaning
CD
book
  • CDasin, title, price genre a-book ?
    bookasin, title, price format audio
  • Inclusion dependency CDasin, title, price ?
    bookasin, title, price holds only if genre
    a-book, i.e., when the CD is an audio book
  • In addition, the format of the corresponding book
    must be audio a pattern for the referenced tuple

And what?
29

Conditional Inclusion Dependencies (CINDs)
  • (R1X Xp ? R2Y Yp, Tp)
  • R1X ? R2Y embedded traditional IND from R1
    to R2
  • Tp a pattern tableau
  • attributes Xp ? Yp
  • tuples in Tp consist of constants and unnamed
    variable _
  • Example express
  • CIND1 orderasin, title, price type book
    ? bookasin, title, price
  • (orderasin, title, price type ? bookasin,
    title, price nil, Tp)
  • nil empty list
  • pattern tableau Tp

30

Traditional CINDs as a special case
  • R1X ? R2Y
  • X A1, , An
  • Y B1, , Bn
  • As a CIND (R1X nil ? R2Y nil, Tp)
  • What is the pattern tableau?
  • pattern tableau Tp a single tuple ( )

CINDs subsume traditional INDs
31

Exercise
  • Express the following as CINDs
  • CIND2 orderasin, title, price type CD ?
    CDasin, title, price
  • CIND3 CDasin, title, price genre a-book
    ? bookasin, title, price format audio
  • (orderasin, title, price type ? CDasin,
    title, price nil, Tp)
  • (CDasin, title, price genre ? bookasin,
    title, price format, Tp)

32

Semantics of CINDs
  • DB (DB1, DB2), where DBj is an instance of Rj,
    j 1, 2.
  • DB satisfies (R1X Xp ? R2Y Yp, Tp) iff for
    any tuples t1 in DB1, and any tuple tp in the
    pattern tableau Tp, if t1Xp ? tpXp, then
    there exists t2 in DB2 such that
  • t1Y t2Y (traditional IND semantics)
  • t2Yp ? tpYp (matching the pattern tuple on
    Y, Yp)
  • Patterns
  • t1Xp ? tpXp identifying the set of R1
    tuples on which tp applies t1 t1Xp ?
    tpXp
  • t2Yp ? tpYp enforcing the embedded IND and
    the constraint specified by patterns Yp

33

Example
  • (CDasin, title, price genre ? bookasin,
    title, price format, Tp)
  • The following DB satisfies the CIND

book
CD
34
Exercise
  • CIND1 (orderasin, title, price type ?
    bookasin, title, price nil, Tp)
  • The following DB violates CIND1. Why?

order
book
CD
35
The satisfiability problem for CINDs
  • The consistency problem for CINDs is to
    determine, given a set ? of CINDs, whether or not
    there exists a nonempty database DB that
    satisfies ?, i.e., for any ? in ?, DB satisfies
    ?.
  • Recall
  • Any set of traditional INDs is always consistent!
  • For CFDs, the satisfiability problem is
    intractable.
  • In contrast.
  • Theorem. Any set of CINDs is always consistent!
  • Despite the increased expressive power, the
    complexity of the satisfiability analysis does
    not go up.

36
The implication problem for CINDs
  • The implication problem for CINDs is to decide,
    given a set ? of CINDs and a single CIND ?,
    whether ? implies ? (? ?).
  • For traditional INDs, the implication problem is
    PSPACE-complete
  • For CINDs, the complexity does not hike up, to an
    extent
  • Theorem. For CINDs containing no finite-domain
    attributes, the implication problem is
    PSPACE-complete
  • In the general setting, however, we have to pay a
    price
  • Theorem. The implication problem for CINDs is
    EXPTIME-complete

37
Finite axiomatizability of CINDs
  • Rules for inferring IND implication
  • Reflexivity If RX ? RX
  • Projection and Permutation If R1A1, , Ak ?
    R2B1, , Bk, then R1Ai1, , Aik ? R2Bi1, ,
    Bik,
  • Transitivity If R1X ? R2Y and R2Y ? R3Z,
    then R1X ? R3Z
  • Sound and complete for IND implication
  • CINDs retain the finite axiomatizability
  • Theorem. There is a sound and complete inference
    system for implication analysis of CINDs
  • There are 8 axioms.

38
Inference rules for CINDs
  • Normal form of CINDs (R1X Xp ? R2Y Yp,
    tp),
  • tp is a single pattern tuple
  • tpA is a constant iff A is in Xp or Yp (tpB
    _ if B is in X or Y)
  • Inference rules
  • Reflexivity (RX nil ? RX nil, tp), where
    tp ( )
  • Projection and permutation If (R1X Xp ?
    R2Y Yp, tp), then (R1X Xp ? R2Y Yp,
    tp), for any permutation of X, Xp

tp
tp
39
Axioms for CINDs transitivity
  • Transitivity if (R1X Xp ? R2Y Yp, tp),

and (R2Y Yp ? R3Z Zp, tp),
equal
(R1X Xp ? R3Z Zp, tp)
40
Axioms for CINDs augmentation
  • augmentation if (R1X Xp ? R2Y Yp, tp), A
    ? attr(R1),

(R1X Xp, A ? R2Y Yp, tp)
41
Static analyses CIND vs. IND
  • General setting
  • in the absence of finite-domain attributes

CINDs retain most complexity bounds of their
traditional counterpart
42
CFDs and CINDs taken together
  • We need both CFDs and CINDs for
  • data cleaning
  • schema matching
  • Theorem. The implication problem for CFDs and
    CINDs is undecidable
  • Not surprising The implication problem for
    traditional FDs and INDs is already undecidable
  • Theorem. The consistency problem for CFDs and
    CINDs is undecidable
  • In contrast, any set of traditional FDs and INDs
    is consistent!

43
Static analyses CFD CIND vs. FD IND
  • CINDs and CFDs properly subsume FDs and INDs
  • Both the satisfiability analysis and
    implication analysis are
  • beyond reach in practice
  • This calls for effective heuristic methods

44

Conditional Constraints for Data Cleaning
  • Conditional functional dependencies (CFDs)
  • Syntax and semantics
  • Static analysis consistency and implication,
    axiom system
  • SQL techniques for inconsistency detection and
    incremental detection
  • Conditional inclusion dependencies (CINDs)
  • Syntax and semantics
  • Static analysis consistency and implication
  • Matching dependencies for record matching (MDs)
  • Syntax and semantics
  • Relative candidate keys

45
Record matching
To identify tuples from one or more unreliable
sources that refer to the same real-world object.
  • Nontrivial
  • Real-life data is often dirty errors in the
    data sources
  • Data is often represented differently in
    different sources

the same person?
Pairwise comparison of attributes via equality
only does not work!
Record linkage, entity resolution, data
deduplication, merge/purge,
45
46
Matching rules (Hernndez Stolfo, 1995)
IF cardLN, address transLN, post AND
cardFN and transFN are similar, THEN identify
the two tuples
?

card
Match
trans
Accommodate errors in the data sources
46
47
Dependencies for record matching
cardLN, address transLN, post ? cardFN ?
transFN ? cardX ? transY
cardtel transphn ? cardaddress ?
transpost
Identifying attributes (not necessarily entire
records), across sources
X
card
Y
trans
2(mn) configurations
What attributes to compare? How to compare them?
47
48
Deducing new dependencies from given rules
cardLN,address transLN,post ? cardFN ?
transFN ? cardX ? transY
cardtel transphn ? cardaddress ?
transpost
deduction
cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
card
Radically different
Match
trans
Matched by the deduced rule, but NOT by the given
ones!
48
49
Matching dependencies (MDs)
(R1A1 ?1 R2B1 ? . . . ? R1Ak ?k R2Bk)
? R1Z1 ? R2Z2
  • R1X, R2Y entities to be identified
  • (Z1, Z2) lists of attributes in (X, Y), of the
    same length
  • ?j similarity operator (edit distance, q-gram,
    jaro distance, )
  • ? matching operator (identify two lists of
    attributes via updates)
  • R1X cardFN, LN, address , R2Y
    transFN, LN, post
  • cardLN, address transLN, post ? cardFN
    ? transFN ? cardX ? transY
  • cardtel transphn ? cardaddress ?
    transpost
  • cardLN, tel transLN, phn ? cardFN ?
    transFN ? cardX ? transY

tel and phn are not part of X, Y
Semantic relationship on attributes across
different sources
49
50
Dynamic semantics
? (R1A1 ?1 R2B1 ? . . . ? R1Ak ?k
R2Bk) ? R1Z1 ? R2Z2
  • (D1, D2) satisfies ? iff for all (t1, t2) ? D1,
  • if t1A1 ?1 t2B1 ? . . . ? t1Ak ?k t2Bk
    in D1
  • then (t1, t2) ? D2, and t1Z1 t2Z2 in D2
  • If (t1, t2) match the LHS, then their RHS are
    updated and equalized

Different from FDs?
D1
D2
Two instances are needed to cope with the dynamic
semantics
50
51
An extension of functional dependencies (FDs)?
MD (R1A1 ?1 R2B1 ? . . . ? R1Ak ?k
R2Bk) ? R1Z1 ? R2Z2
developed for schema design for clean data
FD tel ? address
to accommodate unreliable data
  • similarity operators vs. equality () only
  • across different relations (R1, R2) vs. on a
    single relation
  • dynamic semantic (matching operator ?) vs. static
    semantics

violation of the FD
satisfying the MD
D1
D2
A departure from traditional dependency theory
51
52
Deduction of new MDs from given MDs
Given a set ? of MDs and a single ?, can ? be
deduced from ? ?
  • For all (D1, D2) if
  • (D1, D2) satisfies ? and
  • (D2, D2) satisfies ?
  • then (D1, D2) satisfies ?

fixpoint reached by enforcing ?
? is a logical consequence of ?
D1
D2
No matter how ? is interpreted, if ? is enforced,
so is ?
Example deduction of ? from ?1, ?2, where
? cardLN, tel transLN, phn ? cardFN
? transFN ? cardX ? transY
?1 cardtel transphn ? cardaddress ?
transpost ?2 cardLN,address
transLN,post ? cardFN ? transFN ? cardX ?
transY
Different from our familiar notion of implication
52
53
An inference system for MDs
Recall Armstrongs axioms for FDs
There is a finite set of axioms sound and
complete for MD deduction
Example MD ? is provable from ?1, ?2 by using
the inference system
?1 cardtel transphn ? cardaddress ?
transpost
Augmentation Rule
cardLN, tel transLN, phn ? cardLN,
address ? transLN, post
?2 cardLN,address transLN,post ? cardFN
? transFN ? cardX ? transY
Transitivity Rule
? cardLN, tel transLN, phn ? cardFN ?
transFN ? cardX ? transY
  • More involved than Armstrongs axioms (11 axioms
    vs. 3)
  • two relations, generic reasoning for similarity
    operators

53
54
Relative Candidate Keys (RCKs)
relative to R1X and R2Y
Ultimate goal to decide whether R1X and R2Y
refer to the same object
(R1A1 ?1 R2B1 ? . . . ? R1Ak ?k R2Bk)
? R1X ? R2Y (R1A1, , Ak, R2B1, ,
Bk ?1 , . . ., ?k)
what to compare and how to compare
  • R1X cardX , R2Y transY
  • cardLN,address transLN,post ? cardFN ?
    transFN ? cardX?transY ? (cardLN,
    address, FN, transLN, post, FN , , ?)
  • cardtel transphn ? cardaddress ?
    transpost NOT an RCK
  • cardLN, tel transLN, phn ? cardFN ?
    transFN ? cardX ? transY
  • ? (cardLN, tel, FN, transLN, phn, FN
    , , ?)

A departure from candidate keys similarity,
different sources
54
55
What is special about RCKs?
RCKs can be deduced from matching dependencies
  • Matching rules identify records from unreliable
    data sources
  • Optimization efficiency is a big issue for
    record matching
  • blocking

only records in the same block are compared
B1
D
B2
discriminating attributes
B3
  • windowing (sorted neighborhood)

window of a fixed size only records in the same
window are compared
D
D
sliding window
sorting via keys
The match quality is highly dependent on the
choices of keys
55
56
Summary and review
  • What are CFDs? CINDs? Why do we need new
    constraints?
  • What is the consistency problem? Complexity?
  • What is the implication problem? Inference
    system? Sound and complete?
  • What is record matching? Why bother?
  • What are matching rules?
  • Why matching dependencies and relative candidate
    keys? Why are they different from functional
    dependencies and relational keys?
  • What are blocking and windowing?
  • Projects
  • CFDs across different relations?
  • Negative rules for MDs if condition then NO
    match
  • Conditions for MDs if ? then match
Write a Comment
User Comments (0)
About PowerShow.com