Relational Model of Data over Domains with Similarities - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Relational Model of Data over Domains with Similarities

Description:

adjoint pair (a b c iff a b c). details in proceedings ... A B is entailed by T iff A B is provable from T ... P P iff (detailed description in proceedings) the ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 36
Provided by: stanisla4
Category:

less

Transcript and Presenter's Notes

Title: Relational Model of Data over Domains with Similarities


1
Relational Model of Data over Domainswith
Similarities
An Extension for Similarity Queries and Knowledge
Extraction
  • Radim BelohlavekVilem VychodilStanislav Opichal
  • Dept. Computer SciencePalacky University,
    OlomoucCzech Republic

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
2
Outline
  • problem setting introducing extended Codds
    model
  • preliminaries from fuzzy logic
  • functional dependencies (as example of data
    dependencies)Armstrong axioms and completeness,
    entailment and non-redundant bases, computation
    of bases
  • relational algebra and calculus
  • practical issues
  • further issues, future research

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
3
Problem setting
  • Our paper
  • contribution to an extension of Codds relational
    model
  • extension concerns uncertainty (imprecision)
  • Abiteboul S. et al. The Lowell database research
    self-assessment.Comm. ACM 48(5)(2005),
    111118management of uncertainty in the
    foundations of databases
  • extension
  • extension
  • provides framework for approximate matches and
    related issues(similarity queries, similarity
    join, . . . ) contrary to exact matches of the
    classical model
  • we add
  • similarity relations on domains
  • ranks assigned to tuples
  • in this talk
  • data dependencies
  • relational algebra and calculus
  • practical issues

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
4
Problem setting (cntd.)
  • Our extension of Codds model
  • (ranked) data tables over domains with
    similarities

ranked table ? answer to similarity-based query
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
5
Problem setting (cntd.)
  • Related work
  • extensions of Codds model employing fuzzy logic
  • several approaches, many papers
  • Raju, Majumdar, Fuzzy functional dependencies and
    lossless joindecomposition of fuzzy relational
    database systems.ACM Trans. Database Systems
    Vol. 13, No. 2, 1988, pp. 129166.
  • extensions of Codds model employing probability
  • different both semantically and technically
    (probabilityfuzzy logic)
  • Fuhr, Rölleke, A probabilistic relational algebra
    for the integration ofinformation retrieval and
    database systems.ACM Trans. Information Systems
    153266, 1997.
  • D. Dey and S. Sarkar S. A probabilistic
    relational model and algebra.ACM Trans. Dat.
    Syst. 21339369, 1996.

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
6
Problem setting (cntd.)
  • related work
  • Fagin at al.
  • R. Fagin. Combining fuzzy information an
    overview. ACM SIGMODRecord 31(2)109-118, 2002.
  • Natsev, Chang, Smith, Li, Vitter Supporting
    incremental join querieson ranked inputs.VLDB
    2001, pp. 281290.
  • Cohen, Sagiv An incremental algorithm for
    computing ranked fulldisjunctions. PODS 2005,
    pp. 98107.
  • RankSQL related research
  • Li, Chang, Ilyas, Song RankSQL Query Algebra
    and Optimization forRelational top-k
    queries.ACM SIGMOD 2005, pages 131142, 2005.
  • Illyas, Aref, Elmagarmid Supporting top-k join
    queries in relational databases.The VLDB Journal
    13207221, 2004.

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
7
Preliminaries from fuzzy logic
  • fuzzy logic invented by Zadeh simple calculus
    for handling of vagueness
  • Zadeh L. A. Fuzzy sets. Inf. Control (1965).
  • basic principle allows propositions to have
    intermediate truth degrees
  • instead of just 0 (false) and 1 (true), e.g.
  • John is tall. 0.9, A is simiar to
    B 0.7
  • developed since late 1960s
  • for a long time no firm logical foundations, ad
    hoc approaches, many
  • results of low quality
  • logical foundations developed in late 1990s,
    monographs available

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
8
Preliminaries structures of truth degrees
  • classical logic two-element Boolean algebra,
    given by
  • set 0, 1 of truth degrees
  • (truth functions of) logical connectives
    (conjunction, implication, . . . )
  • fuzzy logic several possibilities, a general
    one complete residuatedlattice, given by
  • (partially ordered) set L of truth degrees, e.g.
    L 0, 1,L 0, 0.1, 0.2, . . . , 1,
    non-linearly ordered L, . . .
  • (truth functions of) logical connectives (conj.
    ?, impl. ?, . . . )
  • Complete residuated lattice basic structure of
    truth degreesL ?L,?,?,?,?,0,1?, where
  • ?L,?,?,?,?,0,1? complete lattice,
  • ?L,?,1? commutative monoid,
  • ??,?? adjoint pair (a ? b ? c iff a ? b ? c).
  • details in proceedings

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
9
Our extension of Codds model
  • (ranked) data tables over domains with
    similarities

ranked table ? answer to similarity-based query
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
10
Functional dependencies
  • formulas
  • A ? B (A,B ? Y , sets of attributes)
  • describing attribute dependencies, e.g.
  • flight No. ? depart. time, arriv. time
  • used in
  • knowledge extraction
  • data mining
  • formal concept analysis (attribute implications)
  • interpreted in tables with yes/no-attributes
  • knowledge extraction
  • relational databases (functional dependencies)
  • interpreted in DB relations (tables with general
    attributes)
  • data redundancy, normalization, DB design, . . .
  • knowledge extraction (Manilla, Raiha Algorithms
    for inferringfunctional dependencies from
    relations, Data Knowledge Eng.128399.)

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
11
Recalling functional dependencies (FDs) . . .
ordinary setting
table D
A ? B is true in table D means for any tuples
x1, x2 IF x1 and x2 agree on their values of
attributes from A THEN x1 and x2 agree on their
values of attributes from B Example y1, y2 ?
y3 is true in D, y1 ? y2 is not (x2 ? x4
counterexample) flight No. ? departure
time, arrival time
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
12
Fuzzy functional dependencies syntax
Definition Fuzzy functional dependence (FFD) over
attributes Y A ? B, where A, B ? LY (fuzzy
sets of attributes)
  • Example
  • 0.7/y1 ? 0.3/y2 y1, y3 ? y4 ordinary
    dependence
  • 0.4/y1, y2,0.1/y3 ? y3,0.5/y4 ?
    empty
  • Intended meaning of A ? B
  • ? as in ordinary case, but equality replaced by
    similarity
  • for any of two tuples x1, x2 ? X IF x1 and x2
    have similar values on attributes from A THEN x1
    and x2 have similar values on attributes from B
  • ? new kind of dependencies (data mining apeal)
  • A ? B can be true to a degree from L, not only 0
    or 1
  • degrees A(y), B(y) act as tresholds

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
13
Semantics of FFDs
D table with similarities (for simplicity,
ranks1)
Definition (degree A ? BD to which A ? B is
true in D defined by A ? BD ?
((x1(A) ? x2(A)) ? ((x1(B) ? x2(B))
x1x2?X
Remark Ordinary meaning of functional
dependencies is a particular caseA and B
ordinary sets, ?y ordinary equality for each y ?
Y .
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
14
Semantics of FFD models, entailment
D table with similarities
Definition (models and entailment in ranked
tables)T a set of T of FFDsmodels of T
Mod(T ) D I for each A ? B ? T A ? BD
1,
in words D is a model of T means each FFD
from T is true in D
Definition (models and entailment in ranked
tables)T a set (fuzzy set) of T of FFDsdegree
of entailment of A ? B from T A ? BT ?D
?Mod(T ) A ? BD
in words a degree to which A ? B follows from
T degree ofA ? B is true in each model of T
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
15
Armstrong-like rules, provability, and
completeness
  • Recall
  • Armstrong W. W. Dependency structures in data
    base relationships.
  • IFIP Congress, Geneva, Switzerland, 1974.
  • a system of deduction rules s. t.
  • A ? B is entailed by T iff A ? B is provable from
    T
  • in our setting, entailment is a matter of degree,
  • two concepts of provability and completeness
  • ordinary completeness (interesting only degree
    1) f follows from T iff f provable from T
  • graded completeness (any degree
    interesting)degree to which f follows from T
    degree of provability of f from T.
  • We present a syntactico-semantically complete
    (both types) system
  • of Armstrong-like rules.

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
16
Armstrong-like rules, provability, and
completeness
  • Deduction rules
  • rules describing what FFDs can be inferred(in
    one elementary step) from other FFDs
  • inspired by Armstrong-like rules, several
    equivalent systems
  • one of them (an elegant one) is
  • classical Armstrong rules fuzzy rule

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
17
Ordinary provability and completeness
  • Provability T . . . theory (set of FFDs)
  • A ? B is provable from T, written T ? A ? B, if
    there is a sequence
  • f1, . . . , fn of FFDs such that
  • fn is A ? B,
  • for each f i f i 2 T or 'i is inferred from the
    preceding formulas
  • (i.e., f1, . . . , fi-1) using one of the
    deduction rules (Ax)(Cut).
  • Provability bivalent notion (either T ? A ? B or
    T ? A ? B).

Theorem (ordinary completeness) A ? BT 1
( A ? B follows from T, in degree 1) iff T ? A
? B (A ? B is provable from T)
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
18
Graded provability and completeness
Provability bivalent notion (either T ? A ? B or
T ? A ? B). can we capture a degree of semantic
entailment syntactically? (i.e., by a
modification of the concept of proof) Graded
provability . . . set of FFDs A ? B T ? L
degree which A ? B is provable from T (details
proceedings)
Theorem (graded completeness) A ? B T A ?
B T (degree of entailment degree of
provability).
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
19
Non-redundant bases of FFDs
aim large sets of FFDs ? small sets of
FFDs (equally informative) example Given ranked
table D with similarities, extract true FFDs
from D, but only the essential ones
  • Definition (complete set of FFDs)
  • A set T of FFDs is complete in D if
  • for each A ? B ? T A ? BD 1(each FFD
    from T is true in D)
  • for each A ? B A ? BD A ? BT

? complete set T of D fully describe validity of
FFDs in D
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
20
Non-redundant bases of FFDs (cntd.)
  • Definition (Non-redundant bases of D )
  • A set T of FFDs a non-redundant basis of D if
  • T is complete in D
  • No T' ? T is complete

In what follows computation of particular
non-redundant bases based on so-called
pseudo-intents
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
21
Non-redundant bases using pseudo-intents
Definition (pseudo-intents of D) A system of
pseudo-intents of a ranked table D with
similarities is a system P of fuzzy sets of
attributes such that P ? P iff (detailed
description in proceedings)
the role of pseudo-intents
Theorem (non-redundant basis based on
pseudo-intents) If P is a system of
pseudo-intents then T P ? C (P ) P ? P is
a non-redundant basis of D
C(P) is a particular modification of P, details
omitted.
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
22
Computing pseudo-intents (non-redundant bases)
Theorem (pseudo-intents from fixpoints of clT
) Let P be a system of pseudo-intents of D.
Then P P ? fix(clT ) P ? C (P )
Where of clT is defined by For Z ? LY we put
. . . operator on L-sets in Y
fix(clT ) P clT (P) P . . . fixpoints
of clT
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
23
Computing non-redundant bases (algorithm)
Input D (data table over dom. with similarity
relations). Output P (system of pseudo intents)
B ? 0 if B ? C(B ) add B to P while B ? Y T
? P ? C (P ) P ? P B ? B (B is
lectically smallest fixed point of clT which
is a successor of B) if B ? C(B ) add B to P
polynomial time delay complexity
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
24
Relational algebra and calculus
  • basic traits (details in proceedings and a
    forthcoming paper)
  • extension of classical relational algebra which
    takes similarities into account
  • relational algebra operations
  • counterparts to Boolean operations (union, . . .
    )
  • new operations arising within the framework of
    fuzzy logic (e.g. based on thresholds, like
    a-cut aD(t) t D (t) a)
  • operations where exact matches are extended by
    similarity-based matches (selection, join, . . .
    )
  • further operations e.g. topk (best k tuples
    satisfying a query considerable interest)
  • relational calculus based on formal predicate
    fuzzy logic (essential are non-standard issues
    like quantifiers most, etc.)
  • well-founded like in the classical case

Theorem (equivalence theorem) Relational algebra
and relational calculus for the extended model
are equivalent.
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
25
Example I Select power production of countries
with large population
D(t) Country COU Population Coal
Air Water Nuclear ---- ---------- ---
---------- ---------- ---------- ----------
---------- 1.0 China Cn 1300000000
498 246 196 34.6 1.0 India
In 1000000000 154 1032
75 24.8 .6 USA US 300000000
570.7 2533 330 743.9 .3
Russia Ru 145000000 115.8 54
157 122.5 .3 Japan Jp
127000000 0 120 90
293.8 .2 Germany Ge 90000000 56.4
3817 50 161.2 .2 UK UK
80000000 19.5 350 8
87.1 .2 France Fr 80000000 0
63 62 394.4 .1 Spain Sp
40000000 10.9 1180 11
58.9
D(t) degree of large population for each tuple
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
26
Implementation of domains with similarities I
Similarity of power production from coal is
defined by table
Similarity table DDL
c Cn In US Ru Jp Ge Fr UK Sp -------------------
----------- Cn 1 .3 In 1 .6 US .3
1 Ru .6 1 .4 Jp 1 .4 1
.8 .9 Ge .4 .4 1 .4 .7 .6 Fr
1 .4 1 .8 .9 UK .8 .7 .8 1
1 Sp .9 .6 .9 1 1
CREATE TABLE t_sim_coal ( country_code1
VARCHAR(2), country_code2 VARCHAR(2), similarity
NUMBER(3,2), CONSTRAINT t_sim_coal_pk PRIMARY
KEY ( country_code1, country_code1 ) )
COUNTRY_CODE1 COUNTRY_CODE2 SIMILARITY -----------
-- ------------- ---------- Cn Cn
1 Cn US
.3 In In 1 In
Ru .6 US
US 1 Ru Ru
1 ...
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
27
Similarity defined by table performance issues
Similarity table DDL
  • Similarity for two countries is retrieved by join
    in following steps lets suppose that both
    country codes are available from a main loop
  • Find out a ROWID in the index t_sim_coal_pk for
    the two given country codes
  • Retrieve similarity from the table t_sim_coal
    using the ROWID and provide it for further query
    execution
  • It is obvious that there is additional step
    retrieving of the ROWID. But the ROWID is not
    necessary for result.
  • t_sim_coal should be replaced by database
    structure supporting searching, which gives the
    similarity value immediately instead of the ROWID
    (i.e. index organized table)

CREATE TABLE t_sim_coal ( country_code1
VARCHAR(2), country_code2 VARCHAR(2), similarity
NUMBER(3,2), CONSTRAINT t_sim_coal_pk PRIMARY
KEY ( country_code1, country_code1 ) )
COUNTRY_CODE1 COUNTRY_CODE2 SIMILARITY -----------
-- ------------- ---------- Cn Cn
1 Cn US
.3 In In 1 In
Ru .6 US
US 1 Ru Ru
1 ...
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
28
Implementation of domains with similarities II
Ranking of large population is defined by function
Definition in RDBMS with procedural extension
function large_population ( p_population in
varchar2 ) return number is large_popul_c
constant number 500000000 l_ret number
0 begin return least(
p_population/large_popul_c, 1
) end /
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
29
Ranking defined by function - performance
considerations
  • Ranking or similarity defined by function can
    lead to decreased performance during SQL
    execution on large data
  • It is possible to created an index based on
    function using row columns as parameters. But the
    index transforms the result to ROWID which is not
    very helpful
  • Extending classical Btree by a degree values in
    the leaves. The leaves would consist of indexed
    column, the degree value and the ROWID. The
    extended Btree would support operations like
    topk or a-cut very effectively provided that
    ranking function is monotonous

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
30
Extended Btree
1,300,000,000
145,000,000
127,000,000
90,000,000
1,000,000,000
300,000,000
0.2
1
0.3
1
0.3
0.6
  • The example above shows a-cut of large population
    (a0.3)
  • When the most left leaf with degree of large
    population 0.3 is found then the right leafs
    are read sequentially
  • polynomial time delay is logarithmic

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
31
Implementation of ranked table in ORDBMS (cntd.)
Lets define object view
Definition of object type in ORDBMS Oracle10g
create or replace type powerprod_t AS OBJECT (
country varchar2(30), population
number, coal number, air number,
water number, nuclear number, MEMBER
FUNCTION similar_coal(itupple in powerprod_t)
return number, MEMBER FUNCTION
similar_air(itupple in powerprod_t) return
number, MEMBER FUNCTION similar_water(itupple
in powerprod_t) return number, MEMBER
FUNCTION similar_nuclear(itupple in powerprod_t)
return number, MEMBER FUNCTION
large_popul(itupple in powerprod_t) return
number )
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
32
Implementation of ranked table in ORDBMS (cntd.)
Lets define object view over the table
t_countries using object type powerprod_t.
Definition of object view powerprod_v in ORDBMS
Oracle10g
create or replace view powerprod_v of
powerprod_t with object identifier (country)
as select country, population, coal,
air, water, nuclear from
t_countries
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
33
Implementation of ranked table in ORDBMS (cntd.)
And now we can select i.e. all countries having
similar power production from nuclear power plats
Definition of object view powerprod_v in ORDBMS
Oracle10g
select a.similar_nuclear(value(b))
"D(t)", a.country "Country", a.population
"Population", a.coal "Coal", a.air
"Air", a.water "Water", a.nuclear
"Nuclear" from powerprod_v a, powerprod_v b where
b.country'Japan' order by 1 desc /
D(t) represents similarity of nuclear power
plant production with Japan Note, that
similar_nuclear search similarities in the table
t_sim_nuclear
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
34
Implementation of ranked table in ORDBMS (result
of the example)
D(t) Country Population Coal Air
Water Nuclear
1
Japan 127000000 0 120
90 293.8 .6 France 80000000
0 63 62 394.4 .4 Germany
90000000 56.4 3817 50
161.2 .1 Russia 145000000 115.8
54 157 122.5 0 India
1000000000 154 1032 75
24.8 0 USA 300000000 570.7
2533 330 743.9 0 Spain
40000000 10.9 1180 11
58.9 0 UK 80000000 19.5
350 8 87.1 0 China
1300000000 498 246 196
34.6
D(t) represents similarity of nuclear power
plant production with Japan
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
35
Future research
  • further study of extended Codds model
    (foundations, algorithms, implementation),
  • connection to existing work on RankSQL, to work
    on algorithms, . . . ,
  • further data dependencies,
  • data redundancy (approximate redundancy),
  • data mining aspects,
  • implications true in degrees other than 1 (at
    least a, etc.) bases, . . .
  • involve tolerance e.g. almost complete basis,
    can it be smaller?
  • . . .

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
Write a Comment
User Comments (0)
About PowerShow.com