Title: Relational Model of Data over Domains with Similarities
1Relational Model of Data over Domainswith
Similarities
An Extension for Similarity Queries and Knowledge
Extraction
- Radim BelohlavekVilem VychodilStanislav Opichal
- Dept. Computer SciencePalacky University,
OlomoucCzech Republic
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
2Outline
- problem setting introducing extended Codds
model - preliminaries from fuzzy logic
- functional dependencies (as example of data
dependencies)Armstrong axioms and completeness,
entailment and non-redundant bases, computation
of bases - relational algebra and calculus
- practical issues
- further issues, future research
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
3Problem setting
- Our paper
- contribution to an extension of Codds relational
model - extension concerns uncertainty (imprecision)
- Abiteboul S. et al. The Lowell database research
self-assessment.Comm. ACM 48(5)(2005),
111118management of uncertainty in the
foundations of databases - extension
- extension
- provides framework for approximate matches and
related issues(similarity queries, similarity
join, . . . ) contrary to exact matches of the
classical model - we add
- similarity relations on domains
- ranks assigned to tuples
- in this talk
- data dependencies
- relational algebra and calculus
- practical issues
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
4Problem setting (cntd.)
- Our extension of Codds model
- (ranked) data tables over domains with
similarities
ranked table ? answer to similarity-based query
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
5Problem setting (cntd.)
- Related work
- extensions of Codds model employing fuzzy logic
- several approaches, many papers
- Raju, Majumdar, Fuzzy functional dependencies and
lossless joindecomposition of fuzzy relational
database systems.ACM Trans. Database Systems
Vol. 13, No. 2, 1988, pp. 129166. - extensions of Codds model employing probability
- different both semantically and technically
(probabilityfuzzy logic) - Fuhr, Rölleke, A probabilistic relational algebra
for the integration ofinformation retrieval and
database systems.ACM Trans. Information Systems
153266, 1997. - D. Dey and S. Sarkar S. A probabilistic
relational model and algebra.ACM Trans. Dat.
Syst. 21339369, 1996.
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
6Problem setting (cntd.)
- related work
- Fagin at al.
- R. Fagin. Combining fuzzy information an
overview. ACM SIGMODRecord 31(2)109-118, 2002. - Natsev, Chang, Smith, Li, Vitter Supporting
incremental join querieson ranked inputs.VLDB
2001, pp. 281290. - Cohen, Sagiv An incremental algorithm for
computing ranked fulldisjunctions. PODS 2005,
pp. 98107. - RankSQL related research
- Li, Chang, Ilyas, Song RankSQL Query Algebra
and Optimization forRelational top-k
queries.ACM SIGMOD 2005, pages 131142, 2005. - Illyas, Aref, Elmagarmid Supporting top-k join
queries in relational databases.The VLDB Journal
13207221, 2004.
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
7Preliminaries from fuzzy logic
- fuzzy logic invented by Zadeh simple calculus
for handling of vagueness - Zadeh L. A. Fuzzy sets. Inf. Control (1965).
- basic principle allows propositions to have
intermediate truth degrees - instead of just 0 (false) and 1 (true), e.g.
- John is tall. 0.9, A is simiar to
B 0.7 - developed since late 1960s
- for a long time no firm logical foundations, ad
hoc approaches, many - results of low quality
- logical foundations developed in late 1990s,
monographs available
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
8Preliminaries structures of truth degrees
- classical logic two-element Boolean algebra,
given by - set 0, 1 of truth degrees
- (truth functions of) logical connectives
(conjunction, implication, . . . ) - fuzzy logic several possibilities, a general
one complete residuatedlattice, given by - (partially ordered) set L of truth degrees, e.g.
L 0, 1,L 0, 0.1, 0.2, . . . , 1,
non-linearly ordered L, . . . - (truth functions of) logical connectives (conj.
?, impl. ?, . . . ) - Complete residuated lattice basic structure of
truth degreesL ?L,?,?,?,?,0,1?, where - ?L,?,?,?,?,0,1? complete lattice,
- ?L,?,1? commutative monoid,
- ??,?? adjoint pair (a ? b ? c iff a ? b ? c).
- details in proceedings
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
9Our extension of Codds model
- (ranked) data tables over domains with
similarities
ranked table ? answer to similarity-based query
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
10Functional dependencies
- formulas
- A ? B (A,B ? Y , sets of attributes)
- describing attribute dependencies, e.g.
- flight No. ? depart. time, arriv. time
- used in
- knowledge extraction
- data mining
- formal concept analysis (attribute implications)
- interpreted in tables with yes/no-attributes
- knowledge extraction
- relational databases (functional dependencies)
- interpreted in DB relations (tables with general
attributes) - data redundancy, normalization, DB design, . . .
- knowledge extraction (Manilla, Raiha Algorithms
for inferringfunctional dependencies from
relations, Data Knowledge Eng.128399.)
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
11Recalling functional dependencies (FDs) . . .
ordinary setting
table D
A ? B is true in table D means for any tuples
x1, x2 IF x1 and x2 agree on their values of
attributes from A THEN x1 and x2 agree on their
values of attributes from B Example y1, y2 ?
y3 is true in D, y1 ? y2 is not (x2 ? x4
counterexample) flight No. ? departure
time, arrival time
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
12Fuzzy functional dependencies syntax
Definition Fuzzy functional dependence (FFD) over
attributes Y A ? B, where A, B ? LY (fuzzy
sets of attributes)
- Example
- 0.7/y1 ? 0.3/y2 y1, y3 ? y4 ordinary
dependence - 0.4/y1, y2,0.1/y3 ? y3,0.5/y4 ?
empty - Intended meaning of A ? B
- ? as in ordinary case, but equality replaced by
similarity - for any of two tuples x1, x2 ? X IF x1 and x2
have similar values on attributes from A THEN x1
and x2 have similar values on attributes from B - ? new kind of dependencies (data mining apeal)
- A ? B can be true to a degree from L, not only 0
or 1 - degrees A(y), B(y) act as tresholds
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
13Semantics of FFDs
D table with similarities (for simplicity,
ranks1)
Definition (degree A ? BD to which A ? B is
true in D defined by A ? BD ?
((x1(A) ? x2(A)) ? ((x1(B) ? x2(B))
x1x2?X
Remark Ordinary meaning of functional
dependencies is a particular caseA and B
ordinary sets, ?y ordinary equality for each y ?
Y .
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
14Semantics of FFD models, entailment
D table with similarities
Definition (models and entailment in ranked
tables)T a set of T of FFDsmodels of T
Mod(T ) D I for each A ? B ? T A ? BD
1,
in words D is a model of T means each FFD
from T is true in D
Definition (models and entailment in ranked
tables)T a set (fuzzy set) of T of FFDsdegree
of entailment of A ? B from T A ? BT ?D
?Mod(T ) A ? BD
in words a degree to which A ? B follows from
T degree ofA ? B is true in each model of T
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
15Armstrong-like rules, provability, and
completeness
- Recall
- Armstrong W. W. Dependency structures in data
base relationships. - IFIP Congress, Geneva, Switzerland, 1974.
- a system of deduction rules s. t.
- A ? B is entailed by T iff A ? B is provable from
T - in our setting, entailment is a matter of degree,
- two concepts of provability and completeness
- ordinary completeness (interesting only degree
1) f follows from T iff f provable from T - graded completeness (any degree
interesting)degree to which f follows from T
degree of provability of f from T. - We present a syntactico-semantically complete
(both types) system - of Armstrong-like rules.
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
16Armstrong-like rules, provability, and
completeness
- Deduction rules
- rules describing what FFDs can be inferred(in
one elementary step) from other FFDs - inspired by Armstrong-like rules, several
equivalent systems - one of them (an elegant one) is
- classical Armstrong rules fuzzy rule
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
17Ordinary provability and completeness
- Provability T . . . theory (set of FFDs)
- A ? B is provable from T, written T ? A ? B, if
there is a sequence - f1, . . . , fn of FFDs such that
- fn is A ? B,
- for each f i f i 2 T or 'i is inferred from the
preceding formulas - (i.e., f1, . . . , fi-1) using one of the
deduction rules (Ax)(Cut). - Provability bivalent notion (either T ? A ? B or
T ? A ? B).
Theorem (ordinary completeness) A ? BT 1
( A ? B follows from T, in degree 1) iff T ? A
? B (A ? B is provable from T)
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
18Graded provability and completeness
Provability bivalent notion (either T ? A ? B or
T ? A ? B). can we capture a degree of semantic
entailment syntactically? (i.e., by a
modification of the concept of proof) Graded
provability . . . set of FFDs A ? B T ? L
degree which A ? B is provable from T (details
proceedings)
Theorem (graded completeness) A ? B T A ?
B T (degree of entailment degree of
provability).
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
19Non-redundant bases of FFDs
aim large sets of FFDs ? small sets of
FFDs (equally informative) example Given ranked
table D with similarities, extract true FFDs
from D, but only the essential ones
- Definition (complete set of FFDs)
- A set T of FFDs is complete in D if
- for each A ? B ? T A ? BD 1(each FFD
from T is true in D) - for each A ? B A ? BD A ? BT
? complete set T of D fully describe validity of
FFDs in D
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
20Non-redundant bases of FFDs (cntd.)
- Definition (Non-redundant bases of D )
- A set T of FFDs a non-redundant basis of D if
- T is complete in D
- No T' ? T is complete
In what follows computation of particular
non-redundant bases based on so-called
pseudo-intents
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
21Non-redundant bases using pseudo-intents
Definition (pseudo-intents of D) A system of
pseudo-intents of a ranked table D with
similarities is a system P of fuzzy sets of
attributes such that P ? P iff (detailed
description in proceedings)
the role of pseudo-intents
Theorem (non-redundant basis based on
pseudo-intents) If P is a system of
pseudo-intents then T P ? C (P ) P ? P is
a non-redundant basis of D
C(P) is a particular modification of P, details
omitted.
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
22Computing pseudo-intents (non-redundant bases)
Theorem (pseudo-intents from fixpoints of clT
) Let P be a system of pseudo-intents of D.
Then P P ? fix(clT ) P ? C (P )
Where of clT is defined by For Z ? LY we put
. . . operator on L-sets in Y
fix(clT ) P clT (P) P . . . fixpoints
of clT
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
23Computing non-redundant bases (algorithm)
Input D (data table over dom. with similarity
relations). Output P (system of pseudo intents)
B ? 0 if B ? C(B ) add B to P while B ? Y T
? P ? C (P ) P ? P B ? B (B is
lectically smallest fixed point of clT which
is a successor of B) if B ? C(B ) add B to P
polynomial time delay complexity
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
24Relational algebra and calculus
- basic traits (details in proceedings and a
forthcoming paper) - extension of classical relational algebra which
takes similarities into account - relational algebra operations
- counterparts to Boolean operations (union, . . .
) - new operations arising within the framework of
fuzzy logic (e.g. based on thresholds, like
a-cut aD(t) t D (t) a) - operations where exact matches are extended by
similarity-based matches (selection, join, . . .
) - further operations e.g. topk (best k tuples
satisfying a query considerable interest) - relational calculus based on formal predicate
fuzzy logic (essential are non-standard issues
like quantifiers most, etc.) - well-founded like in the classical case
Theorem (equivalence theorem) Relational algebra
and relational calculus for the extended model
are equivalent.
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
25Example I Select power production of countries
with large population
D(t) Country COU Population Coal
Air Water Nuclear ---- ---------- ---
---------- ---------- ---------- ----------
---------- 1.0 China Cn 1300000000
498 246 196 34.6 1.0 India
In 1000000000 154 1032
75 24.8 .6 USA US 300000000
570.7 2533 330 743.9 .3
Russia Ru 145000000 115.8 54
157 122.5 .3 Japan Jp
127000000 0 120 90
293.8 .2 Germany Ge 90000000 56.4
3817 50 161.2 .2 UK UK
80000000 19.5 350 8
87.1 .2 France Fr 80000000 0
63 62 394.4 .1 Spain Sp
40000000 10.9 1180 11
58.9
D(t) degree of large population for each tuple
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
26Implementation of domains with similarities I
Similarity of power production from coal is
defined by table
Similarity table DDL
c Cn In US Ru Jp Ge Fr UK Sp -------------------
----------- Cn 1 .3 In 1 .6 US .3
1 Ru .6 1 .4 Jp 1 .4 1
.8 .9 Ge .4 .4 1 .4 .7 .6 Fr
1 .4 1 .8 .9 UK .8 .7 .8 1
1 Sp .9 .6 .9 1 1
CREATE TABLE t_sim_coal ( country_code1
VARCHAR(2), country_code2 VARCHAR(2), similarity
NUMBER(3,2), CONSTRAINT t_sim_coal_pk PRIMARY
KEY ( country_code1, country_code1 ) )
COUNTRY_CODE1 COUNTRY_CODE2 SIMILARITY -----------
-- ------------- ---------- Cn Cn
1 Cn US
.3 In In 1 In
Ru .6 US
US 1 Ru Ru
1 ...
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
27Similarity defined by table performance issues
Similarity table DDL
- Similarity for two countries is retrieved by join
in following steps lets suppose that both
country codes are available from a main loop - Find out a ROWID in the index t_sim_coal_pk for
the two given country codes - Retrieve similarity from the table t_sim_coal
using the ROWID and provide it for further query
execution - It is obvious that there is additional step
retrieving of the ROWID. But the ROWID is not
necessary for result. - t_sim_coal should be replaced by database
structure supporting searching, which gives the
similarity value immediately instead of the ROWID
(i.e. index organized table)
CREATE TABLE t_sim_coal ( country_code1
VARCHAR(2), country_code2 VARCHAR(2), similarity
NUMBER(3,2), CONSTRAINT t_sim_coal_pk PRIMARY
KEY ( country_code1, country_code1 ) )
COUNTRY_CODE1 COUNTRY_CODE2 SIMILARITY -----------
-- ------------- ---------- Cn Cn
1 Cn US
.3 In In 1 In
Ru .6 US
US 1 Ru Ru
1 ...
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
28Implementation of domains with similarities II
Ranking of large population is defined by function
Definition in RDBMS with procedural extension
function large_population ( p_population in
varchar2 ) return number is large_popul_c
constant number 500000000 l_ret number
0 begin return least(
p_population/large_popul_c, 1
) end /
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
29Ranking defined by function - performance
considerations
- Ranking or similarity defined by function can
lead to decreased performance during SQL
execution on large data - It is possible to created an index based on
function using row columns as parameters. But the
index transforms the result to ROWID which is not
very helpful - Extending classical Btree by a degree values in
the leaves. The leaves would consist of indexed
column, the degree value and the ROWID. The
extended Btree would support operations like
topk or a-cut very effectively provided that
ranking function is monotonous
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
30Extended Btree
1,300,000,000
145,000,000
127,000,000
90,000,000
1,000,000,000
300,000,000
0.2
1
0.3
1
0.3
0.6
- The example above shows a-cut of large population
(a0.3) - When the most left leaf with degree of large
population 0.3 is found then the right leafs
are read sequentially - polynomial time delay is logarithmic
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
31Implementation of ranked table in ORDBMS (cntd.)
Lets define object view
Definition of object type in ORDBMS Oracle10g
create or replace type powerprod_t AS OBJECT (
country varchar2(30), population
number, coal number, air number,
water number, nuclear number, MEMBER
FUNCTION similar_coal(itupple in powerprod_t)
return number, MEMBER FUNCTION
similar_air(itupple in powerprod_t) return
number, MEMBER FUNCTION similar_water(itupple
in powerprod_t) return number, MEMBER
FUNCTION similar_nuclear(itupple in powerprod_t)
return number, MEMBER FUNCTION
large_popul(itupple in powerprod_t) return
number )
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
32Implementation of ranked table in ORDBMS (cntd.)
Lets define object view over the table
t_countries using object type powerprod_t.
Definition of object view powerprod_v in ORDBMS
Oracle10g
create or replace view powerprod_v of
powerprod_t with object identifier (country)
as select country, population, coal,
air, water, nuclear from
t_countries
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
33Implementation of ranked table in ORDBMS (cntd.)
And now we can select i.e. all countries having
similar power production from nuclear power plats
Definition of object view powerprod_v in ORDBMS
Oracle10g
select a.similar_nuclear(value(b))
"D(t)", a.country "Country", a.population
"Population", a.coal "Coal", a.air
"Air", a.water "Water", a.nuclear
"Nuclear" from powerprod_v a, powerprod_v b where
b.country'Japan' order by 1 desc /
D(t) represents similarity of nuclear power
plant production with Japan Note, that
similar_nuclear search similarities in the table
t_sim_nuclear
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
34Implementation of ranked table in ORDBMS (result
of the example)
D(t) Country Population Coal Air
Water Nuclear
1
Japan 127000000 0 120
90 293.8 .6 France 80000000
0 63 62 394.4 .4 Germany
90000000 56.4 3817 50
161.2 .1 Russia 145000000 115.8
54 157 122.5 0 India
1000000000 154 1032 75
24.8 0 USA 300000000 570.7
2533 330 743.9 0 Spain
40000000 10.9 1180 11
58.9 0 UK 80000000 19.5
350 8 87.1 0 China
1300000000 498 246 196
34.6
D(t) represents similarity of nuclear power
plant production with Japan
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
35Future research
- further study of extended Codds model
(foundations, algorithms, implementation), - connection to existing work on RankSQL, to work
on algorithms, . . . , - further data dependencies,
- data redundancy (approximate redundancy),
- data mining aspects,
- implications true in degrees other than 1 (at
least a, etc.) bases, . . . - involve tolerance e.g. almost complete basis,
can it be smaller? - . . .
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007