Relational Model of Data over Domains with Similarities

About This Presentation

Title:

Relational Model of Data over Domains with Similarities

Description:

adjoint pair (a b c iff a b c). details in proceedings ... A B is entailed by T iff A B is provable from T ... P P iff (detailed description in proceedings) the ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 36

Provided by: stanisla4

Category:

more less

Transcript and Presenter's Notes

Title: Relational Model of Data over Domains with Similarities

1
Relational Model of Data over Domainswith
Similarities
An Extension for Similarity Queries and Knowledge
Extraction

Radim BelohlavekVilem VychodilStanislav Opichal
Dept. Computer SciencePalacky University,
OlomoucCzech Republic

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
2
Outline

problem setting introducing extended Codds
model
preliminaries from fuzzy logic
functional dependencies (as example of data
dependencies)Armstrong axioms and completeness,
entailment and non-redundant bases, computation
of bases
relational algebra and calculus
practical issues
further issues, future research

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
3
Problem setting

Our paper
contribution to an extension of Codds relational
model
extension concerns uncertainty (imprecision)
Abiteboul S. et al. The Lowell database research
self-assessment.Comm. ACM 48(5)(2005),
111118management of uncertainty in the
foundations of databases
extension
extension
provides framework for approximate matches and
related issues(similarity queries, similarity
join, . . . ) contrary to exact matches of the
classical model
we add
similarity relations on domains
ranks assigned to tuples
in this talk
data dependencies
relational algebra and calculus
practical issues

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
4
Problem setting (cntd.)

Our extension of Codds model
(ranked) data tables over domains with
similarities

ranked table ? answer to similarity-based query
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
5
Problem setting (cntd.)

Related work
extensions of Codds model employing fuzzy logic
several approaches, many papers
Raju, Majumdar, Fuzzy functional dependencies and
lossless joindecomposition of fuzzy relational
database systems.ACM Trans. Database Systems
Vol. 13, No. 2, 1988, pp. 129166.
extensions of Codds model employing probability
different both semantically and technically
(probabilityfuzzy logic)
Fuhr, Rölleke, A probabilistic relational algebra
for the integration ofinformation retrieval and
database systems.ACM Trans. Information Systems
153266, 1997.
D. Dey and S. Sarkar S. A probabilistic
relational model and algebra.ACM Trans. Dat.
Syst. 21339369, 1996.

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
6
Problem setting (cntd.)

related work
Fagin at al.
R. Fagin. Combining fuzzy information an
overview. ACM SIGMODRecord 31(2)109-118, 2002.
Natsev, Chang, Smith, Li, Vitter Supporting
incremental join querieson ranked inputs.VLDB
2001, pp. 281290.
Cohen, Sagiv An incremental algorithm for
computing ranked fulldisjunctions. PODS 2005,
pp. 98107.
RankSQL related research
Li, Chang, Ilyas, Song RankSQL Query Algebra
and Optimization forRelational top-k
queries.ACM SIGMOD 2005, pages 131142, 2005.
Illyas, Aref, Elmagarmid Supporting top-k join
queries in relational databases.The VLDB Journal
13207221, 2004.

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
7
Preliminaries from fuzzy logic

fuzzy logic invented by Zadeh simple calculus
for handling of vagueness
Zadeh L. A. Fuzzy sets. Inf. Control (1965).
basic principle allows propositions to have
intermediate truth degrees
instead of just 0 (false) and 1 (true), e.g.
John is tall. 0.9, A is simiar to
B 0.7
developed since late 1960s
for a long time no firm logical foundations, ad
hoc approaches, many
results of low quality
logical foundations developed in late 1990s,
monographs available

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
8
Preliminaries structures of truth degrees

classical logic two-element Boolean algebra,
given by
set 0, 1 of truth degrees
(truth functions of) logical connectives
(conjunction, implication, . . . )
fuzzy logic several possibilities, a general
one complete residuatedlattice, given by
(partially ordered) set L of truth degrees, e.g.
L 0, 1,L 0, 0.1, 0.2, . . . , 1,
non-linearly ordered L, . . .
(truth functions of) logical connectives (conj.
?, impl. ?, . . . )
Complete residuated lattice basic structure of
truth degreesL ?L,?,?,?,?,0,1?, where
?L,?,?,?,?,0,1? complete lattice,
?L,?,1? commutative monoid,
??,?? adjoint pair (a ? b ? c iff a ? b ? c).
details in proceedings

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
9
Our extension of Codds model

(ranked) data tables over domains with
similarities

ranked table ? answer to similarity-based query
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
10
Functional dependencies

formulas
A ? B (A,B ? Y , sets of attributes)
describing attribute dependencies, e.g.
flight No. ? depart. time, arriv. time
used in
knowledge extraction
data mining
formal concept analysis (attribute implications)
interpreted in tables with yes/no-attributes
knowledge extraction
relational databases (functional dependencies)
interpreted in DB relations (tables with general
attributes)
data redundancy, normalization, DB design, . . .
knowledge extraction (Manilla, Raiha Algorithms
for inferringfunctional dependencies from
relations, Data Knowledge Eng.128399.)

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
11
Recalling functional dependencies (FDs) . . .
ordinary setting
table D
A ? B is true in table D means for any tuples
x1, x2 IF x1 and x2 agree on their values of
attributes from A THEN x1 and x2 agree on their
values of attributes from B Example y1, y2 ?
y3 is true in D, y1 ? y2 is not (x2 ? x4
counterexample) flight No. ? departure
time, arrival time
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
12
Fuzzy functional dependencies syntax
Definition Fuzzy functional dependence (FFD) over
attributes Y A ? B, where A, B ? LY (fuzzy
sets of attributes)

Example
0.7/y1 ? 0.3/y2 y1, y3 ? y4 ordinary
dependence
0.4/y1, y2,0.1/y3 ? y3,0.5/y4 ?
empty
Intended meaning of A ? B
? as in ordinary case, but equality replaced by
similarity
for any of two tuples x1, x2 ? X IF x1 and x2
have similar values on attributes from A THEN x1
and x2 have similar values on attributes from B
? new kind of dependencies (data mining apeal)
A ? B can be true to a degree from L, not only 0
or 1
degrees A(y), B(y) act as tresholds

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
13
Semantics of FFDs
D table with similarities (for simplicity,
ranks1)
Definition (degree A ? BD to which A ? B is
true in D defined by A ? BD ?
((x1(A) ? x2(A)) ? ((x1(B) ? x2(B))
x1x2?X
Remark Ordinary meaning of functional
dependencies is a particular caseA and B
ordinary sets, ?y ordinary equality for each y ?
Y .
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
14
Semantics of FFD models, entailment
D table with similarities
Definition (models and entailment in ranked
tables)T a set of T of FFDsmodels of T
Mod(T ) D I for each A ? B ? T A ? BD
1,
in words D is a model of T means each FFD
from T is true in D
Definition (models and entailment in ranked
tables)T a set (fuzzy set) of T of FFDsdegree
of entailment of A ? B from T A ? BT ?D
?Mod(T ) A ? BD
in words a degree to which A ? B follows from
T degree ofA ? B is true in each model of T
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
15
Armstrong-like rules, provability, and
completeness

Recall
Armstrong W. W. Dependency structures in data
base relationships.
IFIP Congress, Geneva, Switzerland, 1974.
a system of deduction rules s. t.
A ? B is entailed by T iff A ? B is provable from
T
in our setting, entailment is a matter of degree,
two concepts of provability and completeness
ordinary completeness (interesting only degree
1) f follows from T iff f provable from T
graded completeness (any degree
interesting)degree to which f follows from T
degree of provability of f from T.
We present a syntactico-semantically complete
(both types) system
of Armstrong-like rules.

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
16
Armstrong-like rules, provability, and
completeness

Deduction rules
rules describing what FFDs can be inferred(in
one elementary step) from other FFDs
inspired by Armstrong-like rules, several
equivalent systems
one of them (an elegant one) is
classical Armstrong rules fuzzy rule

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
17
Ordinary provability and completeness

Provability T . . . theory (set of FFDs)
A ? B is provable from T, written T ? A ? B, if
there is a sequence
f1, . . . , fn of FFDs such that
fn is A ? B,
for each f i f i 2 T or 'i is inferred from the
preceding formulas
(i.e., f1, . . . , fi-1) using one of the
deduction rules (Ax)(Cut).
Provability bivalent notion (either T ? A ? B or
T ? A ? B).

Theorem (ordinary completeness) A ? BT 1
( A ? B follows from T, in degree 1) iff T ? A
? B (A ? B is provable from T)
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
18
Graded provability and completeness
Provability bivalent notion (either T ? A ? B or
T ? A ? B). can we capture a degree of semantic
entailment syntactically? (i.e., by a
modification of the concept of proof) Graded
provability . . . set of FFDs A ? B T ? L
degree which A ? B is provable from T (details
proceedings)
Theorem (graded completeness) A ? B T A ?
B T (degree of entailment degree of
provability).
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
19
Non-redundant bases of FFDs
aim large sets of FFDs ? small sets of
FFDs (equally informative) example Given ranked
table D with similarities, extract true FFDs
from D, but only the essential ones

Definition (complete set of FFDs)
A set T of FFDs is complete in D if
for each A ? B ? T A ? BD 1(each FFD
from T is true in D)
for each A ? B A ? BD A ? BT

? complete set T of D fully describe validity of
FFDs in D
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
20
Non-redundant bases of FFDs (cntd.)

Definition (Non-redundant bases of D )
A set T of FFDs a non-redundant basis of D if
T is complete in D
No T' ? T is complete

In what follows computation of particular
non-redundant bases based on so-called
pseudo-intents
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
21
Non-redundant bases using pseudo-intents
Definition (pseudo-intents of D) A system of
pseudo-intents of a ranked table D with
similarities is a system P of fuzzy sets of
attributes such that P ? P iff (detailed
description in proceedings)
the role of pseudo-intents
Theorem (non-redundant basis based on
pseudo-intents) If P is a system of
pseudo-intents then T P ? C (P ) P ? P is
a non-redundant basis of D
C(P) is a particular modification of P, details
omitted.
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
22
Computing pseudo-intents (non-redundant bases)
Theorem (pseudo-intents from fixpoints of clT
) Let P be a system of pseudo-intents of D.
Then P P ? fix(clT ) P ? C (P )
Where of clT is defined by For Z ? LY we put
. . . operator on L-sets in Y
fix(clT ) P clT (P) P . . . fixpoints
of clT
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
23
Computing non-redundant bases (algorithm)
Input D (data table over dom. with similarity
relations). Output P (system of pseudo intents)
B ? 0 if B ? C(B ) add B to P while B ? Y T
? P ? C (P ) P ? P B ? B (B is
lectically smallest fixed point of clT which
is a successor of B) if B ? C(B ) add B to P
polynomial time delay complexity
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
24
Relational algebra and calculus

basic traits (details in proceedings and a
forthcoming paper)
extension of classical relational algebra which
takes similarities into account
relational algebra operations
counterparts to Boolean operations (union, . . .
)
new operations arising within the framework of
fuzzy logic (e.g. based on thresholds, like
a-cut aD(t) t D (t) a)
operations where exact matches are extended by
similarity-based matches (selection, join, . . .
)
further operations e.g. topk (best k tuples
satisfying a query considerable interest)
relational calculus based on formal predicate
fuzzy logic (essential are non-standard issues
like quantifiers most, etc.)
well-founded like in the classical case

Theorem (equivalence theorem) Relational algebra
and relational calculus for the extended model
are equivalent.
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
25
Example I Select power production of countries
with large population
D(t) Country COU Population Coal
Air Water Nuclear ---- ---------- ---
---------- ---------- ---------- ----------
---------- 1.0 China Cn 1300000000
498 246 196 34.6 1.0 India
In 1000000000 154 1032
75 24.8 .6 USA US 300000000
570.7 2533 330 743.9 .3
Russia Ru 145000000 115.8 54
157 122.5 .3 Japan Jp
127000000 0 120 90
293.8 .2 Germany Ge 90000000 56.4
3817 50 161.2 .2 UK UK
80000000 19.5 350 8
87.1 .2 France Fr 80000000 0
63 62 394.4 .1 Spain Sp
40000000 10.9 1180 11
58.9
D(t) degree of large population for each tuple
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
26
Implementation of domains with similarities I
Similarity of power production from coal is
defined by table
Similarity table DDL
c Cn In US Ru Jp Ge Fr UK Sp -------------------
----------- Cn 1 .3 In 1 .6 US .3
1 Ru .6 1 .4 Jp 1 .4 1
.8 .9 Ge .4 .4 1 .4 .7 .6 Fr
1 .4 1 .8 .9 UK .8 .7 .8 1
1 Sp .9 .6 .9 1 1
CREATE TABLE t_sim_coal ( country_code1
VARCHAR(2), country_code2 VARCHAR(2), similarity
NUMBER(3,2), CONSTRAINT t_sim_coal_pk PRIMARY
KEY ( country_code1, country_code1 ) )
COUNTRY_CODE1 COUNTRY_CODE2 SIMILARITY -----------
-- ------------- ---------- Cn Cn
1 Cn US
.3 In In 1 In
Ru .6 US
US 1 Ru Ru
1 ...
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
27
Similarity defined by table performance issues
Similarity table DDL

Similarity for two countries is retrieved by join
in following steps lets suppose that both
country codes are available from a main loop
Find out a ROWID in the index t_sim_coal_pk for
the two given country codes
Retrieve similarity from the table t_sim_coal
using the ROWID and provide it for further query
execution
It is obvious that there is additional step
retrieving of the ROWID. But the ROWID is not
necessary for result.
t_sim_coal should be replaced by database
structure supporting searching, which gives the
similarity value immediately instead of the ROWID
(i.e. index organized table)

CREATE TABLE t_sim_coal ( country_code1
VARCHAR(2), country_code2 VARCHAR(2), similarity
NUMBER(3,2), CONSTRAINT t_sim_coal_pk PRIMARY
KEY ( country_code1, country_code1 ) )
COUNTRY_CODE1 COUNTRY_CODE2 SIMILARITY -----------
-- ------------- ---------- Cn Cn
1 Cn US
.3 In In 1 In
Ru .6 US
US 1 Ru Ru
1 ...
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
28
Implementation of domains with similarities II
Ranking of large population is defined by function
Definition in RDBMS with procedural extension
function large_population ( p_population in
varchar2 ) return number is large_popul_c
constant number 500000000 l_ret number
0 begin return least(
p_population/large_popul_c, 1
) end /
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
29
Ranking defined by function - performance
considerations

Ranking or similarity defined by function can
lead to decreased performance during SQL
execution on large data
It is possible to created an index based on
function using row columns as parameters. But the
index transforms the result to ROWID which is not
very helpful
Extending classical Btree by a degree values in
the leaves. The leaves would consist of indexed
column, the degree value and the ROWID. The
extended Btree would support operations like
topk or a-cut very effectively provided that
ranking function is monotonous

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
30
Extended Btree
1,300,000,000
145,000,000
127,000,000
90,000,000
1,000,000,000
300,000,000
0.2
1
0.3
1
0.3
0.6

The example above shows a-cut of large population
(a0.3)
When the most left leaf with degree of large
population 0.3 is found then the right leafs
are read sequentially
polynomial time delay is logarithmic

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
31
Implementation of ranked table in ORDBMS (cntd.)
Lets define object view
Definition of object type in ORDBMS Oracle10g
create or replace type powerprod_t AS OBJECT (
country varchar2(30), population
number, coal number, air number,
water number, nuclear number, MEMBER
FUNCTION similar_coal(itupple in powerprod_t)
return number, MEMBER FUNCTION
similar_air(itupple in powerprod_t) return
number, MEMBER FUNCTION similar_water(itupple
in powerprod_t) return number, MEMBER
FUNCTION similar_nuclear(itupple in powerprod_t)
return number, MEMBER FUNCTION
large_popul(itupple in powerprod_t) return
number )
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
32
Implementation of ranked table in ORDBMS (cntd.)
Lets define object view over the table
t_countries using object type powerprod_t.
Definition of object view powerprod_v in ORDBMS
Oracle10g
create or replace view powerprod_v of
powerprod_t with object identifier (country)
as select country, population, coal,
air, water, nuclear from
t_countries
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
33
Implementation of ranked table in ORDBMS (cntd.)
And now we can select i.e. all countries having
similar power production from nuclear power plats
Definition of object view powerprod_v in ORDBMS
Oracle10g
select a.similar_nuclear(value(b))
"D(t)", a.country "Country", a.population
"Population", a.coal "Coal", a.air
"Air", a.water "Water", a.nuclear
"Nuclear" from powerprod_v a, powerprod_v b where
b.country'Japan' order by 1 desc /
D(t) represents similarity of nuclear power
plant production with Japan Note, that
similar_nuclear search similarities in the table
t_sim_nuclear
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
34
Implementation of ranked table in ORDBMS (result
of the example)
D(t) Country Population Coal Air
Water Nuclear
1
Japan 127000000 0 120
90 293.8 .6 France 80000000
0 63 62 394.4 .4 Germany
90000000 56.4 3817 50
161.2 .1 Russia 145000000 115.8
54 157 122.5 0 India
1000000000 154 1032 75
24.8 0 USA 300000000 570.7
2533 330 743.9 0 Spain
40000000 10.9 1180 11
58.9 0 UK 80000000 19.5
350 8 87.1 0 China
1300000000 498 246 196
34.6
D(t) represents similarity of nuclear power
plant production with Japan
Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007
35
Future research

further study of extended Codds model
(foundations, algorithms, implementation),
connection to existing work on RankSQL, to work
on algorithms, . . . ,
further data dependencies,
data redundancy (approximate redundancy),
data mining aspects,
implications true in degrees other than 1 (at
least a, etc.) bases, . . .
involve tolerance e.g. almost complete basis,
can it be smaller?
. . .

Belohlavek, Vychodil, Opichal (Palacky University)
Relational Model of Data over Domains with
IDA 2007

Write a Comment

User Comments (0)