Title: Deductive databases
 1Deductive databases
- Toon Calders 
- t.calders_at_tue.nl
2Motivation Deductive DB
- Motivation is two-fold 
- add deductive capabilities to databases the 
 database contains
- facts (intensional relations) 
- rules to generate derived facts (extensional 
 relations)
-  Database is knowledge base 
- Extend the querying 
- datalog allows for recursion
3Motivation Deductive DB
- Datalog as engine of deductive databases 
- similarities with Prolog 
- has facts and rules 
- rules define -possibly recursive- views 
- Semantics not always clear 
- safety 
- negation 
- recursion
4Outline
- Syntax of the Datalog language 
- Semantics of a Datalog program 
- Relational algebra  safe Datalog with negation 
 and without recursion
- Optimization techniques 
- Conclusions
5Syntax of Datalog
- Datalog query/program 
- facts ? traditional relational tables 
- rules ? define intensional views 
- Rules 
- if-then rules 
- can contain recursion 
- can contain negations 
- Semantics of program can be ambiguous
6Syntax of Datalog
- Example 
-  father(X,Y) - person(X,m), parent(X,Y). 
-  grandson(X,Y) - parent(Y,Z), parent(Z,X), 
 person(X,m).
-  hbrothers(X,Y) - person(X,m), person(Y,m),  
 parent(Z,X), parent(Z,Y).
7Syntax of Datalog
- Variables X, Y 
- Constants m, f, rita,  
- Positive literal p(t1,,tn) 
- p is the name of a relation (EDB or IDB) 
- t1, , tn constants or variables 
- Negative literal not p(t1, , tn) 
- Rule h - l1, , ln 
- h positive literal, l1, , ln literals
In Datalog Correct negation ( In contrast to 
Prologs negation by failure ) 
 8Syntax of Datalog
- Rule can be recursive 
- Arithmetic operations considered as special 
 predicates
- AltB  smaller(A,B) 
- ABC  plus(A,B,C)
9Outline
- Syntax of the Datalog language 
- Semantics of a Datalog program 
- non-recursive 
- recursive datalog 
- aggregation 
- Relational algebra  safe Datalog with negation 
 and without recursion
- Optimization techniques 
- Conclusions
10Semantics of Non-Recursive Datalog Programs
- Ground instantiation of a rule h - l1, , ln  
 replace every variable in the rule by a constant
- Example 
-  father(X,Y) - person(X,m), parent(X,Y) 
- instantiation 
-  father(toon,an) - person(toon,m), 
 parent(toon,an).
11Semantics of Non-Recursive Datalog Programs
- Let I be a set of facts 
- The body of a rule instantiation R is satisfied 
 by I if
- every positive literal in the body of R is in I 
- no negative literal in the body of R is in I 
- Example 
-  person(toon,m), parent(toon,an) not satisfied by 
 the facts given before
12Semantics of Non-Recursive Datalog Programs
- Let I be a set of facts 
- R is a rule h - l1, , ln 
- Infer(R,I)   h  
- h - l1, , ln is a ground instantiation of R 
- l1  ln is satisfied by I  
- RR1, , RnInfer(R,I)  Infer(R1,I) ?  ? 
 Infer(Rn,I)
13Semantics of Non-Recursive Datalog Programs
- A rule h - l1, , ln is in layer 1 
- l1, , ln only involve extensional predicates 
- A rule h - l1, , ln is in layer i 
- for all 0ltjlti, it is not in layer j 
- l1, , ln only involve predicates that are 
 extensional and in the layers 1, , i-1
14Semantics of Non-Recursive Datalog Programs
- Let I0 be the facts in a datalog program 
-  Let R1 be the rules at layer 1 
-   
-  Let Rn be the rules at layer n 
- I1  I0 ? Infer(R1, I0) 
-  I2  I1 ? Infer(R2, I1) 
-   
-  In  In-1 ? Infer(Rn, In-1)
15Semantics of Non-Recursive Datalog Programs
father(X,Y) - person(X,m), parent(X,Y). grandfa
ther(X,Y) - father(X,Y), parent(Y,Z). hbrothers
(X,Y) - person(X,m), person(Y,m), 
 parent(Z,X), parent(Z,Y). 
 16Semantics of Non-Recursive Datalog Programs
Stratum 0
father(X,Y) - person(X,m), parent(X,Y). grandfa
ther(X,Y) - father(X,Y), parent(Y,Z). hbrothers
(X,Y) - person(X,m), person(Y,m), 
 parent(Z,X), parent(Z,Y). 
 17Semantics of Non-Recursive Datalog Programs
Stratum 1
 father alex toon jan an toon bernd toon
 mattijs hbrothers bernd mattijs matti
js bernd mattijs mattijs bernd bernd
Stratum 0
father(X,Y) - person(X,m), parent(X,Y). grandfa
ther(X,Y) - father(X,Y), parent(Y,Z). hbrothers
(X,Y) - person(X,m), person(Y,m), 
 parent(Z,X), parent(Z,Y). 
 18Semantics of Non-Recursive Datalog Programs
Stratum 1
Stratum 2
 father grandfather alex toon 
jan mattijs jan an jan bernd toon bernd 
 alex mattijs toon mattijs alex bernd 
 hbrothers bernd mattijs mattijs bernd mat
tijs mattijs bernd bernd
Stratum 0
father(X,Y) - person(X,m), parent(X,Y). grandfa
ther(X,Y) - father(X,Y), parent(Y,Z). hbrothers(
X,Y) - person(X,m), person(Y,m), 
 parent(Z,X), parent(Z,Y). 
 19Caveat Correct Negation
- Negation in Datalog ? Negation in Prolog 
- Prolog negation (Negation by failure) 
- not(p(X)) is true if we fail to prove p(X) 
- Datalog negation (Correct negation) 
- not(p(X)) binds X to a value such that p(X) does 
 not hold.
20Caveat Correct Negation
- Example 
- father(a,b). person(a). person(b). 
- nfather(X) - person(X), not( father (X,Y) ), 
 person(Y).
- Datalog 
- ? nfather(X) ?  (a), (b)  
- Prolog 
- ? nfather(X) ?  (b)  
- ? person(a), not(father(a,a)), person(a) ? yes 
21Caveat Correct Negation
- Prolog 
- Order of the clauses is important 
-  nfather(X) - person(X), not( father (X,Y) ), 
 person(Y).
-  versus 
-  nfather(X) - person(X), person(Y), not( 
 father (X,Y) ).
- Order of the rules is important 
- Datalog 
- Order not important 
- More declarative
22Caveat Correct Negation
- Difference is not fundamental 
- Prolog 
-  nfather(X) - person(X), not( father (X,Y) ). 
- ? 
- Datalog 
-  nfather(X) - person(X), not(father_of_someone(X)
 ).
-  father_of_someone(X) - father (X,Y). 
23Caveat Correct Negation
- Difference is not fundamental 
- Many systems that claim to implement Datalog, 
 actually implement negation by failure.
- Debating on whether or not this is correct is 
 pointless both perspectives are useful
- Check on beforehand how an engine implements 
 negation
- Throughout the course, in all exercises, in the 
 exam, , we assume correct negation.
24Safety
- A rule can make no sense if variables appear in 
 funny ways
- Examples 
- S(x) - R(y) 
- S(x) - not R(x) 
- S(x) - R(y), xlty 
- In each of these cases the result is infinite 
 even if the relation R is finite
25Safety
- Even when not leading to infinite relations, such 
 Datalog Programs can be domain-dependent.
- Example 
- s(a,b). s(a,a). r(a). r(b). 
- t(X) - not(s(X,Y)), r(X). 
- If domain is a,b 
-  only t(b) holds. 
- If domain is a,b,c 
-  not only t(b), but also t(a) holds 
-  ( Ground instantiation t(a) - not(s(a,c)), 
 r(a). )
26Safety
- Therefore, we will only consider rules that are 
 safe.
- A rule h - l1, , ln is safe if 
- every variable in the head of the rule also in a 
 non-arithmetic positive literale in body
- every variable in a negative literal of the body 
 also in some positive literal of the body
27Model-Theoretic Semantics
- A model M of a Datalog program is 
- An instatiation of all intensional relations in 
 the program
- That satisfies all rules in the program 
- If the body of a ground instantiation of a rule 
 holds in M, also the head must hold
- Some models are special 
28Model-Theoretic Semantics
- father(a,b). 
- person(X) - father(X,Y). 
- person(Y) - father(X,Y). 
- M1 father person 
-  a b a 
-  b 
- M2 father person 
-  a b a 
-  b a b 
-  a a 
29Model-Theoretic Semantics
- A model is minimal if  we cannot remove tuples  
- M1 father person 
-  a b a 
-  b 
- M2 father person 
-  a b a 
-  b a b 
-  a a 
Minimal
Not Minimal 
 30Model-Theoretic Semantics
- For non-recursive, safe datalog programs 
 semantics is well defined
- The model  all facts that can be derived from 
 the program
- Closed-World Assumption if a fact cannot be 
 derived from the database, then it is not true
- Is a minimal model
31Model-Theoretic Semantics
- Minimal model is, however, not necessarily unique 
- Example 
-  r(a). 
-  t(X) - r(X), not s(X). 
-  minimal models 
-  M1 M2 
- r s t r s t 
- a a a a 
32Outline
- Syntax of the Datalog language 
- Semantics of a Datalog program 
- non-recursive 
- recursive datalog 
- aggregation 
- Relational algebra  safe Datalog with negation 
 and without recursion
- Optimization techniques 
- Conclusions
33Semantics of Recursive Datalog Programs
- g(a,b). g(b,c). g(a,d). 
- reach(X,X) - g(X,Y). reach(Y,Y) - g(X,Y) 
- reach(X,Y) - g(X,Y). 
- reach(X,Z) - reach(X,Y), reach(Y,Z). 
- Fixpoint of a set of rules R, starting with set 
 of facts I
- repeat 
-  Old_I  II  I ? infer(R,I) 
- until I  Old_I 
- Always termination (inflationary fixpoint)
34Semantics of Recursive Datalog Programs
- g(a,b). g(b,c). g(a,d). 
- reach(X,X) - g(X,Y). reach(Y,Y) - g(X,Y) 
- reach(X,Y) - g(X,Y). 
- reach(X,Z) - reach(X,Y), reach(Y,Z). 
- Step 0 reach    
- Step 1 reach   (a,a), (b,b), (c,c), (d,d), 
 (a,b), (b,c), (a,d)
- Step 2 reach  (a,a), (b,b), (c,c), (d,d), 
 (a,b), (b,c), (a,d),  (a,c)
- Step 3 reach  (a,a), (b,b), (c,c), (d,d), 
 (a,b), (b,c), (a,d),  (a,c)  STOP
35Semantics of Recursive Datalog Programs
- Datalog without negation 
- Always a unique minimal model. 
- Semantics of recursive datalog with negation is 
 less clear.
- Example 
-  T(a). 
-  R(X) - T(X), not S(X). 
-  S(X) - T(X), not R(X). 
- What about R(a)? S(a)?
36Semantics of Recursive Datalog Programs
- For some classes of Datalog queries with negation 
 still a natural semantics can be defied
- Important class stratified programs 
- T depends on S if some rule with T in the head 
 contains S or (recursively) some predicate that
 depends on S, in the body.
- Stratified program If T depends on (not S), 
 then S cannot depend on T or (not T).
37Semantics of Recursive Datalog Programs
- The program 
-  T(a). 
-  R(X) - T(X), not S(X). 
-  S(X) - T(X), not R(X). 
- is not stratified 
- R depends negatively on S 
- S depends negatively on R
R T S 
 38Semantics of Recursive Datalog Programs
- g(a,b). g(b,c). g(a,d). 
- reach(X,X) - g(X,Y). 
- reach(Y,Y) - g(X,Y). 
- reach(X,Y) - g(X,Y). 
- reach(X,Z) - reach(X,Y), 
-  reach(Y,Z). 
- node(X) - g(X,Y). 
- node(Y) - g(X,Y). 
- unreach(X,Y) - node(X), node(Y),  not 
 reach(X,Y).
 g reach node unreach  
 39Semantics of Recursive Datalog Programs
- If a program is stratified, the tables in the 
 program can be partitioned into strata
- Stratum 0 All database tables. 
- Stratum I Tables defined in terms of tables in 
 Stratum I and lower strata.
- If T depends on (not S), S is in lower stratum 
 than T.
40Semantics of Recursive Datalog Programs
- g(a,b). g(b,c). g(a,d). 
- reach(X,X) - g(X,Y). 
- reach(Y,Y) - g(X,Y). 
- reach(X,Y) - g(X,Y). 
- reach(X,Z) - reach(X,Y), 
-  reach(Y,Z). 
- node(X) - g(X,Y). 
- node(Y) - g(X,Y). 
- unreach(X,Y) - node(X), node(Y),  not 
 reach(X,Y).
0
 g reach node unreach 
1
2 
 41Semantics of Recursive Datalog Programs
- Semantics of a stratified program given by 
- First, compute the least fixpoint of all tables 
 in Stratum 1. (Stratum 0 tables are fixed.)
- Then, compute the least fixpoint of tables in 
 Stratum 2 then the lfp of tables in Stratum 3,
 and so on, stratum-by-stratum.
42Semantics of Recursive Datalog Programs
- Fixpoint of a set of rules R, starting with set 
 of facts I
- repeat 
-  Old_I  II  I ? infer(R,I) 
- until I  Old_I 
- Fixpoint within one stratum always terminates 
- Due to monotonicity within the strata 
- Only positive dependence between tables in 
 stratum l.
- Due to finite program, number of strata isfinite 
 as well
43Semantics of Recursive Datalog Programs
- Stratum 0 g(a,b). g(b,c). g(a,d). 
- Stratum 1 node(a), node(b), node(c), 
 node(d),reach(a,a), reach(b,b), reach(c,c),
 reach(d,d), reach(a,b), reach(b,c),
- Stratum 2 
-  unreach(b,a), unreach(c,a), 
44Outline
- Syntax of the Datalog language 
- Semantics of a Datalog program 
- non-recursive 
- recursive datalog 
- aggregation 
- Relational algebra  safe Datalog with negation 
 and without recursion
- Optimization techniques 
- Conclusions
45Aggregate Operators
Degree(X, SUM(ltYgt)) - g(X,Y).
- The lt  gt notation in the head indicates 
 grouping the remaining arguments (X, in this
 example) are the GROUP BY fields.
- In order to apply such a rule, must have all of 
 relation g available.
- Stratification with respect to use of lt  gt is 
 similar to negation.
46Aggregate Operators
- bi(X,Y) - g(X,Y). g 
- bi(Y,X) - g(X,Y). 
- Degree(X, SUM(ltYgt)) - bi(X,Y). bi 
-  degree
47Aggregate Operators
- bi(X,Y) - g(X,Y). g 
- bi(Y,X) - g(X,Y). 
- Degree(X, SUM(ltYgt)) - bi(X,Y). bi 
-  degree
0
1
2 
 48Aggregate Operators
- bi(X,Y) - g(X,Y). g 
- bi(Y,X) - g(X,Y). 
- Degree(X, SUM(ltYgt)) - bi(X,Y). bi 
-  degree 
- Compute stratum by stratum 
- Assume strata 1 ? k fixed when computing k1
0
1
2 
 49Aggregate Operators
- r(a,b). r(a,c). s(a,d). 
- t(X,SUM(ltYgt)) - r(X,Y). 
- r(X,Y) - t(X,Z), Z2, s(X,Y). 
-  
50Aggregate Operators
- r(a,b). r(a,c). s(a,d). t 
- t(X,SUM(ltYgt)) - r(X,Y). 
- r(X,Y) - t(X,Z), Z2, s(X,Y). r s 
-  
51Aggregate Operators
- r(a,b). r(a,c). s(a,d). t 
- t(X,SUM(ltYgt)) - r(X,Y). 
- r(X,Y) - t(X,Z), Z2, s(X,Y). r s 
- a is aggregating over a moving target 
- Step 1 t(a,2) is added 
- Step 2 r(a,d) is added 
- Step 3 t(a,3) added, t(a,2) no longer true 
 hence r(a,d) should not have been added
52Outline
- Syntax of the Datalog language 
- Semantics of a Datalog program 
- Relational algebra  Safe Datalog with negation 
 and without recursion
- Optimization techniques 
- Conclusions
53RA  Non-Recursive Datalog
- Every operator of RA can be simulated by 
 non-recursive datalog
- Project on the first attribute of the ternary 
 relation r query (A) r(A, B, C).
- Cartesian product of relations r1 and r2. 
-  query (X1, X2, ..., Xn, Y1, Y1, Y2, ..., Ym ) 
 r1 (X1, X2, ..., Xn ), r2 (Y1, Y2, ..., Ym
 ).
- Union of relations r1 and r2. 
-  query (X1, X2, ..., Xn ) r1 (X1, X2, ..., Xn 
 ),  query (X1, X2, ..., Xn ) r2 (X1, X2, ...,
 Xn ),
- Set difference of r1 and r2. 
-  query (X1, X2, ..., Xn ) r1(X1, X2, ..., Xn 
 ), not r2 (X1, X2, ..., Xn )
54RA  Non-Recursive Datalog
- Every operator of RA can be simulated by 
 non-recursive datalog
- Result of our construction is always safe, and 
 equivalent for stratified semantics
-  ?13 ((?1R) x R) 
-  ? 
-  query1(A) - R(A,B). 
-  query2(A,B,C) - query1(A), R(B,C). 
-  result(A,B,A) - query2(A,B,A) 
55RA  Non-Recursive Datalog
- Every rule can be expressed by one RA expression 
 
- Translate every atom separately 
- Negation/arithmetic use complement 
 construction
- Essential safety 
- Combine atoms with Cartesian product 
- Do the joins with a selection 
- Project on the relevant attributes 
- Strata determine the order of evaluation 
- Because of no recursion every rule only executed 
 once.
56RA  Non-Recursive Datalog
- sister(X,Y) - person(X,f), parent(Z,X), 
 parent(Z,Y), not(XY).
- person(X,f) ?2f Person 
- parent(Z,X) and parent(Z,Y) Parent 
- not(XY) complement construction 
-  X comes from parent(Z,X) ? ?2 Parent 
-  Y from parent(Z,Y) ? ?2 Parent 
-  not(XY) ? ?1?2 (?2 Parent x ?2 Parent)
57RA  Non-Recursive Datalog
- sister(X,Y) - person(X,f), parent(Z,X), 
 parent(Z,Y), not(XY).
- ?1,6 
-  ?14, 35, 17, 68 
-  (?2f Person x Parent x Parent 
-  x ?1?2 (?2 Parent x ?2 Parent)) 
58RA  Non-Recursive Datalog
- Hence, the following two are equivalent in 
 expressive power
- Safe Datalog with negation, without recursion or 
 aggregation, under the stratified semantics
- Relational Algebra 
- Every rule separately can be expressed by a 
 relational algebra expression
- Makes it very suitable for implementation on top 
 of a relational database
59Outline
- Syntax of the Datalog language 
- Semantics of a Datalog program 
- Relational algebra  Datalog with negation and 
 without recursion
- Optimization techniques 
- Conclusions
60Evaluation of Datalog Programs
- Running example 
-  root(r). child(r,a). child(r,b). child(a,c). 
-  child(a,d). child(c,e). child(d,f). child(b,h). 
- sg(X,Y) - root(X),root(Y). 
- sg(X,Y) - child(X,U), sg(U,V),  child(Y,V).
r
a
b
c
d
h
f
e 
 61Evaluation of Datalog Programs Issues
- Repeated inferences recursive rules are 
 repeatedly applied in the naïve way same
 inferences in several iterations.
- Unnecessary inferences if we just want to find 
 sg of a particular node, say e, computing the
 fixpoint of the sg program and then selecting
 tuples with e in the first column is wasteful, in
 that we compute many irrelevant facts.
62Evaluation of Datalog Programs
- Running example 
-  Query ? sg(e,X) 
- (r, r) 
- (a,a), (b,b), (a,b), (b,a) 
- (c,c), (c,d), (c,h), (d,c), (d,d),  
- (e,e), (f,f), (e,f), (f,e)
r
a
b
c
d
h
f
e 
 63Avoiding Repeated Inferences
- Seminaive Fixpoint Evaluation Avoid repeated 
 inferences at least one of the body facts
 generated in the most recent iteration.
- For each recursive table P, use a table delta_P. 
- Rewrite the program to use the delta tables. 
- A second evaluation of the rule 
-  r(X,Y) - s(X), t(Y), u(X,Z), v(Y,Z). 
- only gives new tuples (X,Y) for ground 
 instantiations in which at least one of the atoms
 is new.
64Avoiding Unnecessary Inferences
- Still, in the running example 
- many unnecessary deductions when query is  ? 
 sg(e,X)
- Compare with top-down 
- as in Prolog 
- only facts that are connected to the ultimate 
 goal are being considered
65The Prolog Way
- sg(X,Y) - root(X),root(Y). 
- sg(X,Y) - child(X,U), sg(U,V), child(Y,V). 
- ? sg(e,X). 
-  try root(e) FAIL 
-  try child(e,U) 
-  ? Uc 
-  try sg(c,V) 
-  try root(e) FAIL 
-  try child(c,U) 
-  ? Ua 
-  
r
a
b
c
d
h
f
e 
 66The Prolog Way
- sg(X,Y) - root(X),root(Y). 
- sg(X,Y) - child(X,U), sg(U,V), child(Y,V). 
- ? sg(e,X). 
-  try root(e) FAIL 
-  try child(e,U) 
-  ? Uc 
-  try sg(c,V) 
-  try root(e) FAIL 
-  try child(c,U) 
-  ? Ua 
-  
r
a
b
c
d
h
f
e 
 67Magic Sets Idea
- We want to do something similar for Datalog 
- Idea Define a filter table computes all 
 relevant values, restricts the computation of
 sg(e,X).
- sg(X,Y) - m(X), root(X), root(Y). 
- sg(X,Y) - m(X), child(X,U), sg(U,V),child(Y,V). 
- m(X) - m(Y), child(Y,X). 
- m(e).
68Magic Sets
- It is always possible to do this in such a way 
 that bottom-up becomes as efficient as top-down!
- Different proposals exist in literature 
- how to introduce the magic filters 
69Optimization Techniques
- Many other techniques exist as well 
- Standard relational indexing techniques 
- (partly) materializing intensional relations on 
 beforehand
- Trade-off memory ?? query time performance 
- (See also the OLAP-part for a similar technique) 
- Different representations for relations 
- BDD (Stanford) 
70Outline
- Syntax of the Datalog language 
- Semantics of a Datalog program 
- Relational algebra  Datalog with negation and 
 without recursion
- Optimization techniques 
- Conclusions
71Conclusions
- Datalog adds deductive capabilities to databases 
- extensional relations 
- intensional relations 
- Datalog without recursion, with negation 
- safety requirement 
- stratification 
- equal in power to relational algebra 
- Closed World Assumption
72Conclusions
- Datalog without Negation 
- Always a unique minimal model 
- Datalog with negation and recursion 
- semantics not always clear 
- stratified negation 
- Evaluation of datalog queries 
- without negation  RA-optimization 
- with recursion 
- semi-naive recursion 
- magic sets
73Conclusions
- Very nice idea, but  
- Deductive databases did not make it as a database 
 paradigm
- Yet, many ideas survived 
- recursion in SQL  
- And others may re-surface in future. 
- Increasing need for adding meta-information in 
 databases