Title: WP 8: Assessment and Dissemination
1INFOMIX Data Integration meets Nonmonotonic
Deductive DatabasesThomas Eiter Institute
of Information Systems Vienna University of
Technology
2Overview
- Motivation
- Information Integration Framework
- Nonmonotonic Logic Programs
- INFOMIX Architecture
- Repair Programs
- Focussing Techniques
- Conclusion
3Motivation
- Data integration Increasing demand
- byproduct of expansion of internet and WWW
- Highly complex problem
- Current solutions in practice pragmatic
- Comprehensive, formal design methodologies and
coherent tools for designs are missing - Towards information integration at human-level
competence, utilizing reasoning capabilities
4INFOMIX Objectives
- Powerful information integration
- Comprehensive information model
- deal with incomplete and/or inconsistent
information - Information integration algorithms
- Usage of Computational Logic
- Integration of results on data acquisition
transformation - Prototype system
5Project Partners
- University of Calabria Nonmonotonic LP,
Deductive DB (Leone, Greco, Ianni, ) - University of Rome La Sapienza Data
Integration (Lenzerini, Cali, Lembo, Rosati,
... ) - TU Wien Nonmonotonic LP, data acquisition
extraction (Eiter, Faber, Gottlob, Fink ) - RODAN Systems Database system implementation
(Staniszkis, Nowicki, Kalka)
6Data Integration System Basic View
User Query
Result
Global Schema
Mapping
Source 1
Source 2
7Formal Information Integration Framework
- Data Integration System
- I ltG, M, Sgt
- G Global Schema
- M Mapping Assertions
- S Source Schema
8Global Schema
- G ltR, S gt
- R relational schema (set of relations)
- S set of constraints
- Key constraints
- Inclusion dependencies
- Exclusion dependencies
- .
9Source Schema
- S ltRS,gt
- RS relational schema (set of relations)
- No integrity constraints on sources
- Different interpretations of data retrieved
fromsources wrt data satisfying the global
schema (later)
10Mapping Assertions
- Link sources and global relations
- ltqS, qGgt
- qS query over the sources RS
- qG query over the global relations R
- Informally lhs corresponds to rhs
- Covers
- GAV (qG relation r in R)
- LAV (qS relation s in S)
11Semantics
- Given Instance D of the source schema S
- Issue Instance DG of the global schema G
ltR,Sgt - DG must satisfy the constraints S
- DG must comply with a mapping assumption (MA)
- sem(I,D) DG DG complies with MA
- Most important soundness, exactness
12Example Sound Semantics
KDs player1, team1, coach1 IDs team3 ?
player1 EDs player1 ? coach1
?
team
player
coach
player(X,Y,Z)- s1(X,Y,Z,W)
team(X,Y,Z)- s2(X,Y,Z)team(X,Y,Z)- s3(X,Y,Z)
coach(X,Y,Z)- s4(X,Y,Z)
s1
s4
s2
s3
13User Queries
- Important Core Conjunctive Queries (CQs)
- q(x) - r1(x1),r2(x2),,rk(xk)
- Extensions UOCs, datalog w/o negation,
recursion, built-ins - Semantics Certain answers
- anssem(q,I,D) c c in qDG , for each DG in
sem(I,D)
14Example
- Query
- q(X) - player(X,Y,Z).
- Result
- anssound(q,I,D) 10, 9, 8 .
player
15Semantics for Inconsistency
- Problem sem(I,D) possible
- Relax choice of DG (Loose semantics vs
strict) - DG must satisfy global constraints
- DG should comply as close as possible with
mapping assumption - Select best (minimal) DG under ordering DG ?D DG
- GAVsound get more of qS(D)
- r(DG) ? qS(D) ? r(DG) ? qS(D)
- GAVcomplete miss qS(D) less
- r(DG) ? qS(D) ? r(DG) ? qS(D)
- GAVexact get more of qS(D) and miss it less
16Example Sound Semantics
Additional tuple in s3 Inconsistency (KD
team1 violated)
team
player
coach
player(X,Y,Z)- s1(X,Y,Z,W)
team(X,Y,Z)- s2(X,Y,Z)team(X,Y,Z)- s3(X,Y,Z)
coach(X,Y,Z)- s4(X,Y,Z)
s1
s4
s2
s3
17Example / 2
Two possibilities for ?D-minimal DG (no extra
tuples)
player
coach
1)
team
team
player
coach
2)
Query answer ansloosely-sound(q,I,D)
10, 9, 8
18Complexity of Query Answering
- Queries UOCs
- GAV non-recursive Datalog, LAVCQs
- Constraints and mapping assumptions interact
- Data/combined complexity (lower bounds if
decidable above PTIME) - NKC non-key conflicting IDs rA ? sB ?s
has key K ? K?B. - 1KC 1-key conflicting IDs rA ? sB ?
s has key K ? K?B ? BK1.
19How to Evaluate Queries / Semantics?
- Approach Use computational logic
- Advantages
- executable specification of semantics
- obtain computational power needed
- Desiderata
- close to database processing
- non-determinism (for global view)
- efficiency
20Basic Approach
- Query Rewriting
- Query q(x) on global schema G ? query q(x) on
source schema S. - (perfect rewriting, data independent)
- Feasibility depends on
- mapping type (GAV/LAV) and language
- semantics
- type of constraints
- input / output query language
21Some Results
- For UOCs
- GAV non-recursive Datalog(neg) mapping
- perfect rewriting under
- strictly- / loosely-sound semantics with KDs,
IDs, EDs - output language general Datalog(neg)
- LAV CQs mappings
- compilation of LAV into GAV, for strictly-sound
semantics with IDs and EDs
22Nonmonotonic Logic Programs
- Disjunctive Datalog(neg) Rules
- h1(x1) v v hl(xk) - b1(y1),,bm(ym), not
c1(z1),,not cn(zn) - function-free atoms (constants allowed)
- non-monotonic negation (not)
- Semantics
- minimal model semantics (not-free programs)
- stratified semantics (layered negation)
- stable model semantics (Gelfond Lifschitz, all
programs) - Complexity Expressiveness
- captures co-NPNP queries
- co-NPNP / co-NEXPNP data / combined complexity
23Nonmonotonic Logic Programs / 2
- Non-determinism
- Example Select one element from a set s.
- in(X) - s(X), not out(X).
- out(X) v out(Y) - s(X), s(Y), X
ltgt Y. - Extensions
- strong (classical) negation
- weight constraints
- aggregates, .
- Efficient implementations
- DLV (TU Vienna, U Calabria), Smodels (TU
Helsinki), - Important KR tools (e.g. for Answer Set
Programming)
24 Challenges for Nonmonotonic LP
- Interfacing standard relational DB
- Scalability
- Facilities for query answering(DLV / Smodels
were more conceived as model generators) - Remark Historically, DLV set out as a deductive
DB engine DLV uses a lot of DDB
technology
25INFOMIX High Level Architecture
26Information Service Layer
- Define the data integration system I (G,M,S)
- Store descriptions in Metadata Repository
- Accept user queries
- Visualize query results
- INFOMIX Query Language (IQL)
- subset of stratified Datalog(neg), depending on
decidability
/- NKC / General
27Data Acquisition Transformation (DAT) Layer
- Access raw data in different formats
(relational, HTML, XML, OO) - Support data extraction from web pages (LiXto
technology) - Wrappers
- Code Wrappers (API)
- Query Wrappers (e.g., SQL/ODBC)
- Visual Wrappers (LiXto, RODAN Data Extractor)
- Logical source format fragment of XML Schema
(akin to complex values)
28DAT Relevant Data Formats in INFOMIX
ISDF XML fragment
29Wrapper Design
Goal Relieve Designer from technical details
30Integration Layer
- Perform the data integration
- Receive requests from Information Service Layer
- Compute query rewritings
- Evaluate query rewritings, interacting with DAT
Layer - Approach
- combine / couple DLV with standard relational
engines - narrow use of DLV to where it is needed(push
work to relational engines as much as possible)
31Repair Programs
- GAV setting, loosely semantics (typically, exact
sem.) - Repair semantics (Bertossi et al., Chomicki
and Marcinkowski, Greco et al., ) - each ?G-minimal DG w.r.t. the retrieved database
ret(I,D) ( materialized mappings) is a repair - repI(D) is the set of all repairs
- Query answering
- ans(q,I,D) c q(c) holds w.r.t each R in
repI(D)
32LP Specification for querying I(G,M,S)
- Disjunctive Datalog(neg) program
- PI(q) PM PS Pq
- where
- PM is (stratified) a Datalog(neg) program, for
retrieving the data from the sources - PS is a disjunctive Datalog(neg) program,
computing repI(D) in its stable models - Pq is a nonrecursive Datalog(neg) program
encoding the query q(x) on top
33- Hierarchical structure
- PM gt PS gt Pq
- ret(I,D) ? SM(PM D)
- repI(D) ? SM(PS ret(I,D))
- ans(q,I,D) c q(c) in M for each M
in SM(Pq M), DG?repI(D) c q(c)
in M for each M in SM(Pq PS ret(I,D))
c q(c) in M for each M in SM(PI(q)
D). - Compile-in non-key conflicting IDs
-
34Example
- Pq q(X) - player(X,Y,Z). q(X) -
team(V,W,X). from ID team3
? player1 - PM player_D(X,Y,Z) - s1(X,Y,W,Z).
team_D(X,Y,Z) - s2(X,Y,Z).
team_D(X,Y,Z) - s3(X,Y,Z).
coach_D(X,Y,Z) - s4(X,Y,Z). - PS player(X,Y,Z) - player_D(X,Y,Z), not
player(X,Y,Z). key player1.
player(X,Y,Z) v player(X,V,W)-
player_D(X,Y,Z), player_D(X,V,W), Y ltgt V.
player(X,Y,Z) v player(X,V,W)- player_D(X,Y,Z),
player_D(X,V,W), Z ltgt W.
35Example / 2
- team(X,Y,Z) - team_D(X,Y,Z), not
team(X,Y,Z). key
team1. team(X,Y,Z) v team(X,V,W) -
team_D(X,Y,Z), team_D(X,V,W), Y ltgt V.
team(X,Y,Z) v team(X,V,W) - team_D(X,Y,Z),
team_D(X,V,W), Z ltgt W. - coach(X,Y,Z) - coach_D(X,Y,Z), not
coach(X,Y,Z). key coach1.
coach(X,Y,Z) v coach(X,V,W) - coach_D(X,Y,Z),
coach_D(X,V,W),Y ltgt V. coach(X,Y,Z) v
coach(X,V,W) - coach_D(X,Y,Z), coach_D(X,V,W),
Z ltgt W. -
ED player1,coach1.player(X,Y,Z) v
coach(X,V,W)- player_D(X,Y,Z), coach_D(X,V,W).
36Query Optimization
- Different repair encodings
- E.g., use of unstratified negation instead of
disjunction - ? Equivalence of logic program encodings
- Focussing techniques
- Relevance prune useless rules in PI(q) .
- Decomposition localize inconsistency in
ret(I,D). - Recombination combine localized repairs to
answer q.
37Decomposition
- Conflict set Cret(I,D) (via ?), syntactic
conflict closure Cret(I,D) - affected part (Aret(I,D) ) and safe part
(Sret(I,D) ) of ret(I,D) - works for
- universal constraints
- ?x A1? ? An ? B1 ? ? Bm ? f1 ? ? fk ,
nmgt0, -
fi built-in literals (, ltgt, etc) - similarity-compliant ?D ( R?ret(I,D) ?
R?ret(I,D) ? R ltD R)
38Main Results
- For each R in repI(D) there is some R in
rep(Aret(I,D)) such that - R (R ? Cret(I,D)) ? Sret(I,D).
- For each R in rep(Aret(I,D) ) there is such an R
in repI(D). - Computing Aret(I,D) , Cret(I,D) is expensive,
while computing Cret(I,D) is efficient (use DB
engines) - If ngt0, each R in rep(Aret(I,D) ) is included
in Cret(I,D) - If ngt0 and m0 (e.g., FDs, KDs, EDs), each R in
rep(Aret(I,D)) is included in Cret(I,D)
39DLV Developments
- Coupling with DBMS
- ODBC interface
- Relevance for query answering
- Magic set techniques
- Non-ground query answering
- internal marking of relations
40Recombination
- Answer query q(x) from localized repairs
- ans(q,I,D) c q(c) in each M in
SM(Pq (R cap Cret(I,D))
Sret(I,D)), R in
rep(Aret(I,D)) - Simplifies for special constraints (ngt0 ngt0,m0)
- Practical method Repair Compilation
- store all repairs R in rep(Aret(I,D)) in a
relational DB - mark tuples of Aret(I,D) with bitstring
(membership in R) - rewrite q(x) to an SQL query on marked ret(I,D)
41Experiments
- Experiments on synthetic data sets (football
teams, graph 3-coloring) showed positive effects - Drastic improvement over naïve DLV evaluation
- Still, marking effort increases quickly with
number of conflicts (viable for few conflicts) - But Inspiration to internal marking of
relations for non-ground query answering in DLV
42INFOMIX Demo Scenario
- University of Rome La Sapienza
- information about students, courses, professors,
exams ... - 3 legacy databases (MySQL), lots of web pages
- Global schema 15 relations,
30 constraints (KDs,IDs,EDs) - 40 Wrappers (query wrappers, visual wrappers)
- 10 user queries
- Experiments Talk by Gianluigi Greco
43Conclusion
- INFOMIX Powerful information integration
- Dealing with inconsistent and incomplete sources
- Hard problems to solve
- Fruitful use of computational logic
- Rich Data Acquisition and Transformation Layer
- Prototype (under implementation, available soon)
44Further Issues
- Data Cleaning Important aspect
- Improve wrapper retrieval
- Methodology for Design / Usage
- Compile (parts of) logic program to DBMS DLVDB
- Challenge Information integration for
semi-structured data
45Publications
- INFOMIX homepage
- http//sv.mat.unical.it/infomix/
- INFOMIX Reports
- Papers in conferences and journals (PODS,
ICDT, IJCAI, KR, ICLP, LPNMR, JELIA, )
46Data Integration System
- Provides a global, unified view of a set of
heterogeneous, autonomous sources - A mapping specifies the relationship between the
global view and the sources - Users pose queries to the global view of the
data - The system computes the answers to the query by
suitably accessing the sources
47Example (non-disjunctive)
- Pq q(X) - player(X,Y,Z). q(X) -
team(V,W,X). from ID team3
subset player1 - PM player_D(X,Y,Z) - s1(X,Y,W,Z).
team_D(X,Y,Z) - s2(X,Y,Z).
team_D(X,Y,Z) - s3(X,Y,Z).
coach_D(X,Y,Z) - s4(X,Y,Z). - PS player(X,Y,Z) - player_D(X,Y,Z), not
player(X,Y,Z). key player1.
player(X,Y,Z) - player(X,V,W), player_D(X,Y,Z),
Y ltgt V. player(X,Y,Z) -
player(X,V,W), player_D(X,Y,Z), Z ltgt W.
48Example / 2
- team(X,Y,Z) - team_D(X,Y,Z), not
team(X,Y,Z). key team1.
team(X,Y,Z) - team(X,V,W), team_D(X,Y,Z), Y ltgt
V. team(X,Y,Z) - team(X,V,W),
team_D(X,Y,Z), Z ltgt W. - coach(X,Y,Z) - coach_D(X,Y,Z), not
coach(X,Y,Z). key coach1.
coach(X,Y,Z) - coach(X,V,W), coach_D(X,Y,Z), Y
ltgt V. coach(X,Y,Z) - coach(X,V,W),
coach_D(X,Y,Z), Z ltgt W. - player(X,Y,Z) - player_D(X,Y,Z),
coach(X,V,W). ED team3,coach1.
coach(X,Y,Z) - coach_D(X,Y,Z), team(V,W,X).
coach(X,Y,Z) - coach_D(X,Y,Z), player(X,V,W).
team(X,Y,Z) - team_D(X,Y,Z), coach(Z,V,W).