WP 8: Assessment and Dissemination - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

WP 8: Assessment and Dissemination

Description:

Comprehensive, formal design methodologies and coherent tools ... Visual Wrappers (LiXto, RODAN Data Extractor) Logical source format: fragment of XML Schema ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 49

Provided by: eit4

Category:

more less

Transcript and Presenter's Notes

Title: WP 8: Assessment and Dissemination

1
INFOMIX Data Integration meets Nonmonotonic
Deductive DatabasesThomas Eiter Institute
of Information Systems Vienna University of
Technology
2
Overview

Motivation
Information Integration Framework
Nonmonotonic Logic Programs
INFOMIX Architecture
Repair Programs
Focussing Techniques
Conclusion

3
Motivation

Data integration Increasing demand
byproduct of expansion of internet and WWW
Highly complex problem
Current solutions in practice pragmatic
Comprehensive, formal design methodologies and
coherent tools for designs are missing
Towards information integration at human-level
competence, utilizing reasoning capabilities

4
INFOMIX Objectives

Powerful information integration
Comprehensive information model
deal with incomplete and/or inconsistent
information
Information integration algorithms
Usage of Computational Logic
Integration of results on data acquisition
transformation
Prototype system

5
Project Partners

University of Calabria Nonmonotonic LP,
Deductive DB (Leone, Greco, Ianni, )
University of Rome La Sapienza Data
Integration (Lenzerini, Cali, Lembo, Rosati,
... )
TU Wien Nonmonotonic LP, data acquisition
extraction (Eiter, Faber, Gottlob, Fink )
RODAN Systems Database system implementation
(Staniszkis, Nowicki, Kalka)

6
Data Integration System Basic View
User Query
Result
Global Schema
Mapping
Source 1
Source 2
7
Formal Information Integration Framework

Data Integration System
I ltG, M, Sgt
G Global Schema
M Mapping Assertions
S Source Schema

8
Global Schema

G ltR, S gt
R relational schema (set of relations)
S set of constraints
Key constraints
Inclusion dependencies
Exclusion dependencies
.

9
Source Schema

S ltRS,gt
RS relational schema (set of relations)
No integrity constraints on sources
Different interpretations of data retrieved
fromsources wrt data satisfying the global
schema (later)

10
Mapping Assertions

Link sources and global relations
ltqS, qGgt
qS query over the sources RS
qG query over the global relations R
Informally lhs corresponds to rhs
Covers
GAV (qG relation r in R)
LAV (qS relation s in S)

11
Semantics

Given Instance D of the source schema S
Issue Instance DG of the global schema G
ltR,Sgt
DG must satisfy the constraints S
DG must comply with a mapping assumption (MA)
sem(I,D) DG DG complies with MA
Most important soundness, exactness

12
Example Sound Semantics
KDs player1, team1, coach1 IDs team3 ?
player1 EDs player1 ? coach1
?
team
player
coach
player(X,Y,Z)- s1(X,Y,Z,W)
team(X,Y,Z)- s2(X,Y,Z)team(X,Y,Z)- s3(X,Y,Z)
coach(X,Y,Z)- s4(X,Y,Z)
s1
s4
s2
s3
13
User Queries

Important Core Conjunctive Queries (CQs)
q(x) - r1(x1),r2(x2),,rk(xk)
Extensions UOCs, datalog w/o negation,
recursion, built-ins
Semantics Certain answers
anssem(q,I,D) c c in qDG , for each DG in
sem(I,D)

14
Example

Query
q(X) - player(X,Y,Z).
Result
anssound(q,I,D) 10, 9, 8 .

player
15
Semantics for Inconsistency

Problem sem(I,D) possible
Relax choice of DG (Loose semantics vs
strict)
DG must satisfy global constraints
DG should comply as close as possible with
mapping assumption
Select best (minimal) DG under ordering DG ?D DG
GAVsound get more of qS(D)
r(DG) ? qS(D) ? r(DG) ? qS(D)
GAVcomplete miss qS(D) less
r(DG) ? qS(D) ? r(DG) ? qS(D)
GAVexact get more of qS(D) and miss it less

16
Example Sound Semantics
Additional tuple in s3 Inconsistency (KD
team1 violated)
team
player
coach
player(X,Y,Z)- s1(X,Y,Z,W)
team(X,Y,Z)- s2(X,Y,Z)team(X,Y,Z)- s3(X,Y,Z)
coach(X,Y,Z)- s4(X,Y,Z)
s1
s4
s2
s3
17
Example / 2
Two possibilities for ?D-minimal DG (no extra
tuples)
player
coach
1)
team
team
player
coach
2)
Query answer ansloosely-sound(q,I,D)
10, 9, 8
18
Complexity of Query Answering

Queries UOCs
GAV non-recursive Datalog, LAVCQs
Constraints and mapping assumptions interact
Data/combined complexity (lower bounds if
decidable above PTIME)
NKC non-key conflicting IDs rA ? sB ?s
has key K ? K?B.
1KC 1-key conflicting IDs rA ? sB ?
s has key K ? K?B ? BK1.

19
How to Evaluate Queries / Semantics?

Approach Use computational logic
Advantages
executable specification of semantics
obtain computational power needed
Desiderata
close to database processing
non-determinism (for global view)
efficiency

20
Basic Approach

Query Rewriting
Query q(x) on global schema G ? query q(x) on
source schema S.
(perfect rewriting, data independent)
Feasibility depends on
mapping type (GAV/LAV) and language
semantics
type of constraints
input / output query language

21
Some Results

For UOCs
GAV non-recursive Datalog(neg) mapping
perfect rewriting under
strictly- / loosely-sound semantics with KDs,
IDs, EDs
output language general Datalog(neg)
LAV CQs mappings
compilation of LAV into GAV, for strictly-sound
semantics with IDs and EDs

22
Nonmonotonic Logic Programs

Disjunctive Datalog(neg) Rules
h1(x1) v v hl(xk) - b1(y1),,bm(ym), not
c1(z1),,not cn(zn)
function-free atoms (constants allowed)
non-monotonic negation (not)
Semantics
minimal model semantics (not-free programs)
stratified semantics (layered negation)
stable model semantics (Gelfond Lifschitz, all
programs)
Complexity Expressiveness
captures co-NPNP queries
co-NPNP / co-NEXPNP data / combined complexity

23
Nonmonotonic Logic Programs / 2

Non-determinism
Example Select one element from a set s.
in(X) - s(X), not out(X).
out(X) v out(Y) - s(X), s(Y), X
ltgt Y.
Extensions
strong (classical) negation
weight constraints
aggregates, .
Efficient implementations
DLV (TU Vienna, U Calabria), Smodels (TU
Helsinki),
Important KR tools (e.g. for Answer Set
Programming)

24
Challenges for Nonmonotonic LP

Interfacing standard relational DB
Scalability
Facilities for query answering(DLV / Smodels
were more conceived as model generators)
Remark Historically, DLV set out as a deductive
DB engine DLV uses a lot of DDB
technology

25
INFOMIX High Level Architecture
26
Information Service Layer

Define the data integration system I (G,M,S)
Store descriptions in Metadata Repository
Accept user queries
Visualize query results
INFOMIX Query Language (IQL)
subset of stratified Datalog(neg), depending on
decidability

/- NKC / General
27
Data Acquisition Transformation (DAT) Layer

Access raw data in different formats
(relational, HTML, XML, OO)
Support data extraction from web pages (LiXto
technology)
Wrappers
Code Wrappers (API)
Query Wrappers (e.g., SQL/ODBC)
Visual Wrappers (LiXto, RODAN Data Extractor)
Logical source format fragment of XML Schema
(akin to complex values)

28
DAT Relevant Data Formats in INFOMIX
ISDF XML fragment
29
Wrapper Design
Goal Relieve Designer from technical details
30
Integration Layer

Perform the data integration
Receive requests from Information Service Layer
Compute query rewritings
Evaluate query rewritings, interacting with DAT
Layer
Approach
combine / couple DLV with standard relational
engines
narrow use of DLV to where it is needed(push
work to relational engines as much as possible)

31
Repair Programs

GAV setting, loosely semantics (typically, exact
sem.)
Repair semantics (Bertossi et al., Chomicki
and Marcinkowski, Greco et al., )
each ?G-minimal DG w.r.t. the retrieved database
ret(I,D) ( materialized mappings) is a repair
repI(D) is the set of all repairs
Query answering
ans(q,I,D) c q(c) holds w.r.t each R in
repI(D)

32
LP Specification for querying I(G,M,S)

Disjunctive Datalog(neg) program
PI(q) PM PS Pq
where
PM is (stratified) a Datalog(neg) program, for
retrieving the data from the sources
PS is a disjunctive Datalog(neg) program,
computing repI(D) in its stable models
Pq is a nonrecursive Datalog(neg) program
encoding the query q(x) on top

Hierarchical structure
PM gt PS gt Pq
ret(I,D) ? SM(PM D)
repI(D) ? SM(PS ret(I,D))
ans(q,I,D) c q(c) in M for each M
in SM(Pq M), DG?repI(D) c q(c)
in M for each M in SM(Pq PS ret(I,D))
c q(c) in M for each M in SM(PI(q)
D).
Compile-in non-key conflicting IDs

34
Example

Pq q(X) - player(X,Y,Z). q(X) -
team(V,W,X). from ID team3
? player1
PM player_D(X,Y,Z) - s1(X,Y,W,Z).
team_D(X,Y,Z) - s2(X,Y,Z).
team_D(X,Y,Z) - s3(X,Y,Z).
coach_D(X,Y,Z) - s4(X,Y,Z).
PS player(X,Y,Z) - player_D(X,Y,Z), not
player(X,Y,Z). key player1.
player(X,Y,Z) v player(X,V,W)-
player_D(X,Y,Z), player_D(X,V,W), Y ltgt V.
player(X,Y,Z) v player(X,V,W)- player_D(X,Y,Z),
player_D(X,V,W), Z ltgt W.

35
Example / 2

team(X,Y,Z) - team_D(X,Y,Z), not
team(X,Y,Z). key
team1. team(X,Y,Z) v team(X,V,W) -
team_D(X,Y,Z), team_D(X,V,W), Y ltgt V.
team(X,Y,Z) v team(X,V,W) - team_D(X,Y,Z),
team_D(X,V,W), Z ltgt W.
coach(X,Y,Z) - coach_D(X,Y,Z), not
coach(X,Y,Z). key coach1.
coach(X,Y,Z) v coach(X,V,W) - coach_D(X,Y,Z),
coach_D(X,V,W),Y ltgt V. coach(X,Y,Z) v
coach(X,V,W) - coach_D(X,Y,Z), coach_D(X,V,W),
Z ltgt W.
ED player1,coach1.player(X,Y,Z) v
coach(X,V,W)- player_D(X,Y,Z), coach_D(X,V,W).

36
Query Optimization

Different repair encodings
E.g., use of unstratified negation instead of
disjunction
? Equivalence of logic program encodings
Focussing techniques
Relevance prune useless rules in PI(q) .
Decomposition localize inconsistency in
ret(I,D).
Recombination combine localized repairs to
answer q.

37
Decomposition

Conflict set Cret(I,D) (via ?), syntactic
conflict closure Cret(I,D)
affected part (Aret(I,D) ) and safe part
(Sret(I,D) ) of ret(I,D)
works for
universal constraints
?x A1? ? An ? B1 ? ? Bm ? f1 ? ? fk ,
nmgt0,
fi built-in literals (, ltgt, etc)
similarity-compliant ?D ( R?ret(I,D) ?
R?ret(I,D) ? R ltD R)

38
Main Results

For each R in repI(D) there is some R in
rep(Aret(I,D)) such that
R (R ? Cret(I,D)) ? Sret(I,D).
For each R in rep(Aret(I,D) ) there is such an R
in repI(D).
Computing Aret(I,D) , Cret(I,D) is expensive,
while computing Cret(I,D) is efficient (use DB
engines)
If ngt0, each R in rep(Aret(I,D) ) is included
in Cret(I,D)
If ngt0 and m0 (e.g., FDs, KDs, EDs), each R in
rep(Aret(I,D)) is included in Cret(I,D)

39
DLV Developments

Coupling with DBMS
ODBC interface
Relevance for query answering
Magic set techniques
Non-ground query answering
internal marking of relations

40
Recombination

Answer query q(x) from localized repairs
ans(q,I,D) c q(c) in each M in
SM(Pq (R cap Cret(I,D))
Sret(I,D)), R in
rep(Aret(I,D))
Simplifies for special constraints (ngt0 ngt0,m0)
Practical method Repair Compilation
store all repairs R in rep(Aret(I,D)) in a
relational DB
mark tuples of Aret(I,D) with bitstring
(membership in R)
rewrite q(x) to an SQL query on marked ret(I,D)

41
Experiments

Experiments on synthetic data sets (football
teams, graph 3-coloring) showed positive effects
Drastic improvement over naïve DLV evaluation
Still, marking effort increases quickly with
number of conflicts (viable for few conflicts)
But Inspiration to internal marking of
relations for non-ground query answering in DLV

42
INFOMIX Demo Scenario

University of Rome La Sapienza
information about students, courses, professors,
exams ...
3 legacy databases (MySQL), lots of web pages
Global schema 15 relations,
30 constraints (KDs,IDs,EDs)
40 Wrappers (query wrappers, visual wrappers)
10 user queries
Experiments Talk by Gianluigi Greco

43
Conclusion

INFOMIX Powerful information integration
Dealing with inconsistent and incomplete sources
Hard problems to solve
Fruitful use of computational logic
Rich Data Acquisition and Transformation Layer
Prototype (under implementation, available soon)

44
Further Issues

Data Cleaning Important aspect
Improve wrapper retrieval
Methodology for Design / Usage
Compile (parts of) logic program to DBMS DLVDB
Challenge Information integration for
semi-structured data

45
Publications

INFOMIX homepage
http//sv.mat.unical.it/infomix/
INFOMIX Reports
Papers in conferences and journals (PODS,
ICDT, IJCAI, KR, ICLP, LPNMR, JELIA, )

46
Data Integration System

Provides a global, unified view of a set of
heterogeneous, autonomous sources
A mapping specifies the relationship between the
global view and the sources
Users pose queries to the global view of the
data
The system computes the answers to the query by
suitably accessing the sources

47
Example (non-disjunctive)

Pq q(X) - player(X,Y,Z). q(X) -
team(V,W,X). from ID team3
subset player1
PM player_D(X,Y,Z) - s1(X,Y,W,Z).
team_D(X,Y,Z) - s2(X,Y,Z).
team_D(X,Y,Z) - s3(X,Y,Z).
coach_D(X,Y,Z) - s4(X,Y,Z).
PS player(X,Y,Z) - player_D(X,Y,Z), not
player(X,Y,Z). key player1.
player(X,Y,Z) - player(X,V,W), player_D(X,Y,Z),
Y ltgt V. player(X,Y,Z) -
player(X,V,W), player_D(X,Y,Z), Z ltgt W.

48
Example / 2

team(X,Y,Z) - team_D(X,Y,Z), not
team(X,Y,Z). key team1.
team(X,Y,Z) - team(X,V,W), team_D(X,Y,Z), Y ltgt
V. team(X,Y,Z) - team(X,V,W),
team_D(X,Y,Z), Z ltgt W.
coach(X,Y,Z) - coach_D(X,Y,Z), not
coach(X,Y,Z). key coach1.
coach(X,Y,Z) - coach(X,V,W), coach_D(X,Y,Z), Y
ltgt V. coach(X,Y,Z) - coach(X,V,W),
coach_D(X,Y,Z), Z ltgt W.
player(X,Y,Z) - player_D(X,Y,Z),
coach(X,V,W). ED team3,coach1.
coach(X,Y,Z) - coach_D(X,Y,Z), team(V,W,X).
coach(X,Y,Z) - coach_D(X,Y,Z), player(X,V,W).
team(X,Y,Z) - team_D(X,Y,Z), coach(Z,V,W).