Title: Answer Set Programming and Data Integration Systems
1Answer Set Programming and Data Integration
Systems
- S. Costantini
- Università degli Studi di LAquila
- stefcost_at_di.univaq.it
2A Traditional Database Architecture
Query (SQL)
Database Manager (DBMS) -Storage mgmt
-Query processing -View management
-(Transaction processing)
Database (relational)
Answer (relation)
3An Online Shoppers Information Integration
Problem
El Cheapo Where can I get the cheapest copy
(including shipping cost) of Wittgensteins
Tractatus Logicus-Philosophicus within a week?
One-World Mediation
4A Home Buyers Information Integration Problem
What houses for sale under 500k have at least 2
bathrooms, 2 bedrooms, a nearby school ranking
in the upper third, in a neighborhood with
below-average crime rate and diverse population?
Multiple-Worlds Mediation
5A Geoscientists Information Integration Problem
What is the distribution and U/ Pb zircon ages of
A-type plutons in VA? How about their 3-D
geometry ? How does it relate to host rock
structures?
Complex Multiple-Worlds Mediation
6A Neuroscientists Information Integration Problem
What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
Complex Multiple-Worlds Mediation
7What is a Data Integration System?
A system providing
- Uniform (same query interface to all sources)
- Access to (queries eventually updates too)
- Multiple (we want many, but 2 is hard too)
- Autonomous (DBA doesnt report to you)
- Heterogeneous (data models are different)
- Structured (or at least semi-structured)
- Data Sources (not only databases).
8Past
- User must know which sites have relevant info
- User must go to each one in turn
- Slow Sequential access takes time
- Confusing Each site has a different interface
- User must manually integrate information
9Perspective
- Intelligent agents such that
- User says what she wants
- Agent decides how when to achieve it
10References
- Data Integration System Group at the Dipartimento
di Informatica e Sistemistica, Univ. Of Roma La
Sapienza (Prof. Maurizio Lenzerini) -
- INFOMIX information integration, database
theory, computational logic and deductive
databases, database implementation and
optimization, and logic-based software agent
technology. - University of Calabria (Italy), Vienna
University of Technology (Austria), University of
Rome La Sapienza (Italy), Rodan Systems
(Poland). - Università degli Studi di Roma "Tor Vergata"
Dipartimento di informatica, sistemi e
produzione (Prof. Maria Teresa Pazienza) - Database Group of Microsoft Research (Phil
Bernstein) - Answer Set Programming in Data Integration
Systems, Leopoldo Bertossi, Loreto Bravo
(Carleton University, Ottawa, Canada) Jan
Chomichi (University of Buffalo, USA)
11Integration vs. Distributed DBMS
- No common schema
- Sources with heterogeneous schemas
- Semi-structured sources
- Legacy Sources
- Not necessarily relational
- Access/process limitations
- Autonomous sources
- Uncontrolled source content overlap
- Unpredictable run-time behavior
- Makes query execution hard
12Building a Data Integration System
- Create a middleware mediator or data
integration system over the sources - Can be virtual or a data wharehouse
- Presents a uniform query interface and schema
- Abstracts away multitude of sources consults
them for relevant data
13Data Warehouse Architecture
OLAP / Decision support/ Data mining
User queries
(Relational?) database (warehouse)
Data extraction, cleaning/ scrubbing
Data extraction programs
Data source
Data source
Data source
14OLTP vs. OLAP
- OLTP On-Line Transaction Processing
- Many short transactions (queries updates)
- Examples
- Update account balance
- Enroll in course
- Add book to shopping cart
- Queries touch small amounts of data (one record
or a few records) - Updates are frequent
- Concurrency is biggest performance concern
- OLAP On-Line Analytical Processing
- Long transactions, complex queries
- Examples
- Report total sales for each department in each
month - Identify top-selling books
- Count classes with fewer than 10 students
- Queries touch large amounts of data
- Updates are infrequent
- Individual queries can require lots of resources
15Data Warehouse
Enterprise Database
Customers
Orders
Transactions
Vendors
Etc
Etc
- Data Miners
- Farmers they know
- Explorers - unpredictable
Copied, organized summarized
Data Warehouse
Data Mining
16Architecture for Virtual Integration
- Leave the data in the sources.
- When a query comes in
- Determine which sources are relevant to query.
- Break query into sub-queries for each source.
- Get answers from sources
- Combine.
- Data is fresh.
- Challenge performance.
17Virtual Integration Architecture
User queries
Mediated schema
Reformulator
Mediator
Optimizer
Data source catalog
Execution engine
wrapper
wrapper
wrapper
Data source
Data source
Data source
Sources can be relational, hierarchical
(IMS), structured files, web sites.
18Mediated Integration
- User poses queries in terms of global schema
- Relationship between global schema and local,
source schemas specied in the mediator, as source
descriptions - Mediator is responsible of solving problems of
data - redundancy
- complementarity
- inconsistency sources, independently, may be
consistent, but together, possibly not. E.g.,
Same ID card number may be assigned to different
people in different sources
19XML-Based Mediator Architecture
USER/Client
Query Q ( G (S1,..., Sk) )
Integrated Global (XML) View G
Integrated View Definition G(..)? S1(..)Sk(..)
MEDIATOR
(XML) Queries Results
(XML) View
(XML) View
(XML) View
wrappers implemented as web services
Wrapper
Wrapper
Wrapper
S1
S2
Sk
20Wrapper Programs
- Task
- to communicate with the data sources and do
format translations. - Built w.r.t. a specific source.
- Can sit either at the source or mediator.
- Often hard to build
- (very little science).
- Can be intelligent
- perform source-specific optimizations.
- Exploit ontologies to relate analogous domains
21Example
ltbgt Introduction to DB lt/bgt ltigt Phil Bernstein
lt/igt ltigt Eric Newcomer lt/igt Addison Wesley,
1999 ltbookgt lttitlegt Introduction to DB
lt/titlegt ltauthorgt Phil Bernstein
lt/authorgt ltauthorgt Eric Newcomer
lt/authorgt ltpublishergt Addison Wesley
lt/publishergt ltyeargt 1999 lt/yeargt lt/bookgt
Transform
into
22Semantic Mappings between Schemas
house
address
num-baths
contact-info
agent-name agent-phone
1-1 mapping
non 1-1 mapping
house
location contact
full-baths
half-baths
name phone
23Accessing Sources via Wrappers
SELECT address, tel FROM Restaurant WHERE cuisine
chinese
Chinois, 2720 Main St, 310-777-9876 Peking Star,
1 Broad St, 213-999-7676 .....
24Wrapper Generation
- Approaches to automating the process
- Exploit format information (structure, HTML etc.
) - Template based approaches
- Machine learning techniques
- (e.g., shallow NLP where NLP Natural
Language Processing) - LIXTO semi-automated wrapper generation
- Elog prolog patterns
- HTML ? operator ?Elog ? XML
25Semantic Web
- The semantic-web initiative attempts to
automate schema mapping - Idea Allow pages to write logical axioms
relating their vocabulary (tags) to other
external tags - Support automatic inference of relations between
source and mediator schema using these rules
26Query Model in Virtual Integration
- User formulates query in terms of his/her
ontology on the mediated (or global) schema - System reformulates queries in terms of
sub-queries for each source (local schema) - Structure of the query model should be more
intuitive for the user
27Reformulation Problem
- Given
- A query Q posed over the mediated schema
- Descriptions of the data sources
- Find
- A query Q over the data source relations, such
that - Q provides only correct answers to Q, and
- Q provides all possible answers from to Q given
the sources.
28Deductive Databases
- Tables viewed as predicates.
- Ops on tables expressed as datalog rules
- (Horn clauses, without function symbols)
- Enames(Name) - Employe(Name, SSN) Projection
- Wealthy-Employee(Name)-
- Employee(Name,SSN), Salary(SSN,Money),Moneygt
10 -
Selection - Ed(Name, Dname)-
- Employee(Name, SSN), E_Dependents(SSN,
Dname) -
Join - ERelated(Name,Dname) - Ed(Name,Dname)
- ERelated(Name,Dname) - Ed(Name,D1),
ERelated(D1,D2) -
Recursion
29More datalog terminology
- A datalog program is a set of datalog rules.
- A program with one rule is a conjunctive query.
- We distinguish EDB predicates and IDB predicates
- EDBs are stored in the database,
- appear only in rule bodies
- IDBs are intensionally defined,
- appear in both bodies and heads.
30Approaches to Specifying Schema Descriptions
- Global-as-View express the mediated schema
relations as a set of views over the data source
relations - Local-as-View express the source relations as
views over the mediated schema.
31GaV an example
32Approaches for Mapping
33Global-as-view vs. Local-as-view
- Global-as-view approach
- Each item in Global schema/ontology as a view
(query) over source schemas/ontologies - query(G) query(f(S1, S2, , Sn))
- Local-as-view approach
- Each source as a view/query over global
schema/ontology - query(G) query(f1-1 (S1), f2-1(S2), , fn-1
(Sn))
34 GAV vs. LAV
- Not modular
- Addition of new sources changes the schema
mapping - Can be awkward to write mediated schema without
loss of information - Query reformulation easy
- Often reduces to view unfolding (polynomial)
- Can build hierarchies of mediated schemas
- Best when
- Few, stable, data sources
- well-known to the mediator (e.g. corporate
integration)
- Modular--adding new sources is easy
- Very flexible--power of the entire query language
available to describe sources - Reformulation is hard
- Involves answering queries only using views (can
be intractable) - Best when
- Many, relatively unknown data sources
- possibility of addition/deletion of sources
35Data Integration Global-as-View in Answer Set
Programming
- S. Costantini
- Università degli Studi di LAquila
- stefcost_at_di.univaq.it
36Computing Answers in Data Integration Systems
- Answer set programming (ASP) gives a semantics to
datalog (logic) programs with negation (and
possibly disjunction in the heads) - ASP-based specification of a data integration
system - ASP-based specification of extensions (plausible
answers, repairs)
37Computing Answers in Data Integration Systems
- Methodology works for first-order queries (and
Datalog extensions), and many ICs - LAV Bravo and Bertossi, IJCAI03
- GAV Costantini and Formisano, ASP03
- (experiments and a working program)
38GaV in ASP
- Instead of unfolding w.r.t. each query, helper
model so as to get a general inference engine - Easily express integrity constraints
- Cope with incomplete information (also resulting
form adding new sources)
39GaV an example
40Front-end Global Schema through a Helper Model
- person(X)- entity(1,X). meta-concepts naming
- organization(X)- entity(2,X).
- member(X,Y)- relat(1,X,Y).
- student(X)- entity(3,X).
- university(X)- entity(4,X).
- enrolled(X,Y)- relat(2,X,Y).
- age(X,Y)- attr(1,X,Y).
- Constraints on student and enrolled
- enrolledE(X)- student(X),university(Y),enrolled(X
,Y). - - not student(X),enrolled(X,Y).
- - student(X),not enrolledE(X).
41Back-end Mapping to the Source Schemas
- entity(1,X)- s(1,X).
- entity(2,X)- s(2,X).
- entity(4,X)- s(5,X).
- student is described implicitly as the domain
of an unknown relationship - entity(3,X)- s(3,X,Y).
- student is described implicitly as the domain
of one of the sources of enrolled - entity(3,X)- s(4,X,Z).
42Back-end Mapping to the Source Schemas
- Relationships
- relat(1,X,Y)- s(7,X,Z),s(8,Z,Y).
- relat(2,X,Y)- s(4,X,Y).
- relat(2,X,Y)-s(9,X,Y).
- attr(1,X,Y)- s(3,X,Y).
- attr(1,X,Y)- s(6,X,Y,Z).
43Helper Model Relational Inference Engine
- Meta-rules for is_a chaining
- relat(M,X,Y)- schema(N,E1,E2),
entity(E1,X), entity(E2,Y),
is_a_r(N,M), relat(N,X,Y). - entity(M,X)- is_a_e(N,M), entity(N,X).
44GaV an example
45Dealing with Incomplete Information
- Assume to add a new source for Enrolled, that in
the back-end becomes, e.g., - relat(2,X,Y)- s(4,X,Y).
- relat(2,X,Y)- s(9,X,Y).
- No connection to student but, one who is
enrolled to a University should be a student
46Dealing with Incomplete Information (cont. 1)
ASP constraints
- Constraints on student and enrolled
- enrolledE(X)- university(Y),enrolled(X,Y).
- Student mandatory in Enrolled
- - student(X),not enrolledE(X).
- One if enrolled is a student
- - not student(X),enrolled(X,Y).
47Dealing with Incomplete Information (cont. 2)
Generating a Search Space
- specifies possibly incomplete entities
- ... student(X)- entity(3,X).
- incomplete(3). Student
- Any individual might occur in
- an incomplete entity
- entity(N,X)- incomplete(N), not noentity(N,X).
- noentity(N,X)- incomplete(N), not entity(N,X).
48Dealing with Incomplete Information (cont. 2)
Pruned Search space
- One who is enrolled is a student
- Instance of a well-known problem in GaV
- Not solvable with simple unfolding
- Simply (though not efficiently) solved in ASP
49Dealing with Incomplete Information
50Dealing with Incomplete Information (cont. 1)
51Dealing with Incomplete Information (cont. 2)
52Dealing with Incomplete Information (cont. 3)
53Dealing with Incomplete Information (cont. 4)
54Observations and Issues
55The End, thank you for your attention!