Title: Information Integration
1 2Information Resides on Heterogeneous Information
Sources
Personal database
Excel
WWW
Flat File
- different interfaces
- different data representations
- redundant and conflicting information
3Modes of Information Integration
- Federated Databases the sources are independent,
but one source can call on others to supply
information - Data warehouse copies of data from several
sources are stored in a single database, called a
data warehouse. The data stored at the warehouse
is first processed in some way before storage
e.g. data may be filtered, and relations may be
joined or aggregated. As the data is copied from
the sources, it may need to be transformed in
certain ways to make all data conform to the
schema at the data warehouse
4Modes of Information Integration
- Mediation a mediator is a software component
that supports a virtual database, which the user
may query as if it were materialized (physically
constructed like a warehouse). The mediator store
no data of its own. Rather, it translates the
users query into one or more queries to its
sources. The mediator then synthesizes the answer
to the users query from the responses of those
sources, and returns the answer to the user
5Problems of Information Integration
- Example
- The AAAI Automobile Co. has 1000 dealers each
of which maintains a database of their cars in
stock. AAAI wants to create an integrated
database containing the information of all 1000
sources. The integrated database will help
dealers locate a particular model if they dont
have one in stock. It also can be used by
corporate analysts to predict the market and
adjust production to provide the model most
likely to sell
6Problems of Information Integration
- The 1000 dealers do not all use the same database
schema - Cars (serialNo, model, color, autoTrans,
cdPlayer, ...) - or
- Autos (serialNo, model, color)
- Options (serialNo, option)
7Problems of Information Integration
- Schema difference
- Different equivalent names
- Data type differences numbers may be represented
by character strings of varying length at one
source and fixed length at another - Value differences the same concept may be
represented by different constants at different
sources (BLACK, BL, 100, etc) - Semantic differences Terms can be given
different interpretations at different sources
(Cars includes trucks or not) - Missing values a source may not record
information of a type that all of the other
sources provide
8Goal System Providing Integrated View of
Heterogeneous Data
Integration System
Personal database
Excel
WWW
Flat File
- collects and combines information
- provides integrated view, uniform user interface
9The Data Warehousing Approach to Integration
Client
Stored Integrated View
Mediator
Wrapper
Wrapper
Excel
Flat File
10The Data Warehousing Approach to Integration
- Data from several sources is extracted and
combined into a global schema - The data is stored at the warehouse which looks
like an ordinary database - There are three approaches to maintaining the
data in the data warehouse - off-line reconstruction of the whole data
warehouse - the data warehouse is updated periodically based
on the changes made to the original data sources - the data warehouse is updated immediately
11The Data Warehousing Approach to Integration
- Example
- Suppose that there are two dealers in the
system and that they use the schemas - Cars (serialNo, model,color,autoTrans, cdPlayer,
...) - and
- Autos (serialNo,model,color)
- Options (serialNo,option)
- Assume a data warehouse with the schema
- AutoWhse(serialNo,model,color,autoTrans, dealer)
12The Data Warehousing Approach to Integration
- The software to extracts data from the dealers
databases and populates the global schema can be
written as SQL-queries. The query for the first
dealer - insert into AutoWhse(serialNo,model,color,autoTran
s, dealer) - select serialNo, model, color, autoTrans,
dealer1 - from Cars
- The code for the second dealer is more complex
since we have to decide whether or not a given
car has an automatic transmission.
13The Data Warehousing Approach to Integration
- insert into AutoWhse(serialNo,model,color,autoTran
s, dealer) - select serialNo, model, color, yes, dealer2
- from Autos, Options
- where Autos.serialNoOptions.serialNo and
- optionautoTrans
- insert into AutoWhse(serialNo,model,color,autoTran
s, dealer) - select serialNo, model, color, no dealer2
- from Autos
- where not exists ( select from Options
- where serialNoAutos.serialNo and
- optionautoTrans)
14The Wrapper and Mediator Architecture
Client
Common Data Model
portfolios for each company
Mediator
business reports
stock market prices
Wrapper
Wrapper
Excel
Flat File
15The Wrapper and Mediator Architecture
- A mediator supports a virtual view, or
collection of views, that integrates several
sources in much the same way that the
materialized relation(s) in a data warehouse
integrate sources. - The mediator doesnt store any data lt lt lt
- Example
- Let us consider the same scenario. The mediator
integrates the same two data sources into a view
that is a single relation with the schema - AutoMed(serialNo,model,color,autoTrans, dealer)
16The Wrapper and Mediator Architecture
- Assume the user asks the mediator about the red
cars - select serialNo, model from AutosMed
- where color red
- The mediator forward the same query to each of
the two wrappers - (1) select serialNo, model from Cars where
colorred - (2) select serialNo, model from Autos where
colorred - The mediator can take the union of answers and
return the result to the user.
17The Lazy Integration Approach
Query Decomposition, Translation and Result
Fusion
Client
IBM portfolio
Mediator
IBM price
IBM related reports (in common model)
Wrapper
Wrapper
IBM related reports
Excel
Flat File
18Wrappers in Mediator-Based Systems
- In a data warehouse system, the source extractors
consist of - one or more queries built-in that are executed at
the source to produce data for the data warehouse - communication mechanisms, so that wrapper can
- pass ad-hoc queries to the source
- receive responses from the source
- pass information to the warehouse
- Mediator systems require more complex wrappers -
the wrapper must be able to accept a variety of
queries from the mediator and translate any of
them to the terms of the source.
19Wrappers in Mediator-Based Systems
- A systematic way to design a wrapper that
connects a mediator to a source is to classify
the possible queries that the mediator can ask
into templates, which are queries with parameters
that represent constants. - The mediator can provide the constants, and the
wrapper executes the query with the given
constants. - T ? S the template T is turned into the source
query S - Example
- The source of dealer1
- Cars (serialNo, model,color,autoTrans, cdPlayer,
...)
20Wrappers in Mediator-Based Systems
- Assume we use the mediator with schema
- AutoMed(serialNo,model,color,autoTrans, dealer)
- How the mediator could ask the wrapper for cars
of a given colorgt - The template
- select from AutoMed where color c ?
- select serialNo, model color, autoTrans,
dealer1 - from Cars where colorc
21Wrappers in Mediator-Based Systems
- The wrapper could have another template that
specified the parameter m representing a model - there would be 2N templates for N attributes
- the number of templates could grow unreasonably
large.
22Wrapper Generators
- The template defining a wrapper must be turned
into code for the wrapper itself - the software
that creates the wrapper is called a wrapper
generator - The wrapper generator creates a table that holds
the various query patterns contained in the
templates, and the source queries that are
associated with each. - A driver is used in each wrapper. The task of the
driver is to - accept a query from the mediator
- search the table for a template that match the
query - the source query is sent to the source using a
communication mechanism - the response is processed by the wrapper, if
necessary, and then returned to the mediator
23Wrappers Mediators from High-Level
Specifications
Client
Mediator Specification Interpreter
Mediator
Mediator Specification
Wrapper Generator
Wrapper
Wrapper
Wrapper Specification
Source
Source
24Filters
- Complex template
- select from AutoMed
- where color c and model m ?
- select serialNo, model color, autoTrans,
dealer1 - from Cars where colorc and model m
- Wrapper filter approach - if the wrapper has a
template that returns a superset of what the
query wants then it is possible to filter the
result at the wrapper - The decision whether a mediator asks for a subset
of what the pattern of some wrapper template
returns is a hard problem lt lt lt ltgtgtgtgt
25Filters
- Example
- Given the template
- select from AutoMed where color c
- The mediator needs to find blue Gobi model car
- select from AutoMed where color blue and
modelGobi - use the template with cblue to find all blue
cars - store the result in the temporary relation Temp
- select from Temp the Gobis and return the result
26Other Wrapper Operations
- It is possible to transform the data at the
wrapper in different ways - The mediator is asked to find dealers and models
such that the dealer has two red cars, of the
same model, one with and one without automatic
transmission. Suppose we have only one template
as before. - Select A1.model A1.dealer
- from AutoMed A1 AutoMed A2
- where A1.modelA2.model and A1.colorred and
- A2.colorred and A1.autoTransno and
- A2.autoTransyes
27Other Wrapper Operations
- It is possible to answer the query by first
obtaining from the Dealers 1 source a relation
with all the red cars (use the original template)
- RedAutos relation - select distinct A1.model A1.dealer
- from RedAutos A1, RedAutos A2
- where A1.modelA2.model and
- A1.autoTransno and
- A2.autoTransyes
28Challenge Sources Without a Well-Structured
Schema
Examples
- semistructured
- irregular
- deeply nested
- cross-referenced
- incomplete schema knowledge
- autonomous
- dynamic
- HTML pages
- SGML documents
- genome data
- chemical structures
- bibliographic information
- results of the integration process
29Challenge Different and Limited Source
Capabilities
Client
retrieve IBM data
Mediator (U A B)
retrieve IBM data
retrieve IBM data
Wrapper (A)
Wrapper (B)
30Mediator has to Adapt to Query Capabilities of
Sources
Client
retrieve IBM data
Mediator (U A B)
retrieve IBM data
retrieve IBM data
retrieve everything
(A) does not allow selection
Wrapper (A)
Wrapper (B)
31Part B
- Semistructured Data Representation
- Mediator Generation
- Wrapper Generation
- Capabilities-Based Rewriting
32Representation of Semistructured Information
using OEM
semantic object-id
label
Set Value
lthttp//www/doe, faculty, f1,l1,r1gt
ltf1, first_name, Johngt
ltl1, last_name, Doegt ltr1, rank,
professorgt
Atomic Value
structural object-id
33Object Exchange Model - Goals
- Easy to read
- Easy to edit
- Easy to generate or parse by a program
- Consistency with Stanfords other projects
(developed with the TSIMMIS) - Possibility of extensions in the future
34Graph Representation of OEM Data
lthttp//www/doe, faculty, f1,l1,r1gt
ltf1, first_name, Johngt
ltl1, last_name, Doegt ltr1, rank,
professorgt
http//www/doe
faculty first_name John
last_name Doe rank
professor
35OEM Structures Represent Arbitrary Labeled Graphs
http//www/smith
faculty name Mary Smith project
Air DB paper author
name John Doe author
name Mary Smith title Thin Air
DB
http//www/doe
faculty first_name John
last_name Doe rank professor
36Reprezentacja danych semistrukturalnych
- Object Exchange Model
- ACeDB
- XML
- Moga byc wykorzystywane w warstwie pomiedzy
mediatorem a wrapperami.
37Object Exchange Model
- Zdefiniowany przy okazji budowy systemu Tsimmis
sluzacego do integracji heterogenicznych zródel
danych. - Wykorzystywany przy projekcie Merlin (MQS) i
Lorel (QL)
38Object Exchange Model (cd)
- Wezel OEM sklada sie z czterech pól
- Object-ID jest wykorzystywany do unikalnej
identyfikacji okreslonego wezla OEM - Label jest ciagiem znaków który opisuje to co
wezel OEM reprezentuje - Type jest typem danych wartosci wezla. (atomowy
lub kolekcja) - Value moze byc albo wartoscia atomowa albo
referencja do kolekcji wezlów OEM - Jest zwykle reprezentowany jako ltObject-ID
Label Type Valuegt
39Object Exchange Model przyklad
- ltoid1 Notowanie Set oid11 oid12gt
- ltoid11 NrNotowania String 4004gt
- ltoid12 Rezultaty Set oid121 oid122gt
- ltoid121 Miejsce1 Set oid1211gt
- ltoid1211 Utwor String metropolisgt
- ltoid122 Miejsce2 Set oid1221gt
- ltoid1221 Utwor String moneygt
- ltoid2 Notowanie Set oid21 oid22gt
- ltoid21 NrNotowania String 4005gt
- ltoid22 Rezultaty Set oid221 oid222gt
- ltoid221 Miejsce1 Set oid2211gt
- ltoid2211 Utwor String learning to flygt
40Object Exchange Model cechy
- Reprezentowany jako graf z obiektami na
wierzcholkach i etykietami na krawedziach. - Wszystkie wystapienia sa obiektami.
- Kazdy obiekt ma swój unikalny identyfikator
(oid). - Rozrózniane sa dwa typy obiektów atomowe i
zlozone.
41Object Exchange Model cechy (cd)
- W OEM wystepuja tzw. nazwy (ang. names), które
moga byc traktowane jako aliasy do obiektów
wewnatrz bazy danych. - Nazwa sluzy jako wskaznik do bazy danych.
- Kazdy obiekt w bazie danych powinien byc
osiagalny za pomoca nazwy.
42ACeDB
- ACeDB (A C. elegans Database) byla rozwijana jako
baza danych informacji genetycznej organizmów. - Rozwijana od 1989.
- Posiada swój wlasny jezyk zapytan AQL - Acedb
Query Language - http//www.acedb.org/
43ACeDB cechy
- Schemat i dane moga byc traktowane jako drzewo z
etykietowanymi krawedziami. - Krawedzie moga byc etykietowane jakimkolwiek
typem podstawowym. (int, Notowanie)np. array
Int unique Int - Z okreslonego wierzcholka drzewa danych moze
wychodzic wiele galezi. - ACeDB pozwala na to aby jakakolwiek etykieta
rózna od etykiety glównej byla pominieta. - Identyfikatory obiektów wprowadzane sa przez
uzytkownika.
44ACeDB cechy (cd)
- ACeDB wymaga schematu
- Mimo to, fakt, ze dane moga byc pomijane oraz to,
ze etykietowane dane sa traktowane jednolicie z
innymi prostymi typami powoduje, ze jest on
bardzo bliski semistrukturalnemu modelowi danych.
45ACeDB przyklad
- gtBook title UNIQUE Text
- authors Text
- chapters int UNIQUE Text
- language UNIQUE english
- french
- other
- date UNIQUE month Int
- year Int
- hock2 title Computer Simulation Using
Particles - authors Hockney
- Eastwood
- chapters 1 Computer Experiments
- 2 A One-Dimensional Model
- ...
- language english
46XML
47Róznice pomiedzy XML a OEM
- XML jest uporzadkowany.
- Etykiety w OEM sa wykorzystywane tylko jako punkt
odniesienia oraz do oznaczania zaleznosci
pomiedzy obiektami. W XML kazdy element nie
bedacy ciagiem tekstowym zawiera identyfikujacy
go znacznik etykiete. - XML nie wspiera bezposrednio struktury grafu.
48Overview
- Semistructured Data Representation
- Mediator Generation
- Example of mediator specification
- Language expressiveness
- Implementation and performance
- Wrapper Generation
- Capabilities-Based Rewriting
49Merge Information Relating to a Faculty
faculty name John Doe rank
professor birthday April 1
papers ...
- Schema Integration
- Info fusion
s2
s1
faculty name John Doe rank
professor papers
...
person name John Doe birthday
April 1
50Mediator Specification Example
faculty name John Doe rank
professor birthday April 1
papers ...
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
s2
s1
faculty name John Doe rank
professor papers
...
person name John Doe birthday
April 1
51Mediator Specification Example Semantics of Rule
Bodies
faculty name John Doe rank
professor birthday April 1
papers ...
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
s2
s1
faculty name John Doe rank
professor papers
...
person name John Doe birthday
April 1
52Mediator Specification Example Semantics of Rule
Heads
John Doe faculty name John
Doe rank professor birthday
April 1 papers
...
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
s2
s1
faculty name John Doe rank
professor papers
...
person name John Doe birthday
April 1
53Incrementally Add to Semantically Identified
Object
John Doe faculty name John
Doe rank professor birthday
April 1 papers ...
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
s1
s2
faculty name John Doe rank
professor papers ...
person name John Doe birthday
April 1
54Irregularities Incomplete Schema Knowledge
John Doe
faculty name John Doe rank
professor birthday April 1
papers faculty name Mary Smith
project Air DB
Mary Smith
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1
s1
faculty name John Doe rank
professor papers faculty name
Mary Smith project Air DB
s2
person name John Doe birthday
April 1
55Second Rule Attaches More Subobjects to View
Objects
John Doe faculty name John
Doe rank professor birthday
April 1 papers ...
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
s1
s2
faculty name John Doe rank
professor papers ...
person name John Doe birthday
April 1
56The OEM object structure of the cs wrapper
- lte1, employee, set, f1,l1,t1,rep1gt
- ltf1, first name, string, 'Joe'gt
- ltl1, last name, string, 'Chung'gt
- ltt1, title, string, 'professor'gt
- ltrep1, reports to, string, 'John Hennessy'gt
- lte2, employee, set, f2,l2,t2gt
- ltf2, first name, string, 'John'gt
- ltl2, last name, string, 'Hennessy'gt
- ltt2, title, string, 'chairman'gt
- . . .
- lts3, student, set, f3,l3,y3gt
- ltf3, first name, string, 'Pierre'gt
- ltl3, last name, string, 'Huyn'gt
- lty3, year, integer, 3gt
- . . .
57The OEM object structure of whois
- ltp1, person, set, n1,d1,rel1,elm1gt
- ltn1, name, string, 'Joe Chung'gt
- ltd1, dept, string, 'CS'gt
- ltrel1, relation, string, 'employee'gt
- ltelm1, e_mail, string, 'chung_at_cs'gt
- ltp2, person, set, n2,d2,rel2gt
- ltn2, name, string, 'Nick Naive'gt
- ltd2, dept, string, 'CS'gt
- ltrel2, relation, string, 'student'gt
- lty2, year, integer, 3gt
- ...
58Object exported by med
- ltcp1, cs_person, mn1,mrel1,t1,rep1,elm1gt
- ltmn1, name, string, 'Joe Chung'gt
- ltmrel1, rel, string, 'employee'gt
- ltt1, title, string, 'professor'gt
- ltrep1, reports_to, string, 'John Hennessy'gt
- ltelm1, e_mail, string, 'chung_at_cs'gt
59Problemy wystepujace przy tworzeniu Specyfikacji
Mediatora
- Schema-domain mismatch
- Schematic discrepancy
- Schema Evolution
- Structure Irregularities
60(MSL) Rules
- ltcs_person ltname Ngt ltrel Rgt Rest1 Rest2gt
- ltperson ltname Ngt ltdept 'CS'gt
- ltrelation Rgt Rest1gt_at_whois
- AND decomp(N, LN, FN)
- AND ltR ltfirst name FNgt
- ltlast name LNgt Rest2gt_at_cs
- External
- decomp(string,string,string)(bound,free,free)
- impl by name_to_lnfn
- decomp(string,string,string)(free,bound,bound)
- impl by lnfn_to_name
61(MSL) Rules
- ltcs_person ltname Ngt ltrel Rgtgt
- ltperson ltname Ngt ltdept 'CS'gt
- ltrelation Rgtgt_at_whois
- ltcs_person ltname Ngt ltrel Rgtgt
- - decomp(N, LN, FN)
- AND ltR ltfirst name FNgt
- ltlast name LNgtgt_at_cs
- ltcs_person ltname Ngt ltrel Rgt lttitle Tgtgt
- - ???
62(MSL) Rules
- ltcs_person ltname Ngt ltrel Rgt lte_mail Egtgt
- - ???
- ??? - ??? ltR ltfirst_name FNgt
- ltlast_name LNgt lttitle Egt
- ltreports_to Sgtgt_at_cs
- Rewriting
- ltcs_person ltname Ngt ltrel Rgt lte_mail Egt lttitle Egt
- ltreports_to Sgtgt
- - ???
63Language Expressiveness
- Information fusion problems solved by MSL
- Irregularities
- Incomplete knowledge of source structure
- Transformation of cross-referenced structures
- Inconsistent and redundant data
- Use of arbitrary matching criteria
- Theoretical analysis of expressiveness
- Consider the relational representation of OEM
graphs. Then MSL is equivalent to SQL special
form of transitive closure
64Inconsistent and Redundant Information
John Doe faculty
name John Doe rank associate
rank assistant
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2 AND NOT ltfaculty ltname Ngt ltL V1gtgt_at_s1
s1
s2
faculty name
John Doe rank associate
person name
John Doe rank assistant
65Overview
- Semistructured Data Representation
- Mediator Generation
- Example of mediator specification
- Language expressiveness
- Implementation and performance
- Wrapper Generation
- Capabilities-Based Rewriting
66Mediator Specification Interpreter Architecture
Result
Query
Mediator Specification
Query Rewriter
logical datamerge program
Cost-Based Optimizer
plan
Datamerge Engine
Queries to Wrappers
Results
67Query Rewriting When Known Origins of Information
- ltN faculty ltsalary Sgtgt - - ltfaculty
ltname Ngt ltsalary Sgtgt_at_s1 - ltN faculty lt rank R gtgt - ltperson ltname
Ngt ltrank Rgtgt_at_s2 - ltwell-paid ltname Ngt ltsalary Xgtgt - ltN
faculty ltsalary Xgt ltrank assistantgtgt AND
Xgt65000
68Query Rewriter Pushes Conditions to Sources
- ltN faculty ltsalary Sgtgt - - ltfaculty
ltname Ngt ltsalary Sgtgt_at_s1 ltN faculty lt
rank R gtgt - ltperson ltname Ngt ltrank
Rgtgt_at_s2 - ltwell-paid ltname Ngt ltsalary Xgtgt - ltN
faculty ltsalary Xgt ltrank assistantgtgt AND
Xgt65000 - logical datamerge program ltwell-paid
ltname Ngt ltsalary Xgtgt - (ltfaculty
ltname Ngt ltsalary Xgtgt AND Xgt65000)_at_s1
AND ltperson ltname Ngt ltrank assistantgtgt_at_s2
69Passing Bindings Local Join Plans
Passing Bindings
s1
s2
ltsalary Xgt - ltfaculty ltname Ngt
ltsalary Xgtgt AND Xgt65000
ltname Ngt - ltperson ltrank assistantgtgt
Local Join
s1
s2
lta lts Xgt ltn Ngtgt- ltfaculty ltname Ngt
ltsalary Xgtgt AND Xgt65000
N
ltname Ngt - ltperson ltrank assistantgtgt
70Query Decomposition When Unknown Origins of
Information
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
ltX faculty ltS Ygtgt - ltX faculty ltbirthday
1/20gt ltS Ygtgt
71Plan Considers All Possible Sources of birthday
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
ltX faculty ltS Ygtgt - ltX faculty ltbirthday
1/20gt ltS Ygtgt
s2
s1
name
birthday
name
birthday
72Overview
- Semistructured-Data Representation
- Mediator Generation
- Wrapper Generation
- Capabilities-Based Rewriting
73Query Translation in Wrappers
SELECT FROM person SELECT FROM person WHERE
nameSmith
Wrapper
Query Translator
Result Translator
find -all find -n Smith
Source
74Rapid Query Translation Using Templates and
Actions
SELECT FROM person SELECT FROM person WHERE
nameSmith
SELECT FROM person emit find -all
SELECT FROM person WHERE nameN emit
find -n N
Template Interpreter
Result Translator
find -all find -n Smith
Source
75Description of Infinite Sets of Supported Queries
- uses recursive nonterminals
- Example
- job description contains word w1 and word w2 and
... - SELECT subset(person) FROM person WHERE
\CJob \CJob job LIKE W AND \CJob \CJob
TRUE
76Overview
- Semistructured-Data Representation
- Mediator Generation
- Wrapper Generation
- Capabilities-Based Rewriting
77Capabilities-Based Rewriter in Mediator
Architecture
Query
logical datamerge program
Query Rewriter
Mediator Specification
Capabilities- Based Rewriter
supported plans
Cost-Based Optimizer
optimal plan
Datamerge Engine
Wrapper
Supported Queries Description
Wrapper
Supported Queries Description
78Capabilities-Based Rewriter Finds Supported Plans
SELECT FROM A WHERE salarygt65000
Supported Queries
SELECT FROM A
79Capabilities-Based Rewriter Finds Most-Selective
Supported Plans
SELECT FROM B WHERE salarygt65000
Supported Queries
SELECT FROM B WHERE salary gt65000
SELECT FROM B
80Capabilities-Based Rewriter Architecture
Query
Query Capabilities Description
Component SubQuery Discovery
Component SubQueries
Plan Construction
Plans (not fully optimized)
Plan Refinement
Algebraically optimal plans
81What TSIMMIS Achieved
- system for integration of heterogeneous sources
- challenges and solutions
- semistructured data incomplete schema knowledge
- appropriate specification language and query
processing algorithms - limited and different query capabilities
- query translation algorithm
- capabilities-based query rewriting algorithm
82Overview
- TSIMMIS goals, technical challenges, and
solutions - Insufficiencies of the TSIMMIS framework
- Going forward
83Insufficiencies of the TSIMMIS framework
- OEM was really unstructured data
- some loose and partial schematic info may pay off
tremendously - too databasy user/mediator/source interaction
84Overview
- TSIMMIS goals, technical challenges, and
solutions - Insufficiencies of the TSIMMIS framework
- Going forward
85Web emerges as a Distributed DB and XML as its
Data Model
XMAS Query Language
Also export 1. Schemas Metadata (XML-Data,
RDF,) 2. Description of supported queries
XML View Document(s)
XML View Document(s)
XML View Document(s)
Data Source
Wrapper
Native XML Database
Legacy Source
86Definition of Integrated Views
Integrated XML View
View Definition in XMAS
Mediator
XML View Document(s)
XML View Document(s)
XML View Document(s)
Data Source
Data Source
Data Source
87Non-Materialized Views in the MIX mediator system
Blended Browsing Querying (BBQ) GUI
Application
XMAS query
XML document
DOM for Virtual XML Docs
Integrated View DTD
View Definition in XMAS
MIX Mediator
DTD Inference
Query Processor
Source DTD
XML Source
XML Source
88Application
XML Document Fragments
Blended Browsing Querying (BBQ) GUI
DOM (VXD) Client API
XMAS Query
View DTD
MIX Mediator
XMAS Mediator View Definition
Resolution
Unfolded Query
DTD Inference
Simplification
Translation to Algebra
Optimization
DTD
Execution
XMAS Query
XML Document Fragments
XML Source 1
RDB2XML Wrapper
XML Source 2
RDB