Information Integration - PowerPoint PPT Presentation

About This Presentation
Title:

Information Integration

Description:

Federated Databases: the sources are independent, but one source can ... Wykorzystywany przy projekcie Merlin (MQS) i Lorel (QL) 38. Object Exchange Model (cd) ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 89
Provided by: yanni3
Category:

less

Transcript and Presenter's Notes

Title: Information Integration


1
  • Information Integration

2
Information Resides on Heterogeneous Information
Sources
Personal database
Excel
WWW
Flat File
  • different interfaces
  • different data representations
  • redundant and conflicting information

3
Modes of Information Integration
  • Federated Databases the sources are independent,
    but one source can call on others to supply
    information
  • Data warehouse copies of data from several
    sources are stored in a single database, called a
    data warehouse. The data stored at the warehouse
    is first processed in some way before storage
    e.g. data may be filtered, and relations may be
    joined or aggregated. As the data is copied from
    the sources, it may need to be transformed in
    certain ways to make all data conform to the
    schema at the data warehouse

4
Modes of Information Integration
  • Mediation a mediator is a software component
    that supports a virtual database, which the user
    may query as if it were materialized (physically
    constructed like a warehouse). The mediator store
    no data of its own. Rather, it translates the
    users query into one or more queries to its
    sources. The mediator then synthesizes the answer
    to the users query from the responses of those
    sources, and returns the answer to the user

5
Problems of Information Integration
  • Example
  • The AAAI Automobile Co. has 1000 dealers each
    of which maintains a database of their cars in
    stock. AAAI wants to create an integrated
    database containing the information of all 1000
    sources. The integrated database will help
    dealers locate a particular model if they dont
    have one in stock. It also can be used by
    corporate analysts to predict the market and
    adjust production to provide the model most
    likely to sell

6
Problems of Information Integration
  • The 1000 dealers do not all use the same database
    schema
  • Cars (serialNo, model, color, autoTrans,
    cdPlayer, ...)
  • or
  • Autos (serialNo, model, color)
  • Options (serialNo, option)

7
Problems of Information Integration
  • Schema difference
  • Different equivalent names
  • Data type differences numbers may be represented
    by character strings of varying length at one
    source and fixed length at another
  • Value differences the same concept may be
    represented by different constants at different
    sources (BLACK, BL, 100, etc)
  • Semantic differences Terms can be given
    different interpretations at different sources
    (Cars includes trucks or not)
  • Missing values a source may not record
    information of a type that all of the other
    sources provide

8
Goal System Providing Integrated View of
Heterogeneous Data
Integration System
Personal database
Excel
WWW
Flat File
  • collects and combines information
  • provides integrated view, uniform user interface

9
The Data Warehousing Approach to Integration
Client
Stored Integrated View
Mediator
Wrapper
Wrapper
Excel
Flat File
10
The Data Warehousing Approach to Integration
  • Data from several sources is extracted and
    combined into a global schema
  • The data is stored at the warehouse which looks
    like an ordinary database
  • There are three approaches to maintaining the
    data in the data warehouse
  • off-line reconstruction of the whole data
    warehouse
  • the data warehouse is updated periodically based
    on the changes made to the original data sources
  • the data warehouse is updated immediately

11
The Data Warehousing Approach to Integration
  • Example
  • Suppose that there are two dealers in the
    system and that they use the schemas
  • Cars (serialNo, model,color,autoTrans, cdPlayer,
    ...)
  • and
  • Autos (serialNo,model,color)
  • Options (serialNo,option)
  • Assume a data warehouse with the schema
  • AutoWhse(serialNo,model,color,autoTrans, dealer)

12
The Data Warehousing Approach to Integration
  • The software to extracts data from the dealers
    databases and populates the global schema can be
    written as SQL-queries. The query for the first
    dealer
  • insert into AutoWhse(serialNo,model,color,autoTran
    s, dealer)
  • select serialNo, model, color, autoTrans,
    dealer1
  • from Cars
  • The code for the second dealer is more complex
    since we have to decide whether or not a given
    car has an automatic transmission.

13
The Data Warehousing Approach to Integration
  • insert into AutoWhse(serialNo,model,color,autoTran
    s, dealer)
  • select serialNo, model, color, yes, dealer2
  • from Autos, Options
  • where Autos.serialNoOptions.serialNo and
  • optionautoTrans
  • insert into AutoWhse(serialNo,model,color,autoTran
    s, dealer)
  • select serialNo, model, color, no dealer2
  • from Autos
  • where not exists ( select from Options
  • where serialNoAutos.serialNo and
  • optionautoTrans)

14
The Wrapper and Mediator Architecture
Client
Common Data Model
portfolios for each company
Mediator
business reports
stock market prices
Wrapper
Wrapper
Excel
Flat File
15
The Wrapper and Mediator Architecture
  • A mediator supports a virtual view, or
    collection of views, that integrates several
    sources in much the same way that the
    materialized relation(s) in a data warehouse
    integrate sources.
  • The mediator doesnt store any data lt lt lt
  • Example
  • Let us consider the same scenario. The mediator
    integrates the same two data sources into a view
    that is a single relation with the schema
  • AutoMed(serialNo,model,color,autoTrans, dealer)

16
The Wrapper and Mediator Architecture
  • Assume the user asks the mediator about the red
    cars
  • select serialNo, model from AutosMed
  • where color red
  • The mediator forward the same query to each of
    the two wrappers
  • (1) select serialNo, model from Cars where
    colorred
  • (2) select serialNo, model from Autos where
    colorred
  • The mediator can take the union of answers and
    return the result to the user.

17
The Lazy Integration Approach
Query Decomposition, Translation and Result
Fusion
Client
IBM portfolio
Mediator
IBM price
IBM related reports (in common model)
Wrapper
Wrapper
IBM related reports
Excel
Flat File
18
Wrappers in Mediator-Based Systems
  • In a data warehouse system, the source extractors
    consist of
  • one or more queries built-in that are executed at
    the source to produce data for the data warehouse
  • communication mechanisms, so that wrapper can
  • pass ad-hoc queries to the source
  • receive responses from the source
  • pass information to the warehouse
  • Mediator systems require more complex wrappers -
    the wrapper must be able to accept a variety of
    queries from the mediator and translate any of
    them to the terms of the source.

19
Wrappers in Mediator-Based Systems
  • A systematic way to design a wrapper that
    connects a mediator to a source is to classify
    the possible queries that the mediator can ask
    into templates, which are queries with parameters
    that represent constants.
  • The mediator can provide the constants, and the
    wrapper executes the query with the given
    constants.
  • T ? S the template T is turned into the source
    query S
  • Example
  • The source of dealer1
  • Cars (serialNo, model,color,autoTrans, cdPlayer,
    ...)

20
Wrappers in Mediator-Based Systems
  • Assume we use the mediator with schema
  • AutoMed(serialNo,model,color,autoTrans, dealer)
  • How the mediator could ask the wrapper for cars
    of a given colorgt
  • The template
  • select from AutoMed where color c ?
  • select serialNo, model color, autoTrans,
    dealer1
  • from Cars where colorc

21
Wrappers in Mediator-Based Systems
  • The wrapper could have another template that
    specified the parameter m representing a model
  • there would be 2N templates for N attributes
  • the number of templates could grow unreasonably
    large.

22
Wrapper Generators
  • The template defining a wrapper must be turned
    into code for the wrapper itself - the software
    that creates the wrapper is called a wrapper
    generator
  • The wrapper generator creates a table that holds
    the various query patterns contained in the
    templates, and the source queries that are
    associated with each.
  • A driver is used in each wrapper. The task of the
    driver is to
  • accept a query from the mediator
  • search the table for a template that match the
    query
  • the source query is sent to the source using a
    communication mechanism
  • the response is processed by the wrapper, if
    necessary, and then returned to the mediator

23
Wrappers Mediators from High-Level
Specifications
Client
Mediator Specification Interpreter
Mediator
Mediator Specification
Wrapper Generator
Wrapper
Wrapper
Wrapper Specification
Source
Source
24
Filters
  • Complex template
  • select from AutoMed
  • where color c and model m ?
  • select serialNo, model color, autoTrans,
    dealer1
  • from Cars where colorc and model m
  • Wrapper filter approach - if the wrapper has a
    template that returns a superset of what the
    query wants then it is possible to filter the
    result at the wrapper
  • The decision whether a mediator asks for a subset
    of what the pattern of some wrapper template
    returns is a hard problem lt lt lt ltgtgtgtgt

25
Filters
  • Example
  • Given the template
  • select from AutoMed where color c
  • The mediator needs to find blue Gobi model car
  • select from AutoMed where color blue and
    modelGobi
  • use the template with cblue to find all blue
    cars
  • store the result in the temporary relation Temp
  • select from Temp the Gobis and return the result

26
Other Wrapper Operations
  • It is possible to transform the data at the
    wrapper in different ways
  • The mediator is asked to find dealers and models
    such that the dealer has two red cars, of the
    same model, one with and one without automatic
    transmission. Suppose we have only one template
    as before.
  • Select A1.model A1.dealer
  • from AutoMed A1 AutoMed A2
  • where A1.modelA2.model and A1.colorred and
  • A2.colorred and A1.autoTransno and
  • A2.autoTransyes

27
Other Wrapper Operations
  • It is possible to answer the query by first
    obtaining from the Dealers 1 source a relation
    with all the red cars (use the original template)
    - RedAutos relation
  • select distinct A1.model A1.dealer
  • from RedAutos A1, RedAutos A2
  • where A1.modelA2.model and
  • A1.autoTransno and
  • A2.autoTransyes

28
Challenge Sources Without a Well-Structured
Schema
Examples
  • semistructured
  • irregular
  • deeply nested
  • cross-referenced
  • incomplete schema knowledge
  • autonomous
  • dynamic
  • HTML pages
  • SGML documents
  • genome data
  • chemical structures
  • bibliographic information
  • results of the integration process

29
Challenge Different and Limited Source
Capabilities
Client
retrieve IBM data
Mediator (U A B)
retrieve IBM data
retrieve IBM data
Wrapper (A)
Wrapper (B)
30
Mediator has to Adapt to Query Capabilities of
Sources
Client
retrieve IBM data
Mediator (U A B)
retrieve IBM data
retrieve IBM data
retrieve everything
(A) does not allow selection
Wrapper (A)
Wrapper (B)
31
Part B
  • Semistructured Data Representation
  • Mediator Generation
  • Wrapper Generation
  • Capabilities-Based Rewriting

32
Representation of Semistructured Information
using OEM
semantic object-id
label
Set Value
lthttp//www/doe, faculty, f1,l1,r1gt
ltf1, first_name, Johngt
ltl1, last_name, Doegt ltr1, rank,
professorgt
Atomic Value
structural object-id
33
Object Exchange Model - Goals
  • Easy to read
  • Easy to edit
  • Easy to generate or parse by a program
  • Consistency with Stanfords other projects
    (developed with the TSIMMIS)
  • Possibility of extensions in the future

34
Graph Representation of OEM Data
lthttp//www/doe, faculty, f1,l1,r1gt
ltf1, first_name, Johngt
ltl1, last_name, Doegt ltr1, rank,
professorgt
http//www/doe
faculty first_name John
last_name Doe rank
professor
35
OEM Structures Represent Arbitrary Labeled Graphs
http//www/smith
faculty name Mary Smith project
Air DB paper author
name John Doe author
name Mary Smith title Thin Air
DB
http//www/doe
faculty first_name John
last_name Doe rank professor
36
Reprezentacja danych semistrukturalnych
  • Object Exchange Model
  • ACeDB
  • XML
  • Moga byc wykorzystywane w warstwie pomiedzy
    mediatorem a wrapperami.

37
Object Exchange Model
  • Zdefiniowany przy okazji budowy systemu Tsimmis
    sluzacego do integracji heterogenicznych zródel
    danych.
  • Wykorzystywany przy projekcie Merlin (MQS) i
    Lorel (QL)

38
Object Exchange Model (cd)
  • Wezel OEM sklada sie z czterech pól
  • Object-ID jest wykorzystywany do unikalnej
    identyfikacji okreslonego wezla OEM
  • Label jest ciagiem znaków który opisuje to co
    wezel OEM reprezentuje
  • Type jest typem danych wartosci wezla. (atomowy
    lub kolekcja)
  • Value moze byc albo wartoscia atomowa albo
    referencja do kolekcji wezlów OEM
  • Jest zwykle reprezentowany jako ltObject-ID
    Label Type Valuegt

39
Object Exchange Model przyklad
  • ltoid1 Notowanie Set oid11 oid12gt
  • ltoid11 NrNotowania String 4004gt
  • ltoid12 Rezultaty Set oid121 oid122gt
  • ltoid121 Miejsce1 Set oid1211gt
  • ltoid1211 Utwor String metropolisgt
  • ltoid122 Miejsce2 Set oid1221gt
  • ltoid1221 Utwor String moneygt
  • ltoid2 Notowanie Set oid21 oid22gt
  • ltoid21 NrNotowania String 4005gt
  • ltoid22 Rezultaty Set oid221 oid222gt
  • ltoid221 Miejsce1 Set oid2211gt
  • ltoid2211 Utwor String learning to flygt

40
Object Exchange Model cechy
  • Reprezentowany jako graf z obiektami na
    wierzcholkach i etykietami na krawedziach.
  • Wszystkie wystapienia sa obiektami.
  • Kazdy obiekt ma swój unikalny identyfikator
    (oid).
  • Rozrózniane sa dwa typy obiektów atomowe i
    zlozone.

41
Object Exchange Model cechy (cd)
  • W OEM wystepuja tzw. nazwy (ang. names), które
    moga byc traktowane jako aliasy do obiektów
    wewnatrz bazy danych.
  • Nazwa sluzy jako wskaznik do bazy danych.
  • Kazdy obiekt w bazie danych powinien byc
    osiagalny za pomoca nazwy.

42
ACeDB
  • ACeDB (A C. elegans Database) byla rozwijana jako
    baza danych informacji genetycznej organizmów.
  • Rozwijana od 1989.
  • Posiada swój wlasny jezyk zapytan AQL - Acedb
    Query Language
  • http//www.acedb.org/

43
ACeDB cechy
  • Schemat i dane moga byc traktowane jako drzewo z
    etykietowanymi krawedziami.
  • Krawedzie moga byc etykietowane jakimkolwiek
    typem podstawowym. (int, Notowanie)np. array
    Int unique Int
  • Z okreslonego wierzcholka drzewa danych moze
    wychodzic wiele galezi.
  • ACeDB pozwala na to aby jakakolwiek etykieta
    rózna od etykiety glównej byla pominieta.
  • Identyfikatory obiektów wprowadzane sa przez
    uzytkownika.

44
ACeDB cechy (cd)
  • ACeDB wymaga schematu
  • Mimo to, fakt, ze dane moga byc pomijane oraz to,
    ze etykietowane dane sa traktowane jednolicie z
    innymi prostymi typami powoduje, ze jest on
    bardzo bliski semistrukturalnemu modelowi danych.

45
ACeDB przyklad
  • gtBook title UNIQUE Text
  • authors Text
  • chapters int UNIQUE Text
  • language UNIQUE english
  • french
  • other
  • date UNIQUE month Int
  • year Int
  • hock2 title Computer Simulation Using
    Particles
  • authors Hockney
  • Eastwood
  • chapters 1 Computer Experiments
  • 2 A One-Dimensional Model
  • ...
  • language english

46
XML
  • Znany i lubiany

47
Róznice pomiedzy XML a OEM
  • XML jest uporzadkowany.
  • Etykiety w OEM sa wykorzystywane tylko jako punkt
    odniesienia oraz do oznaczania zaleznosci
    pomiedzy obiektami. W XML kazdy element nie
    bedacy ciagiem tekstowym zawiera identyfikujacy
    go znacznik etykiete.
  • XML nie wspiera bezposrednio struktury grafu.

48
Overview
  • Semistructured Data Representation
  • Mediator Generation
  • Example of mediator specification
  • Language expressiveness
  • Implementation and performance
  • Wrapper Generation
  • Capabilities-Based Rewriting

49
Merge Information Relating to a Faculty
faculty name John Doe rank
professor birthday April 1
papers ...
  • Schema Integration
  • Info fusion

s2
s1
faculty name John Doe rank
professor papers
...
person name John Doe birthday
April 1
50
Mediator Specification Example
faculty name John Doe rank
professor birthday April 1
papers ...
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
s2
s1
faculty name John Doe rank
professor papers
...
person name John Doe birthday
April 1
51
Mediator Specification Example Semantics of Rule
Bodies
faculty name John Doe rank
professor birthday April 1
papers ...
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
s2
s1
faculty name John Doe rank
professor papers
...
person name John Doe birthday
April 1
52
Mediator Specification Example Semantics of Rule
Heads
John Doe faculty name John
Doe rank professor birthday
April 1 papers
...
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
s2
s1
faculty name John Doe rank
professor papers
...
person name John Doe birthday
April 1
53
Incrementally Add to Semantically Identified
Object
John Doe faculty name John
Doe rank professor birthday
April 1 papers ...
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
s1
s2
faculty name John Doe rank
professor papers ...

person name John Doe birthday
April 1
54
Irregularities Incomplete Schema Knowledge
John Doe
faculty name John Doe rank
professor birthday April 1
papers faculty name Mary Smith
project Air DB
Mary Smith
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1
s1
faculty name John Doe rank
professor papers faculty name
Mary Smith project Air DB
s2
person name John Doe birthday
April 1
55
Second Rule Attaches More Subobjects to View
Objects
John Doe faculty name John
Doe rank professor birthday
April 1 papers ...
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
s1
s2
faculty name John Doe rank
professor papers ...

person name John Doe birthday
April 1
56
The OEM object structure of the cs wrapper
  • lte1, employee, set, f1,l1,t1,rep1gt
  • ltf1, first name, string, 'Joe'gt
  • ltl1, last name, string, 'Chung'gt
  • ltt1, title, string, 'professor'gt
  • ltrep1, reports to, string, 'John Hennessy'gt
  • lte2, employee, set, f2,l2,t2gt
  • ltf2, first name, string, 'John'gt
  • ltl2, last name, string, 'Hennessy'gt
  • ltt2, title, string, 'chairman'gt
  • . . .
  • lts3, student, set, f3,l3,y3gt
  • ltf3, first name, string, 'Pierre'gt
  • ltl3, last name, string, 'Huyn'gt
  • lty3, year, integer, 3gt
  • . . .

57
The OEM object structure of whois
  • ltp1, person, set, n1,d1,rel1,elm1gt
  • ltn1, name, string, 'Joe Chung'gt
  • ltd1, dept, string, 'CS'gt
  • ltrel1, relation, string, 'employee'gt
  • ltelm1, e_mail, string, 'chung_at_cs'gt
  • ltp2, person, set, n2,d2,rel2gt
  • ltn2, name, string, 'Nick Naive'gt
  • ltd2, dept, string, 'CS'gt
  • ltrel2, relation, string, 'student'gt
  • lty2, year, integer, 3gt
  • ...

58
Object exported by med
  • ltcp1, cs_person, mn1,mrel1,t1,rep1,elm1gt
  • ltmn1, name, string, 'Joe Chung'gt
  • ltmrel1, rel, string, 'employee'gt
  • ltt1, title, string, 'professor'gt
  • ltrep1, reports_to, string, 'John Hennessy'gt
  • ltelm1, e_mail, string, 'chung_at_cs'gt

59
Problemy wystepujace przy tworzeniu Specyfikacji
Mediatora
  • Schema-domain mismatch
  • Schematic discrepancy
  • Schema Evolution
  • Structure Irregularities

60
(MSL) Rules
  • ltcs_person ltname Ngt ltrel Rgt Rest1 Rest2gt
  • ltperson ltname Ngt ltdept 'CS'gt
  • ltrelation Rgt Rest1gt_at_whois
  • AND decomp(N, LN, FN)
  • AND ltR ltfirst name FNgt
  • ltlast name LNgt Rest2gt_at_cs
  • External
  • decomp(string,string,string)(bound,free,free)
  • impl by name_to_lnfn
  • decomp(string,string,string)(free,bound,bound)
  • impl by lnfn_to_name

61
(MSL) Rules
  • ltcs_person ltname Ngt ltrel Rgtgt
  • ltperson ltname Ngt ltdept 'CS'gt
  • ltrelation Rgtgt_at_whois
  • ltcs_person ltname Ngt ltrel Rgtgt
  • - decomp(N, LN, FN)
  • AND ltR ltfirst name FNgt
  • ltlast name LNgtgt_at_cs
  • ltcs_person ltname Ngt ltrel Rgt lttitle Tgtgt
  • - ???

62
(MSL) Rules
  • ltcs_person ltname Ngt ltrel Rgt lte_mail Egtgt
  • - ???
  • ??? - ??? ltR ltfirst_name FNgt
  • ltlast_name LNgt lttitle Egt
  • ltreports_to Sgtgt_at_cs
  • Rewriting
  • ltcs_person ltname Ngt ltrel Rgt lte_mail Egt lttitle Egt
  • ltreports_to Sgtgt
  • - ???

63
Language Expressiveness
  • Information fusion problems solved by MSL
  • Irregularities
  • Incomplete knowledge of source structure
  • Transformation of cross-referenced structures
  • Inconsistent and redundant data
  • Use of arbitrary matching criteria
  • Theoretical analysis of expressiveness
  • Consider the relational representation of OEM
    graphs. Then MSL is equivalent to SQL special
    form of transitive closure

64
Inconsistent and Redundant Information
John Doe faculty
name John Doe rank associate
rank assistant
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2 AND NOT ltfaculty ltname Ngt ltL V1gtgt_at_s1
s1
s2
faculty name
John Doe rank associate
person name
John Doe rank assistant
65
Overview
  • Semistructured Data Representation
  • Mediator Generation
  • Example of mediator specification
  • Language expressiveness
  • Implementation and performance
  • Wrapper Generation
  • Capabilities-Based Rewriting

66
Mediator Specification Interpreter Architecture
Result
Query
Mediator Specification
Query Rewriter
logical datamerge program
Cost-Based Optimizer
plan
Datamerge Engine
Queries to Wrappers
Results
67
Query Rewriting When Known Origins of Information
  • ltN faculty ltsalary Sgtgt - - ltfaculty
    ltname Ngt ltsalary Sgtgt_at_s1
  • ltN faculty lt rank R gtgt - ltperson ltname
    Ngt ltrank Rgtgt_at_s2
  • ltwell-paid ltname Ngt ltsalary Xgtgt - ltN
    faculty ltsalary Xgt ltrank assistantgtgt AND
    Xgt65000

68
Query Rewriter Pushes Conditions to Sources
  • ltN faculty ltsalary Sgtgt - - ltfaculty
    ltname Ngt ltsalary Sgtgt_at_s1 ltN faculty lt
    rank R gtgt - ltperson ltname Ngt ltrank
    Rgtgt_at_s2
  • ltwell-paid ltname Ngt ltsalary Xgtgt - ltN
    faculty ltsalary Xgt ltrank assistantgtgt AND
    Xgt65000
  • logical datamerge program ltwell-paid
    ltname Ngt ltsalary Xgtgt - (ltfaculty
    ltname Ngt ltsalary Xgtgt AND Xgt65000)_at_s1
    AND ltperson ltname Ngt ltrank assistantgtgt_at_s2

69
Passing Bindings Local Join Plans
Passing Bindings
s1
s2
ltsalary Xgt - ltfaculty ltname Ngt
ltsalary Xgtgt AND Xgt65000
ltname Ngt - ltperson ltrank assistantgtgt
Local Join
s1
s2
lta lts Xgt ltn Ngtgt- ltfaculty ltname Ngt
ltsalary Xgtgt AND Xgt65000
N
ltname Ngt - ltperson ltrank assistantgtgt
70
Query Decomposition When Unknown Origins of
Information
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
ltX faculty ltS Ygtgt - ltX faculty ltbirthday
1/20gt ltS Ygtgt
71
Plan Considers All Possible Sources of birthday
ltN faculty ltL Vgtgt - ltfaculty ltname Ngt ltL
Vgtgt_at_s1 ltN faculty ltL Vgtgt - ltperson ltname Ngt
ltL Vgtgt_at_s2
ltX faculty ltS Ygtgt - ltX faculty ltbirthday
1/20gt ltS Ygtgt
s2
s1
name
birthday
name
birthday
72
Overview
  • Semistructured-Data Representation
  • Mediator Generation
  • Wrapper Generation
  • Capabilities-Based Rewriting

73
Query Translation in Wrappers
SELECT FROM person SELECT FROM person WHERE
nameSmith
Wrapper
Query Translator
Result Translator
find -all find -n Smith
Source
74
Rapid Query Translation Using Templates and
Actions
SELECT FROM person SELECT FROM person WHERE
nameSmith
SELECT FROM person emit find -all
SELECT FROM person WHERE nameN emit
find -n N
Template Interpreter
Result Translator
find -all find -n Smith
Source
75
Description of Infinite Sets of Supported Queries
  • uses recursive nonterminals
  • Example
  • job description contains word w1 and word w2 and
    ...
  • SELECT subset(person) FROM person WHERE
    \CJob \CJob job LIKE W AND \CJob \CJob
    TRUE

76
Overview
  • Semistructured-Data Representation
  • Mediator Generation
  • Wrapper Generation
  • Capabilities-Based Rewriting

77
Capabilities-Based Rewriter in Mediator
Architecture
Query

logical datamerge program
Query Rewriter
Mediator Specification
Capabilities- Based Rewriter
supported plans
Cost-Based Optimizer
optimal plan
Datamerge Engine
Wrapper
Supported Queries Description
Wrapper
Supported Queries Description
78
Capabilities-Based Rewriter Finds Supported Plans
SELECT FROM A WHERE salarygt65000
Supported Queries
SELECT FROM A
79
Capabilities-Based Rewriter Finds Most-Selective
Supported Plans
SELECT FROM B WHERE salarygt65000
Supported Queries
SELECT FROM B WHERE salary gt65000
SELECT FROM B
80
Capabilities-Based Rewriter Architecture
Query
Query Capabilities Description
Component SubQuery Discovery
Component SubQueries
Plan Construction
Plans (not fully optimized)
Plan Refinement
Algebraically optimal plans
81
What TSIMMIS Achieved
  • system for integration of heterogeneous sources
  • challenges and solutions
  • semistructured data incomplete schema knowledge
  • appropriate specification language and query
    processing algorithms
  • limited and different query capabilities
  • query translation algorithm
  • capabilities-based query rewriting algorithm

82
Overview
  • TSIMMIS goals, technical challenges, and
    solutions
  • Insufficiencies of the TSIMMIS framework
  • Going forward

83
Insufficiencies of the TSIMMIS framework
  • OEM was really unstructured data
  • some loose and partial schematic info may pay off
    tremendously
  • too databasy user/mediator/source interaction

84
Overview
  • TSIMMIS goals, technical challenges, and
    solutions
  • Insufficiencies of the TSIMMIS framework
  • Going forward

85
Web emerges as a Distributed DB and XML as its
Data Model
XMAS Query Language
Also export 1. Schemas Metadata (XML-Data,
RDF,) 2. Description of supported queries
XML View Document(s)
XML View Document(s)
XML View Document(s)
Data Source
Wrapper
Native XML Database
Legacy Source
86
Definition of Integrated Views
Integrated XML View
View Definition in XMAS
Mediator
XML View Document(s)
XML View Document(s)
XML View Document(s)
Data Source
Data Source
Data Source
87
Non-Materialized Views in the MIX mediator system
Blended Browsing Querying (BBQ) GUI
Application
XMAS query
XML document
DOM for Virtual XML Docs
Integrated View DTD
View Definition in XMAS
MIX Mediator
DTD Inference
Query Processor
Source DTD
XML Source
XML Source
88
Application
XML Document Fragments
Blended Browsing Querying (BBQ) GUI
DOM (VXD) Client API
XMAS Query
View DTD
MIX Mediator
XMAS Mediator View Definition
Resolution
Unfolded Query
DTD Inference
Simplification
Translation to Algebra
Optimization
DTD
Execution
XMAS Query
XML Document Fragments
XML Source 1
RDB2XML Wrapper
XML Source 2
RDB
Write a Comment
User Comments (0)
About PowerShow.com