Bridging Different Data Representations - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Bridging Different Data Representations

Description:

Some content may be courtesy of Susan Davidson, Dan Suciu, & Raghu ... PieceOfArt(ID, Artist, Subject, Title, TypeOfArt) MotionPicture(ID, Title, Year) ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 29
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Bridging Different Data Representations


1
Bridging Different Data Representations
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 550 Database Information Systems
  • October 28, 2003

Some slide content may be courtesy of Susan
Davidson, Dan Suciu, Raghu Ramakrishnan
2
Administrivia
  • HW4 and midterms returned today
  • You all did great!
  • Median on midterm was 75 of 80 (mean was 73.4)
  • Remember to turn in your project plan on
    Thursday!
  • Should have a plan for how to break down the
    project tasks among your group
  • Should have some milestones that get you towards
    a completed project
  • Ill ask for a status report in a couple of weeks

3
A Problem
  • Weve seen that even with normalization and the
    same needs, different people will arrive at
    different schemas
  • In fact, most people also have different needs!
  • Often people build databases in isolation, then
    want to share their data
  • Different systems within an enterprise
  • Different information brokers on the Web
  • Scientific collaborators
  • Researchers who want to publish their data for
    others to use
  • This is the goal of data integration tie
    together different sources, controlled by many
    people, under a common schema

4
Building a Data Integration System
  • Create a middleware mediator or data
    integration system over the sources
  • Can be warehoused (a data warehouse) or virtual
  • Presents a uniform query interface and schema
  • Abstracts away multitude of sources consults
    them for relevant data
  • Unifies different source data formats (and
    possibly schemas)
  • Sources are generally autonomous, not designed to
    be integrated
  • Sources may be local DBs or remote web
    sources/services
  • Sources may require certain input to return
    output (e.g., web forms) binding patterns
    describe these

5
Typical Data Integration Components
Query
Results
Data Integration System / Mediator
Mediated Schema
Source Catalog
Mappings in Catalog
Wrapper
Wrapper
Wrapper
Source Relations
6
Typical Data Integration Architecture
Source Descrs.
Query
Reformulator
Source Catalog
Query over sources
QueryProcessor
Results
Queries bindings
Data in mediated format
Wrapper
Wrapper
Wrapper
7
Challenges of Mapping Schemas
  • In a perfect world, it would be easy to match up
    items from one schema with another
  • Every table would have a similar table in the
    other schema
  • Every attribute would have an identical attribute
    in the other schema
  • Every value would clearly map to a value in the
    other schema
  • Real world as with human languages, things
    dont map clearly!
  • May have different numbers of tables different
    decompositions
  • Metadata in one relation may be data in another
  • Values may not exactly correspond
  • It may be unclear whether a value is the same

8
A Few Simple Examples
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
  • MotionPicture(ID, Title, Year)Participant(ID,
    Name, Role)

9
How Do We Relate Schemas?
  • General approach is to use a view to define
    relations in one schema, given data in the other
    schema
  • This allows us to restructure or recompose
    decompose our data in a new way
  • We can also define mappings between values in a
    view
  • We use an intermediate table defining
    correspondences a concordance table
  • It can be filled in using some type of code, and
    corrected by hand

10
Mapping Our Examples
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
  • MotionPicture(ID, Title, Year)Participant(ID,
    Name, Role)

PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
T2
???
11
Two Important Approaches
  • TSIMMIS Garcia-Molina97 Stanford
  • Focus semistructured data (OEM), OQL-based
    language (Lorel)
  • Creates a mediated schema as a view over the
    sources
  • Spawned a UCSD project called MIX, which led to a
    company now owned by BEA Systems
  • Information Manifold Levy96 ATT Research
  • Focus local-as-view mappings, relational model
  • Sources defined as views over mediated schema
  • Spawned Tukwila at Washington, and eventually a
    company as well

12
TSIMMIS and Information Manifold
  • Focus Web-based queryable sources
  • CGI forms, online databases, maybe a few RDBMSs
  • Each needs to be mapped into the system not as
    easy as web search but the benefits are
    significant vs. query engines
  • A few parenthetical notes
  • Part of a slew of works on wrappers, source
    profiling, etc.
  • The creation of mappings can be partly automated
    systems such as LSD, Cupid, Clio, do this
  • Today most people look at integrating large
    enterprises (thats where the is!) Nimble,
    BEA Liquid Data, Enosys, IBM

13
TSIMMIS
  • The Stanford-IBM Manager of Multiple Information
    Sources or, a Yiddish stew
  • An instance of a global-as-view mediation
    system
  • One of the first systems to support
    semi-structured data, which predated XML by
    several years

14
Semi-structured Data OEM
  • Observation given a particular schema, its
    attributes may be unavailable from certain
    sources inherent irregularity
  • Proposal Object Exchange Model, OEM
  • OID ltlabel, type, valuegt
  • How does it relate to XML?
  • What problems does OEM solve, and not solve, in
    a heterogeneous system?

15
OEM Example
Show this XML fragment in OEM ltbookgt
ltauthorgtBernsteinlt/authorgt ltauthorgtNewcomerlt/aut
horgt lttitlegtPrinciples of TPlt/titlegtlt/bookgt ltbo
okgt ltauthorgtChamberlinlt/authorgt lttitlegtDB2
UDBlt/titlegtlt/bookgt
16
Queries in TSIMMIS
  • Specified in OQL-style language called Lorel
  • OQL was an object-oriented query language
  • Lorel is, in many ways, a predecessor to XQuery
  • Based on path expressions over OEM structures
  • select bookwhere book.author DB2 UDB and
    book.title Chamberlin
  • This is basically like XQuery, which well use in
    place of Lorel and the MSL template language.
    Previous query restated
  • for b in document(my-source)/bookwhere
    b/title/text DB2 UDB and b/author/text()
    Chamberlinreturn b

17
Query Answering in TSIMMIS
  • Basically, its view unfolding, i.e., composing a
    query with a view
  • The query is the one being asked
  • The views are the MSL templates for the wrappers
  • Some of the views may actually require
    parameters, e.g., an author name, before theyll
    return answers
  • Common for web forms (see Amazon, Google, )
  • XQuery functions (XQuerys version of views)
    support parameters as well, so well see these in
    action

18
A Wrapper Definition in MSL
  • Wrappers have templates and binding patterns (X)
    in MSL
  • B - B ltbook ltauthor Xgtgt // select
    from book where author X //
  • This reformats a SQL query over Book(author,
    year, title)
  • In XQuery, this might look like
  • define function GetBook(X AS xsdstring) as book
    for x in sql(select from book where
    author x )return ltbookgtxltauthorgtxlt/au
    thorgtlt/bookgt

19
How to Answer the Query
  • Given our query
  • for b in document(my-source)/bookwhere
    b/title/text() DB2 UDB and b/author/text()
    Chamberlinreturn b
  • We want to find all wrapper definitions that
  • Either contain output enough information that we
    can evaluate all of our conditions over the
    output
  • Or have already tested the conditions for us!

20
Query Composition with Views
  • We find all views that define book with author
    and title, and we compose the query with each
  • define function GetBook(x AS xsdstring) as book
    for b in sql(select from book where
    author x )return ltbookgtbltauthorgtxlt/au
    thorgtlt/bookgt
  • for b in document(my-source)/bookwhere
    b/title/text() DB2 UDB and b/author/text()
    Chamberlinreturn b

21
Matching View Output to Our Querys Conditions
  • Determine that b/book/author/text() ?? x by
    matching the pattern on the functions output
  • define function GetBook(x AS xsdstring) as book
    for b in sql(select from book where
    author x )return ltbookgtbltauthorgtxlt/au
    thorgtlt/bookgt
  • where x Chamberlinfor b in
    GetBook(x)/bookwhere b/title/text() DB2
    UDB return b

22
The Final Step Unfolding
  • where x Chamberlinfor b in for b in
    sql(select from book where author x
    )return ltbookgtbltauthorgtxlt/authorgtlt/bookgt
    /bookwhere b/title/text() DB2 UDB return
    b

23
What Is the Answer?
  • Given schema book(author, year, title) and
    datalog rules defining an instance
  • book(Chamberlin, 1992, DB2 UDB)
  • book(Chamberlin, 1995, DB2/CS)

24
TSIMMIS
  • Early adopter of semistructured data
  • Can support irregular structure and missing
    attributes
  • Can support data from many different sources
  • Doesnt fully solve heterogeneity problem,
    though!
  • Simple algorithms for view unfolding
  • Easily can be composed in a hierarchy of mediators

25
Limitations of TSIMMIS Approach
  • Some data sources may contain data with certain
    ranges or properties
  • Books by Aho, Students at UPenn,
  • How do we express these? (Important for
    performance!)
  • Mediated schema is basically the union of the
    various MSL templates as they change, so may
    the mediated schema

26
The Information Manifold
  • Defines the mediated schema independently of the
    sources!
  • Local-as-view instead of global-as-view
  • Guarantees soundness and completeness of answers
  • Allows us to specify information about data
    sources
  • Focuses on relations (with OO extensions),
    datalog

27
Observations of Levy et al.
  • When you integrate something, you have some
    conceptual model of the integrated domain
  • Define that as a basic frame of reference
  • May have overlapping/incomplete sources
  • Define each source as the subset of a query over
    the mediated schema
  • We can use selection or join predicates to
    specify that a source contains a range of values
  • ComputerBooks() ? Books(Title, , Subj), Subj
    Computers

28
The Local-as-View Model
  • If we look at the Information Manifold model
  • Local sources are views over the mediated
    schema
  • Sources have the data mediated schema is
    virtual
  • Sources may not have all the data from the domain
    open-world assumption
  • The system must use the sources (views) to answer
    queries over the mediated schema
  • Thursday well see what answering queries using
    views is all about
Write a Comment
User Comments (0)
About PowerShow.com