Bridging Different Data Representations - PowerPoint PPT Presentation

About This Presentation
Title:

Bridging Different Data Representations

Description:

Conjunctive queries are possible to reason about statically ... mappings can be partly automated systems such as LSD, Cupid, Clio, ... do this ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 37
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Bridging Different Data Representations


1
Bridging Different Data Representations
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 550 Database Information Systems
  • November 11, 2004

2
A Special Type of Query Conjunctive Queries
  • A single Datalog rule with no Ç, , 8 can
    express select, project, and join a conjunctive
    query
  • Conjunctive queries are possible to reason about
    statically
  • (Note that we can write CQs in other languages,
    e.g., SQL!)
  • We know how to minimize conjunctive queries
  • An important simplification that cant be done
    for general SQL
  • We can test whether one conjunctive querys
    answers always contain another conjunctive
    querys answers (for ANY instance)
  • Why might this be useful?

3
Example of Containment
  • Suppose we have two queriesq1(S,C) -
    Student(S, N), Takes(S, C), Course(C, X),
    inCIS(C), Course(C, DB Info
    Systems)q2(S,C) - Student(S, N), Takes(S, C),
    Course(C, X)
  • Intuitively, q1 must contain the same or fewer
    answers vs. q2
  • It has all of the same conditions, except one
    extra conjunction (i.e., its more restricted)
  • Theres no union or any other way it can add more
    data
  • We can say that q2 contains q1 because this holds
    for any instance of our DB Student, Takes,
    Course

4
Wrapping up Datalog
  • Weve seen a new language, Datalog
  • Its basically a glorified DRC with a special
    feature, recursion
  • Its much cleaner than SQL for reasoning about
  • But negation (as in the DRC) poses some
    challenges
  • Weve seen that a particular kind of query, the
    conjunctive query, is written naturally in
    Datalog
  • Conjunctive queries are possible to reason about
  • We can minimize them, or check containment
  • Conjunctive queries are very commonly used in our
    next problem, data integration

5
A Problem
  • Weve seen that even with normalization and the
    same needs, different people will arrive at
    different schemas
  • In fact, most people also have different needs!
  • Often people build databases in isolation, then
    want to share their data
  • Different systems within an enterprise
  • Different information brokers on the Web
  • Scientific collaborators
  • Researchers who want to publish their data for
    others to use
  • This is the goal of data integration tie
    together different sources, controlled by many
    people, under a common schema

6
Building a Data Integration System
  • Create a middleware mediator or data
    integration system over the sources
  • Can be warehoused (a data warehouse) or virtual
  • Presents a uniform query interface and schema
  • Abstracts away multitude of sources consults
    them for relevant data
  • Unifies different source data formats (and
    possibly schemas)
  • Sources are generally autonomous, not designed to
    be integrated
  • Sources may be local DBs or remote web
    sources/services
  • Sources may require certain input to return
    output (e.g., web forms) binding patterns
    describe these

7
Typical Data Integration Components
Query
Results
Data Integration System / Mediator
Mediated Schema
Source Catalog
Mappings in Catalog
Wrapper
Wrapper
Wrapper
Source Relations
8
Typical Data Integration Architecture
Source Descrs.
Query
Reformulator
Source Catalog
Query over sources
QueryProcessor
Results
Queries bindings
Data in mediated format
Wrapper
Wrapper
Wrapper
9
Challenges of Mapping Schemas
  • In a perfect world, it would be easy to match up
    items from one schema with another
  • Every table would have a similar table in the
    other schema
  • Every attribute would have an identical attribute
    in the other schema
  • Every value would clearly map to a value in the
    other schema
  • Real world as with human languages, things
    dont map clearly!
  • May have different numbers of tables different
    decompositions
  • Metadata in one relation may be data in another
  • Values may not exactly correspond
  • It may be unclear whether a value is the same

10
A Few Simple Examples
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
  • MotionPicture(ID, Title, Year)Participant(ID,
    Name, Role)

CustID CustName
1234 Ives, Z.
PennID EmpName
46732 Zachary Ives
11
How Do We Relate Schemas?
  • General approach is to use a view to define
    relations in one schema (typically either the
    mediated schema or the source schema), given data
    in the other schema
  • This allows us to restructure or recompose
    decompose our data in a new way
  • We can also define mappings between values in a
    view
  • We use an intermediate table defining
    correspondences a concordance table
  • It can be filled in using some type of code, and
    corrected by hand

12
Mapping Our Examples
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
  • MotionPicture(ID, Title, Year)Participant(ID,
    Name, Role)

PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
CustID CustName
1234 Ives, Z.
PennID EmpName
46732 Zachary Ives
T2
Need a concordance table from CustIDs to PennIDs
13
Two Important Approaches
  • TSIMMIS Garcia-Molina97 Stanford
  • Focus semistructured data (OEM), OQL-based
    language (Lorel)
  • Creates a mediated schema as a view over the
    sources
  • Spawned a UCSD project called MIX, which led to a
    company now owned by BEA Systems
  • Other important systems of this vein Kleisli/K2
    _at_ Penn
  • Information Manifold Levy96 ATT Research
  • Focus local-as-view mappings, relational model
  • Sources defined as views over mediated schema
  • Requires a special
  • Spawned Tukwila at Washington, and eventually a
    company as well
  • Led to peer-to-peer integration approaches
    (Piazza, etc.)

14
TSIMMIS and Information Manifold
  • Focus Web-based queryable sources
  • CGI forms, online databases, maybe a few RDBMSs
  • Each needs to be mapped into the system not as
    easy as web search but the benefits are
    significant vs. query engines
  • A few parenthetical notes
  • Part of a slew of works on wrappers, source
    profiling, etc.
  • The creation of mappings can be partly automated
    systems such as LSD, Cupid, Clio, do this
  • Today most people look at integrating large
    enterprises (thats where the is!) Nimble,
    BEA, IBM

15
TSIMMIS
  • The Stanford-IBM Manager of Multiple Information
    Sources or, a Yiddish stew
  • An instance of a global-as-view mediation
    system
  • One of the first systems to support
    semi-structured data, which predated XML by
    several years

16
Semi-structured Data OEM
  • Observation given a particular schema, its
    attributes may be unavailable from certain
    sources inherent irregularity
  • Proposal Object Exchange Model, OEM
  • OID ltlabel, type, valuegt
  • How does it relate to XML?
  • What problems does OEM solve, and not solve, in
    a heterogeneous system?

17
OEM Example
Show this XML fragment in OEM ltbookgt
ltauthorgtBernsteinlt/authorgt ltauthorgtNewcomerlt/aut
horgt lttitlegtPrinciples of TPlt/titlegtlt/bookgt ltbo
okgt ltauthorgtChamberlinlt/authorgt lttitlegtDB2
UDBlt/titlegtlt/bookgt
18
Queries in TSIMMIS
  • Specified in OQL-style language called Lorel
  • OQL was an object-oriented query language
  • Lorel is, in many ways, a predecessor to XQuery
  • Based on path expressions over OEM structures
  • select bookwhere book.title DB2 UDB and
    book.author Chamberlin
  • This is basically like XQuery, which well use in
    place of Lorel and the MSL template language.
    Previous query restated
  • for b in AllData()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b

19
Query Answering in TSIMMIS
  • Basically, its view unfolding, i.e., composing a
    query with a view
  • The query is the one being asked
  • The views are the MSL templates for the wrappers
  • Some of the views may actually require
    parameters, e.g., an author name, before theyll
    return answers
  • Common for web forms (see Amazon, Google, )
  • XQuery functions (XQuerys version of views)
    support parameters as well, so well see these in
    action

20
A Wrapper Definition in MSL
  • Wrappers have templates and binding patterns (X)
    in MSL
  • B - B ltbook ltauthor Xgtgt // select
    from book where author X //
  • This reformats a SQL query over Book(author,
    year, title)
  • In XQuery, this might look like
  • define function GetBook(x AS xsdstring) as book
    for b in sql(Amazon.DB, select
    from book where author x )return
    ltbookgtb/titleltauthorgtxlt/authorgtlt/bookgt


book
author
title


The union of GetBooks results is unioned with
others to form the view AllData()
21
How to Answer the Query
  • Given our query
  • for b in AllData()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b
  • Find all wrapper definitions that
  • Contain output enough structure to match the
    conditions of the query
  • Or have already tested the conditions for us!

22
Query Composition with Views
  • We find all views that define book with author
    and title, and we compose the query with each
  • define function GetBook(x AS xsdstring) as book
    for b in sql(Amazon.DB, select
    from book where author x )return
    ltbookgt b/title ltauthorgtxlt/authorgtlt/bookgt
  • for b in AllData()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b

book
author
title


23
Matching View Output to Our Querys Conditions
  • Determine that b/book/author/text() ?? x by
    matching the pattern on the functions output
  • define function GetBook(x AS xsdstring) as book
    for b in sql(Amazon.DB, select
    from book where author x )return
    ltbookgt b/title ltauthorgtxlt/author
    gtlt/bookgt
  • let x Chamberlinfor b in
    GetBook(x)/bookwhere b/title/text() DB2
    UDB return b

book
author
title


24
The Final Step Unfolding
  • let x Chamberlinfor b in ( for b in
    sql(Amazon.com, select from book where
    author x ) return ltbookgt b/title
    ltauthorgtxlt/authorgtlt/bookgt
  • )/bookwhere b/title/text() DB2 UDB
    return b
  • How do we simplify further to get to here?
  • for b in sql(Amazon.com, select from
    book where authorChamberlin)where
    b/title/text() DB2 UDB return b

25
Virtues of TSIMMIS
  • Early adopter of semistructured data, greatly
    predating XML
  • Can support data from many different kinds of
    sources
  • Obviously, doesnt fully solve heterogeneity
    problem
  • Presents a mediated schema that is the union of
    multiple views
  • Query answering based on view unfolding
  • Easily composed in a hierarchy of mediators

26
Limitations of TSIMMIS Approach
  • Some data sources may contain data with certain
    ranges or properties
  • Books by Aho, Students at UPenn,
  • If we ask a query for students at Columbia, dont
    want to bother querying students at Penn
  • How do we express these?
  • Mediated schema is basically the union of the
    various MSL templates as they change, so may
    the mediated schema

27
An Alternate ApproachThe Information Manifold
(Levy et al.)
  • When you integrate something, you have some
    conceptual model of the integrated domain
  • Define that as a basic frame of reference,
    everything else as a view over it
  • Local as View
  • May have overlapping/incomplete sources
  • Define each source as the subset of a query over
    the mediated schema
  • We can use selection or join predicates to
    specify that a source contains a range of values
  • ComputerBooks() ? Books(Title, , Subj), Subj
    Computers

28
The Local-as-View Model
  • The basic model is the following
  • Local sources are views over the mediated
    schema
  • Sources have the data mediated schema is
    virtual
  • Sources may not have all the data from the domain
    open-world assumption
  • The system must use the sources (views) to answer
    queries over the mediated schema

29
Answering Queries Using Views
  • Assumption conjunctive queries, set semantics
  • Suppose we have a mediated schema author(aID,
    isbn, year), book(isbn, title, publisher)
  • A conjunctive query might be q(a, t, p) -
    author(a, i, _), book(i, t, p), t DB2 UDB
  • Recall intuitions about this class of queries
  • Adding a conjunct to a query removes answers from
    the result but never adds any
  • Any conjunctive query with at least the same
    constraints conjuncts will give valid answers

30
Query Answering
  • Suppose we have the query
  • q(a, t, p) - author(a, i, _), book(i, t, p)
  • and sources
  • s1(a,t) ? author(a, i, _), book(i, t, p), t
    123
  • s5(a,i) ? author(a, i, _)
  • s6(i,p) ? book(i, t, p)
  • We want to compose the query with the source
    mappings but theyre in the wrong direction!

31
Inverse Rules
  • We can take every mapping and invert it, though
    sometimes we may have insufficient information
  • If
  • s5(a,i) ? author(a, i, _)
  • then we can also infer that
  • author(a, i, ???) ? s5(a,i)
  • But how to handle the absence of the 3rd
    (publisher) attribute?
  • We know that there must be AT LEAST one instance
    of ??? in author for each (a,i) pair
  • So we might simply insert a NULL and define that
    NULL means unknown (as opposed to missing)

32
But NULLs Lose Information
  • Suppose we take these rules and ask for
  • q(a,t) - author(a, i, _), book(i, t, p)
  • If we look at the rule
  • s1(a,t) ? author(a, i, _), book(i, t, p), t
    123
  • Clearly q(a,t) ? s1(a,t)
  • But if apply our inversion procedure, we get
  • author(a, NULL, NULL) ? s1(a,t)
  • book(NULL, t, p) ? s1(a,t), t 123
  • and theres no way to kow to join author and book
    on NULL!
  • We need a special NULL for each a-t combo so we
    can figure out which as and ts go together

33
The Solution Skolem Functions
  • Skolem functions
  • Conceptual perfect hash functions
  • Each function returns a unique, deterministic
    value for each combination of input values
  • Every function returns a non-overlapping set of
    values (Skolem function F will never return a
    value that matches any of Skolem function Gs
    values)
  • Skolem functions wont ever be part of the answer
    set or the computation it doesnt produce real
    values
  • Theyre just a way of logically generating
    special NULLs

34
Revisiting Our Example
  • Query
  • q(a,t) - author(a, i, _), book(i, t, p)
  • Mapping rule
  • s1(a,t) ? author(a, i, _), book(i, t, p), t
    123
  • Inverse rules
  • author(a, f(a,t), NULL) ? s1(a,t)
  • book(f(a,t), t, p) ? s1(a,t), t 123
  • Expand the query as follows
  • q(a,t) - author(a, i, NULL), book(i, t, p), i
    f(a,t)
  • q(a,t) - s1(a,t), s1(a,t), t 123, i f(a,t)

35
Query Answering Using Inverse Rules
  • Invert all rules using the procedures described
  • Take the query and the possible rule expansions
    and execute them in a Datalog interpreter
  • In the previous query, we expand with all
    combinations of expansions of book and of author
    every possible way of combining and
    cross-correlating info from different sources
  • Then we throw away all unsatisfiable rewritings
    (some expansions will be logically inconsistent)
  • More efficient, but equivalent, algorithms now
    exist
  • Bucket algorithm Levy et al.
  • MiniCon Pottinger Halevy
  • Also related chase and backchase Popa,
    Tannen, Deutsch

36
Summary of Data Integration
  • Local-as-view integration has replaced
    global-as-view as the standard
  • More robust way of defining mediated schemas and
    sources
  • Mediated schema is clearly defined, less likely
    to change
  • Sources can be more accurately described
  • Methods exist for query reformulation, including
    inverse rules
  • Integration requires standardization on a single
    schema
  • Can be hard to get consensus
  • Today we have peer-to-peer data integration,
    e.g., Piazza Halevy et al., Orchestra Ives et
    al., Hyperion Miller et al.
  • Some other aspects of integration were addressed
    in related papers
  • Overlap between sources coverage of data at
    sources
  • Semi-automated creation of mappings and wrappers
  • Data integration capabilities in commercial
    products BEAs Liquid Data, IBMs DB2
    Information Integrator, numerous packages from
    middleware companies
Write a Comment
User Comments (0)
About PowerShow.com