Recursive Views and Global Views - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Recursive Views and Global Views

Description:

We've seen how views are useful both within a data model, and as a way of going ... mappings can be partly automated systems such as LSD, Cupid, Clio, ... do this ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 42
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Recursive Views and Global Views


1
Recursive Views and Global Views
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 550 Database Information Systems
  • November 9, 2004

Some slide content courtesy of Susan Davidson,
Dan Suciu, Raghu Ramakrishnan
2
Where We Are
  • Weve seen how views are useful both within a
    data model, and as a way of going from one model
    to another
  • You read the Shanmugasundaram paper on relational
    ? XML conversion
  • There have been many follow-up pieces of work
  • There have been attempts to build native XML
    databases instead
  • Now were going to talk about another important
    way views can be used to fix a limitation of the
    XML ? relational mappings
  • Well also talk about how certain classes of
    views can be manipulated and reasoned about in
    interesting ways
  • Then well consider the use of views in
    integrating data

3
An Important Set of Questions
  • Views are incredibly powerful formalisms for
    describing how data relates fn rel ? ? rel ?
    rel
  • Can I define a view recursively?
  • Why might this be useful in the XML construction
    case? When should the recursion stop?
  • Suppose we have two views, v1 and v2
  • How do I know whether they represent the same
    data?
  • If v1 is materialized, can we use it to compute
    v2?
  • This is fundamental to query optimization and
    data integration, as well see later

4
Reasoning about Queries and Views
  • SQL or XQuery are a bit too complex to reason
    about directly
  • Some aspects of it make reasoning about SQL
    queries undecidable
  • We need an elegant way of describing views (lets
    assume a relational model for now)
  • Should be declarative
  • Should be less complex than SQL
  • Doesnt need to support all of SQL aggregation,
    for instance, may be more than we need

5
Lets Go Back a Few WeeksDomain Relational
Calculus
  • Queries have form
  • p
  • Predicate boolean expression over x1,x2, , xn
  • We have the following operations
  • ? R xi op xj xi op const const op xi
  • ?xi. p ?xj. p p?q, p?q ?p, p?q
  • where op is ?, ?, ?, ?, ?, ? and
  • xi,xj, are domain variables p,q are predicates
  • Recall that this captures the same expressiveness
    as the relational algebra

domain variables
predicate
6
A Similar Logic-Based LanguageDatalog
  • Borrows the flavor of the relational calculus but
    is a real query language
  • Based on the Prolog logic-programming language
  • A datalog program will be a series of if-then
    rules (Horn rules) that define relations from
    predicates
  • Rules are generally of the form
  • Rout(T1) ? R1(T2), R2(T3), , c(T2 Tn)
  • where Rout is the relation representing the
    query result, Ri are predicates representing
    relations, c is an expression using
    arithmetic/boolean predicates over vars, and
    Ti are tuples of variables

7
Datalog Terminology
  • An example datalog rule
  • idb(x,y) ? r1(x,z), r2(z,y), z
  • Irrelevant variables can be replaced by _
    (anonymous var)
  • Extensional relations or database schemas (edbs)
    are relations only occurring in rules bodies
    these are base relations with ground facts
  • Intensional relations (idbs) appear in the heads
    these are basically views
  • Distinguished variables are the ones output in
    the head
  • Ground facts only have constants, e.g., r1(abc,
    123)

body
head
subgoals
8
Datalog in Action
  • As in DRC, the output (head) consists of a tuple
    for each possible assignment of variables that
    satisfies the predicate
  • We typically avoid 8 in Datalog queries
    variables in the body are existential, ranging
    over all possible values
  • Multiple rules with the same relation in the head
    represent a union
  • We often try to avoid disjunction (Ç) within
    rules
  • Lets see some examples of datalog queries
    (which consist of 1 or more rules)
  • Given Professor(fid, name), Teaches(fid, serno,
    sem), Courses(serno, cid, desc), Student(sid,
    name)
  • Return course names other than CIS 550
  • Return the names of the teachers of CIS 550
  • Return the names of all people (professors or
    students)

9
Datalog is Relationally Complete
  • We can map RA ? Datalog
  • Selection ?p p becomes a datalog subgoal
  • Projection ?A we drop projected-out variables
    from head
  • Cross-product r ? s q(A,B,C,D) ? r(A,B),s(C,D)
  • Join r ? s q(A,B,C,D) ? r(A,B),s(C,D),
    condition
  • Union r U s q(A,B) ? r(A,B) q(C, D) - s(C,D)
  • Difference r s q(A,B) ? r(A,B), s(A,B)
  • (If you think about it, DRC ? Datalog is even
    easier)
  • Great But then why do we care about Datalog?

10
A Query We CantAnswer in RA/TRC/DRC
  • Recall our example of a binary relation for
    graphs or trees (similar to an XML Edge
    relation)
  • edge(from, to)
  • If we want to know what nodes are reachable
  • reachable(F, T, 1) - edge(F, T) distance 1
  • reachable(F, T, 2) - edge(F, X), edge(X,
    T) dist. 2
  • reachable(F, T, 3) - edge(F, X), dist2(X,
    T) dist. 3
  • But how about all reachable paths? (Note this
    was easy in XPath over an XML representation --
    //edge)

(another way of writing ?)
11
Recursive Datalog Queries
  • Define a recursive query in datalog
  • reachable(F, T, 1) - edge(F, T) distance 1
  • reachable(F, T, D 1) - edge(F,
    X),reachable(X, T, D) distance 1
  • What does this mean, exactly, in terms of logic?
  • There are actually three different (equivalent)
    definitions of semantics
  • All make a closed-world assumption facts
    should exist only if they can be proven true from
    the input i.e., assume the DB contains all of
    the truths out there!

12
Fixpoint Semantics
  • One of the three Datalog models is based on a
    notion of fixpoint
  • We start with an instance of data, then derive
    all immediate consequences
  • We repeat as long as we derive new facts
  • In the RA, this requires a while loop!
  • However, that is too powerful and needs to be
    restricted
  • Special case inflationary semantics (which
    terminates in time polynomial in the size of the
    database!)

13
Our Query in RA while(inflationary semantics,
no negation)
  • Datalog
  • reachable(F, T, 1) - edge(F, T)
  • reachable(F, T, D1) - edge(F, X), reachable(X,
    T, D)
  • RA procedure with while
  • reachable edge ? literal1
  • while change
  • reachable ?F, T, D(?T ! X(edge) ? ?F ! X,D !
    D0(reachable) ? add1)

Note literal1(F,1) and add1(D0,D) are actually
arithmetic and literal functions modeled here as
relations.
14
Negation in Datalog
  • Datalog allows for negation in rules
  • Its essential for capturing RA set
    difference-style opsProfessor(, name),
    Student(, name)
  • But negation can be tricky
  • You may recall that in the DRC, we had a notion
    of unsafe queries, and they return here
  • Single(X) ? Person(X), Married(X,Y)

15
Safe Rules/Queries
  • Range restriction, which requires that every
    variable
  • Occurs at least once in a positive relational
    predicate in the body,
  • Or its constrained to equal a finite set of
    values by arithmetic predicates

Safeq(X) ? r(X,Y)q(X) ? X 5 q(X) ?
r(X,X), s(X)q(X) ? r(X) Ç (t(Y),u(X,Y))
Unsafeq(X) ? r(Y)q(X) ? r(X,X)q(X) ? r(X) Ç
t(Y)
  • For recursion, use stratified semantics
  • Allow negation only over edb predicates
  • Then recursively compute values for the idb
    predicates that depend on the edbs (layered like
    strata)

16
Conjunctive Queries
  • A single Datalog rule with no Ç, , 8 can
    express select, project, and join a conjunctive
    query
  • Conjunctive queries are possible to reason about
    statically
  • (Note that we can write CQs in other languages,
    e.g., SQL!)
  • We know how to minimize conjunctive queries
  • An important simplification that cant be done
    for general SQL
  • We can test whether one conjunctive querys
    answers always contain another conjunctive
    querys answers (for ANY instance)
  • Why might this be useful?

17
Example of Containment
  • Suppose we have two queriesq1(S,C) -
    Student(S, N), Takes(S, C), Course(C, X),
    inCSE(C), Course(C, DB Info
    Systems)q2(S,C) - Student(S, N), Takes(S, C),
    Course(C, X)
  • Intuitively, q1 must contain the same or fewer
    answers vs. q2
  • It has all of the same conditions, except one
    extra conjunction (i.e., its more restricted)
  • Theres no union or any other way it can add more
    data
  • We can say that q2 contains q1 because this holds
    for any instance of our DB Student, Takes,
    Course

18
Wrapping up Datalog
  • Weve seen a new language, Datalog
  • Its basically a glorified DRC with a special
    feature, recursion
  • Its much cleaner than SQL for reasoning about
  • But negation (as in the DRC) poses some
    challenges
  • Weve seen that a particular kind of query, the
    conjunctive query, is written naturally in
    Datalog
  • Conjunctive queries are possible to reason about
  • We can minimize them, or check containment
  • Conjunctive queries are very commonly used in our
    next problem, data integration

19
The Data Integration Problem
  • Weve seen that even with normalization and the
    same needs, different people will arrive at
    different schemas
  • In fact, most people also have different needs!
  • Often people build databases in isolation, then
    want to share their data
  • Different systems within an enterprise
  • Different information brokers on the Web
  • Scientific collaborators
  • Researchers who want to publish their data for
    others to use
  • This is the goal of data integration tie
    together different sources, controlled by many
    people, under a common schema
  • Typically its based on conjunctive queries, as
    with Datalog

20
Building a Data Integration System
  • Create a middleware mediator or data
    integration system over the sources
  • Can be warehoused (a data warehouse) or virtual
  • Presents a uniform query interface and schema
  • Abstracts away multitude of sources consults
    them for relevant data
  • Unifies different source data formats (and
    possibly schemas)
  • Sources are generally autonomous, not designed to
    be integrated
  • Sources may be local DBs or remote web
    sources/services
  • Sources may require certain input to return
    output (e.g., web forms) binding patterns
    describe these

21
Typical Data Integration Components
Query
Results
Data Integration System / Mediator
Mediated Schema
Source Catalog
Mappings in Catalog
Wrapper
Wrapper
Wrapper
Source Relations
22
Typical Data Integration Architecture
Source Descrs.
Query
Reformulator
Source Catalog
Query over sources
QueryProcessor
Results
Queries bindings
Data in mediated format
Wrapper
Wrapper
Wrapper
23
Challenges of Mapping Schemas
  • In a perfect world, it would be easy to match up
    items from one schema with another
  • Every table would have a similar table in the
    other schema
  • Every attribute would have an identical attribute
    in the other schema
  • Every value would clearly map to a value in the
    other schema
  • Real world as with human languages, things
    dont map clearly!
  • May have different numbers of tables different
    decompositions
  • Metadata in one relation may be data in another
  • Values may not exactly correspond
  • It may be unclear whether a value is the same

24
A Few Simple Examples
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
  • MotionPicture(ID, Title, Year)Participant(ID,
    Name, Role)

25
How Do We Relate Schemas?
  • General approach is to use a view to define
    relations in one schema, given data in the other
    schema
  • This allows us to restructure or recompose
    decompose our data in a new way
  • We can also define mappings between values in a
    view
  • We use an intermediate table defining
    correspondences a concordance table
  • It can be filled in using some type of code, and
    corrected by hand

26
Mapping Our Examples
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • Movie(Title, Year, Director, Editor, Star1,
    Star2)
  • PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
  • MotionPicture(ID, Title, Year)Participant(ID,
    Name, Role)

PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
T2
???
27
Two Important Approaches
  • TSIMMIS Garcia-Molina97 Stanford
  • Focus semistructured data (OEM), OQL-based
    language (Lorel)
  • Creates a mediated schema as a view over the
    sources
  • Spawned a UCSD project called MIX, which led to a
    company now owned by BEA Systems
  • Other important systems of this vein Kleisli/K2
    _at_ Penn
  • Information Manifold Levy96 ATT Research
  • Focus local-as-view mappings, relational model
  • Sources defined as views over mediated schema
  • Requires a special
  • Spawned Tukwila at Washington, and eventually a
    company as well
  • Led to peer-to-peer integration approaches
    (Piazza, etc.)

28
The Focus of these Systems
  • Focus Web-based queryable sources
  • CGI forms, online databases, maybe a few RDBMSs
  • Each needs to be mapped into the system not as
    easy as web search but the benefits are
    significant vs. query engines
  • A few parenthetical notes
  • Part of a slew of works on wrappers, source
    profiling, etc.
  • The creation of mappings can be partly automated
    systems such as LSD, Cupid, Clio, do this
  • Today most people look at integrating large
    enterprises (thats where the is!) Nimble,
    BEA, IBM

29
TSIMMIS
  • The Stanford-IBM Manager of Multiple Information
    Sources or, a Yiddish stew
  • An instance of a global-as-view mediation
    system
  • One of the first systems to support
    semi-structured data, which predated XML by
    several years

30
Semi-structured Data OEM
  • Observation given a particular schema, its
    attributes may be unavailable from certain
    sources inherent irregularity
  • Proposal Object Exchange Model, OEM
  • OID
  • How does it relate to XML?
  • What problems does OEM solve, and not solve, in
    a heterogeneous system?

31
OEM Example
Show this XML fragment in OEM
Bernstein Newcomerhor Principles of TP ok Chamberlin DB2
UDB
32
Queries in TSIMMIS
  • Specified in OQL-style language called Lorel
  • OQL was an object-oriented query language
  • Lorel is, in many ways, a predecessor to XQuery
  • Based on path expressions over OEM structures
  • select bookwhere book.author DB2 UDB and
    book.title Chamberlin
  • This is basically like XQuery, which well use in
    place of Lorel and the MSL template language.
    Previous query restated
  • for b in document(my-source)/bookwhere
    b/title/text DB2 UDB and b/author/text()
    Chamberlinreturn b

33
Query Answering in TSIMMIS
  • Basically, its view unfolding, i.e., composing a
    query with a view
  • The query is the one being asked
  • The views are the MSL templates for the wrappers
  • Some of the views may actually require
    parameters, e.g., an author name, before theyll
    return answers
  • Common for web forms (see Amazon, Google, )
  • XQuery functions (XQuerys version of views)
    support parameters as well, so well see these in
    action

34
A Wrapper Definition in MSL
  • Wrappers have templates and binding patterns (X)
    in MSL
  • B - B // select
    from book where author X //
  • This reformats a SQL query over Book(author,
    year, title)
  • In XQuery, this might look like
  • define function GetBook(X AS xsdstring) as book
    for x in sql(select from book where
    author x )return xx
    thor

The union of GetBooks results, plus many
others,is the view AllData()
35
How to Answer the Query
  • Given our query
  • for b in AllData()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b
  • We want to find all wrapper definitions that
  • Either contain output enough information that we
    can evaluate all of our conditions over the
    output
  • Or have already tested the conditions for us!

define function AllData(x AS xsdstring) as
element return GetBooks(x),
36
Query Composition with Views
  • We find all views that define book with author
    and title, and we compose the query with each
  • define function GetBook(x AS xsdstring) as book
    for b in sql(select from book where
    author x )return bx
    thor
  • for b in AllData()/bookwhere b/title/text()
    DB2 UDB and b/author/text()
    Chamberlinreturn b

We need a value for x!
37
Matching View Output to Our Querys Conditions
  • Determine that b/book/author/text() ?? x by
    matching the pattern on the functions output
  • define function GetBook(x AS xsdstring) as book
    for b in sql(select from book where
    author x )return bx
    thor
  • where x Chamberlinfor b in
    GetBook(x)/bookwhere b/title/text() DB2
    UDB return b

38
The Final Step Unfolding
  • where x Chamberlinfor b in for b in
    sql(select from book where author x
    )return bx
    /bookwhere b/title/text() DB2 UDB return
    b

39
What Is the Answer?
  • Given schema book(author, year, title) and
    datalog rules defining an instance
  • book(Chamberlin, 1992, DB2 UDB)
  • book(Chamberlin, 1995, DB2/CS)
  • What do we get for our query answer?

40
TSIMMIS
  • Early adopter of semistructured data
  • Can support irregular structure and missing
    attributes
  • Can support data from many different sources
  • Doesnt fully solve heterogeneity problem,
    though!
  • Simple algorithms for view unfolding
  • Easily can be composed in a hierarchy of mediators

41
Limitations of TSIMMIS Approach
  • Some data sources may contain data with certain
    ranges or properties
  • Books by Aho, Students at UPenn,
  • How do we express these? (Important for
    performance!)
  • Mediated schema is basically the union of the
    various MSL templates as they change, so may
    the mediated schema
  • Next time well see the opposite approach and
    some very cool logical inference!
Write a Comment
User Comments (0)
About PowerShow.com