An Annotation Management System for Relational Databases - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

An Annotation Management System for Relational Databases

Description:

A system that is able to propagate meta-data along with the data as ... Yummy chicken curry!! NYRestaurants. Restaurant. Cost. Type. Peacock Alley. Bull & Bear ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 36
Provided by: usersS
Category:

less

Transcript and Presenter's Notes

Title: An Annotation Management System for Relational Databases


1
An Annotation Management System for Relational
Databases
  • Laura Chiticariu
  • University of California, Santa Cruz
  • Joint work with
  • Deepavali Bhagwat, Wang-Chiew Tan, Gaurav
    Vijayvargiya

2
Annotation Management System
  • A system that is able to propagate meta-data
    along with the data as the data is being moved
    around
  • Main motivation
  • To trace the provenance and flow of data
  • Many other uses

b2
transformation
transformation
b2
b1
b3
a1
a2
b1
a3
a1
a2
a1
a2
b3
transformation step a query, an ETL rule, etc.
3
Our Vision
NYRestaurants
Cost
Type
Restaurant
Zip
Peacock Alley

French
10022
Bull Bear

Seafood
10022
Pacifica

Chinese
10013

Soho Kitchen Bar
American
10022
Yummy chicken curry!!
All Restaurants
Cheap Restaurants
4
Other Applications
  • Keep information that cannot be otherwise stored
    in the current database design
  • Highlight wrong data
  • Erroneous data may be copied around but the
    comment that it is wrong goes along with it
  • Security and quality metric
  • Annotate security or quality levels of data items

5
Some Related Work
  • Idea is not new though propagation of annotations
    was never explicitly stated as provenance-based
  • Wang Madnick VLDB 90,
  • Lee, Bressan Madnick WIDM 98,
  • Bernstein Bergstraesser IEEE Data Eng. 99
  • Superimposed Information. Maier and Delcambre
    WebDB 99
  • Annotations of Web documents
  • Annotations on genomic sequences
  • Why-Provenance
  • Cui, Widom, Wiener CWW00

6
Outline
  • pSQL queries
  • Semantics
  • CUSTOM propagation scheme
  • DEFAULT propagation scheme
  • DEFAULT-ALL propagation scheme
  • Implementation
  • System architecture
  • Experimental results

7
pSQL an extension of SQL
  • A pSQL fragment
  • SELECT DISTINCT selectlist
  • FROM fromlist
  • WHERE wherelist
  • PROPAGATE DEFAULT
  • DEFAULT-ALL
  • r1.A1 TO B1, , rn.An TO Bn
  • A pSQL query is a union of pSQL fragments

8
The CUSTOM Scheme
R
  • Propagate annotations according to
    user specification

a
c
b
h
Result1
a
SELECT DISTINCT A FROM R r PROPAGATE r.A TO
A UNION SELECT DISTINCT A FROM R r PROPAGATE
r.B TO A
b
Result
annotation UNION
a
c
b
Result2
c
h
h
9
The DEFAULT Scheme
R
S
  • Propagate annotations according to
    where data is copied from

a
c
d
g
b
f
h
e
SELECT DISTINCT B FROM R r PROPAGATE
DEFAULT UNION SELECT DISTINCT B FROM S
s PROPAGATE DEFAULT
Result
r.B TO B
c
g
f
e
h
s.B TO B
natural semantics for tracing the provenance of
data
10
Annotation Propagation under the DEFAULT Scheme
R
S
a
b
Q1
SELECT DISTINCT r.A, r.B, s.C FROM R r, S s WHERE
r.B s.B PROPAGATE DEFAULT
versus SELECT DISTINCT FROM R NATURAL JOIN
S PROPAGATE DEFAULT
Ans1
a
a
equivalent queries, but different annotated
output
Q2
Ans2
a
b
11
The DEFAULT-ALL scheme
  • Propagate annotations according to where data is
    copied from according to all equivalent
    formulations of the given query
  • User Query Q
  • Compute the results of Q on a database D idea
  • E(Q) denotes the set of all queries that are
    equivalent to Q (more precisely, ()).
  • Execute each query in E(Q) on the database D
    under the DEFAULT scheme, then combine the
    results under ?a.

SELECT DISTINCT r.A, s.B, s.C FROM R r, S s WHERE
r.B s.B PROPAGATE DEFAULT-ALL
the SQL query corresponding to Q
()
12
Computing the results of a DEFAULT-ALL query
  • Question
  • Given a pSQL query Q with DEFAULT-ALL
    propagation scheme and a database D, can we
    compute the result of Q(D)?
  • Problem
  • There are infinitely many queries in E(Q). It is
    therefore impossible to execute every query in
    E(Q) in order to obtain the result of Q(D).
  • Solution Compute a finite basis of E(Q) first

13
A Query Basis of Q
  • A query basis of Q, denoted as B(Q), is a finite
    set of pSQL queries (with default propagation
    scheme) such that
  • Ua q(D) a Ua q(D)
  • Given B(Q), we can execute each query in B(Q) and
    combine the results to obtain the result of Q(D).
  • Question Given Q, does B(Q) always exist and how
    can we compute B(Q)?

q?B(Q)
q?E(Q)
14
Generating a Query Basis of Q
  • Given R(A,B) and S(B,C)
  • User query Q
  • Representative Query Q0

The representative query propagates annotations
according to where data is copied from or
equivalently copied from.
SELECT DISTINCT r.A, s.B, s.C FROM R r, S s WHERE
r.B s.B PROPAGATE DEFAULT-ALL
Ans(x,y,z) - R(x,y), S(y,z).
SELECT DISTINCT r.A, s.B, s.C FROM R r, S s WHERE
r.B s.B PROPAGATE r.A TO A, s.B TO B, s.C TO C,
r.B TO B
Propagations under the default propagation scheme
Additional propagation due to the equality r.B
s.B
15
Generating a Query Basis of Q
  • Auxiliary Queries
  • Q1
  • Q2

Ans(x,y,z) - R(x,y), S(y,z), R(x,w).
SELECT DISTINCT r.A, s.B, s.C FROM R r, S s, R
r WHERE r.B s.B, r.A r.A PROPAGATE r.A TO
A, s.B TO B, s.C TO C, r.B TO B, r.A
TO A
Ans(x,y,z) - R(x,y), S(y,z), S(w,z).
SELECT DISTINCT r.A, s.B, s.C FROM R r, S s, S
s WHERE r.B s.B, s.C s.C PROPAGATE r.A TO
A, s.B TO B, s.C TO C, r.B TO B, s.C
TO C
16
Generating a Query Basis of Q
  • Auxiliary Queries
  • Q3
  • Q4

Ans(x,y,z) - R(x,y), S(y,z), R(w,y).
SELECT DISTINCT r.A, s.B, s.C FROM R r, S s, R
r WHERE r.B s.B, r.B r.B PROPAGATE r.A TO
A, s.B TO B, s.C TO C, r.B TO B, r.B
TO B
Ans(x,y,z) - R(x,y), S(y,z), S(y,w).
SELECT DISTINCT r.A, s.B, s.C FROM R r, S s, S
s WHERE r.B s.B, s.B s.B PROPAGATE r.A TO
A, s.B TO B, s.C TO C, r.B TO B, s.B
TO B
17
Correctness of the Algorithm
  • For the example, a query basis of Q
    consists of Q0, Q1, Q2, Q3, and Q4.
  • Theorem
  • Given a pSQL query Q with DEFAULT-ALL
    propagation scheme, the algorithm generates a
    query basis of Q.
  • Proof Idea
  • Every query in B(Q) is an equivalent query of Q
  • Every equivalent query of Q is annotation-containe
    d in Ua q(D)

q?B(Q)
18
Outline
  • pSQL queries
  • Semantics
  • CUSTOM propagation scheme
  • DEFAULT propagation scheme
  • DEFAULT-ALL propagation scheme
  • Implementation
  • System architecture
  • Experimental results

19
System Architecture
RDBMS
SQL query
USER pSQL query
sorted tuples
final result
Postprocessor
Translator
  • Translator Module
  • Input a pSQL query Q
  • Output an SQL query Q written against the naïve
    storage scheme
  • Q is sent to the RDBMS and executed
  • Postprocessor Module
  • Input sorted tuples (returned by the RDBMS)
  • Output An annotated set of tuples.
  • Annotations for the same output location are
    collected together
  • Duplicate tuples are removed

20
A Naïve Storage Scheme
  • For every attribute of every relation there is an
    additional attribute for storing the annotations
  • Conceivably, there are other possible storage
    schemes

R
R
a
c
d
b
21
The Translator module
Generate a Query Basis
pSQL query default-all scheme
set of pSQL queries with custom scheme
Translate default pSQL to custom pSQL
Translate custom pSQL to SQL
pSQL query default scheme
pSQL query custom scheme
SQL query
default pSQL query
custom pSQL query
22
Experiments
  • Goals
  • compare the performance of pSQL queries under
    different propagation schemes (DEFAULT,
    DEFAULT-ALL, or no propagation scheme)
  • compare the performance of pSQL queries when the
    number of annotations in a database is varied

23
Experimental setup
  • Implemented on top of Oracle 9i
  • Datasets
  • 100MB, 500MB, 1GB TPCH database
  • Unannotated database on original schema
  • 30, 60, 100 annotations on naïve schema
  • buffer size 256Mb
  • Test queries
  • SPJ queries
  • Varied the number of joins (0 to 4 joins)
  • Varied the number of selected attributes (1,3 or
    5 attributes)

24
100MB dataset 100 annotated
Qi(j) denotes a query with i joins and j output
attributes.
25
500MB dataset 100 annotated
Qi(j) denotes a query with i joins and j output
attributes.
26
1GB dataset 100 annotated
Qi(j) denotes a query with i joins and j output
attributes.
27
100MB dataset annotated in various degrees
Qi(j) denotes a query with i joins and j output
attributes.
28
Contributions
  • an annotation management system
  • for carrying annotations along as data is being
    transformed based on provenance
  • pSQL query language for propagation annotations
  • CUSTOM user defined
  • DEFAULT where data was copied from?
  • DEFAULT-ALL invariant under equivalent queries
  • Generate-Query-Basis algorithm
  • an initial implementation

29
Future work
  • Performance of our annotation management system
    on other storage schemes
  • pSQL extensions
  • Aggregates
  • Bag Queries

30
END
31
The CUSTOM Scheme - Example
R
a
c
b
SELECT DISTINCT B FROM R r PROPAGATE r.A TO B,
r.B TO B
h
Result
a
c
b
h
32
Terminology
  • A location is a triple (R, t, A)
  • Definition
  • A query Q1 is annotation contained in a query Q2
    if
  • Q1 ? Q2
  • for every database D, the set of annotations
    attached to every output location in Q1(D) is a
    subset of the set of annotations associated with
    the same location in the output of Q2(D).

R
The annotation a is attached to the location
(R,(1,2),B)
a
33
In a More Concise Notation
  • Ans(x,y,z) - R(x,y), S(y,z), y y.

  • x ! 1, y ! 2, y ! 2, z ! 3
  • Ans(x,y,z) - R(x,y), S(y,z).

  • x ! 1, y ! 2, z ! 3
  • Annotations of values that reside in different
    source locations but are bound to the same
    variable are unioned together.
  • Ans(y) - R(x,y).
  • Ans(y) - S(y,z).
  • Ans(2 ).
  • Annotations that belong to the same output
    location are unioned together.

a
b
a
b
a
b
34
Containment vs. annotation-containment
R
a
b
c
d
Q1 Ans(x,v) - R(x,y,u), R(x,z,v), R(t,w,z).
Q2 Ans(x,v) - R(p,q,v), R(x,z,v), R(t,w,z).
Ans1
c
a
b
Ans2
c
d
b
Q1 ? Q2 but Q1 ?a Q2 and Q2 ?a Q1
35
Translating a CUSTOM pSQL to SQL
custom pSQL query
SELECT DISTINCT r.A, s.B, s.C FROM R r, S s WHERE
r.B s.B PROPAGATE s.B TO B, s.C TO C, r.B TO B
SQL query
SELECT DISTINCT FROM ( Q1 UNION Q2 ) t ORDER BY
t.A, t.B, t.C
Q1 SELECT r.A, NULL, s.B, s.B,
s.C, s.C FROM R r, S s WHERE r.B s.B
Q2 SELECT r.A, NULL, s.B, r.B,
s.C, NULL FROM R r, S s WHERE r.B s.B
Write a Comment
User Comments (0)
About PowerShow.com