Information Preserving XML Schema Embedding - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Information Preserving XML Schema Embedding

Description:

Schema mapping: to derive instance-level mapping. similarity flooding, Cupid, Clio, TransSCM... derive an instance-level mapping sd: I(S1) I(S2) from s ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 29

Provided by: homepage7

Category:

more less

Transcript and Presenter's Notes

Title: Information Preserving XML Schema Embedding

1
Information Preserving XML Schema Embedding

Philip Bohannon Bell Laboratories
Wenfei Fan Univ of Edinburgh Bell Labs
Michael Flaster Bell Laboratories
PPS Narayan Bell Laboratories

2
XML mapping

XML mapping sd I(S1) ? I(S2)
Instance-level from XML instances of a given
source DTD schema S1 to XML trees of a predefined
target DTD schema S2
Information preserving (lossless)
XML data exchange, migration, integration, P2P,

XML mapping
XML tree of S2
XML tree T of S1
3
Example XML mapping source DTD

Source schema S1
db ? class
class ? cno, title, type
type ? ( regular project )
regular ? prereq
prereq ? class
DTD (E, P, r). E element types
r root P element type definitions A ? ?
? PCDATA ? B1, , Bk B1
Bk B
Graph representation
concatenation production B1, , Bk AND edge
(solid)
disjunction B1 Bk OR edge (dashed)
Kleene star B STAR edge (with edge label )

4
Example XML mapping target DTD
school

target schema S2

courses
students

current
history
student

course
gpa
name
ssn
taking
category
basic

cno
mandatory
credit
semester
advanced
term
year
title
project
seminar
lab
regular
required
gpa
prereq

5
information preserving XML mapping

Objective Find an XML mapping sd I(S1) ? I(S2)
such that
Type safety for any XML tree T of S1, sd(T) is
an XML document that conforms to the predefined
target schema S2
Information preserving
Invertibility there exists an inverse s-1d
I(S2) ? I(S1) such that for any XML tree T of
S1, T s-1d (sd (T)).
The source T can be recovered from the target
sd(T)
Query preservation w.r.t a query language L
there is a query-rewriting function F L ? L
such that for any Q in L and any T of S1, Q(T)
F(Q)(sd (T)).
All queries in L on the source can be answered on
the target

6
Challenge different structures
school
S1
S2

S1 and S2 have vastly different structures graph
similarity (simulation) does not work here!

courses
students
. . .
current
history

course
category
basic

cno
mandatory
credit
semester
advanced
term
year
title
seminar
lab
regular
project
required
gpa
prereq

7
Challenge data integration
S1
S1

Multiple sources are to be mapped to a single
target the target schema must have a larger
information capacity it cannot be similar to
sources

S2

. . .
8
About query preservation XML query languages

Regular XPath
Q ? A Q/text() Q/Q
Q ? Q Q Qq
q Q Q/text() c position() k
q ? q q ? q not q
An XPath fragment // instead of Q
Example a regular XPath query over
S1 Find all prerequisites of CIS 331

Q1 class cno/text() CS331 /
(type/regular/prereq/class)

query rewriting
Q2 courses/current/course basic/cno/text()
CS331 / (category/mandatory/regular/required/p
rereq/course)
9
Challenge information preservation for XML

For relational data w.r.t. relational calculus
(L), invertiblility (calculus dominance) and
query preservation (dominance) coincide Hull 84
Separation (a) There is an invertible XML
mapping that is NOT query preserving w.r.t.
XPath. (b) There is an XML mapping that is query
preserving w.r.t. XPath without position( ) but
it is NOT invertible.
Complexity It is undecidable to determine, for
an XML mapping defined in any language subsuming
FO, whether it is (a) invertible, or (b) query
preserving w.r.t. any query language with
projection.
beyond reach for XML mappings defined in
XQuery/XSLT
Other results
query preservation w.r.t. regular XPath stronger
than invertibility
sufficient conditions under which the two coincide

10
Previous work

XML mappings defined in XQuery/XSLT no guidance
on
type safety for any XML tree T of S1, is sd(T)
guaranteed to conform to a predefined (recursive)
target schema S2?
how to ensure information preservation
Schema mapping to derive instance-level mapping
similarity flooding, Cupid, Clio, TransSCM
cannot guarantee information preservation
Information preservation in traditional data
models not directly applicable to XML mappings
No prior work has considered information-preservin
g XML mapping

11
Our approach

A systematic way to find information-preserving
XML mappings
find a schema embedding s S1 ? S2, a schema
mapping with certain properties
derive an instance-level mapping sd I(S1) ?
I(S2) from s
automatically guarantee information preservation
accommodate integration (multiple sources)
express XML mappings commonly used in practice

S2
schema embedding
S1
schema level
instance level
XML mapping
T1
T2
information preserving
12
Schema embedding s (?( ), path( ))

Given
source DTD S1 (E1, P1, r1), target DTD S2
(E2, P2, r2)
similarity matrix att( ) on element type names
att(A, B) in 0, 1 indicates how close A ? E1 is
to B ? E2
Schema embedding s (?( ), path( ))
? E1 ? E2, type mapping ?(r1) r2 and att(A,
?(A)) gt 0
path(A, B) maps an edge (A, B) in S1 to a unique
path from ?(A) to ?(B) in S2 A1position( )
k1 / /An(position( ) kn
path type AND (OR, STAR) edge to AND (OR, STAR)
path (AND/STAR edges, AND at least 1 OR, AND
edges STAR)
Information capacity
prefix-free if P1(A) A1, , An, path(A, Ai) is
NOT a prefix of any path(A, Aj) for j ? i
similarly for P1(A) A1 An.
Type safety valid mapping

13
Example Schema embedding

edge/path type match

prefix-free

A
A
S1
S2
Schema embedding NO Graph simulation YES
1
2
B
B
Schema embedding is not a mild generalization of
graph simulation
14
Schema embedding example

?(db) school, ?(class) course
path(db, class) courses/current/course
mapping edge to path
STAR edge to STAR path
Graph similarity? NO

school
S2
S1
db

courses
students

class
current
history
student

course
gpa
name
ssn
taking
15
Schema embedding example

?(type) category, ?(A) A
path(class, cno) basic/cno
path(class, title) basic/semester/title
path(class, type) category
AND (STAR) edges to AND (STAR) paths
Relative path relative to course

course
S2
S1
category
class
basic

cno
type
title
cno
credit
semester
term
year
title
16
Schema embedding example

?(X) X
path(type, regular) mandatory/regular
path(type, project) advanced/project
?(X) X
path(regular, prereq)
required/prereq
path(prereq, class)
course

OR edges to OR paths
category
type
S2
S1
mandatory
advanced
project
regular
project
seminar
lab
regular
course
. .
S1
S2
regular
required
gpa
prereq

17
Deriving instance-level mapping

Each schema embedding s S1 ? S2 determines an
XML mapping sd I(S1) ? I(S2)
Path types and prefix-free
Given an XML tree T1 of S1, sd (T1) constructs an
instance T2 of S2, top-down by mapping A-elements
of T1 to ?(A)-nodes in T2
the root of T2 is mapped from the root of T1
for each ?(A)-element in T2 mapped from an
A-element of T1, generate path(A, B) in T2 for
each B-child of the A-element
when all the element in T2 mapped from nodes in
T1 are fully expanded, add necessary default
elements to T2 such that T2 satisfies S2.

18
Properties of schema embedding

Theorem The XML mapping sd I(S1) ? I(S2)
derived from a schema embedding s S1 ? S2 is
well defined (type safety)
invertible (with a quadratic-time inverse), and
query preserving w.r.t. regular XPath (query
rewriting linear-time data complexity,
quadratic-time combined complexity)

19
Integration multiple sources
S1
S1
S2
. . .
cno

?(db) school, ?(X) X
path(db, student)
students/student
path(taking, cno) cno

pairwise disjoint path mappings from S1, S1 to S2
20
Schema embedding vs. graph simulation

Definition
embedding mapping edges to paths
simulation mapping edges to edges
restructuring
embedding various DTD constructs, different
structures
simulation source and target schemas with
similar structures
information preservation for XML mappings
embedding automatically guarantee both
invertibility and query preservation w.r.t.
regular XPath
simulation no
data integration
embedding multiple source DTDs to a single
target schema
simulation no
A systematic method to define information-preservi
ng XML mappings

21
Complexity finding schema embedding

Input two DTD schemas S1 and S2, and a
similarity matrix att( )
Output find a schema embedding from s S1 ? S2
such that qual(s, att) is maximal, if there is
any
qual(s, att) is the sum of att(A, ?(A)) for
all A in S1
Theorem It is NP-complete to determine whether
or not there is a schema embedding from S1 to S2,
even when S1 and S2 are nonrecursive and they
consist of concatenation types only.
Efficient algorithms are necessarily heuristic.
Find local embedding for each DTD production of
S1
Assemble local embeddings to make a schema
embedding

22
Computing local embedding

Input a production A ? P(A) in source DTD S1,
target schema S2
Output s0 (?0, path0), a partial embedding
from P(A) to S2
Example find ?0( ) from types in P(A) to types
of S2, and path0( )

. . .
category
S2
mandatory
advanced
project
seminar
lab
regular
. . .

If ?0 is given an O(P(A) S2) algorithm
findPath to find local embedding (depth-first
search, checking each S2 subtree only once)
When ?0 is not fixed, the local embedding
problem is NP-hard
Heuristic randomized findPath to find both ?0
and path0 (randomly pick up possible type-node
match in the search)

23
Assembling local embeddings

Input C(A), a set of local embeddings for each A
in the source DTD (initialized via randomized
findPath) a target schema S2
Output s (?, path), a schema embedding from
S1 to S2 if any
Theorem The assemble-embedding problem is
NP-complete even when S1 and S2 are nonrecursive.
Conflict type mapping, prefix free
Three heuristic algorithms
Fix an order O on S1 types via qual( ), pick a
local embedding sA from C(A) in O, and increment
s with sA if no conflict
Assume a random order O on S1 types, then do the
same as (1)
Reduction to the MAX-Weight-Independence-Set
problem, leveraging an existing tool for that
problem.

24
Experimental evaluation

benchmark
XMark (99 type nodes in its original form)
Real-life DTDs SIGMOD (13), PSD (121), mondial
(70), etc
Generating target schemas by adding noise
changing edges to paths, mutating names,
inserting new subtrees.
selectivity/accuracy of att( ) 0, 1 (1.0
exact match)
Target schemas with a noise parameter of 75
XMark (581-748), SIGMOD (54-96), PSD (712-820),
mondial (395-496)
system
933MHZ/1.0GHZ Pentium III, 256M memory
QUALEX a tool for MAX-Weight-Independence-Set
Algorithms implemented in Java

25
Experimental result target size
XMark (acc 0.75). RandomOrder and
MAXSet-Reduction perform well
26
Experimental result running time required
XMark (acc 0.75). In seconds for schemas of
hundreds of nodes
27
Experimental result different source schemas
Various source schemas (acc 0.75). RandomOrder
finds solutions more than 90 of the time, in
seconds
28
Summary

Information preservation the first study for XML
mappings
more intriguing than its relational counterparts
separation, equivalence, complexity of
invertibility and query preservation
important for data exchange, migration,
integration, P2P,
Schema embedding
mapping edges to paths
capture various DTD constructs, support
restructuring
automatically guarantee information preservation
accommodate multiple sources to a single target
NP-complete, but with efficient and effective
heuristic
A practical solution for finding
information-preserving XML mappings