Title: Information Preserving XML Schema Embedding
1Information Preserving XML Schema Embedding
- Philip Bohannon Bell Laboratories
- Wenfei Fan Univ of Edinburgh Bell Labs
- Michael Flaster Bell Laboratories
- PPS Narayan Bell Laboratories
2XML mapping
- XML mapping sd I(S1) ? I(S2)
- Instance-level from XML instances of a given
source DTD schema S1 to XML trees of a predefined
target DTD schema S2 - Information preserving (lossless)
- XML data exchange, migration, integration, P2P,
XML mapping
XML tree of S2
XML tree T of S1
3Example XML mapping source DTD
- Source schema S1
- db ? class
- class ? cno, title, type
- type ? ( regular project )
- regular ? prereq
- prereq ? class
- DTD (E, P, r). E element types
- r root P element type definitions A ? ?
- ? PCDATA ? B1, , Bk B1
Bk B - Graph representation
- concatenation production B1, , Bk AND edge
(solid) - disjunction B1 Bk OR edge (dashed)
- Kleene star B STAR edge (with edge label )
4Example XML mapping target DTD
school
courses
students
current
history
student
course
gpa
name
ssn
taking
category
basic
cno
mandatory
credit
semester
advanced
term
year
title
project
seminar
lab
regular
required
gpa
prereq
5information preserving XML mapping
- Objective Find an XML mapping sd I(S1) ? I(S2)
such that - Type safety for any XML tree T of S1, sd(T) is
an XML document that conforms to the predefined
target schema S2 - Information preserving
- Invertibility there exists an inverse s-1d
I(S2) ? I(S1) such that for any XML tree T of
S1, T s-1d (sd (T)). - The source T can be recovered from the target
sd(T) - Query preservation w.r.t a query language L
there is a query-rewriting function F L ? L
such that for any Q in L and any T of S1, Q(T)
F(Q)(sd (T)). - All queries in L on the source can be answered on
the target
6Challenge different structures
school
S1
S2
-
- S1 and S2 have vastly different structures graph
similarity (simulation) does not work here!
courses
students
. . .
current
history
course
category
basic
cno
mandatory
credit
semester
advanced
term
year
title
seminar
lab
regular
project
required
gpa
prereq
7Challenge data integration
S1
S1
-
- Multiple sources are to be mapped to a single
target the target schema must have a larger
information capacity it cannot be similar to
sources
S2
. . .
8About query preservation XML query languages
- Regular XPath
- Q ? A Q/text() Q/Q
Q ? Q Q Qq - q Q Q/text() c position() k
q ? q q ? q not q - An XPath fragment // instead of Q
- Example a regular XPath query over
- S1 Find all prerequisites of CIS 331
Q1 class cno/text() CS331 /
(type/regular/prereq/class)
query rewriting
Q2 courses/current/course basic/cno/text()
CS331 / (category/mandatory/regular/required/p
rereq/course)
9Challenge information preservation for XML
- For relational data w.r.t. relational calculus
(L), invertiblility (calculus dominance) and
query preservation (dominance) coincide Hull 84 - Separation (a) There is an invertible XML
mapping that is NOT query preserving w.r.t.
XPath. (b) There is an XML mapping that is query
preserving w.r.t. XPath without position( ) but
it is NOT invertible. - Complexity It is undecidable to determine, for
an XML mapping defined in any language subsuming
FO, whether it is (a) invertible, or (b) query
preserving w.r.t. any query language with
projection. - beyond reach for XML mappings defined in
XQuery/XSLT - Other results
- query preservation w.r.t. regular XPath stronger
than invertibility - sufficient conditions under which the two coincide
10Previous work
- XML mappings defined in XQuery/XSLT no guidance
on - type safety for any XML tree T of S1, is sd(T)
guaranteed to conform to a predefined (recursive)
target schema S2? - how to ensure information preservation
- Schema mapping to derive instance-level mapping
- similarity flooding, Cupid, Clio, TransSCM
- cannot guarantee information preservation
- Information preservation in traditional data
models not directly applicable to XML mappings - No prior work has considered information-preservin
g XML mapping
11Our approach
- A systematic way to find information-preserving
XML mappings - find a schema embedding s S1 ? S2, a schema
mapping with certain properties - derive an instance-level mapping sd I(S1) ?
I(S2) from s - automatically guarantee information preservation
- accommodate integration (multiple sources)
- express XML mappings commonly used in practice
S2
schema embedding
S1
schema level
instance level
XML mapping
T1
T2
information preserving
12Schema embedding s (?( ), path( ))
- Given
- source DTD S1 (E1, P1, r1), target DTD S2
(E2, P2, r2) - similarity matrix att( ) on element type names
att(A, B) in 0, 1 indicates how close A ? E1 is
to B ? E2 - Schema embedding s (?( ), path( ))
- ? E1 ? E2, type mapping ?(r1) r2 and att(A,
?(A)) gt 0 - path(A, B) maps an edge (A, B) in S1 to a unique
path from ?(A) to ?(B) in S2 A1position( )
k1 / /An(position( ) kn - path type AND (OR, STAR) edge to AND (OR, STAR)
path (AND/STAR edges, AND at least 1 OR, AND
edges STAR) - Information capacity
- prefix-free if P1(A) A1, , An, path(A, Ai) is
NOT a prefix of any path(A, Aj) for j ? i
similarly for P1(A) A1 An. - Type safety valid mapping
13Example Schema embedding
A
A
S1
S2
Schema embedding NO Graph simulation YES
1
2
B
B
Schema embedding is not a mild generalization of
graph simulation
14Schema embedding example
- ?(db) school, ?(class) course
- path(db, class) courses/current/course
- mapping edge to path
- STAR edge to STAR path
- Graph similarity? NO
school
S2
S1
db
courses
students
class
current
history
student
course
gpa
name
ssn
taking
15Schema embedding example
- ?(type) category, ?(A) A
- path(class, cno) basic/cno
- path(class, title) basic/semester/title
- path(class, type) category
- AND (STAR) edges to AND (STAR) paths
- Relative path relative to course
course
S2
S1
category
class
basic
cno
type
title
cno
credit
semester
term
year
title
16Schema embedding example
- ?(X) X
- path(type, regular) mandatory/regular
- path(type, project) advanced/project
- ?(X) X
- path(regular, prereq)
- required/prereq
- path(prereq, class)
- course
OR edges to OR paths
category
type
S2
S1
mandatory
advanced
project
regular
project
seminar
lab
regular
course
. .
S1
S2
regular
required
gpa
prereq
17Deriving instance-level mapping
- Each schema embedding s S1 ? S2 determines an
XML mapping sd I(S1) ? I(S2) - Path types and prefix-free
- Given an XML tree T1 of S1, sd (T1) constructs an
instance T2 of S2, top-down by mapping A-elements
of T1 to ?(A)-nodes in T2 - the root of T2 is mapped from the root of T1
- for each ?(A)-element in T2 mapped from an
A-element of T1, generate path(A, B) in T2 for
each B-child of the A-element - when all the element in T2 mapped from nodes in
T1 are fully expanded, add necessary default
elements to T2 such that T2 satisfies S2.
18Properties of schema embedding
- Theorem The XML mapping sd I(S1) ? I(S2)
derived from a schema embedding s S1 ? S2 is - well defined (type safety)
- invertible (with a quadratic-time inverse), and
- query preserving w.r.t. regular XPath (query
rewriting linear-time data complexity,
quadratic-time combined complexity)
19Integration multiple sources
S1
S1
S2
. . .
cno
- ?(db) school, ?(X) X
- path(db, student)
- students/student
- path(taking, cno) cno
pairwise disjoint path mappings from S1, S1 to S2
20Schema embedding vs. graph simulation
- Definition
- embedding mapping edges to paths
- simulation mapping edges to edges
- restructuring
- embedding various DTD constructs, different
structures - simulation source and target schemas with
similar structures - information preservation for XML mappings
- embedding automatically guarantee both
invertibility and query preservation w.r.t.
regular XPath - simulation no
- data integration
- embedding multiple source DTDs to a single
target schema - simulation no
- A systematic method to define information-preservi
ng XML mappings
21Complexity finding schema embedding
- Input two DTD schemas S1 and S2, and a
similarity matrix att( ) - Output find a schema embedding from s S1 ? S2
such that qual(s, att) is maximal, if there is
any - qual(s, att) is the sum of att(A, ?(A)) for
all A in S1 - Theorem It is NP-complete to determine whether
or not there is a schema embedding from S1 to S2,
even when S1 and S2 are nonrecursive and they
consist of concatenation types only. - Efficient algorithms are necessarily heuristic.
- Find local embedding for each DTD production of
S1 - Assemble local embeddings to make a schema
embedding
22Computing local embedding
- Input a production A ? P(A) in source DTD S1,
target schema S2 - Output s0 (?0, path0), a partial embedding
from P(A) to S2 - Example find ?0( ) from types in P(A) to types
of S2, and path0( )
. . .
category
S2
mandatory
advanced
project
seminar
lab
regular
. . .
- If ?0 is given an O(P(A) S2) algorithm
findPath to find local embedding (depth-first
search, checking each S2 subtree only once) - When ?0 is not fixed, the local embedding
problem is NP-hard - Heuristic randomized findPath to find both ?0
and path0 (randomly pick up possible type-node
match in the search)
23Assembling local embeddings
- Input C(A), a set of local embeddings for each A
in the source DTD (initialized via randomized
findPath) a target schema S2 - Output s (?, path), a schema embedding from
S1 to S2 if any - Theorem The assemble-embedding problem is
NP-complete even when S1 and S2 are nonrecursive. - Conflict type mapping, prefix free
- Three heuristic algorithms
- Fix an order O on S1 types via qual( ), pick a
local embedding sA from C(A) in O, and increment
s with sA if no conflict - Assume a random order O on S1 types, then do the
same as (1) - Reduction to the MAX-Weight-Independence-Set
problem, leveraging an existing tool for that
problem.
24Experimental evaluation
- benchmark
- XMark (99 type nodes in its original form)
- Real-life DTDs SIGMOD (13), PSD (121), mondial
(70), etc - Generating target schemas by adding noise
changing edges to paths, mutating names,
inserting new subtrees. - selectivity/accuracy of att( ) 0, 1 (1.0
exact match) - Target schemas with a noise parameter of 75
XMark (581-748), SIGMOD (54-96), PSD (712-820),
mondial (395-496) - system
- 933MHZ/1.0GHZ Pentium III, 256M memory
- QUALEX a tool for MAX-Weight-Independence-Set
- Algorithms implemented in Java
25Experimental result target size
XMark (acc 0.75). RandomOrder and
MAXSet-Reduction perform well
26Experimental result running time required
XMark (acc 0.75). In seconds for schemas of
hundreds of nodes
27Experimental result different source schemas
Various source schemas (acc 0.75). RandomOrder
finds solutions more than 90 of the time, in
seconds
28Summary
- Information preservation the first study for XML
mappings - more intriguing than its relational counterparts
separation, equivalence, complexity of
invertibility and query preservation - important for data exchange, migration,
integration, P2P, - Schema embedding
- mapping edges to paths
- capture various DTD constructs, support
restructuring - automatically guarantee information preservation
- accommodate multiple sources to a single target
- NP-complete, but with efficient and effective
heuristic - A practical solution for finding
information-preserving XML mappings