Information Preserving XML Schema Embedding - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Information Preserving XML Schema Embedding

Description:

Schema mapping: to derive instance-level mapping. similarity flooding, Cupid, Clio, TransSCM... derive an instance-level mapping sd: I(S1) I(S2) from s ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 29
Provided by: homepage7
Category:

less

Transcript and Presenter's Notes

Title: Information Preserving XML Schema Embedding


1
Information Preserving XML Schema Embedding
  • Philip Bohannon Bell Laboratories
  • Wenfei Fan Univ of Edinburgh Bell Labs
  • Michael Flaster Bell Laboratories
  • PPS Narayan Bell Laboratories

2
XML mapping
  • XML mapping sd I(S1) ? I(S2)
  • Instance-level from XML instances of a given
    source DTD schema S1 to XML trees of a predefined
    target DTD schema S2
  • Information preserving (lossless)
  • XML data exchange, migration, integration, P2P,

XML mapping
XML tree of S2
XML tree T of S1
3
Example XML mapping source DTD
  • Source schema S1
  • db ? class
  • class ? cno, title, type
  • type ? ( regular project )
  • regular ? prereq
  • prereq ? class
  • DTD (E, P, r). E element types
  • r root P element type definitions A ? ?
  • ? PCDATA ? B1, , Bk B1
    Bk B
  • Graph representation
  • concatenation production B1, , Bk AND edge
    (solid)
  • disjunction B1 Bk OR edge (dashed)
  • Kleene star B STAR edge (with edge label )

4
Example XML mapping target DTD
school
  • target schema S2

courses
students

current
history
student


course
gpa
name
ssn
taking
category
basic


cno
mandatory
credit
semester
advanced
term
year
title
project
seminar
lab
regular
required
gpa
prereq

5
information preserving XML mapping
  • Objective Find an XML mapping sd I(S1) ? I(S2)
    such that
  • Type safety for any XML tree T of S1, sd(T) is
    an XML document that conforms to the predefined
    target schema S2
  • Information preserving
  • Invertibility there exists an inverse s-1d
    I(S2) ? I(S1) such that for any XML tree T of
    S1, T s-1d (sd (T)).
  • The source T can be recovered from the target
    sd(T)
  • Query preservation w.r.t a query language L
    there is a query-rewriting function F L ? L
    such that for any Q in L and any T of S1, Q(T)
    F(Q)(sd (T)).
  • All queries in L on the source can be answered on
    the target

6
Challenge different structures
school
S1
S2
  • S1 and S2 have vastly different structures graph
    similarity (simulation) does not work here!

courses
students
. . .
current
history


course
category
basic

cno
mandatory
credit
semester
advanced
term
year
title
seminar
lab
regular
project
required
gpa
prereq

7
Challenge data integration
S1
S1
  • Multiple sources are to be mapped to a single
    target the target schema must have a larger
    information capacity it cannot be similar to
    sources

S2

. . .
8
About query preservation XML query languages
  • Regular XPath
  • Q ? A Q/text() Q/Q
    Q ? Q Q Qq
  • q Q Q/text() c position() k
    q ? q q ? q not q
  • An XPath fragment // instead of Q
  • Example a regular XPath query over
  • S1 Find all prerequisites of CIS 331

Q1 class cno/text() CS331 /
(type/regular/prereq/class)

query rewriting
Q2 courses/current/course basic/cno/text()
CS331 / (category/mandatory/regular/required/p
rereq/course)
9
Challenge information preservation for XML
  • For relational data w.r.t. relational calculus
    (L), invertiblility (calculus dominance) and
    query preservation (dominance) coincide Hull 84
  • Separation (a) There is an invertible XML
    mapping that is NOT query preserving w.r.t.
    XPath. (b) There is an XML mapping that is query
    preserving w.r.t. XPath without position( ) but
    it is NOT invertible.
  • Complexity It is undecidable to determine, for
    an XML mapping defined in any language subsuming
    FO, whether it is (a) invertible, or (b) query
    preserving w.r.t. any query language with
    projection.
  • beyond reach for XML mappings defined in
    XQuery/XSLT
  • Other results
  • query preservation w.r.t. regular XPath stronger
    than invertibility
  • sufficient conditions under which the two coincide

10
Previous work
  • XML mappings defined in XQuery/XSLT no guidance
    on
  • type safety for any XML tree T of S1, is sd(T)
    guaranteed to conform to a predefined (recursive)
    target schema S2?
  • how to ensure information preservation
  • Schema mapping to derive instance-level mapping
  • similarity flooding, Cupid, Clio, TransSCM
  • cannot guarantee information preservation
  • Information preservation in traditional data
    models not directly applicable to XML mappings
  • No prior work has considered information-preservin
    g XML mapping

11
Our approach
  • A systematic way to find information-preserving
    XML mappings
  • find a schema embedding s S1 ? S2, a schema
    mapping with certain properties
  • derive an instance-level mapping sd I(S1) ?
    I(S2) from s
  • automatically guarantee information preservation
  • accommodate integration (multiple sources)
  • express XML mappings commonly used in practice

S2
schema embedding
S1
schema level
instance level
XML mapping
T1
T2
information preserving
12
Schema embedding s (?( ), path( ))
  • Given
  • source DTD S1 (E1, P1, r1), target DTD S2
    (E2, P2, r2)
  • similarity matrix att( ) on element type names
    att(A, B) in 0, 1 indicates how close A ? E1 is
    to B ? E2
  • Schema embedding s (?( ), path( ))
  • ? E1 ? E2, type mapping ?(r1) r2 and att(A,
    ?(A)) gt 0
  • path(A, B) maps an edge (A, B) in S1 to a unique
    path from ?(A) to ?(B) in S2 A1position( )
    k1 / /An(position( ) kn
  • path type AND (OR, STAR) edge to AND (OR, STAR)
    path (AND/STAR edges, AND at least 1 OR, AND
    edges STAR)
  • Information capacity
  • prefix-free if P1(A) A1, , An, path(A, Ai) is
    NOT a prefix of any path(A, Aj) for j ? i
    similarly for P1(A) A1 An.
  • Type safety valid mapping

13
Example Schema embedding
  • edge/path type match
  • prefix-free

A
A
S1
S2
Schema embedding NO Graph simulation YES
1
2
B
B
Schema embedding is not a mild generalization of
graph simulation
14
Schema embedding example
  • ?(db) school, ?(class) course
  • path(db, class) courses/current/course
  • mapping edge to path
  • STAR edge to STAR path
  • Graph similarity? NO

school
S2
S1
db

courses
students

class
current
history
student


course
gpa
name
ssn
taking
15
Schema embedding example
  • ?(type) category, ?(A) A
  • path(class, cno) basic/cno
  • path(class, title) basic/semester/title
  • path(class, type) category
  • AND (STAR) edges to AND (STAR) paths
  • Relative path relative to course

course
S2
S1
category
class
basic

cno
type
title
cno
credit
semester
term
year
title
16
Schema embedding example
  • ?(X) X
  • path(type, regular) mandatory/regular
  • path(type, project) advanced/project
  • ?(X) X
  • path(regular, prereq)
  • required/prereq
  • path(prereq, class)
  • course

OR edges to OR paths
category
type
S2
S1
mandatory
advanced
project
regular
project
seminar
lab
regular
course
. .
S1
S2
regular
required
gpa
prereq

17
Deriving instance-level mapping
  • Each schema embedding s S1 ? S2 determines an
    XML mapping sd I(S1) ? I(S2)
  • Path types and prefix-free
  • Given an XML tree T1 of S1, sd (T1) constructs an
    instance T2 of S2, top-down by mapping A-elements
    of T1 to ?(A)-nodes in T2
  • the root of T2 is mapped from the root of T1
  • for each ?(A)-element in T2 mapped from an
    A-element of T1, generate path(A, B) in T2 for
    each B-child of the A-element
  • when all the element in T2 mapped from nodes in
    T1 are fully expanded, add necessary default
    elements to T2 such that T2 satisfies S2.

18
Properties of schema embedding
  • Theorem The XML mapping sd I(S1) ? I(S2)
    derived from a schema embedding s S1 ? S2 is
  • well defined (type safety)
  • invertible (with a quadratic-time inverse), and
  • query preserving w.r.t. regular XPath (query
    rewriting linear-time data complexity,
    quadratic-time combined complexity)

19
Integration multiple sources
S1
S1
S2
. . .
cno
  • ?(db) school, ?(X) X
  • path(db, student)
  • students/student
  • path(taking, cno) cno

pairwise disjoint path mappings from S1, S1 to S2
20
Schema embedding vs. graph simulation
  • Definition
  • embedding mapping edges to paths
  • simulation mapping edges to edges
  • restructuring
  • embedding various DTD constructs, different
    structures
  • simulation source and target schemas with
    similar structures
  • information preservation for XML mappings
  • embedding automatically guarantee both
    invertibility and query preservation w.r.t.
    regular XPath
  • simulation no
  • data integration
  • embedding multiple source DTDs to a single
    target schema
  • simulation no
  • A systematic method to define information-preservi
    ng XML mappings

21
Complexity finding schema embedding
  • Input two DTD schemas S1 and S2, and a
    similarity matrix att( )
  • Output find a schema embedding from s S1 ? S2
    such that qual(s, att) is maximal, if there is
    any
  • qual(s, att) is the sum of att(A, ?(A)) for
    all A in S1
  • Theorem It is NP-complete to determine whether
    or not there is a schema embedding from S1 to S2,
    even when S1 and S2 are nonrecursive and they
    consist of concatenation types only.
  • Efficient algorithms are necessarily heuristic.
  • Find local embedding for each DTD production of
    S1
  • Assemble local embeddings to make a schema
    embedding

22
Computing local embedding
  • Input a production A ? P(A) in source DTD S1,
    target schema S2
  • Output s0 (?0, path0), a partial embedding
    from P(A) to S2
  • Example find ?0( ) from types in P(A) to types
    of S2, and path0( )

. . .
category
S2
mandatory
advanced
project
seminar
lab
regular
. . .
  • If ?0 is given an O(P(A) S2) algorithm
    findPath to find local embedding (depth-first
    search, checking each S2 subtree only once)
  • When ?0 is not fixed, the local embedding
    problem is NP-hard
  • Heuristic randomized findPath to find both ?0
    and path0 (randomly pick up possible type-node
    match in the search)

23
Assembling local embeddings
  • Input C(A), a set of local embeddings for each A
    in the source DTD (initialized via randomized
    findPath) a target schema S2
  • Output s (?, path), a schema embedding from
    S1 to S2 if any
  • Theorem The assemble-embedding problem is
    NP-complete even when S1 and S2 are nonrecursive.
  • Conflict type mapping, prefix free
  • Three heuristic algorithms
  • Fix an order O on S1 types via qual( ), pick a
    local embedding sA from C(A) in O, and increment
    s with sA if no conflict
  • Assume a random order O on S1 types, then do the
    same as (1)
  • Reduction to the MAX-Weight-Independence-Set
    problem, leveraging an existing tool for that
    problem.

24
Experimental evaluation
  • benchmark
  • XMark (99 type nodes in its original form)
  • Real-life DTDs SIGMOD (13), PSD (121), mondial
    (70), etc
  • Generating target schemas by adding noise
    changing edges to paths, mutating names,
    inserting new subtrees.
  • selectivity/accuracy of att( ) 0, 1 (1.0
    exact match)
  • Target schemas with a noise parameter of 75
    XMark (581-748), SIGMOD (54-96), PSD (712-820),
    mondial (395-496)
  • system
  • 933MHZ/1.0GHZ Pentium III, 256M memory
  • QUALEX a tool for MAX-Weight-Independence-Set
  • Algorithms implemented in Java

25
Experimental result target size
XMark (acc 0.75). RandomOrder and
MAXSet-Reduction perform well
26
Experimental result running time required
XMark (acc 0.75). In seconds for schemas of
hundreds of nodes
27
Experimental result different source schemas
Various source schemas (acc 0.75). RandomOrder
finds solutions more than 90 of the time, in
seconds
28
Summary
  • Information preservation the first study for XML
    mappings
  • more intriguing than its relational counterparts
    separation, equivalence, complexity of
    invertibility and query preservation
  • important for data exchange, migration,
    integration, P2P,
  • Schema embedding
  • mapping edges to paths
  • capture various DTD constructs, support
    restructuring
  • automatically guarantee information preservation
  • accommodate multiple sources to a single target
  • NP-complete, but with efficient and effective
    heuristic
  • A practical solution for finding
    information-preserving XML mappings
Write a Comment
User Comments (0)
About PowerShow.com