XML Data Management 10. Deterministic DTDs and Schemas - PowerPoint PPT Presentation

About This Presentation
Title:

XML Data Management 10. Deterministic DTDs and Schemas

Description:

XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt How Expressive can a Schema Be? – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 30
Provided by: Dan1299
Category:

less

Transcript and Presenter's Notes

Title: XML Data Management 10. Deterministic DTDs and Schemas


1
XML Data Management10. Deterministic DTDs and
Schemas
  • Werner Nutt

2
How Expressive can a Schema Be?
This schema is a frequent example in teaching
material on XML Schema
ltxsdcomplexType nameoneBgt ltxsdchoicegt
ltxsdelement nameB typexsdstring/gt
ltxsdsequencegt ltxsdelement nameA
typeonlyAs/gt ltxsdelement nameA
typeoneB/gt lt/xsdsequencegt
ltxsdsequencegt ltxsdelement nameA
typeoneB/gt ltxsdelement nameA
typeonlyAs/gt lt/xsdsequencegt
lt/xsdchoicegtlt/xsdcomplexTypegt
ltxsdelement nameA typeoneB/gtltxsdcomplexT
ype nameonlyAsgt ltxsdchoicegt
ltxsdsequencegt ltxsdelement nameA
typeonlyAs/gt ltxsdelement nameA
typeonlyAs/gt lt/xsdsequencegt
ltxsdelement nameA typexsdstring/gt
lt/xsdchoicegtlt/xsdcomplexTypegt
What would documents look like that satisfy this
schema?
Arbitrary deep binary tree with A elements, and a
single B element
How would one check validity? What would be the
cost? What are the pros and cons of allowing such
schemas?
3
Lets see what SAXON says
4
Here is the Full Error Message from Eclipse
  • cos-element-consistent Error for type 'oneB'.
    Multiple elements with name 'A', with different
    types, appear in the model group.
  • cos-element-consistent Error for type 'onlyAs'.
    Multiple elements with name 'A', with different
    types, appear in the model group.
  • cos-nonambig A and A (or elements from their
    substitution group) violate "Unique Particle
    Attribution". During validation against this
    schema, ambiguity would be created for those two
    particles.
  • cos-nonambig A and A (or elements from their
    substitution group) violate "Unique Particle
    Attribution". During validation against this
    schema, ambiguity would be created for those two
    particles.

I.e., in a given context, elements with the same
name must have the same content. Easy to check!
Thats more subtle ...
5
The Country Example in XML Schema
lt?xml version"1.0" encoding"UTF-8"?gt ltxsdschema
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
targetNamespace"http//www.example.org/country"
xmlns"http//www.example.org/country"
elementFormDefault"qualified"gt ltxsdelement
name"country"gt ltxsdcomplexTypegt
ltxsdchoicegt ltxsdelement name"king"
type"xsdstring"gtlt/xsdelementgt
ltxsdelement name"queen" type"xsdstring"gtlt/xsd
elementgt ltxsdsequencegt
ltxsdelement name"king" type"xsdstring"gtlt/xsde
lementgt ltxsdelement name"queen"
type"xsdstring"gtlt/xsdelementgt
lt/xsdsequencegt lt/xsdchoicegt
lt/xsdcomplexTypegt lt/xsdelementgt lt/xsdschemagt
As DTD lt!ELEMENT country (king queen
(king,queen))gt
6
Also this is not validated
  • cos-nonambig king and king (or elements from
    their substitution group) violate "Unique
    Particle Attribution". During validation against
    this schema, ambiguity would be created for those
    two particles.

Lets check what this means!
7
What the W3C Standard Explains
  • Schema Component Constraint Unique Particle
    Attribution
  • A content model must be formed such that during
    validation of an element information item
    sequence, the particle contained directly,
    indirectly or implicitly therein with which to
    attempt to validate each item in the sequence
    in turn can be uniquely determined without
    examining the content or attributes of that item,
    and without any information about the items in
    the remainder of the sequence.
  • http//www.w3.org/TR/2001/REC-xmlschem
    a-1-20010502/cos-nonambig

8
Questions and Ideas
  • Questions
  • How can one make the standard formal?
  • How can a validator implement the standard?
  • Ideas
  • Content models are specified by regular
    expressions
  • A regular expression E can be translated into a
    finite state automaton A (Glushkov
    automaton)that checks which strings satisfy E
  • ? Construct A from E and check whether A is
    deterministic

9
Formalization
  • Alphabet ? (i.e., set of symbols) the element
    names occurring in the content model
  • Regular expressions over ? are generated with the
    rule
  • e, f ? a (e?f) (ef) (e)
    (e)
  • where e, f are expressions and a ? ?
  • Language L(e) of an expression e (inductively
    defined)
  • Exercise Which of the following are in the
    language defined by a ? (b c) ? a ?
  • aba
  • abca

In the following, we denote concatenation by a
dot, no more by a comma.
  • aab
  • aaacaaa

10
Regular Expressions and DTDs
  • These are formalizations of DTDs and validation
  • A DTD is a pair (d, s) where
  • s ? ? is the start symbol
  • d maps every ?-symbol to a regular expression
    over ?
  • A document tree t satisfies d (t is valid wrt d)
    iff
  • the root of t is labeled s
  • for every node n in t, with symbol a, the string
    formed by the names of the children of n
    satisfies d(a)
  • ? Validation is checking whether a string
    satisfies a regexp

11
Markings
  • Distinguish between the different occurrences of
    a symbol in
  • a regexp by using numbers markings of regexps
  • Examples
  • a1 ? (b2 c3) ? a4 is a marking of a ?
    (b c) ? a
  • king1 queen2 king3 ? queen4 is a marking
    of
    king queen king ? queen
  • Definition
  • A marking e' of a regular expression e is an
    assignment of numbers to every symbol in e.

12
Unmarked Version
  • Consider a regular expression e and a e? marking
    of e
  • Definition
  • For w ? L(e?) , we denote by w the
    corresponding unmarked string in L(r).
  • Example
  • If w b2a1a3, then w baa

13
Unique Particle Attribution Formalization

Brüggemann-Klein/Wood 1998
  • Definition A regular expression r is
    deterministic iff
  • there are no strings uxv, uyw ? L(r') with
  • x y 1
  • x ? y, (x and y are
    different marked symbols)
  • x y (their
    unmarking is the same).
  • Example (a b) a is not deterministic because
    there are
  • marking ((a1 b2) a3)
  • strings b2 a1 a3 and b2 a3 ?

u x v
u x w
How can we check, whether e is deterministic?
14
Finite State Automata
The automaton is deterministic if every pair
(q,a) is only mapped to a single state
  • Regular anguages can also be defined using
    automata
  • A finite state automaton (FSA) consists of
  • a set of states Q.
  • an alphabet ? (i.e., a set of symbols)
  • a transition function ?, which maps every pair
    (q,a) to a set of states q
  • an initial state q0
  • a set of accepting states F
  • A word a1an is in the language defined by an
    automaton if there is a path from q0 to a state
    in F with edges labeled a1,,an

15
Which Language Does this FSA Define?
16
Non-Deterministic Automata
  • An automaton is non-deterministic if there is a
    state q and a letter a such that there are at
    least two transitions from q via edges labeled
    with a
  • What words are in the language of a
    non-deterministic automaton?
  • We now create a Glushkov automaton from a
    regular expression

17
Creating a Glushkov Automaton from a Regular
Expression
Step 1 Create a marking of the expression
a?(bc)?a
18
Creating a Glushkov Automaton from a Regular
Expression
Step 2 Create a state q0 and create a
state for each subscripted letter
a1?(b1c1)?a2
Step 3 Choose as accepting states all
subscripted letters with which it is possible to
end a word
How do we find these states?

q0
19
Creating a Glushkov Automaton from a Regular
Expression
Step 4 Create a transition from a state li to a
state kj if there is a word in which kj follows
li. Label the transition with k
a1?(b1c1)?a2
How do we find these transitions?
20
Exercises
  • What are the Glushkov automata of
  • a ? b ?(a ? b)
  • (a b) ? a ? (a b)
  • (a b)?a ?

21
Recognizing Deterministic Regular Expressions
  • Theorem (Book et al 1971, Brüggemann-Klein, Wood,
    1998)
  • A regular expression is deterministic
    (one-unambiguous) iff its Glushkov automaton is
    deterministic.

22
Construction of the Glushkov Automaton
  • For an arbitrary alphabet ? and a language L ? ?
  • we define two sets
  • first(L) ?a?? ? ? ?? u?? ?. a?u ? L?
  • last(L) ?a?? ? ? ?? u?? ?. u?a ? L?
  • and the function
  • follow(L,a) ?b?? ? ? ?? u,v?? ?. u?a?b?v ?
    L?.
  • Consider an expression e and its marking e?
  • We can construct the Glushkov automaton for e if
    we know
  • the sets first(L(e?)) , last(L(e?)) ,
  • the function follow(L(e?), ? ) ,
  • and if we know whether ? ? L(e?) .

empty word
Why?
23
Construction of the Glushkov Automaton
  • Where do we get this info?
  • If e? a1 , then
  • first(L(e?)) ? a1 ?
  • last(L(e?)) ? a1 ?
  • follow(L(e?), ? ) is not defined for any li ???
  • Also, ??? L( e?)
  • If e? (f g) , then
  • first(L(e?)) first(L(f))?? first(L(g))
  • last(L(e?)) last(L(f))?? last(L(g))
  • follow(L(e?), li) follow(L(f), li) if li ?
    L(f) follow(L(g), li) if li ? L(g)
  • Also, ?? ? L(e?) if ?? ? L(f) or ?? ? L(g)

For e? f, f, f?g,exercise!
24
Construction of the Glushkov Automaton
  • If e? (f?g) , then
  • first(L(e?)) first(L(f))?? first(L(g)) if ? ?
    L(f)
  • first(L(f))?otherwise
  • last(L(e?)) last(L(f))?? last(L(g)) if ? ? L(g)
  • first(L(g))?otherwise
  • follow(L(e?), li) follow(L(f), li) if li in f
    but not li ? last(L(f)) follow(L(g),
    li) ? first(L(g)) if li ? last(L(f))
    follow(L(g), li) if li in g
  • Also, ?? ? L(e?) if ?? ? L(f) and ?? ? L(g)

25
Construction of the Glushkov Automaton
  • If e? (f) , then
  • first(L(e?)) first(L(f))
  • last(L(e?)) last(L(f))
  • follow(L(e?), li) follow(L(f), li) if li in f
    but not li ? last(L(f))
  • follow(L(f), li) ? first(L(f)) if li
    ? last(L(f))
  • Also, ?? ? L(e?) if ?? ? L(f) and ?? ? L(g)

26
Recognizing Deterministic Regular Expressions
  • Observation
  • For each operator, first, last, and follow can be
    computed in quadratic time.
  • ?This yields an O(n3) algorithm.
  • Theorem (Brüggemann-Klein, Wood, 1998)
  • There is an O(n2) algorithm to check whether a
    regexpis deterministic.

27
More Results
  • Theorems (Brüggemann-Klein, Wood, 1998)
  • Not every regular language can be denoted by a
    deterministic regular expression.
  • E.g.,
    (a b) a (a b)
  • Deterministic regular languages are not closed
    under union, concatenation, or Kleene-star.
  • I.e., there is no easy syntactic
    characterization
  • If it exists, an equivalent deterministic regular
    expression can be constructed in exponential
    time.
  • It is possible to help users, but
    that is costly

28
Theory for XML Schema
  • XML schema allows schemas where
  • the same element appears with different types
  • However,
  • it is illegal to have two elements of the same
    name,but different types in one content model.
  • Also, content models must be deterministic.
  • Consequence
  • Documents can be validated in a deterministic
    top-down pass

29
References
  • This material draws upon slides by
  • Sara Cohen
  • Frank Neven,
  • notes by
  • Leonid Libkin
  • and the papers by A. Brüggemann-Klein and D. Wood
Write a Comment
User Comments (0)
About PowerShow.com