Title: XML Data Management 10. Deterministic DTDs and Schemas
1XML Data Management10. Deterministic DTDs and
Schemas
2How Expressive can a Schema Be?
This schema is a frequent example in teaching
material on XML Schema
ltxsdcomplexType nameoneBgt ltxsdchoicegt
ltxsdelement nameB typexsdstring/gt
ltxsdsequencegt ltxsdelement nameA
typeonlyAs/gt ltxsdelement nameA
typeoneB/gt lt/xsdsequencegt
ltxsdsequencegt ltxsdelement nameA
typeoneB/gt ltxsdelement nameA
typeonlyAs/gt lt/xsdsequencegt
lt/xsdchoicegtlt/xsdcomplexTypegt
ltxsdelement nameA typeoneB/gtltxsdcomplexT
ype nameonlyAsgt ltxsdchoicegt
ltxsdsequencegt ltxsdelement nameA
typeonlyAs/gt ltxsdelement nameA
typeonlyAs/gt lt/xsdsequencegt
ltxsdelement nameA typexsdstring/gt
lt/xsdchoicegtlt/xsdcomplexTypegt
What would documents look like that satisfy this
schema?
Arbitrary deep binary tree with A elements, and a
single B element
How would one check validity? What would be the
cost? What are the pros and cons of allowing such
schemas?
3Lets see what SAXON says
4Here is the Full Error Message from Eclipse
- cos-element-consistent Error for type 'oneB'.
Multiple elements with name 'A', with different
types, appear in the model group. - cos-element-consistent Error for type 'onlyAs'.
Multiple elements with name 'A', with different
types, appear in the model group. - cos-nonambig A and A (or elements from their
substitution group) violate "Unique Particle
Attribution". During validation against this
schema, ambiguity would be created for those two
particles. - cos-nonambig A and A (or elements from their
substitution group) violate "Unique Particle
Attribution". During validation against this
schema, ambiguity would be created for those two
particles.
I.e., in a given context, elements with the same
name must have the same content. Easy to check!
Thats more subtle ...
5The Country Example in XML Schema
lt?xml version"1.0" encoding"UTF-8"?gt ltxsdschema
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
targetNamespace"http//www.example.org/country"
xmlns"http//www.example.org/country"
elementFormDefault"qualified"gt ltxsdelement
name"country"gt ltxsdcomplexTypegt
ltxsdchoicegt ltxsdelement name"king"
type"xsdstring"gtlt/xsdelementgt
ltxsdelement name"queen" type"xsdstring"gtlt/xsd
elementgt ltxsdsequencegt
ltxsdelement name"king" type"xsdstring"gtlt/xsde
lementgt ltxsdelement name"queen"
type"xsdstring"gtlt/xsdelementgt
lt/xsdsequencegt lt/xsdchoicegt
lt/xsdcomplexTypegt lt/xsdelementgt lt/xsdschemagt
As DTD lt!ELEMENT country (king queen
(king,queen))gt
6Also this is not validated
- cos-nonambig king and king (or elements from
their substitution group) violate "Unique
Particle Attribution". During validation against
this schema, ambiguity would be created for those
two particles.
Lets check what this means!
7What the W3C Standard Explains
- Schema Component Constraint Unique Particle
Attribution - A content model must be formed such that during
validation of an element information item
sequence, the particle contained directly,
indirectly or implicitly therein with which to
attempt to validate each item in the sequence
in turn can be uniquely determined without
examining the content or attributes of that item,
and without any information about the items in
the remainder of the sequence. - http//www.w3.org/TR/2001/REC-xmlschem
a-1-20010502/cos-nonambig
8Questions and Ideas
- Questions
- How can one make the standard formal?
- How can a validator implement the standard?
- Ideas
- Content models are specified by regular
expressions - A regular expression E can be translated into a
finite state automaton A (Glushkov
automaton)that checks which strings satisfy E - ? Construct A from E and check whether A is
deterministic
9Formalization
- Alphabet ? (i.e., set of symbols) the element
names occurring in the content model - Regular expressions over ? are generated with the
rule - e, f ? a (e?f) (ef) (e)
(e) - where e, f are expressions and a ? ?
- Language L(e) of an expression e (inductively
defined) - Exercise Which of the following are in the
language defined by a ? (b c) ? a ? - aba
- abca
In the following, we denote concatenation by a
dot, no more by a comma.
10Regular Expressions and DTDs
- These are formalizations of DTDs and validation
- A DTD is a pair (d, s) where
- s ? ? is the start symbol
- d maps every ?-symbol to a regular expression
over ? - A document tree t satisfies d (t is valid wrt d)
iff - the root of t is labeled s
- for every node n in t, with symbol a, the string
formed by the names of the children of n
satisfies d(a) - ? Validation is checking whether a string
satisfies a regexp
11Markings
- Distinguish between the different occurrences of
a symbol in - a regexp by using numbers markings of regexps
- Examples
- a1 ? (b2 c3) ? a4 is a marking of a ?
(b c) ? a - king1 queen2 king3 ? queen4 is a marking
of
king queen king ? queen - Definition
- A marking e' of a regular expression e is an
assignment of numbers to every symbol in e.
12Unmarked Version
- Consider a regular expression e and a e? marking
of e - Definition
- For w ? L(e?) , we denote by w the
corresponding unmarked string in L(r). - Example
- If w b2a1a3, then w baa
13Unique Particle Attribution Formalization
Brüggemann-Klein/Wood 1998
- Definition A regular expression r is
deterministic iff - there are no strings uxv, uyw ? L(r') with
- x y 1
- x ? y, (x and y are
different marked symbols) - x y (their
unmarking is the same). - Example (a b) a is not deterministic because
there are - marking ((a1 b2) a3)
- strings b2 a1 a3 and b2 a3 ?
u x v
u x w
How can we check, whether e is deterministic?
14Finite State Automata
The automaton is deterministic if every pair
(q,a) is only mapped to a single state
- Regular anguages can also be defined using
automata - A finite state automaton (FSA) consists of
- a set of states Q.
- an alphabet ? (i.e., a set of symbols)
- a transition function ?, which maps every pair
(q,a) to a set of states q - an initial state q0
- a set of accepting states F
- A word a1an is in the language defined by an
automaton if there is a path from q0 to a state
in F with edges labeled a1,,an
15Which Language Does this FSA Define?
16Non-Deterministic Automata
- An automaton is non-deterministic if there is a
state q and a letter a such that there are at
least two transitions from q via edges labeled
with a - What words are in the language of a
non-deterministic automaton? - We now create a Glushkov automaton from a
regular expression
17Creating a Glushkov Automaton from a Regular
Expression
Step 1 Create a marking of the expression
a?(bc)?a
18Creating a Glushkov Automaton from a Regular
Expression
Step 2 Create a state q0 and create a
state for each subscripted letter
a1?(b1c1)?a2
Step 3 Choose as accepting states all
subscripted letters with which it is possible to
end a word
How do we find these states?
q0
19Creating a Glushkov Automaton from a Regular
Expression
Step 4 Create a transition from a state li to a
state kj if there is a word in which kj follows
li. Label the transition with k
a1?(b1c1)?a2
How do we find these transitions?
20Exercises
- What are the Glushkov automata of
- a ? b ?(a ? b)
- (a b) ? a ? (a b)
- (a b)?a ?
21Recognizing Deterministic Regular Expressions
- Theorem (Book et al 1971, Brüggemann-Klein, Wood,
1998) -
- A regular expression is deterministic
(one-unambiguous) iff its Glushkov automaton is
deterministic.
22Construction of the Glushkov Automaton
- For an arbitrary alphabet ? and a language L ? ?
- we define two sets
- first(L) ?a?? ? ? ?? u?? ?. a?u ? L?
- last(L) ?a?? ? ? ?? u?? ?. u?a ? L?
- and the function
- follow(L,a) ?b?? ? ? ?? u,v?? ?. u?a?b?v ?
L?. -
- Consider an expression e and its marking e?
- We can construct the Glushkov automaton for e if
we know - the sets first(L(e?)) , last(L(e?)) ,
- the function follow(L(e?), ? ) ,
- and if we know whether ? ? L(e?) .
empty word
Why?
23Construction of the Glushkov Automaton
- Where do we get this info?
- If e? a1 , then
- first(L(e?)) ? a1 ?
- last(L(e?)) ? a1 ?
- follow(L(e?), ? ) is not defined for any li ???
- Also, ??? L( e?)
- If e? (f g) , then
- first(L(e?)) first(L(f))?? first(L(g))
- last(L(e?)) last(L(f))?? last(L(g))
- follow(L(e?), li) follow(L(f), li) if li ?
L(f) follow(L(g), li) if li ? L(g) - Also, ?? ? L(e?) if ?? ? L(f) or ?? ? L(g)
For e? f, f, f?g,exercise!
24Construction of the Glushkov Automaton
- If e? (f?g) , then
- first(L(e?)) first(L(f))?? first(L(g)) if ? ?
L(f) - first(L(f))?otherwise
- last(L(e?)) last(L(f))?? last(L(g)) if ? ? L(g)
- first(L(g))?otherwise
- follow(L(e?), li) follow(L(f), li) if li in f
but not li ? last(L(f)) follow(L(g),
li) ? first(L(g)) if li ? last(L(f))
follow(L(g), li) if li in g - Also, ?? ? L(e?) if ?? ? L(f) and ?? ? L(g)
25Construction of the Glushkov Automaton
- If e? (f) , then
- first(L(e?)) first(L(f))
- last(L(e?)) last(L(f))
- follow(L(e?), li) follow(L(f), li) if li in f
but not li ? last(L(f)) - follow(L(f), li) ? first(L(f)) if li
? last(L(f)) - Also, ?? ? L(e?) if ?? ? L(f) and ?? ? L(g)
26Recognizing Deterministic Regular Expressions
- Observation
- For each operator, first, last, and follow can be
computed in quadratic time. - ?This yields an O(n3) algorithm.
- Theorem (Brüggemann-Klein, Wood, 1998)
-
- There is an O(n2) algorithm to check whether a
regexpis deterministic.
27More Results
- Theorems (Brüggemann-Klein, Wood, 1998)
-
- Not every regular language can be denoted by a
deterministic regular expression. - E.g.,
(a b) a (a b) - Deterministic regular languages are not closed
under union, concatenation, or Kleene-star. - I.e., there is no easy syntactic
characterization - If it exists, an equivalent deterministic regular
expression can be constructed in exponential
time. - It is possible to help users, but
that is costly
28Theory for XML Schema
- XML schema allows schemas where
- the same element appears with different types
- However,
- it is illegal to have two elements of the same
name,but different types in one content model. - Also, content models must be deterministic.
- Consequence
- Documents can be validated in a deterministic
top-down pass
29References
- This material draws upon slides by
- Sara Cohen
- Frank Neven,
- notes by
- Leonid Libkin
- and the papers by A. Brüggemann-Klein and D. Wood