Title: Constraints for XML
1Constraints for XML
- Susan B. Davidson
- University of Pennsylvania
- Wenfei Fan
- Bell Labs and Temple University
2Outline
- XML, Web data and database techniques
- XML specifications types and constraints
- XML constraints absolute/relative keys and
foreign keys - Analysis of XML constraints consistency and
implication - Constraints in practice
- Area references
- "A Web odyssey From Codd to XML", V. Vianu, PODS
2001. http//www.cis.upenn.edu/wfan/PODS2001/pro
ceedings.html - "Constraints for semistructured data and XML", P.
Buneman, W. Fan, J. Simeon and S. Weinstein,
SIGMOD Record 30(1), March 2001.
http//www.cis.temple.edu/fan/papers/xml/survey.p
s.gz
3 - Part 1. XML a brief introduction
4What is wrong with HTML?
- HTML (HyperText Markup Language) is good for
presentation, but does not help information
extraction by programs. - lth3gt George Bush lt/h3gt
- ltbgt Taking Eng 055 lt/bgt ltbrgt
- ltemgt GPA 1.5 lt/emgt ltbrgt
- lth3gt Eng 055 lt/h3gt
- ltbgt Spelling lt/bgt
- HTML tags
- predefined and fixed
- describing display format rather than the
structure of data
5eXtensible Markup Language
- XML tags
- user-defined, arbitrarily nested
- describing the structure of the data rather than
display - ltstudent id 123gt
- ltnamegt
- ltfirstNamegt George lt/firstNamegt ltlastNamegt
Bush lt/lastNamegt - lt/namegt
- lttakinggt Eng 055 lt/takinggt
- ltGPAgt 1.5 lt/GPAgt
- lt/studentgt
- ltcourse cno Eng 055gt
- lttitlegt Spelling lt/titlegt
- lt/coursegt
6XML basics
- Element the segment between a start tag and a
corresponding end tag, e.g., student, name. - Subelement relation between an element and its
component elements, e.g., name to student. - Attribute marked text within a start tag, e.g.,
id. - Text the single basic type (PCDATA), e.g.,
Bush. - XML elements are ordered, whereas attributes are
not. - ltstudent id 123gt
- ltnamegt
- ltfirstNamegt George lt/firstNamegt ltlastNamegt
Bush lt/lastNamegt - lt/namegt
- lttakinggt Eng 055 lt/takinggt ltGPAgt
1.5 lt/GPAgt - lt/studentgt
7Representing relational databases
- A relational database for school
- student course
- enroll
8XML representation
- ltschoolgt
- ltstudent id001gt
- ltnamegt Joe lt/namegt ltgpagt 3.0 lt/gpagt
- lt/studentgt
- ltstudent id002gt
- ltnamegt Mary lt/namegt ltgpagt 4.0
lt/gpagt - lt/studentgt
- ltcourse cno331gt
- lttitlegt DB lt/titlegt ltcreditgt 3.0
lt/creditgt - lt/coursegt
- ltcourse cno350gt
- lttitlegt Web lt/titlegt ltcreditgt 3.0
lt/creditgt - lt/coursegt
9XML representation
- ltenrollgt
- ltidgt 001 lt/idgt ltcnogt 331 lt/cnogt
- lt/enrollgt
- ltenrollgt
- ltidgt 001 lt/idgt ltcnogt 350 lt/cnogt
- lt/enrollgt
- ltenrollgt
- ltidgt 002 lt/idgt ltcnogt 331 lt/cnogt
- lt/enrollgt
- lt/schoolgt
10Representing object-oriented databases
- An object-oriented database for school
- student s1, s2
- value(s1) id 001, name Joe, gpa
3.0, - taking c1, c2
- value(s2) id 002, name Mary, gpa
4.0, - taking c1
- course c1, c2
- value(c1) cno 331, title DB, credit
3.0, - taken_by s1, s2
- value(c2) cno 350, title Web, credit
3.0, - taken_by s1
11XML representation
- ltschoolgt
- ltstudent ids1gt
- ltidgt 001 lt/idgt
- ltnamegt Joe lt/namegt
- ltgpagt 3.0 lt/gpagt
- lttaking idrefsc1 c2 /gt
- lt/studentgt
- ltstudent ids2gt
- ltidgt 002 lt/idgt
- ltnamegt Mary lt/namegt
- ltgpagt 4.0 lt/gpagt
- lttaking idrefsc1 c2 /gt
- lt/studentgt
12XML representation
- ltcourse idc1gt
- ltcnogt 331 lt/cnogt
- lttitlegt DB lt/titlegt
- ltcreditgt 3.0 lt/creditgt
- lttaken_by idrefss1 s2 /gt
- lt/coursegt
- ltcourse idc2gt
- ltcnogt 331 lt/cnogt
- lttitlegt Web lt/titlegt
- ltcreditgt 3.0 lt/creditgt
- lttaken_by idrefss1 /gt
- lt/coursegt
- lt/schoolgt
13The XML tree model
- An XML document is typically modeled as a
node-labeled tree. - Element node internal, with a name (tag) and
children (subelements and attributes), e.g.,
student, name. - Attribute node leaf with a name (tag) and text,
e.g., _at_id. - Text node leaf with text (string) but without a
name.
14XML and Web data
- Web data is semistructured schemaless, irregular
- Traditional database systems cant model Web data
- XML model a special case of the semistructured
data model - flexible model Web data (with references
foreign keys) - powerful represent data from databases
15XML in data exchange
- XML the primary standard for data exchange on
the Web - across formats/platforms/enterprises
- generated and consumed by applications
- healthcare industry, e-commerce, digital library,
Web
XML
XML
OODB Unix
RDB MS
16XML in data integration
- mediator/wrapper vs. virtual view of a database
- data warehouse vs. materialized view of a
database - Web databases, e-commerce
client
client
mediator -- XML
wrapper
wrapper
wrapper
file
Web
DB
17XML in e-commerce
- A site for a car dealer provides a uniform query
interface for price, rating, review and
competitors price/availability. - Integrating local data, national archive for
safety records, review data, competitors sites - e-commerce query interface (XML), integration
system (XML), database system, workflow management
client
client
query interface, warehouse -- XML
integrator
integrator
integrator
integrator
local DB
national records
review
competitor
18Database techniques for managing XML data
- specifying XML types and constraints
- querying XML XSL, XQL, XML-QL, Lorel, UnQL
- updating XML constraints and concurrency control
- integrating XML database transformations and
integration - storing XML efficient storage and access
methods, indexing - These are crucial for Web applications
- e-commerce, digital library, data exchange, Web
databases, - Web site management,
- XML players W3C, Microsoft, HP, Oracle, Adobe,
...
19 - Part 2. XML specification types and constraints
20A relational schema (SQL)
- Types and constraints
- create table students create table
courses - ( id char(9), ( cno
char(9), - name char(20), title
char(20), - primary key id) primary key
cno) - create table enroll
- ( id char(9),
- cno char(9),
- primary key (id, cno),
- foreign key id references students,
- foreign key cno references courses)
21An object-oriented schema (ODMG)
- Types and constraints
- class student class
course - (key id, (key
cno, - extent students) extent
courses) - attribute string id attribute
string cno - attribute string name
attribute string title - relationship setltcoursegt taking
relationship setltstudentgt takenBy - inverse coursetakenBy
inverse studenttaking - The distinction between types and constraints is
dictated by what programming languages treat as
types
22XML specification DTD
- DTD (Document Type Definition)
- Type
- lt!ELEMENT db (student, course) gt
- lt!ELEMENT student (name, taking)gt
- lt!ELEMENT course (title, taken_by)gt
- lt!ELEMENT taking (empty)gt
- lt!ELEMENT taken_by (empty)gt
- Constraints ID and IDREF attributes in DTD
- lt!ATTLIST student id ID
requiredgt - lt!ATTLIST course cno ID
requiredgt - lt!ATTLIST taking cno IDREF
impliedgt - lt!ATTLIST taken_by id IDREF
impliedgt - Others XML Schema, XML-Data, XDR, SOX,
Schematron, DSD, ...
23Capturing oids with IDs
- Recall our XML encoding of our OODB
- student s1, s2
- course c1, c2
- ltschoolgt
- ltstudent ids1gt
- ltidgt 001 lt/idgt ltnamegt Joe lt/namegt
- ltgpagt 3.0 lt/gpagt lttaking
idrefsc1 c2 /gt - . . .
- ltcourse idc2gt
- ltcnogt 331 lt/cnogt lttitlegt Web lt/titlegt
- ltcreditgt 3.0 lt/creditgt lttaken_by
idrefss1 /gt - lt/coursegt
- lt/schoolgt
24A DTD for the OODB
- Types
- lt!ELEMENT db (student, course) gt
- lt!ELEMENT student (id, name, gpa,
taking)gt - lt!ELEMENT course (cno, title, credit,
taken_by)gt - lt!ELEMENT taking (empty)gt
- Constraints
- lt!ATTLIST student id ID
requiredgt - lt!ATTLIST course id ID
requiredgt - lt!ATTLIST taking idrefs IDREFS
impliedgt - lt!ATTLIST taken_by idrefs IDREFS
impliedgt - ID vs. object-identifier (oid)
25 - Part 3. XML constraints keys and foreign keys
26Keys and foreign keys for XML
- Keys locating a specific object, an invariant
connection from an object in the real world to
its representation - student._at_id ? student
- course._at_cno ? course
- foreign keys referencing an object from another
object - taking._at_cno ? course._at_cno, course._at_cno ?
course - taken_by._at_id ? student._at_id, student._at_id ?
student - Central issues value equality, typing, scoping,
absolute/relative, ... - Key specifications
- the XML standard (DTD), XML Schema, XML Data,
...
27Specification of student in XML-Schema
- ltelement name studentgt
- ltcomplexTypegt
- ltsequencegt
- ltelement namename typestring/gt
- ltelement nametaking min0occurs0
- max0occursunboundedgt
- ltcomplexTypegt
- ltattribute namecno typestringgt
- lt/complexTypegt
- lt/elementgt
- ltattribute nameid typestring
/gt - lt/sequencegt
- lt/complexTypegt
28Keys and foreign keys in student
- ltkey namek1 gt
- ltselector xpath./gt
- ltfield xpath_at_id/gt
- lt/keygt
- ltkeyref namefk1 referk2gt
- ltselector xpathtaking/gt
- ltfield xpath_at_cno/gt
- lt/keyrefgt
- lt/elementgt
29Specification of course in XML-Schema
- ltelement name coursegt
- ltcomplexTypegt
- ltsequencegt
- ltelement nametitle typestring/gt
- ltelement nametaken_by min0occurs0
- max0occursunboundedgt
- ltcomplexTypegt
- ltattribute nameid typestringgt
- lt/complexTypegt
- lt/elementgt
- ltattribute namecno typestring
/gt - lt/sequencegt
- lt/complexTypegt
30Keys and foreign keys in course
- ltkey namek2 gt
- ltselector xpath./gt
- ltfield xpath_at_cno/gt
- lt/keygt
- ltkeyref namefk2 referk1gt
- ltselector xpathtaken_by/gt
- ltfield xpath_at_id/gt
- lt/keyrefgt
- lt/elementgt
31Keys in XML-Data
- ltelementType id studentgt
- ltelement idp1 typeid /gt
- ltelement typename /gt
- ltelement typetaking
occursONEORMORE/gt - ltkey idk1 gt ltkeyPart hrefp1/gt lt/keygt
- lt/elementTypegt
- ltelementType id coursegt
- ltelement idp2 typecno /gt
- ltelement typetitle /gt
- ltelement typetaken_by
occursONEORMORE/gt - ltkey idk2 gt ltkeyPart hrefp2/gt lt/keygt
- lt/elementTypegt
32Foreign keys in XML-Data
- ltelementType id takinggt
- ltelement typecno /gt
- ltdomain typestudent /gt
- ltforeignKey rangecourse key k2 /gt
- lt/elementTypegt
- ltelementType id taken_bygt
- ltelement typeid /gt
- ltdomain typecourse /gt
- ltforeignKey rangestudent key k1/gt
- lt/elementTypegt
33Constraints are important for XML
- XML is semistructured and may not come with a
DTD/type - constraints are a fundamental part of the
semantics - constraints have proved useful in
- semantic specifications obvious
- query optimization chasing algorithm
- database conversion to an XML encoding a must
- data integration information preservation
- update anomaly prevention classical
- normal forms for XML specifications BCNF,
3NF - efficient storage/access indexing
- ...
34The limitations of the XML standard
- ID and IDREF attributes in DTD
- lt!ATTLIST student id ID
requiredgt - lt!ATTLIST course cno ID
requiredgt - lt!ATTLIST taking cno IDREF
impliedgt - lt!ATTLIST taken_by idrefs IDREF
impliedgt - Scoping
- ID unique within the entire document (like oids)
- IDREF untyped one has no control over what it
points to - unary and primary
- defined in a type
- A mixture of relational keys and object
identities (oids)
35The limitations of XML Schema
- Keys defined with a list of XPath expressions
- (student, firstName, lastName)
- (student, lastName, firstName)
- (student, lastName, lastName,
firstName) - Equivalence/containment of XPath expressions is
unresolved - No efficient way to tell whether two keys are
equivalent - The notion of value equality is too restricted
(text only) - The notion of relative keys is not addressed
- Mild generalizations of relational keys fail to
capture some fundamental semantics associated
with the hierarchical structure of XML data
36To overcome the limitations WWW10
- Absolute key (Q, P1, . . ., Pk )
- target path Q to identify a target set Q of
nodes on which the key is defined (vs. relation) - a set of key paths P1, . . ., Pk to provide
an identification for nodes in Q (vs. key
attributes) - semantics for any two nodes in Q, if they
have all the key paths and agree on them up to
value equality, then they must be the same node
(value equality and node identity) - ( _.student, _at_id)
- ( _.student, _.name)
- ( _.enroll, _at_id, _at_cno)
- ( _, _at_id)
37Value equality on trees
- Two nodes are value equal iff
- either they are text nodes (PCDATA) with the same
value - or they are attributes with the same tag and the
same value - or they are elements having the same tag and
their children are pairwise value equal
...
38Capturing the semistructured nature
- independent of types
- no structural requirement tolerating
missing/multiple paths - (person, name) (person, name, _at_phone)
39Path expressions
- A simple yet powerful regular path language
- q ? l q.q
_ - ? empty path
- l tag
- q.q concatenation
- _ combination of wildcard and the Kleene
closure - Theorem. The containment and equivalence problems
for these path expressions are finitely
axiomatizable and decidable in quadratic time.
40Relative constraints
- How to identify in a document
- a book?
- a chapter?
- a section?
41A key constraint language K
- Relative key (Q, K)
- path Q identifies a set Q of nodes, called
the context - k (Q, P1, . . ., Pk ) is a key on
sub-documents rooted at nodes in Q (relative
to Q). - Example. (book, (chapter, number)
- (book.chapter, (section, number))
- (book, title) -- absolute key
- Analogous to keys for weak entities in a
relational database - the key of the parent entity
- an identification relative to the parent entity
42Examples of K constraints
- absolute (book, title)
- relative (book, (chapter, number)
- relative (book.chapter, (section, number))
43Absolute vs. relative keys
- Absolute keys as a special case of relative keys
- (Q, K) when Q is the empty path
- Absolute keys are scoped within the context of
the entire document, while relative keys are
scoped within the context of a sub-document - Important for hierarchically structured data
XML, scientific databases, - absolute (book, title)
- relative (book, (chapter, number)
- relative (book.chapter, (section, number))
- XML keys are more complex than relational keys!
44Inverse constraints
- Recall inverse constraints in OODB
- class student class
course - (key id, (key
cno, - extent students) extent
courses) - attribute string id attribute
string cno - attribute string name
attribute string title - relationship setltcoursegt taking
relationship setltstudentgt takenBy - inverse coursetakenBy
inverse studenttaking - Inverse constraints
- if student s is taking course c, then c must be
taken by s - it course c is taken by student s, then s must be
taking c.
45Inverse constraints for XML pods00
- lt!ELEMENT student (name, taking)gt
- lt!ELEMENT course (title, taken_by)gt
- lt!ATTLIST student id ID
requiredgt - lt!ATTLIST course cno ID
requiredgt - lt!ATTLIST taking cno IDREF
impliedgt - lt!ATTLIST taken_by id IDREF
impliedgt - Inverse constraints
- student(id).taking(cno) ? course(cno).taken_by
(id) - for any student s and any course c,
- if c.cno ?s.taking.cno, then s.id ? c.taken_by.id
- if s.id ? c.taken_by.id, then c.cno ?s.taking.cno
46Other constraints pods00
- Path inclusion constraints
- student.taking.cno ? course.cno
- course.taken_by.id ? student.id
- Path functional constraints
- lt!ELEMENT professor (name, research,
course)gt - lt!ELEMENT course (cno, title,
credit)gt - professor.research ? professor.course.cno
- value equality in both sides
47 - Part 4. XML constraint analysis
48Consistency of an XML specification
- Given D a DTD
- ? a set of keys and foreign keys
- Consistency is there an XML document that both
conforms to D and satisfies ?? - Example.
- DTD D lt!ELEMENT foo (X, X) gt
- lt!ELEMENT X (empty)gt
- constraints ? (X, ?)
- One wants to know whether an XML specification
makes sense!
49Implication of XML constraints
- Given D a DTD
- ? a set of keys and foreign keys
- ? a property (a key or foreign key)
- Implication is it the case that for any XML
document, if it conforms to D and satisfies ?,
then it must satisfy ?? - The need for studying implication
- data integration constraints cannot be checked
directly at the mediator level - design theory for XML specifications along the
same lines as database normalization - query optimization (chase), . . .
50Consistency analysis
- Trivial for relational databases given any
schema and keys, foreign keys, one can always
find a nonempty instance of the schema satisfying
the constraints. - Hard for XML XML specifications with DTD and
keys, foreign keys may not be consistent! - DTDs interact with constraints in an intricate
way.
51The interaction between DTDs and constraints
- DTD D lt!ELEMENT foo (X, X) gt
- lt!ELEMENT X (empty)gt
- key ? (X, ?)
- (1) conforms to D two X nodes under the root
- (2) satisfies ? no two X nodes under the root
can have the same value - There is no XML tree both conforming to D and
satisfying ?
52Consistency of DTDs
- There is need for consistency analysis even in
the absence of constraints - Example. DTD
- lt!ELEMENT foo (foo)gt
- There exists no XML document that conforms to the
DTD!
53A simple constraint language, C
- absolute key ?X ? ?. A document satisfies
the key iff - ? x y ? ext(?) (xX v yX ? x y)
-
- absolute foreign key an inclusion constraint
?1X ? ?2Y and a key ?2Y ? ?2. A document
satisfies the foreign key iff it satisfies the
key and - ? x ? ext(?1) ? y ? ext(?2) (xX v yY)
-
- where
- ?, ?1, ?2 element types
- X, Y sets (sequences) of attributes
- ext(?) the set of all ? elements in the
document - v value equal.
54Examples of C constraints
- Specifying keys and foreign keys in terms of
element types, rather than paths (in the flavor
of XML-Data). - student._at_id ? student
- course._at_cno ? course
- taking._at_cno ? course._at_cno
- person_at_firstName, _at_lastName ? person
- C constraints vs. K constraints
- absolute key ?X ? ? in C is equivalent to an
absolute key in K (_. ?, X) - absolute keys are a special case of K constraints
- absolute foreign key ?1X ? ?2Y and ?2Y ?
?2 of C is not expressible in K
55Unary constraints
- Keys and foreign keys defined in terms of
single-attribute. - Example.
- student._at_id ? student
- course._at_cno ? course
- taking._at_cno ? course._at_cno
56Analysis of C constraints PODS01
- Theorem. In the presence of DTDs, the following
problems are undecidable for keys and foreign
keys of C - the consistency problem
- the implication problem.
- As opposed to the trivial consistency analysis in
relational databases. - These negative results carry over to
- other schema languages XML Schema, XML Data,
XDuce, - other constraint languages XML Schema, XML
Data,...
57Analysis of unary constraints
- Theorem. In the presence of DTDs, for unary
constraints of C - the consistency problem is NP-complete
- the implication problem is coNP-complete.
- In relational databases, implication of unary
keys and foreign keys is decidable in linear
time. - Primary key restriction at most one key for each
element type. - Theorem. In the presence of DTDs, the consistency
and implication problems remain intractable for
unary keys and foreign keys of C even under the
primary key restriction. - Keys specified with ID attributes are primary and
unary!
58A simple language for relative constraints, R
- relative key (Q, ?X ? ?). A document
satisfies the key iff - ? x ? Q ? y z ? ext(x.?) (yX v zX
? x y) -
- relative foreign key (Q1, ?1X) ? (Q2, ?2Y)
and a key (Q2, ?2Y ? ?2). A document
satisfies the foreign key iff it satisfies the
key and - ? x ? Q1 ? y ? Q2 (ext(x.?1)X ?v
ext(y.?2)Y) -
- where
- Q, Q1, Q2 path expressions
- ?, ?1, ?2 element types X, Y attributes
- ext(x.?) the set of ? sub-elements of x
- ?v set inclusion defined in terms of value
equality
59Examples of R constraints
- Specifying relative constraints in terms of
element types - (CS.student, (taking._at_cno ? taking)
- (_, (course._at_cno ? course))
- (CS.student, taking._at_cno) ? (CS,
course._at_cno) - (CS, course._at_cno) ? (CS.student,
taking._at_cno) - R constraints vs. K constraints
- key (Q, ?X ? ?) of R is equivalent to
- (Q, (?, X))
- relative keys are a special case of K constraints
- foreign key (Q1, ?1X) ? (Q2, ?2Y) and (Q2,
?2Y ? ?2) of R is not expressible in K
60Analysis of relative constraints
- Theorem. In the presence of DTDs, the following
problems are undecidable even for unary relative
constraints of R - the consistency problem
- the implication problem.
- The analysis of XML constraints is far more
intriguing than its database counterparts!
61Tractable special cases
- Theorem. In the absence of constraints, the
consistency problem for arbitrary DTDs is
decidable in linear time. - Theorem. When DTD is fixed, the consistency and
implication problems for unary constraints of C
are in PTIME. - Theorem. When only keys of C are considered, the
consistency and implication problems are
decidable in linear time in the presence of DTDs.
62Constraint analysis in the absence of DTDs
- Regardless of DTDs
- Consistency given any set of keys and foreign
keys, can they be satisfied by an XML document? - Implication given a set ? of keys and foreign
keys, does it follow that all documents
satisfying ? must also satisfy another key or
foreign key? - The need for investigating these issues
- many XML documents do not come with a DTD
- one is interested in implication that generally
holds for all kinds of documents, regardless of
their DTDs.
63Analysis of C constraints PODS00
- Without DTDs, the consistency problem becomes
trivial any keys and foreign keys of C are
satisfiable. - Theorem. In the absence of DTDs, the implication
problem for C constraints remains undecidable. - Theorem. In the absence of DTDs, the implication
problem is decidable in PSPACE for keys and
foreign keys of C under the primary key
restriction. - Theorem. In the absence of DTDs, the implication
problem is decidable in linear time for unary
keys and foreign keys of C. - These results also hold when inverse constraints
are allowed.
64Analysis of K constraints DBPL01
- Without DTDs, the consistency problem for K also
becomes trivial any keys of K are satisfiable. - Theorem. In the absence of DTDs, the implication
problem for keys of K is finitely axiomatizable
and is decidable in PTIME. - Theorem. In the absence of DTDs, the implication
problem for absolute keys of K is finitely
axiomatizable and is decidable in O(n3) time. - The absence of DTDs simplifies the constraint
analysis but does not make it trivial!
65Inference rules for K constraint implication
- superkey if (Q, (Q, S)) then (Q, (Q, S ?
P)) - where P is any path
- Example (_, (person, id) ? (_, (person,
id, name)) - containment-reduce if (Q, (Q, S ? P1, P2))
and P1 ? P2, then (Q, (Q, S ? P1)) - Example (_, (person, id, _.id ) ? (_,
(person, id)) - context-target if (Q, (Q1.Q2, S)), then
(Q.Q1, (Q2, S)) - Example (_, (university.employee, id) ?
- (_.university, (employee, id))
66 - Part 5. Constraints in Practice
67Updates in XML Tatatarinov et al
(SIGMOD01)Zhang Shasha SIAM J. Comput
18(5), Chawathe SIGMOD97
- Updates for XML are based on its ordered tree
model Insert, Delete, Rename, InsertBefore/Insert
After, Replace.
68Using Keys to Update Transitive Keys WWW01
- To update a unique node, we must be able to
identify it uniquely. - Example 1
- Example 2
- In the first example, the second (relative) key
is given a context by the first. This is not
the case in the second example. - (Q1, (Q1, S1)) immediately precedes (Q2, (Q2,
S2)) if Q2Q1.Q1. Precedes is the transitive
closure of immediately precedes. - A set ? of relative keys is transitive if for any
relative key (Q1, (Q1, S1)) ?? there is a key
(?, (Q2, S2)) which precedes (Q1, (Q1, S1)).
(?, (bible.book, name)) (bible.book, (chapter,
number))
(?, (bible.book, name)) (bible.book.chapter,
(verse, number))
69Checking Key Constraints
- How efficiently can we check that a document
satisfies a key specification of absolute and
relative keys (Q, (Q, K))? It turns out there
is an incremental technique which runs in linear
time in the size of the document, and uses
efficient indexing and SAX. - The index is a hierarchical hash table, which is
composed of levels - Key specification level
- Context node level
- Key path level
- Key value level
- Nodes are partitioned by key path and key value.
- The index is incremental, and updates can be
performed in linear time in the size of the
update.
70Structure of Index
Where the Key Value Sharing Class (KVSC)
represents a set of nodes (oids) that share the
key value.
71Example
6 ln
72Example, cont
- Suppose we have the following key specification
- KS1 (?, (book, ISBN)
- KS2 (book, (author, fn, ln))
- KS2 (?, (author, _at_ID)
73XML keys and relational storage
- The previous approach does not consider how the
XML document is being stored - Text file?
- Relational storage?
- Object system?
- If a relational store is used, can we use the
native key or constraint checking to check XML
keys?
74XML Relational Storage Strategies quick review
- Edge approach create a single relational table
called the edge table (Florescu Kossman Data
Eng. Bul. 22(3)) - (sourceID, tag, ordinal, targetID, data)
- Basic inlining each table corresponds to an
element using the DTD, place within an element
table as many single-valued attributes as
possible (Wisconsin VLDB99)
lt?xml?gt lt!ELEMENT Dept(Student)gt lt!ATTLIST Dept
dept_id ID REQUIREDgt lt!ELEMENT Student(Name,
Enroll)gt lt!ATTLIST Student student_id ID
REQUIREDgt lt!ELEMENT Name PCDATAgt lt!ELEMENT
Enroll PCDATAgt
Dept(parentID, ID, dept_id) Student(parentID,
ID, student_id, Name) Enroll(parentID, ID,
TEXT)
75Mapping XML keys to relational constraints
- How do XML keys translate to relational keys or
functional dependencies? - Edge model separates all edges out, cannot use
key constraints. - Inlining allows more.
- (?, (Student, _at_student_id)), (?, (Dept,
_at_dept_id)) - Check that _at_student_id is a key in the Student
relation and _at_dept_id is a key in the Dept
relation - (Dept, (Student, Name))
- Check that (parentID, Name) is a key for the
Student relation. - However, these are special cases in which the Q
path is simple and consists of a single label!
76Constraints-Preserving TransformationsLee Chu
ER00
- DTDs encapsulate certain types of constraints
- Domain lt!ATTLIST author gender (malefemale) gt
- Cardinality lt!ELEMENT article (title, author,
ref, price?)gt - Inclusion lt!ATTLIST contact aid IDREF REQUIREDgt
- Hybrid inlining can be modified to preserve these
constraints, and to generate SQL constraint
statements create domain, NOT NULL,
UNIQUE, id and foreign key. - The key is assumed to be the attribute of type
ID, whenever it exists. - Can our extended notion of keys be captured as
well to influence the transformation? This is an
area of future research.
77XML Relational Storage Strategies, cont.
- There are many more storage strategies
- Shared inlining (Wisconsin) inlines element
tags that are single valued and are not
subelements of more than one element type. - Hybrid inlining (Wisconsin) inlines all element
tags - Both of these approaches may pull subelements and
attributes that are needed in the key to separate
relations, making it complex to check XML keys. - These storage strategies go automatically from
the DTD of the document to a relational schema.
What if we want more control?
78Mapping Constraints Through Views
- Describing a transformation e.g. basic, shared
or hybrid inlining can be done using a basic
set of primitives (language). This also allows
other possibilities in how the data will be
stored. - We are then faced with the general problem of
mapping constraints through a view definition. - Mapping constraints through a view definition is
understood in the context of relational and
object-oriented databases. - Klug TODS 5(3), 1980 mapping functional
dependencies and join dependencies over
relational views - Beeri Vardi SIAM J. of Comput. 13(1), 1984
algebraic dependencies over relational views - Popa ICDT99 mapping constraints over
object-oriented views - For XML this is an area of current research.
79Other uses of constraints Query Optimization
- Initial work on query optimization for XML
focused on indices (Stanford, ATT, Wisconsin,
etc) - Value, label, and edge indices Dataguides (Lore)
- Template index (Milo Suciu ICDT99)
- Work on query optimization using statistics and
cost model has been done for the Lore system
(e.g. McHugh VLDB99) - Other work has focused on pushing XML queries
into relational databases (e.g. Silkroute WWW9,
Manolescu VLDB01, Shanmugasundaram VLDB2001) - What about constraints? They have been used in
relational databases, and more recently in
object-oriented databases with a constraint
language that can capture keys, foreign keys,
inclusion constraints and indices. (Popa
VLDB99, ICDT99). This is an active area of
research (Deutsch, UPenn).
80Other uses of constraints Normalization
- Consider the following transitive set of keys
- (?, (university, name)
- (university, (dept, dept-name))
- (university, (dept.employee, emp-id))
- Note that employee is nested under dept.
However, to insert an employee nothing about the
dept is necessary to identify the employee! This
is reminiscent of non-second normal form
relations. We would like to say that employees
should be directly nested under university, and
that the linkage between employee and dept be
expressed by a foreign key. - This is also an area which needs further research.
81XML Keys Practical Observations
- In bioinformatics , the popular sequence
databases tend to have natural keys. For
example, EMBL format SwissProt has a natural
translation to XML and keys can be formulated
STANDARD PRT 924 AA. AC P15711 DT
01-APR-1990 (REL. 14, CREATED) DT 01-APR-1990
(REL. 14, LAST SEQUENCE UPDATE) DT 01-AUG-1992
(REL. 23, LAST ANNOTATION UPDATE) DE 104 KD
MICRONEME-RHOPTRY ANTIGEN. OS THEILERIA
PARVA. RN 1 RC STRAINMUGUGA RX MEDLINE
90158697. RA IAMS K.P., YOUNG J.R., NENE V. RL
MOL. BIOCHEM. PARASITOL. 3947-60(1990). DR
EMBL M29954 G161866 -. DR PIR A44945
A44945. KW ANTIGEN SPOROZOITE. FT DOMAIN
1 19 HYDROPHOBIC. FT DOMAIN
905 924 HYDROPHOBIC.
82SwissProt Entry in XML
ltEntry mtype"PRT" seqlen"924"gt
ltPrimACgtP15711lt/PrimACgt ltMod
date"01-APR-1990" Rel"14" type"CREATED"gtlt/gt
ltMod date"01-APR-1990" Rel"14" type"LAST SEQ
UPD"gtlt/gt ltMod date"01-AUG-1992" Rel"23"
type"LAST ANNOT UPD"gtlt/gt ltDescrgt104 KD
MICRONEME-RHOPTRY ANTIGENlt/Descrgt
ltSpeciesgtTHEILERIA PARVAlt/Speciesgt ltRef
num"1"gt ltSTRAINgtMUGUGAlt/STRAINgt
ltMedlineIDgt90158697lt/MedlineIDgt
ltAuthorgtIAMS K.P.lt/Authorgt ltAuthorgtYOUNG
J.R.lt/Authorgt ltAuthorgtNENE Vlt/Authorgt
ltCitegtMOL. BIOCHEM. PARASITOL. 3947-60(1990)lt/Cit
egt lt/Refgt ltEMBL prim_id"M29954"
sec_id"G161866" status"-"gtlt/gt ltPIR
prim_id"A44945" sec_id"A44945"gtlt/PIRgt
ltKeywordgtANTIGENlt/Keywordgt ltKeywordgtSPOROZOITElt/Ke
ywordgt ltFeaturesgt ltDOMAIN from"1"
to"19"gt ltDescrgtHYDROPHOBIClt/Descrgt lt/DOMAINgt
ltDOMAIN from"905" to"924"gt ltDescrgtHYDROPHOBIClt/
Descrgt lt/DOMAINgt lt/Featuresgt lt/Entrygt
83Practical Observations, cont
- Many DTDs are now being formulated for data
exchange within bioinformatics. In particular,
gene expression data uses MAGE, representing the
merge of MAML (MicroArray Markup Language) and
GEML (Gene Expression Markup Language). They
have also switched to modeling the concepts in
UML, from which there is a natural translation to
DTD's. - Within these representations, attributes are
often used to hold key information IDs are
occasionally used with special prefixes to
capture their element type.
84Conclusions and Future Work
- Constraints are extremely important for XML data
management - XML constraints and their analysis are more
intricate than their database counterparts - Further work is needed for a better understanding
of - XML constraints
- consistency and implication of XML constraints
85Open problems
- Practical, tractable classes of XML constraints
- Normal forms for XML specifications is (D, ?)
good? - XML query optimization chasing for XML
constraints - Constraint propagation given certain database
constraint, what is the XML constraint that must
hold on the XML view of the database? - Constraint implementation given an XML
constraint, what impact does this have on the
storage representation? Can the constraint be
checked by the underlying storage system (e.g.
relational)? - Relative information capacity is it the case
that if an XML document conforms to (D1, ?1) ,
then it must also conform to (D2, ?2)? - . . .