Title: Managing%20XML%20and%20Semistructured%20Data
1Managing XML and Semistructured Data
2(No Transcript)
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7OEM vs. XML
- OEMs objects correspond to elements in XML
- Sub-elements in XML are inherently ordered.
- XML elements may optionally include a list of
attribute value pairs. - Graph structure for multiple incoming edges
specified in XML with references (ID, IDREF
attributes). i.e. the Project attribute.
8(No Transcript)
9OEM to XML
- Example
- ltMember project5 6gt ltnamegtJoneslt/namegt lta
gegt46lt/agegt ltofficegt ltbuildinggtgateslt/buildin
ggt ltroomgt252lt/roomgt lt/officegtlt/membergt - This corresponds to rightmost member in the
example OEM, where project is an attribute.
10Select xFrom A.B xWhere exists y in x.C y 5
11In this lecture
- Indexes
- XSet
- Region algebras
- Indexes for Arbitrary Semistructured Data
- Dataguides
- 1-2 indexes
- Resources
- Index Structures for Path Expressions by Milo and
Suciu, in ICDT'99 - XSet description http//www.openhealth.org/XSet/
- Data on the Web Abiteboul, Buneman, Suciu
section 8.2
12The problem
- Input large, irregular data graph
- Output index structure for evaluating regular
path expressions
13The Data
- Semistructured data instance a large graph
14The queries
- Regular expressions (using Lorel-like syntax)
SELECT X fROM (Bib..author).(lastnamefirstname).
Abiteboul X
Select x from part._.supplier.name x
Requires to traverse data from root, return all
nodes x reachable by a path matching the given
path expression.
Select X From part._.supplier name X,
address Philadelphia
Need index on values to narrow search to parts of
the database that contain the string
Philadelphia.
15Analyzing the problem
- what kind of data
- tree data (XML) easier to index
- graph data used in more complex applications
- what kind of queries
- restricted regular expressions (e.g. XPath) may
be more efficient
16XSet a simple index for XML
- Part of the Ninja project at Berkeley
- Example XML data
17XSet a simple index for XML
- Each node a hashtable
- Each entry list of pointers to data nodes (not
shown)
18XSet Efficient query evaluation
(R1) SELECT X FROM part.name X
-yes (R2) SELECT X FROM part.supplier.name X
-yes (R3) SELECT X FROM .supplier.name X
-maybe (R4) SELECT X FROM part..subpart.name X
-maybe
- To evaluate R1, look for part in the root hash
table h1, follow the link to table h2, then look
for name. - R4 following part leads to h2 traverse all
nodes in the index (corresponding to ), then
continue with the path subpart.name. - Thus, explore the entire subtree dominated by h2.
- Will be efficient if index is small and fits in
memory - R3 leading wild card forces to consider all
nodes in the index tree, resulting in less
efficient computation than for R4. - Can index the index itself.
- Retrieve all hash tables that contain a supplier
entry, continue a normal search from there.
19Region Algebras
- Structured text text with tags (like XML)
- New Oxford English Dictionary
- critical limitationordered data only (like text)
- Assume data given as an XML text file, and
implicit ordering in the file. - less critical limitation restricted regular
expressions
20Region Algebras Definitions
- data sequence of characters c1c2c3
- region segment of the text in a file
- representation (x,y) cx,cx1, cy, x start
position, y end position of the region - example ltsectiongt lt/sectiongt
- region set a set of regions s.t. any two
regions are either disjoint or one included in
the other - example all ltsectiongt regions (may be nested)
- Tree data each node defines a region and each
set of nodes define a region set. - example region p2 consisting of text under p2,
set p2,s2,s1 is a region set with three regions
21Representation of a region set
- Example the ltsubpartgt region set
- region algebra operators on region set,
- s1 op s2 defines a new region set
22Region algebra some operators
- s1 intersect s2 r r? s1, r ?s2
- s1 included s2 r r?s1, ?r ? s2, r ? r
- s1 including s2 r r? s1, ?r ? s2, r ? r
- s1 parent s2 r r? s1, ?r ? s2, r is a
parent of r - s1 child s2 r r? s1, ?r ? s2, r is child of
r
Examples ltsubpartgt included ltpartgt s1, s2,
s3, s5 ltpartgt including ltsubpartgt p2,
p3 ltnamegt child ltpartgt n1, n3, n12
23From path expressions to region expressions
- Use region algebra operators to answer regular
path expressions - Only restricted forms of regular path expressions
can be translated into region algebra operators - expressions of the form R1.R2Rn, where each Ri
is either a label constant or the Kleene closure
.
part.name name child (part child
root) part.supplier.name name child (supplier
child (part child root)) .supplier.name
name child supplier part..subpart.name name
child (subpart included (part child root))
Region expressions correspond to simple XPath
expressions
24From path expressions to region expressions
- Answering more complex queries
- Translates into the following region algebra
expression - Philadelphia denotes a region set consisting of
all regions corresponding to the word
Philadelphia in the text. - Such a region can be computed dynamically using a
full text index. - Region expressions correspond to simple XPath
expressions
Select X From .subpart name X,
.supplier.address Philadelphia
Name child (subpart includes (supplier parent
(address intersect Philadelphia)))
25Indexes for Arbitrary Semistructured Data
- A semistructured data instance that is a DAG
26Indexes for Arbitrary Semistructured Data
- The data represents employees and projects in a
company. - Two kinds of employees programmers and
statisticians - Three kinds of links to projects leads,
workson, consultants - Index graph reduced graph that summarizes all
paths from root in the data graph - Example node p1 paths from root to p1 labeled
with the following five sequences - Project
- Employee.leads
- Employee.workson
- Programmer.employee.leads
- Programmer.employee.workson
- Node p2 paths from root to p2 labeled by same
five sequences - p1 and p2 are language-equivalent
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Indexes for Arbitrary Semistructured Data
- For each node x in the data graph,
- Lx w ? a path from the root to x
labeled w - Note that Lx will be infinite if graph has a
cycle! - For any two nodes x and y, they are language
equivalent - ?x,y x ? y ? Lx Ly
- Equivalence class of x, x y x ? y
- Nodes(I) x
x ? nodes(G) - I
- Edges(I) x
y x? ? x, y? ? y, x? y?
33Indexes for Arbitrary Semistructured Data
- We have the following equivalences
- e1 ? e2
- e3 ? e4 ? e5
- p1 ? p2
- p3 ? p4
- p5 ? p6 ? p7
34Indexes for Arbitrary Semistructured Data
- Computing path expression queries
- Compute query on I and obtain set of index nodes
- Compute union of all extents, a list of pointers
to all data nodes in the equivalence class - Returns nodes h8, h9.
- Their extents are p5, p6, p7 and p8,
respectively - result set p5, p6, p7, p8
- Always size(I) ? size(G)
- Efficient when I can be stored in main memory
- Checking x ? y is expensive.
Select X From statistician.employee.(leadsconsult
s) X
35DataGuides
- Goldman Widom VLDB 97
- graph data
- arbitrary regular expressions
36DataGuides
- Definition
- given a semistructured data instance DB, a
DataGuide for DB is a graph G s.t. - - every path in DB also occurs in G
- - every path in G occurs in DB
- - every path in G is unique
37Dataguides
38DataGuides
- Multiple DataGuides for the same data
39DataGuides
- Definition
- Let w, w be two words (I.e word queries) and G
a graph - w ?G w if w(G) w(G)
- Definition
- G is a strong dataguide for a database DB if ?G
is the same as ?DB
40(No Transcript)
41DataGuides
- Example
- G1 is a strong dataguide
- G2 is not strong
- person.project !?DB dept.project
- person.project ?G2 dept.project
42(No Transcript)
43DataGuides
- Constructing the strong DataGuide G
- Nodes(G)root
- Edges(G)?
- while changes do
- choose s in Nodes(G), a in Labels
- add syx in s, (x -a-gty) in Edges(DB) to
Nodes(G) - add (x -a-gty) to Edges(G)
- Use hash table for Nodes(G)
44DataGuides
- How large are the dataguides ?
- if DB is a tree, then size(G) lt size(DB)
- why? answer every node is in exactly one extent
of G - here dataguide XSet
Dataguides usually fail on data with cyclic
schemas, like
45T-Indexes
- Milo Suciu ICDT 99
- 1-index
- data graph
- arbitrary regular expressions
- 2-index, T-index for more complex queries,
consisting of more regular expressions.
46T-Indexes
- T-index template index
- Trades space for generality
- The class of paths associated with a given
T-index is specified by a path template - Example 1 x y. Here can
be replaced by any regular expression. - Example 2 (.Restaurant) x y. The first
regular expression is fixed this T-index takes
less space but is less general. - T-indexes can be generated efficiently.
- The size of a T-index associated to a single
regular expression is at most linear in that of
the database
P
P
P
P
47(No Transcript)
481-Indexes
- Database DB (V,E,Roots), V is finite set of
nodes, E is a set of labeled edges, R is a set of
root nodes. - Regular path expressions
- P ? ? ƒ (PP) (P.P) P. where ƒ
are formulas defined over predicates p1, p2,on
the set of data values. - A path expression p v0 ? v1 ? v2vn-1 ? vn
- Queries regular path expressions q(DB)
- A query path is an expression of the form
- P1 x1 P2 x2 Pn xn, xi variable names,
Pis path expressions - A query has the form
- Select x1, x2, , xn from P1 x1 P2 x2 Pn xn
a1
a2
an
491-Indexes
- Path template t T1 x1 T2 x2 T3 x3, Ti a
regular expression or or - Instantiating query paths
- Query path q instantiating and
by regular path expression and some formula,
respectively, in template t - Example path template t (.Restaurant) x1
x2 Name x3 x4 - Query path instantiations
- q1 (.Restaurant) x1 x2 Name x3 Fridays
x4 - q2 (.Restaurant) x1 x2 Name x3 _ x4
( _ is a predicate with True) - q3 (.Restaurant) x1 ( ? _ ) x2 Name x3
Fridays x4
P
F
P
F
P
F
501-Indexes
- Goal compute efficiently queries q ? inst(
x) - A first attempt
- Lu is the set of words on path reachable from
root to u. - That is, all the path queries that lead to u.
- ?u?V. Lu a1an v0 ? ? vn ?DB, v0?Root,
vnu - ?u,v?V. u ? v ? Lu Lv
- That is, u and v are indistinguishable by path
queries from root. - ?u?V.
- u v u ? v is a equivalence class
containing u
P
a1
an
511-Indexes
- Nodes(I) u u in nodes(DB)
- Edges(I) u ? u? ?u ? u, ?u? ? u?,
(u ? u?) ? Edges(DB) - Roots(I) r r ? roots(DB)
a
a
I
q(DB) u ? u? ? q(I), u ? u
Example
That is, there will be an edge e in the index
tree between s and s if there is an edge e
between a node in s and a node in s. if
Inefficient construction cost
52Analyzing1-Indexes
- Storing I-index
- Associate an oid s to each node in I
- Store graph I in standard form
- Store for each node s, extent(s)
- Extent(s) v s is an oid for v
- Always size(I) lt size(DB) (unlike Dataguide)
- Always can compute in O(nlogn) time nsize(DB)
- When DB is a tree
- 1-index Dataguide XSet
53Analyzing1-Indexes
- Do we have size(I) ltlt size(DB) ? No. Two worst
cases - Facts
- in theory except for these two DBs, size(I) ltlt
size(DB) - in practice its a different story. Experiments
size(I) ? 1/3 size(DB)
54Evaluating Query Paths with 1-indexes
- Example evaluate query path P x
- q(DB) q(I)
- Let Nodes(I) s1, s2, , sk each si, 1 ? i ?
k, satisfies query path P x - q(DB) extent(s1) ? extent(s2) ? ? extent(sk)
55Evaluating Query Paths with 1-indexes
- Example query q t.a x
- The evaluation of q follows two paths t.a in I
rather than five in DB and unions their extents
7,13 ? 8,10,12 - The extents in strong data guide overlap, hence
storage may be larger
562-Indexes
- Database DB (V, E, Roots)
- Queries select x1, x2 from x1 P x2, with P a
regular path expression - Template x1 x2.
- Find pairs of nodes (x1, x2)
- L(u,v) set of words on the path between (u,v)
- L(u,v) a1 an u ? ? v in DB
- (u,v) ? (u?,v?) ? L(u,v) L(u?,v?), that is,
they are indistingushable by path queries of the
form root x1 x2.
P
an
a1
P
57(No Transcript)
582-Indexes
- Nodes(I) (u,v) u,v ? Nodes(DB)
- I2 Roots(I) (u,u) u ?
Nodes(DB) - Edges(I) (u,v) ? (u,v?) v ? v? ?
Edges(DB) - Storing I2
- The graph
- Extent(s) (v,u), for each node s representing
the equivalence class (v,u) - L(v,u)(DB) L(v,u)(I2),
- L(v,u)(DB) represents paths between v and u
- L(v,u)(I2) represents the paths in the 2-index
I2, between some root of the index and (v,u) - Query evaluation
- To compute select x, y from x P y, we compute
the query path P y on I2 and take the union of
the extents. - This saves the search, but may have to start at
several roots in I2, which is only one in case of
acyclic databases
a
a
592-Index Example
- Cost size(I) ? O(n2)
- May be less in practice, similar to PAT trees
(Patricia tree) for text databases
60Conclusions
- work on structured text relevant but restrictive
- trees are simple XSet Dataguides 1-index
(conceptually) - 1-index scales to cyclic data too
- more complex queries 2-index, T-index