Managing%20XML%20and%20Semistructured%20Data - PowerPoint PPT Presentation

About This Presentation

Title:

Managing%20XML%20and%20Semistructured%20Data

Description:

XML elements may optionally include a list of attribute value pairs. ... New Oxford English Dictionary. critical limitation:ordered data only (like text) ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 61

Provided by: web78

Learn more at: https://web.mst.edu

Category:

more less

Transcript and Presenter's Notes

Title: Managing%20XML%20and%20Semistructured%20Data

1
Managing XML and Semistructured Data

Lecture Indexes

2
(No Transcript)
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
OEM vs. XML

OEMs objects correspond to elements in XML
Sub-elements in XML are inherently ordered.
XML elements may optionally include a list of
attribute value pairs.
Graph structure for multiple incoming edges
specified in XML with references (ID, IDREF
attributes). i.e. the Project attribute.

8
(No Transcript)
9
OEM to XML

Example
ltMember project5 6gt ltnamegtJoneslt/namegt lta
gegt46lt/agegt ltofficegt ltbuildinggtgateslt/buildin
ggt ltroomgt252lt/roomgt lt/officegtlt/membergt
This corresponds to rightmost member in the
example OEM, where project is an attribute.

10
Select xFrom A.B xWhere exists y in x.C y 5
11
In this lecture

Indexes
XSet
Region algebras
Indexes for Arbitrary Semistructured Data
Dataguides
1-2 indexes
Resources
Index Structures for Path Expressions by Milo and
Suciu, in ICDT'99
XSet description http//www.openhealth.org/XSet/
Data on the Web Abiteboul, Buneman, Suciu
section 8.2

12
The problem

Input large, irregular data graph
Output index structure for evaluating regular
path expressions

13
The Data

Semistructured data instance a large graph

14
The queries

Regular expressions (using Lorel-like syntax)

SELECT X fROM (Bib..author).(lastnamefirstname).
Abiteboul X
Select x from part._.supplier.name x
Requires to traverse data from root, return all
nodes x reachable by a path matching the given
path expression.
Select X From part._.supplier name X,
address Philadelphia
Need index on values to narrow search to parts of
the database that contain the string
Philadelphia.
15
Analyzing the problem

what kind of data
tree data (XML) easier to index
graph data used in more complex applications
what kind of queries
restricted regular expressions (e.g. XPath) may
be more efficient

16
XSet a simple index for XML

Part of the Ninja project at Berkeley
Example XML data

17
XSet a simple index for XML

Each node a hashtable
Each entry list of pointers to data nodes (not
shown)

18
XSet Efficient query evaluation
(R1) SELECT X FROM part.name X
-yes (R2) SELECT X FROM part.supplier.name X
-yes (R3) SELECT X FROM .supplier.name X
-maybe (R4) SELECT X FROM part..subpart.name X
-maybe

To evaluate R1, look for part in the root hash
table h1, follow the link to table h2, then look
for name.
R4 following part leads to h2 traverse all
nodes in the index (corresponding to ), then
continue with the path subpart.name.
Thus, explore the entire subtree dominated by h2.
Will be efficient if index is small and fits in
memory
R3 leading wild card forces to consider all
nodes in the index tree, resulting in less
efficient computation than for R4.
Can index the index itself.
Retrieve all hash tables that contain a supplier
entry, continue a normal search from there.

19
Region Algebras

Structured text text with tags (like XML)
New Oxford English Dictionary
critical limitationordered data only (like text)
Assume data given as an XML text file, and
implicit ordering in the file.
less critical limitation restricted regular
expressions

20
Region Algebras Definitions

data sequence of characters c1c2c3
region segment of the text in a file
representation (x,y) cx,cx1, cy, x start
position, y end position of the region
example ltsectiongt lt/sectiongt
region set a set of regions s.t. any two
regions are either disjoint or one included in
the other
example all ltsectiongt regions (may be nested)
Tree data each node defines a region and each
set of nodes define a region set.
example region p2 consisting of text under p2,
set p2,s2,s1 is a region set with three regions

21
Representation of a region set

Example the ltsubpartgt region set
region algebra operators on region set,
s1 op s2 defines a new region set

22
Region algebra some operators

s1 intersect s2 r r? s1, r ?s2
s1 included s2 r r?s1, ?r ? s2, r ? r
s1 including s2 r r? s1, ?r ? s2, r ? r
s1 parent s2 r r? s1, ?r ? s2, r is a
parent of r
s1 child s2 r r? s1, ?r ? s2, r is child of
r

Examples ltsubpartgt included ltpartgt s1, s2,
s3, s5 ltpartgt including ltsubpartgt p2,
p3 ltnamegt child ltpartgt n1, n3, n12
23
From path expressions to region expressions

Use region algebra operators to answer regular
path expressions
Only restricted forms of regular path expressions
can be translated into region algebra operators
expressions of the form R1.R2Rn, where each Ri
is either a label constant or the Kleene closure
.

part.name name child (part child
root) part.supplier.name name child (supplier
child (part child root)) .supplier.name
name child supplier part..subpart.name name
child (subpart included (part child root))
Region expressions correspond to simple XPath
expressions
24
From path expressions to region expressions

Answering more complex queries
Translates into the following region algebra
expression
Philadelphia denotes a region set consisting of
all regions corresponding to the word
Philadelphia in the text.
Such a region can be computed dynamically using a
full text index.
Region expressions correspond to simple XPath
expressions

Select X From .subpart name X,
.supplier.address Philadelphia
Name child (subpart includes (supplier parent
(address intersect Philadelphia)))
25
Indexes for Arbitrary Semistructured Data

A semistructured data instance that is a DAG

26
Indexes for Arbitrary Semistructured Data

The data represents employees and projects in a
company.
Two kinds of employees programmers and
statisticians
Three kinds of links to projects leads,
workson, consultants
Index graph reduced graph that summarizes all
paths from root in the data graph
Example node p1 paths from root to p1 labeled
with the following five sequences
Project
Employee.leads
Employee.workson
Programmer.employee.leads
Programmer.employee.workson
Node p2 paths from root to p2 labeled by same
five sequences
p1 and p2 are language-equivalent

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Indexes for Arbitrary Semistructured Data

For each node x in the data graph,
Lx w ? a path from the root to x
labeled w
Note that Lx will be infinite if graph has a
cycle!
For any two nodes x and y, they are language
equivalent
?x,y x ? y ? Lx Ly
Equivalence class of x, x y x ? y
Nodes(I) x
x ? nodes(G)
I
Edges(I) x
y x? ? x, y? ? y, x? y?

33
Indexes for Arbitrary Semistructured Data

We have the following equivalences
e1 ? e2
e3 ? e4 ? e5
p1 ? p2
p3 ? p4
p5 ? p6 ? p7

34
Indexes for Arbitrary Semistructured Data

Computing path expression queries
Compute query on I and obtain set of index nodes
Compute union of all extents, a list of pointers
to all data nodes in the equivalence class
Returns nodes h8, h9.
Their extents are p5, p6, p7 and p8,
respectively
result set p5, p6, p7, p8
Always size(I) ? size(G)
Efficient when I can be stored in main memory
Checking x ? y is expensive.

Select X From statistician.employee.(leadsconsult
s) X
35
DataGuides

Goldman Widom VLDB 97
graph data
arbitrary regular expressions

36
DataGuides

Definition
given a semistructured data instance DB, a
DataGuide for DB is a graph G s.t.
- every path in DB also occurs in G
- every path in G occurs in DB
- every path in G is unique

37
Dataguides

Example

38
DataGuides

Multiple DataGuides for the same data

39
DataGuides

Definition
Let w, w be two words (I.e word queries) and G
a graph
w ?G w if w(G) w(G)
Definition
G is a strong dataguide for a database DB if ?G
is the same as ?DB

40
(No Transcript)
41
DataGuides

Example
G1 is a strong dataguide
G2 is not strong
person.project !?DB dept.project
person.project ?G2 dept.project

42
(No Transcript)
43
DataGuides

Constructing the strong DataGuide G
Nodes(G)root
Edges(G)?
while changes do
choose s in Nodes(G), a in Labels
add syx in s, (x -a-gty) in Edges(DB) to
Nodes(G)
add (x -a-gty) to Edges(G)
Use hash table for Nodes(G)

44
DataGuides

How large are the dataguides ?
if DB is a tree, then size(G) lt size(DB)
why? answer every node is in exactly one extent
of G
here dataguide XSet

Dataguides usually fail on data with cyclic
schemas, like

45
T-Indexes

Milo Suciu ICDT 99
1-index
data graph
arbitrary regular expressions
2-index, T-index for more complex queries,
consisting of more regular expressions.

46
T-Indexes

T-index template index
Trades space for generality
The class of paths associated with a given
T-index is specified by a path template
Example 1 x y. Here can
be replaced by any regular expression.
Example 2 (.Restaurant) x y. The first
regular expression is fixed this T-index takes
less space but is less general.
T-indexes can be generated efficiently.
The size of a T-index associated to a single
regular expression is at most linear in that of
the database

P
P
P
P
47
(No Transcript)
48
1-Indexes

Database DB (V,E,Roots), V is finite set of
nodes, E is a set of labeled edges, R is a set of
root nodes.
Regular path expressions
P ? ? ƒ (PP) (P.P) P. where ƒ
are formulas defined over predicates p1, p2,on
the set of data values.
A path expression p v0 ? v1 ? v2vn-1 ? vn
Queries regular path expressions q(DB)
A query path is an expression of the form
P1 x1 P2 x2 Pn xn, xi variable names,
Pis path expressions
A query has the form
Select x1, x2, , xn from P1 x1 P2 x2 Pn xn

a1
a2
an
49
1-Indexes

Path template t T1 x1 T2 x2 T3 x3, Ti a
regular expression or or
Instantiating query paths
Query path q instantiating and
by regular path expression and some formula,
respectively, in template t
Example path template t (.Restaurant) x1
x2 Name x3 x4
Query path instantiations
q1 (.Restaurant) x1 x2 Name x3 Fridays
x4
q2 (.Restaurant) x1 x2 Name x3 _ x4
( _ is a predicate with True)
q3 (.Restaurant) x1 ( ? _ ) x2 Name x3
Fridays x4

P
F
P
F
P
F
50
1-Indexes

Goal compute efficiently queries q ? inst(
x)
A first attempt
Lu is the set of words on path reachable from
root to u.
That is, all the path queries that lead to u.
?u?V. Lu a1an v0 ? ? vn ?DB, v0?Root,
vnu
?u,v?V. u ? v ? Lu Lv
That is, u and v are indistinguishable by path
queries from root.
?u?V.
u v u ? v is a equivalence class
containing u

P
a1
an
51
1-Indexes

Nodes(I) u u in nodes(DB)
Edges(I) u ? u? ?u ? u, ?u? ? u?,
(u ? u?) ? Edges(DB)
Roots(I) r r ? roots(DB)

a
a
I
q(DB) u ? u? ? q(I), u ? u
Example
That is, there will be an edge e in the index
tree between s and s if there is an edge e
between a node in s and a node in s. if
Inefficient construction cost
52
Analyzing1-Indexes

Storing I-index
Associate an oid s to each node in I
Store graph I in standard form
Store for each node s, extent(s)
Extent(s) v s is an oid for v
Always size(I) lt size(DB) (unlike Dataguide)
Always can compute in O(nlogn) time nsize(DB)
When DB is a tree
1-index Dataguide XSet

53
Analyzing1-Indexes

Do we have size(I) ltlt size(DB) ? No. Two worst
cases
Facts
in theory except for these two DBs, size(I) ltlt
size(DB)
in practice its a different story. Experiments
size(I) ? 1/3 size(DB)

54
Evaluating Query Paths with 1-indexes

Example evaluate query path P x
q(DB) q(I)
Let Nodes(I) s1, s2, , sk each si, 1 ? i ?
k, satisfies query path P x
q(DB) extent(s1) ? extent(s2) ? ? extent(sk)

55
Evaluating Query Paths with 1-indexes

Example query q t.a x
The evaluation of q follows two paths t.a in I
rather than five in DB and unions their extents
7,13 ? 8,10,12
The extents in strong data guide overlap, hence
storage may be larger

56
2-Indexes

Database DB (V, E, Roots)
Queries select x1, x2 from x1 P x2, with P a
regular path expression
Template x1 x2.
Find pairs of nodes (x1, x2)
L(u,v) set of words on the path between (u,v)
L(u,v) a1 an u ? ? v in DB
(u,v) ? (u?,v?) ? L(u,v) L(u?,v?), that is,
they are indistingushable by path queries of the
form root x1 x2.

P
an
a1
P
57
(No Transcript)
58
2-Indexes

Nodes(I) (u,v) u,v ? Nodes(DB)
I2 Roots(I) (u,u) u ?
Nodes(DB)
Edges(I) (u,v) ? (u,v?) v ? v? ?
Edges(DB)
Storing I2
The graph
Extent(s) (v,u), for each node s representing
the equivalence class (v,u)
L(v,u)(DB) L(v,u)(I2),
L(v,u)(DB) represents paths between v and u
L(v,u)(I2) represents the paths in the 2-index
I2, between some root of the index and (v,u)
Query evaluation
To compute select x, y from x P y, we compute
the query path P y on I2 and take the union of
the extents.
This saves the search, but may have to start at
several roots in I2, which is only one in case of
acyclic databases