Efficient Evaluation of Regular Path Expressions on Streaming XML Data

About This Presentation

Title:

Efficient Evaluation of Regular Path Expressions on Streaming XML Data

Description:

name Seattle Bio Lab /name location city Seattle /city country USA /country ... 52. Are We Going in Circles ? Considering the following XML graph #1 #2 ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 83

Provided by: csHu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Evaluation of Regular Path Expressions on Streaming XML Data

1
Efficient Evaluation of Regular Path Expressions
on Streaming XML Data

By - Zachary G. Ives, Alon Y. Levy and Daniel S.
Weld

2
Table of Contents

A bit about XML (yes, again)
Our goal, problem and solution
Our XML data model
How to ask questions ?

3
Table of Contents

X-scan operation and structure
Digging deep into x-scan
How good is it ? Performance Evaluation
Conclusion

4
A Bit About XML (yes, again)

XML the eXtensible Markup Language
Become a standard
Useful for the dissemination and exchange of
information

5
A Bit About XML (yes, again)

Advantages
Simple
Self-describing nature
Flexible
Represents both structured and semi-structured
data

6
XML Structure

Consists of
Elements pairs of matching open and close tags.
Elements may enclose additional elements or data
values.
Attributes included in element tags.
Attributes are single-valued and describe the
element.

7
XML Structure (Cont.)

ID is special attribute which uniquely identify
the element.
IDREF form links the other elements in the
document.
Combining ID and IDREF forms a graph structure
rather than just a tree structure.

8
XML Example
We will use this example throughout the rest of
the lecture
9
Our Goal

Our goal is to perform queries and search
operations on the XML document.
Several query languages have been proposed.
Represents the XML document as a graph.

10
Our Goal (Cont.)

Represents the query as a regular path expression
that should be matched against XML source.
These regular path expressions describe
traversals along edges in the XML graph.
The variables in the query are mapped to XML
elements along these paths.

11
Our Problem

Most XML query processors
Loading the data into a local repository
Building indexes on the repository
Processing the query
The repository is either
Relational database
An object oriented database
A repository of semi structured data

12
Our Problem (Cont.)

The local storing and indexing is expensive.
Especially when the query is made over streams of
incoming XML.
The streams can come from many sources, some fast
and some slow.
Sometimes we want some partial answer but as soon
as possible.

13
Our Solution

The query can be performed while the data streams
in.
The XML-Scan (x-scan) operator does exactly that.
Used at the lowest level of the query plan and
supplies data to other operators.

14
The X-Scan Operator

Input
An XML data stream.
Set of regular path expressions.
Output
Stream of binding for the variables occurring in
the expressions.
The bindings are produced incrementally, as the
XML data is streaming in.

15
The X-Scan Operator (Cont.)

The entire graph can be constructed in a single
pass.
X-Scan simultaneously.
Parse the XML data.
Indexing nodes by their IDs.
Resolving IDREFs.
Return the nodes that match the path expressions
of the query.

16
The X-Scan Operator (Cont.)

Some issues in the X-Scan operation are
Deal with possibly cyclic data
Preserve order of elements
Remove duplicate bindings that are generated due
to multiple paths to the same elements

17
Data Model for XML

Naturally, the XML data model is a graph.
Each XML tag is an edge labeled with the tag
name.
It is directed to a node which label is the tags
ID. (if it has no ID it gets a number).
A given element node will have labeled edges
directed to its attribute values, sub-elements,
and any other elements referenced via IDREF.

18
Data Model for XML (Cont.)

Example is always the best way

19
How to ask questions ?

A variety of query languages have been proposed.
The key feature in all of these languages is the
use of regular path expressions over the data.
Most of them also give the answer to the query as
XML document.
X-Scan uses XML-QL.

20
The XML-QL Syntax

The syntax of XML-QL is
patterni template is matched against the XML data
graph from sourcei and the resulted tuples are
formatted as described in result.

WHERE pattern1 IN source1, pattern2 IN
source2, CONSTRUCT result
21
The XML-QL Syntax (Cont.)

An XML-QL pattern is a set of nested tags with
embedded variable names (prefixed by ) that
specify bindings of graph nodes to variables.
The CONSTRUCT clause specifies a tree-structured
set of edges and nodes to add to the output graph
for each tuple of variable bindings.

22
The XML-QL Syntax (Cont.)

Again, example is the best way
Lets look at

WHERE ltdbgt ltlabgt ltnamegtnlt/gt lt_gtltcitygt
clt/gtlt/gt lt/gt ELEMENT_AS l lt/gt IN
fig1.xml CONSTRUCT ltresultgt ltcentergt ltname
gtnlt/gt ltlocationgtclt/gt lt/gt lt/gt
23
The XML-QL Syntax (Cont.)

As we can see, the result will be

ltresultgt ltcentergt ltnamegtSeattle Bio
Lablt/namegt ltlocationgtSeattlelt/locationgt lt/cente
rgt ltcentergt ltnamegtPMBLlt/namegt ltlocationgtPhila
delphialt/locationgt lt/centergt lt/resultgt
24
The XML-QL Syntax (Cont.)

If the variable is bound to a node with
sub-elements, all the sub-graph will be inserted
to the resulted graph.
We will use dot-notation to describe the X-Scan
operation.
The previous example will rewritten as.
El root.db.lab
En El.name
Ec El._.city

25
The X-Scan Place

The goal of the X-Scan operator is therefore to
produce a set of bindings for each pattern in the
WHERE clause.

26
So, What X-Scan do ?

Given the XML Stream and a set of regular path
expressions, outputs a stream of tuples assigning
binding values to each variable in the set of
regular path expression.
The central mechanism is a set of state machines
that traverse the XML graph, trying to satisfy
the path expressions.

27
What is it made of ?

The data components of X-Scan are

28
Where the data flows?

As the data streams into the system, several
structures are created
The data get parsed and stored locally
A structural index of the XML graph is created
An ID index records the IDs of all elements and
their location in the structural index
A list of references to not-yet-seen element IDs
is maintained

29
Where the data flows?

In parallel to the creation of those structures,
a set of finite state machines perform a DFS over
the partial structural index.
When a machine reaches an accepting state, a new
value is added to the binding-value table of that
machine.
Those values are later combine to form the
complete image.

30
Example problems

It sounds easy, but yet there some problems to
meet, for example
The handling of cycles
How to prune duplicate bindings as they are
created ? Remember X-Scan is online operator

31
The State Machines

As described earlier, we create one regular
expression for every variable in the query in
the dot-notation.
So, we build a finite-state machine for each
expression.
State transition is correspond to edge traversals
in the XML data graph

32
The State Machines (Cont.)

The end of the path expression yield an accepting
state, which outputs instances of the
corresponding variables.
When one variable is dependent upon other
variable, the other variable machine accepting
state is pointing to the state machine of the
first one.

33
The State Machines (Cont.)

And back to our example

34
Indexing the XML Graph

The structural index should allow x-scan to
quickly traverse the XML data graph.
Each node in the index contains
The ID of the element and its offset in the
document
Pointers to all the sub-elements, attributes and
IDREFs of the element.
Essentially it looks like the graph except for
the leafs.

35
The Algorithm Step by Step

X-Scan proceeds by building the structural index
and running a set of active state machines in
parallel.
The core algorithm is in fact the way those state
machines run, lets focus on that by running our
example.

36
The Algorithm Step by Step

Initially, only the top level machine is active.
When a machine M reaches an accepting state, it
produces a binding b for its variable, writes it
and the parent value to its table and activates
all of its dependent state machines.

37
The Algorithm Step by Step

Those machines remain active while x-scan is
scanning b or any element accessible by a path
from b.
The final output of x-scan is the equi-join of
all the appropriate tables.

38
The Algorithm By Example

Ml is initialized on state 1 as the only active
machine.

39
The Algorithm By Example

The root got a db edge, so the machine is
pushed to its stack and moving to state 2 with
value node 1

40
The Algorithm By Example

Next, following the first outgoing edge, pushing
the old state value, and setting Ml to state 3
with value baselab

41
The Algorithm By Example

Since it now in accepting state
the baselab value is written to the Ml table
Ml is suspended
Mn and Mc are activated

42
The Algorithm By Example

The next edge takes Mn from state 4 to 5
And Mc run on the loop back to state 6
Both machines have 2 as binding value

43
The Algorithm By Example

Since Mn is now in an accept state x-scan writes
lt2,baselabgt into Mns table.
Since no edges remain for exploration, x-scan
pops the stack and backs up the state machines,
resetting Mn to state 4 and Mc to state 6

44
The Algorithm By Example

The next edge is labeled location so
Mn stay in state 4
Mc also stay in state 6 but advanced to node 3
Then Mc is advanced to state 7 on the city edge
to node 4

45
The Algorithm By Example

At this point x-scan writes lt4,baselabgt into
Mcs table.
It can also produce the first tuple of bindings
ltl/baselab,n/2,c/4gt

46
The Algorithm By Example

X-Scan keeps running Mc but no more cities are
found
It pops back up to baselab
Running Mc along the IDREF to smith1 gives no
more cities

47
The Algorithm By Example

Now, Mn and Mc are deactivated and the control
return to Ml
X-scan pops up to node 1 to state 2
The other lab edge yield another tuple
ltl/lab2,n/6,c/7gt

48
Where should we go ?

On occasion x-scan will encounter an IDREF to a
node that has not yet been parsed.
Unknown node simply will not be in the ID index.

49
Where should we go ?

When X-Scan hits such unseen reference
It pauses all the relevant state machines
Adds an entry to the list of unresolved IDREFs
ltdesired ID value, referrers addressgt
Continue to parse and build the structural index

50
Where should we go ?

Once the target element is parsed x-scan
fills its address into each referring IDREF in
the structural index
Removes the entry from the list of unresolved
IDREFs
Awakens the state machines and proceeds

51
Are We Going in Circles ?

Sometimes the input XML graph contains a cycles.
X-Scan must not get trap in an infinite loop.

52
Are We Going in Circles ?

Considering the following XML graph

a
1
a
2
b
a
a
4
3
53
Are We Going in Circles ?

If we refuse to move in circles, we will miss the
answer to the query
Exroot._.b.a
But if we allow moving in circles we going to get
in trouble with this one
Eyroot._.z

54
Are We Going in Circles ?

What can we do ?
Now we going to use the stack
The stack contains pairs of the form
(binding, state)
Describing which bindings have been associated
with states of the machine along the current path.

55
Are We Going in Circles ?

Since x-scan uses deterministic finite state
machines, returning to a previous state with the
same binding will not add any new possible
actions.
So, when a machine enters a state, it should
checks to see that this state has not been bound
to the same binding along the current path.

56
Are We Going in Circles ?

Is it working for our example ?
Look at those state machines.

57
Some Enhancements

It is important to prevent the operator from
spending time evaluating paths that are not
useful.
There are two enhancements.
Selection Push-Down.
Duplicate elimination.

58
Selection Push-Down

The query optimizer creates and push selection
operators down into the x-scan operation.
Works only on attributes since they are single
valued.
So X-Scan evaluates all node attribute edges
before sub-elements edges.

59
Selection Push-Down

Here also, the best way to explain is by example.

WHERE ltdbgt ltlab managersmith1gt ltnamegtnlt
/gt lt_gtltcitygtclt/gtlt/gt lt/gt ELEMENT_AS
l lt/gt IN fig1.xml CONSTRUCT ltresultgtltcentergt
ltnamegtnlt/gt ltlocationgtclt/gt lt/gtlt/gt
60
Selection Push-Down

The query plan generator must create an
additional temporary variable temp1 and a regular
path expression
Etemp1El._at_manager
It also adds a selection predicate
Etemp1smith1

61
Selection Push-Down

Now, for the second lab, since it got only ID
attribute, as X-Scan iterates through all lab2
attributes it finds no manager attribute.
So it can short-circuit on this sub-graph.
Discarding the value of l and ignoring its
children.

62
Duplicate Elimination

Sometimes we can visit an element multiple times
through different paths.
This can produce duplicate binding tuples.
The naive way is to do some post-processing
stage.
It can be done smartly so there is no need to
save the entire history.

63
There is a big one on the way

Main memory may not be large enough to handle all
of the index structures.
The way to handle is by
Paging the XML source document
Paging the structural index
Conventional buffer manager using LRU or some
other similar policy is sufficient.

64
There is a big one on the way

But what about the ID lookup index, list of
unresolved IDREFs and the state machine stack?
They use either a B-Tree or a multilevel
hashtable
The size of stack is bounded by the product of
number of variables and the longest non-repeating
path.
Inactive state machine stack can be naturally
paged

65
How good is it ?

They used the IBM XML4C parser version 3.0.1 with
the SAX parser API to implement the X-Scan.
The SAX API provides callbacks to the code as
elements are read, and so allowing X-Scan to
evaluate streaming XML data

66
How good is it ?

They compare the X-Scan against
Stanfords Lore semi-structured/XML database
system.
A commercial OO-based XML repository
The experiments were performed with locally
stored XML files
X-Scan lose some of its advantages

67
How good is it ?

Also this X-Scan implementation didnt include
selection predicates.
All the queries were performed on a single
processor 450MHz Pentium II with 256 MB of
memory.
X-Scan and the OO-based system run on Windows-NT
and Lore run on Linux.

68
How good is it ?

The queries they had performed included the
following documents.
Mondial and VLDB contains many references whereas
the rest are mostly tree structures.

69
How good is it ?
70
How good is it ?

Conclusions.
Neither Lore nor the commercial system scale up
well to queries across multi-megabyte data files.
They failed particularly on files that contain
graph structure.
X-Scan scale better in all cases.

71
How good is it ?

Another experiments on synthetic XML data files
were conducted.
Those XMLs are random generated.
They was to check the scalability of X-Scan.
They averaged three different runs across each of
the three different random graphs of the same
generation parameters.

72
How good is it ?

The first experiments were conducted on a
tree-structured data.
Therefore they didnt have to build the
structural index, ID and IDREFs tables.
This was to check how good the state machines
work.

73
How good is it ?

The results were

74
How good is it ?

Next, they wanted to check for the cost of the
graph indexing and resolving references.
Without traversing the IDREFs.
They took the same graphs from the previous tests
and change back the DTD so it will considered as
graph.
The results were.

75
How good is it ?
76
How good is it ?
77
How good is it ?

Next, they wanted to check the effectiveness of
the structural index when called to evaluate such
reference edges.
The results were.

78
How good is it ?
79
How good is it
80
Conclusion

X-Scan differs in three key ways from previous
works
The structural index allow more efficient
traversing without splitting the data to table or
objects that should be re combined later
X-Scan state machines are based on the query
rather then on the data source not reusable,
but we always reread the data
X-Scan is pipelined and produces bindings as data
is being streamed into the system

81
Conclusion (Cont.)

Another points regarding to X-Scan are
It handles cycles well
It preserves the document order and structure
Eliminate duplicate tuples
The state machines are independent and so can run
in parallel
X-Scan is very efficient, typically imposing 8
overhead on top of the time required to parse the
XML document

82
THE END

Write a Comment

User Comments (0)

About PowerShow.com

Efficient Evaluation of Regular Path Expressions on Streaming XML Data - PowerPoint PPT Presentation

Efficient Evaluation of Regular Path Expressions on Streaming XML Data

name Seattle Bio Lab /name location city Seattle /city country USA /country ... 52. Are We Going in Circles ? Considering the following XML graph #1 #2 ... – PowerPoint PPT presentation