Title: Queries on TreeStructured Data: Logical Languages
1Queries on Tree-Structured Data(Logical)
Languages Complexity
- Christoph Koch
- Universität des Saarlandes
- SaarbrĂĽcken, Germany
2Query Languages for Trees
- Languages that select (tuples of) nodes in
unranked ordered node-labeled trees - FO, MSO, conjunctive queries, (monadic) datalog,
XPath - Queries on trees with data values
- XPath with data values
- Tree transformation languages
- XSLT, tree transducers
- XQuery (functional programming language on trees)
3XPath Examples
/descendanta/childb
/descendanta/childb descendantc and
not(following-siblingd)
c
a
/descendanta/childb following-siblingd
a
c
b
d
b
c
a
a
b
c
a
a
c
b
d
b
b
d
b
b
c
b
c
c
c
4Queries on Trees
5Complexity of XPath
XPath The Language
6XPath Examples
/descendantbposition() 2
c
a
/descendant position() mod 2 1
a
/descendantb position() last()
c
1
b
d
b
c
a
1
3
2
a
b
c
a
2
3
a
c
4
8
b
d
b
b
d
b
7
b
c
(document order)
5
b
c
9
c
6
c
7XPath Examples
/descendantc/ancestor position() 2
c
/descendant id(concat(text(),cd))/chil
dc
a
a
c
2
2
a
b
d
b
1
a
b
c
1
b
d
b
c
ltb idabcdgt
ltcgtablt/cgt
c
id(abcd)
8Xpath and Fragments
- Core XPath Logically clean core of XPath
- All axes of Xpath (incl. ancestor, following, )
- Node tests label checks axislabel
- Conditions (predicates) existence of paths,
boolean expressions (and, or, not). - Wadler fragment Core Xpathposition
arithmetics. - Full Xpath id/idref, string manipulations,
9Core XPath Example
/descendanta/childb
c
a
a
b
d
b
b
c
c
10Complexity of XPath
XPath State-of- the-Art
11XT vs. XALAN (Java implementations, UNIX)
Appending parenta/b to the query doubles its
running time (for branching factor 2 of document)
!!!
Queries a/b/parenta/b//parenta/b
Document (fixed) ltagt ltb/gt ltb/gt lt/agt
12Observation
- THE SYSTEMS REQUIRE TIME EXPONENTIAL IN THE SIZE
OF THE XPATH QUERY!!! - Interpretation follows.
- (However, as we will show, it can be done in
polynomial time (for full Xpath !!!))
13Xpath Query /a/b/parenta/b/parenta/b
Document
ltagt ltb/gt ltb/gt lt/agt
a
a
b
b
b
b
14Xpath Query /a/b/parenta/b/parenta/b
Document
ltagt ltb/gt ltb/gt lt/agt
a
a
b
b
b
b
15Xpath Query /a/b/parenta/b/parenta/b
Document
ltagt ltb/gt ltb/gt lt/agt
a
a
b
b
b
b
a
a
b
b
b
b
16Xpath Query /a/b/parenta/b/parenta/b
Document
ltagt ltb/gt ltb/gt lt/agt
a
a
b
b
b
b
a
a
b
b
b
b
17Xpath Query /a/b/parenta/b/parenta/b
Document
ltagt ltb/gt ltb/gt lt/agt
a
a
b
b
b
b
a
a
b
b
b
b
a
a
a
a
b
b
b
b
b
b
b
b
18Xpath Query /a/b/parenta/b/parenta/b
Document
ltagt ltb/gt ltb/gt lt/agt
a
a
b
b
b
b
a
a
b
b
b
b
a
a
a
a
b
b
b
b
b
b
b
b
Tree of nodes visited is of size
!!!
19IE6 (native code, Windows)
MSXML4 constant-factor improvement as compared
to MSXML3/IE6.
Queries (below size 3) a/bcount(parenta/bcoun
t(parenta/b count(parenta/b) gt 1) gt 1) gt 1
20Efficient Xpath Processing
- Theorem Full Xpath is in polynomial time w.r.t.
combined complexity (both query and data assumed
variable). - Dynamic programming algorithm.
- Gottlob, K., Pichler, Efficient Algorithms for
Processing XPath Queries, Proc. VLDB 2002
21Documents and Contexts
Example document
ltagt ltb/gt ltb/gt ltb/gt ltb/gt lt/agt
Contexts
node
position in node set
size of node set
Context-value Tables (CVT)
- Four types of values in XPath (nset, num, str,
bool) - Defined for each Xpath expression .
- The CVT of is a relation
22Query Trees
Query descendantb/following-siblingposition
() ! last()
Query Tree
Evaluation Idea Traverse query tree bottom-up
and compute CVT of each subexpression using the
CVTs of the constituent subexpressions.
23Best Algorithms for XPath
- Bottom-up algorithm time,
space (memory consumption). - Improved top-down algorithm
time, space. - Outside-in strategy minimization of contexts.
- Prototype implementation exists
www.xmltaskforce.com - Core Xpath position arithmetics (Wadler
Fragment) time , space
. - Core Xpath time .
- n... size of data, q... size of query.
- Gottlob, K., Pichler, VLDB 2002 ICDE 2003
24Core Xpath with data values
25Core Xpath with data values
26Core Xpath with data values
27Combined Complexity
Core Xpath is P-hard!
P-hard not parallelizable (conj.)
In NC massively parallelizable
Xpath without negation, , concatenation,
multiplication is in LOGCFL!
Gottlob, K., Pichler, PODS 2003
28Combined Complexity
- LOGCFL is in NC2 highly parallelizable
solvable inO(log2 n) time with polynomially
many processors. - Core Xpath can be processed in time O(n).
However, parallel algorithms for LOGCFL-hard
problems do not run in linear time on a fixed
number of processors! - gtnot realistic
Gottlob, K., Pichler, PODS 2003
29Combined Complexity
P-hardness highly unlikely that there are
algorithms for Xpath which do not take at least
polynomial memory gt Refutation of automata-based
techniques.
Gottlob, K., Pichler, PODS 2003
30Combined Complexity
- Path queries with downward axes in LOGSPACE
Current SDI algorithms (Xfilter, XTrie) are
highly suboptimal! - However, naive algorithm takes time O(n3) and
will certainly not become linear!
Gottlob, K., Pichler, PODS 2003
31Data and Query Complexity
- XPath is in L (data complexity).
- Also Segoufin, PODS 2003
- PF is L-hard under NC1-reductions (data
complexity). - XPath without multiplication, concatenation is in
L w.r.t. query complexity.
XPath
PF
L-complete (NC1-red.)
Data complexity
Gottlob, K., Pichler, PODS 2003
32Conjunctive Queries over Trees
Conjunctive Queries
33Conjunctive Queries
- FO queries w/o disjunction, negation, universal
quantification. - Importance in database theory
- Great success story, very nice properties
- (e.g., containment is decidable).
- Well-studied (complexity, optimization, ...)
34Data Model Tree Signatures
- Any unary relations (node labels)
- Binary relations Child, Child, Child,
Nextsibling, Nextsibling, Nextsibling,
Following - Subsumes XPath axes
- Descendant Child
- Descendant-or-self Child
- Following-sibling Nextsibling.
- Reverse axes (e.g. Parent) redundant in CQs.
35Conjunctive Query Example
36Applications
- XML Queries XPath, XQuery
- Data Extraction and Integration
- Monadic datalog rules Lixto Baumgartner,
Gottlob, Flesca 2001 - Internal Query language of MARS XQuery Rewriter
Deutsch, Tannen 2003. - Computational Linguistics
- Queries on Parse Trees lpath Bird et al., 2005
- Dominance Constraints Marcus, Hindle, Fleck,
1983 incomplete specification of parse trees. - Higher-order Unification Context Matching Problem
37Cyclic Query Example (from Computational
Linguistics)
38Complexity Results
(combined complexity)
(Partition of set of axes!)
39Conjunctive Queries over Trees
Ptime Results
40Hemichordal Relations
41Special Case R µ
42Examples of HC Relations
- Child is ltbflr-hemichordal
1
2
3
4
6
7
8
9
5
10
11
12
Never true!
43Examples of HC Relations
44Examples of non-HC Relations
45PTime CQ Eval on HC Structures
46PTime CQ Eval on HC Structures
47Properties of HC-Relations
48Properties of HC-Relations
49Properties of HC-Relations
- For HC structures Global consistency
arc-consistency !
50Complexity Results
(combined complexity)
(Partition of set of axes!)
51Conjunctive Queries over Trees
Expressive Power CQ APQ
52CQ µ APQ
- Acyclic positive query (APQ) union of acyclic
conjunctive queries. - Each APQ can be rewritten (in linear time) into
Xpath. - Theorem Gottlob, K., Schulz, PODS 2004.Each CQ
can be expressed as an APQ.
53Conjunctive Queries over Trees
Exponential Blowup CQ gt APQ
54Succinctnes of Conj. Queries
Dn
Theorem Gottlob, K., Schulz, PODS 2004. There
is no polynomial mapping of the Dn to equivalent
APQs.
55Summary Conjunctive Queries
- Complete characterization of tractability
frontier (P/NP) of CQs over trees (in terms of
axis relations). - Machinery for proving Ptime-Results
- lt-hemichordality property
- Expressive power of CQs over trees
- Each CQ is equivalent to an acyclic positive
query (APQ) - Blow-up from CQs to APQs is exponential and
necessarily so.
56Queries on Trees
Monadic Datalog
57Relevance Web Wrapping
- Extract structured data from unstructured Web
pages. - Numerous applications e.g., extract current list
of book prices from amazon.com Web site. - Commercial wrapping system Lixto Baumgartner,
Flesca, Gottlob, VLDB 2001 uses monadic datalog
as kernel language.
58Scope of Wrapping
root
- Assign new labels to some nodes of the tree.
- Drop others as irrelevant.
- No need for arbitrary data transformations.
- Labeling unary information extraction
functions over domain of nodes. - Navigate recursively in trees.
59Monadic Datalog on Trees
- Unary IDB predicates.
- Over unranked ordered trees Signature ?U
predicates firstchild, lastchild, nextsibling,
labell, root, leaf. - Shortcuts, e.g. subelem ancestor accessible
via a given path, rewrite using firstchild and
nextsibling.
lastchild
firstchild
nextsibling
nextsibling
firstchild
lastchild
nextsibling
nextsibling
60How complex is Monadic Datalog?
- Previously known facts on full Datalog over
Graphs - Data complexity of datalog P-complete (impl. in
Vardi 88) - Combined complexity EXPTIME-complete (impl. in
Vardi 88) - Comb. compl. of sirups EXPTIME-cplt.
(Gottlob,Papadimitriou 99) - Comb. compl. of monadic datalog NP-complete
(folklore?)
Theorem Gottlob K. 2002
Monadic Datalog over ?U has combined complexity
O(dataquery)
Query Complexity P-complete and linear-time.
61Proof idea
1.) Transform datalog program input tree in
linear time into a ground
propositional logic program
- Exploit functional dependencies
- nextsibling(X,Y) each connected rule
has only a - linear number of ground instances.
- Decouple independent atoms of rule bodies
p(X) ?q(X) r(Y) nextsibling(X,Z) s(Z).
p(X) ?q(X) r nextsibling(X,Z) s(Z). r
? r(Y).
2.) Execute ground program in linear time by
using well-known algorithms
DowlingGallier Minoux
62Expressiveness Result
- Theorem. Given a unary query Q over ltfirstchild,
lastchild, nextsibling, labelagt, - Q is definable in MSO iff Q is definable as a
monadic datalog program.
63Monadic Datalog and TMNF
- Example
- D0(x) - Root(x).
- D1(x) - D0(x0), First-Child(x0, x).
- D0(x) - D1(x0), First-Child(x0, x).
- D0(x) - D0(x0), Next-Sibling(x0, x).
- D1(x) - D1(x0), Next-Sibling(x0, x).
- TMNF (tree-marking normal form) - restricted
syntax - P(x) - P1(x), P2(x). P(x) - P0(x0), R(x0, x).
P(x) - P0(x0), R(x, x0).
- D0 nodes at even depth in tree.
- D1 nodes at odd depth in tree.
64Summary Monadic Datalog
- Gottlob K., JACM 2004
- M.dl.o.t. can be evaluated in time O(Program
Data). - M.dl.o.t. captures the unary MSO queries over
trees. - Gottlob K., LICS 2002, Frick, Grohe, K.,
LICS 2003 - Linear-time reduction to TMNF.
- Linear-time reduction also from Core Xpath to
TMNF (negation!) - Grohe and Schweikardt, CSL 2003
- But M.dl. much less succinct than MSO, monadic
fixpoint logic. (nonelementarily less succinct
unless FPTW1) - However, no problems observed in practice yet.
- Wrapping Killer Application for Datalog?
- Also known Containment for monadic datalog (on
arbitrary finite structures) is decidable!
Cosmadakis, Gaifman, Kannellakis, Vardi, 1988
65Monadic Datalog with Xpath Axes
- It would be desirable to support Xpath axes
(child, descendant, ...) in monadic datalog
queries. - But how difficult is it to evaluate such queries?
- Proposition. Monadic datalog is in NP w.r.t.
combined complexity (arbitrary finite
structures). - Observation Monadic datalog over some signature
? is in P (resp., NP-complete) iff the
conjunctive queries over ? are.
66Queries on Trees
Summary
67Expressiveness
68Complexity
69End
70State of the Art XPath
- XT, XALAN, IE, SAXON exponential time w.r.t.
size of queries. - XQuery algebra motivates Exptime implementation.
- Queries tend to be small, but
- Not small enough in practice (some queries with m
location steps require time ). - Xpath as a tree pattern matching language
Queries are not that short.
71Complexity of XPath
P-Time Bottom-up XPath Evaluation
72PTime Evaluation Example
Document Tree
Query Tree
(In fact, this is only a relevant subset of the
full tables.)
73PTime Evaluation Example
Document Tree
Query Tree
74PTime Evaluation Example
Document Tree
Query Tree
75PTime Evaluation Example
Document Tree
Query Tree
76PTime Evaluation Example
Document Tree
Query Tree
77PTime Evaluation Example
Document Tree
Query Tree
78PTime Evaluation Example
Document Tree
Query Tree
Query result b2, b3
79Context-Value Table Principle
if
CVT for each operation can
be computed in polynomial time given the CVTs for
sub-expressions
then
CVT of overall query can be computed (bottom-up)
in polynomial time.
80Queries on Trees
Proof PF is NL-complete
81(No Transcript)
82Where can we go from v2 in one step?
83Where can we go from v2 in one step?
84Where can we go from v2 in one step?
85Where can we go from v2 in one step?
86Where can we go from v2 in one step?
87Where can we go from v2 in one step?
88Where can we go from v2 in one step?
- Reachable from v2 in one step v1, v3!
89PF is NL-hard.
- Reachability in precisely m steps
- Add loop at each node to graph
- Set m E.
90Conjunctive Queries over Trees
various
91Conjunctive Queries in MARS
- Example from DeutschTannen, VLDB 2003
- XQueryltresultgt for a in distinct(//author/t
ext()) return ltitemgt ltwritergt a
lt/writergt for b in //book, a1 in
b/author/text(), t in b/title
where aa1 return t
lt/itemgtlt/resultgt - Conjunctive QueryQ(a,b,a1,t) -
//author/text()(a), //book(b),
./author/text()(b,a1), ./title(b,t),
aa1.
92Contributions Conjunctive Queries
- Complete characterization of tractability
frontier (P/NP) of CQs over trees (in terms of
axis relations). - Machinery for proving Ptime-Results
- lt-hemichordality property
- Expressive power of CQs over trees
- Each CQ is equivalent to an acyclic positive
query (APQ) - Blow-up from CQs to APQs is exponential and
necessarily so.
93Hemichordal Relations
94Hemichordata
- half the propertiesof chordata.
- Hemichordality sw.reminiscent of chordality.
95Conjunctive Queries over Trees
NP-Hardness Results
96NP-Hardness Proofs
- All by reduction from 1-in-3 3SAT
- In each clause, precisely one variable must be
true. - No negative literals.
- Problem NP-Complete.
97Clause Gadget, ChildFollowing
98NP-Hardness, ChildFollowing
...
Clause 1 (P1, Q1, R1)
Clause 2 (P2, Q2, R2)
Clause n (Pn, Qn, Rn)
Following...
A
B
C
B
D
E
Clause i (A,B,C)
Clause j (B,D,E)
99Clause Gadget, NextsiblingFollowing
100Clause Gadget, ChildChild
101Conjunctive Queries over Trees
Expressive Power CQ µ APQ
102CQ µ APQ
- Theorem Gottlob, K., Schulz, PODS 2004. Each CQ
can be expressed as an APQ. - Proof Idea Bottom-up rewriting of query
- polynomially many steps
- At most 2 (3) alternatives in each step.
- In each step, move a v up in the query graph.
- Takes exponential time, query may be of exp. size.
103CQ µ APQ
104CQ µ APQ
105CQ µ APQ
106CQ µ APQ
107CQ µ APQ
108CQ µ APQ
109Conjunctive Queries over Trees
Exponential Blowup CQ gt APQ
110n-Diamond Query, Models
Dn
111Step 1
112Step 2
113Step 3
114Step 4
- There are only polynomially many paths in Qi, but
there are exponentially many in Dn. - gt There is (much more than) one path of Dn which
is not in Qi. - We construct a model M from Qi and Dn s.t. Qi is
true on M but Dn is false on M (because that path
is not satisfied.)
115Step 4
116Step 4
Bottommost X2
M
Topmost X1
117Step 4
Bottommost X2
M satisfies (b) but not (a) !!!
M
Topmost X1
118Queries on Trees
Monadic Datalog
119HTML Example
lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"gt lthtmlgt ltbodygt lth1gtPeople _at_
DBAIlt/h1gt lttable border"1" cellpadding"3"
cellspacing"1"gt lttrgt lttdgtGeorg Gottloblt/tdgt
lttdgtgottlob_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18420lt/tdgt lt/trgt lttrgt
lttdgtChristoph Kochlt/tdgt
lttdgtkoch_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18449lt/tdgt lt/trgt lt/tablegt lt/bodygt lt/htmlgt
People _at_ DBAI
120HTML Example
lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"gt lthtmlgt ltbodygt lth1gtPeople _at_
DBAIlt/h1gt lttable border"1" cellpadding"3"
cellspacing"1"gt lttrgt lttdgtGeorg Gottloblt/tdgt
lttdgtgottlob_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18420lt/tdgt lt/trgt lttrgt
lttdgtChristoph Kochlt/tdgt
lttdgtkoch_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18449lt/tdgt lt/trgt lt/tablegt lt/bodygt lt/htmlgt
People _at_ DBAI
121HTML Example
lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"gt lthtmlgt ltbodygt lth1gtPeople _at_
DBAIlt/h1gt lttable border"1" cellpadding"3"
cellspacing"1"gt lttrgt lttdgtGeorg Gottloblt/tdgt
lttdgtgottlob_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18420lt/tdgt lt/trgt lttrgt
lttdgtChristoph Kochlt/tdgt
lttdgtkoch_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18449lt/tdgt lt/trgt lt/tablegt lt/bodygt lt/htmlgt
People _at_ DBAI
122HTML Example
lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"gt lthtmlgt ltbodygt lth1gtPeople _at_
DBAIlt/h1gt lttable border"1" cellpadding"3"
cellspacing"1"gt lttrgt lttdgtGeorg Gottloblt/tdgt
lttdgtgottlob_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18420lt/tdgt lt/trgt lttrgt
lttdgtChristoph Kochlt/tdgt
lttdgtkoch_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18449lt/tdgt lt/trgt lt/tablegt lt/bodygt lt/htmlgt
People _at_ DBAI
123HTML Example
lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"gt lthtmlgt ltbodygt lth1gtPeople _at_
DBAIlt/h1gt lttable border"1" cellpadding"3"
cellspacing"1"gt lttrgt lttdgtGeorg Gottloblt/tdgt
lttdgtgottlob_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18420lt/tdgt lt/trgt lttrgt
lttdgtChristoph Kochlt/tdgt
lttdgtkoch_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18449lt/tdgt lt/trgt lt/tablegt lt/bodygt lt/htmlgt
People _at_ DBAI
124HTML Example
lt!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"gt lthtmlgt ltbodygt lth1gtPeople _at_
DBAIlt/h1gt lttable border"1" cellpadding"3"
cellspacing"1"gt lttrgt lttdgtGeorg Gottloblt/tdgt
lttdgtgottlob_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18420lt/tdgt lt/trgt lttrgt
lttdgtChristoph Kochlt/tdgt
lttdgtkoch_at_dbai.tuwien.ac.atlt/tdgt
lttdgt18449lt/tdgt lt/trgt lt/tablegt lt/bodygt lt/htmlgt
People _at_ DBAI
125Monadic Datalog as a Wrapping Language
entry(X) - root(R), subelemhtml.body.table.tr(R
, X). name(X) - entry(E), firstchild(E, X),
labeltd(X). email(X) - name(N),
nextsibling(N, X), labeltd(X). phone(X) -
email(M), nextsibling(M, X), labeltd(X).
126Monadic Datalog as a Wrapping Language
entry(X) - root(R), subelemhtml.body.table.tr(R
, X). name(X) - entry(E), firstchild(E, X),
labeltd(X). email(X) - name(N),
nextsibling(N, X), labeltd(X). phone(X) -
email(M), nextsibling(M, X), labeltd(X).
- Rules simple -gt visual specification
- Select root as start pattern.
- Define destination pattern entry by selecting a
document region. - Selection is interpreted relative to start
pattern. - Add further constraints.
127Monadic Datalog as a Wrapping Language
entry(X) - root(R), subelemhtml.body.table.tr(R
, X). name(X) - entry(E), firstchild(E, X),
labeltd(X). email(X) - name(N),
nextsibling(N, X), labeltd(X). phone(X) -
email(M), nextsibling(M, X), labeltd(X).
128Monadic Datalog as a Wrapping Language
root
entry(X) - root(R), subelemhtml.body.table.tr(R
, X). name(X) - entry(E), firstchild(E, X),
labeltd(X). email(X) - name(N),
nextsibling(N, X), labeltd(X). phone(X) -
email(M), nextsibling(M, X), labeltd(X).
html
body
table
tr
tr
td
td
td
td
td
td
129Monadic Datalog as a Wrapping Language
entry(X) - root(R), subelemhtml.body.table.tr(R
, X). name(X) - entry(E), firstchild(E, X),
labeltd(X). email(X) - name(N),
nextsibling(N, X), labeltd(X). phone(X) -
email(M), nextsibling(M, X), labeltd(X).
130Monadic Datalog as a Wrapping Language
entry(X) - root(R), subelemhtml.body.table.tr(R
, X). name(X) - entry(E), firstchild(E, X),
labeltd(X). email(X) - name(N),
nextsibling(N, X), labeltd(X). phone(X) -
email(M), nextsibling(M, X), labeltd(X).
lt?xml version"1.0"?gt ltpeopledbgt ltentrygt
ltnamegtGeorg Gottloblt/namegt
ltemailgtgottlob_at_dbai.tuwien.ac.atlt/emailgt
ltphonegt18420lt/phonegt lt/entrygt ltentrygt
ltnamegtChristoph Kochlt/namegt
ltemailgtkoch_at_dbai.tuwien.ac.atlt/emailgt
ltphonegt18449lt/phonegt lt/entrygt lt/peopledbgt
131Queries on Trees
XQuery
132Expressiveness 2