Title: Flexible Queries over Semistructured Data
1Flexible Queries over Semistructured Data
- Yaron Kanza
- Yehoshua Sagiv
- The Hebrew University
2Overview of the Talk
- New semantics for queries over semistructured
data - New results for
- Query evaluation
- Query equivalence
- Database equivalence (databases could be
equivalent even if they are not identical!) - Transforming a database into a tree
3Why is it Difficult to Formulate Queries over
Semistructured Data?
It is difficult to design queries
Data does not conform to a rigid schema
The structure of the database changes frequently
Queries should be rewritten frequently
Data is contributed by many users in a variety
of designs
The query should deal with different structures
of data
The description of the schema is large (e.g., a
DTD of XML)
It is difficult to use the schema for formulating
queries
4A University Scenario
University Website Database
5Database
- Following OEM, the database is represented as a
rooted labeled directed graph
6University
1
Course
Teacher
Course
2
3
4
Course
Teacher
Teacher
Title
Title
Course
Name
11
5
6
7
8
10
9
Logic
OS
Name
Title
C. Katz
Name
Title
12
13
15
14
A. Cohen
B. Levi
Databases
Compilers
A teacher node can either be below or above a
course node
Thus, it is difficult to write a query that
looks for all the teachers and their courses
7Queries
- Queries are represented as rooted labeled
directed graphs - The nodes of the graph are considered as variables
8University
r
Course
Course
u
v
Teacher
Teacher
w
Name
y
A query that finds all pairs of courses taught
by the same teacher
However, if in the database, courses are
descendents of teachers, the query has to be
reformulated
Instead, we propose new ways of matching queries
to databases
9A Rigid Matching
- The query root is mapped to the db root
-
- A query edge with label l is mapped to a db edge
with label l (and, hence, a path is mapped to a
path) - It is the usual semantics for queries
- (e.g., Lorel, XML-QL, XQL, etc.)
Query Root
Database Root
r
1
x
x
9
l
l
y
11
10University
University
1
u
Course
Teacher
Course
Teacher
2
3
4
v
Course
Teacher
Teacher
Title
Title
Course
Name
Course
11
5
6
7
8
10
9
w
Logic
OS
Name
Title
C. Katz
Name
Title
12
13
15
14
A. Cohen
B. Levi
Databases
Compilers
A Rigid Matching Example
Another Rigid Matching
11A Semiflexible Matching
- The query root is mapped to the db root
- A query node with an incoming label l is mapped
to a db node with an incoming label l - The image of every query path is embedded in some
database path - SCC is mapped to SCC
The last two conditions cannot be verified
locally, i.e., by considering one query edge at
a time
l
l
y
11
12University
University
1
u
Course
Teacher
Course
Teacher
2
3
4
v
Course
Teacher
Teacher
Title
Title
Course
Name
Course
11
5
6
7
8
10
9
w
Logic
OS
Name
Title
C. Katz
Name
Title
12
13
15
14
A. Cohen
B. Levi
Databases
Compilers
A Semiflexible Matching Example
We get all the teacher-course pairs
13University
University
1
u
Course
Teacher
Course
Course
Course
2
3
4
v
Course
Teacher
Teacher
x
Teacher
Title
Title
Course
Name
Teacher
11
5
6
7
8
10
9
w
Logic
OS
Name
Title
C. Katz
Name
Title
12
13
15
14
A. Cohen
B. Levi
Databases
Compilers
Impossible to get this pair by means of a rigid
matching, since the query is a dag and the db is
a tree
Another Example of a Semiflexible Matching
The SF matching gives a pair of courses taught by
the same teacher
14A Flexible Matching
- The query root is mapped to the db root
- A query node with an incoming label l is mapped
to a db node with an incoming label l - An edge is mapped to two nodes on one path
- Notice that a path in the query is not
necessarily mapped to a path in the db
l
l
y
11
15University
University
1
u
Course
Teacher
Course
Course
Course
2
3
4
v
Course
Teacher
Teacher
x
Teacher
Title
Title
Course
Name
Teacher
11
5
6
7
8
10
9
w
Logic
OS
Name
Title
C. Katz
Name
Title
Name
12
13
15
14
y
A. Cohen
B. Levi
Databases
Compilers
A Flexible Matching Example
A query edge is mapped to two db nodes on one path
This flexible matching is neither a rigid
matching nor a semiflexible matching
16Differences Between the Semiflexible and Flexible
Semantics
- On a technical level, in flexible matchings
- Query paths are not necessarily embedded in
database paths - SCCs are not necessarily mapped to SCCs
- On a conceptual level, in the semiflexible
semantics, nodes are semantically related if
they are on the same path, and hence - Query paths are embedded in database paths
- In the flexible semantics, this condition is
relaxed - Query edges are embedded in database paths
17Inclusion
- Proposition
- R-MATQ(D) ? SF-MATQ(D) ? F-MATQ(D)
- where
- R-MATQ(D) is the set of rigid matchings
- SF-MATQ(D) is the set of semiflexible
matchings - F-MATQ(D) is the set of flexible
matchings
18Verifying that Mappings are Semiflexible Matchings
- Is a given mapping of query nodes to database
nodes a semiflexible matching? - Not as simple as for rigid matchings (no local
test, i.e., need to consider paths rather than
edges) - In a dag query, the number of paths may be
exponential - Yet, verifying is in polynomial time
- In a cyclic query, the number of paths may be
infinite - Yet, verifying is in exponential time
19Verifying that a Mapping is a Semiflexible
Matching
Cyclic Query DAG Query Tree Query Path Query Query / Database
No matchings PTIME PTIME PTIME Path Database
No matchings PTIME PTIME PTIME Tree Database
No matchings PTIME PTIME PTIME DAG Database
coNP coNP PTIME PTIME Cyclic Database
20Complexity of Query Evaluation
- Not surprisingly, for both the semiflexible and
flexible semantics - Data complexity is polynomial
- Query complexity is exponential
But is it exponential because the result is
large or because the result is hard to compute?
21Input-Output Complexity of Query Evaluation for
the Semiflexible Semantics
- The input consists of both the query and the
database - The input-output complexity is a function of the
query, the database and the result - Next slide summarizes results about the
input-output complexity - Polynomial for a dag query and a tree database
(or simpler cases) - Rather difficult to prove, even when the query is
a tree, since there is no local test for
verifying that mappings are semiflexible
matchings - Exponential lower bounds for other cases
22I/O Complexity for SF Semantics (lower bounds
are for non-emptiness)
Cyclic Query DAG Query Tree Query Path Query Query / Database
Result is empty PTIME PTIME PTIME Path Database
Result is empty PTIME PTIME PTIME Tree Database
Result is empty NP-Complete NP-Complete NP-Complete DAG Database
NP-Hard (in ?P2) NP-Hard (in ?P2) NP-Complete NP-Complete Cyclic Database
23I/O Complexity of Query Evaluation for the
Flexible Semantics
- Results follow from a reduction to query
evaluation under the rigid semantics - Tree query
- Input-Output complexity is polynomial
- DAG query
- Testing for non-emptiness is NP-Complete
24Query Containment
- Q1 ? Q2 if for all database D,
- the set of matchings of Q1 w.r.t. to D
- is contained in
- the set of matchings of Q2 w.r.t. to D
- We assume that
- Both queries have the same set of variables, and
- All variables are distinguished
25Query Equivalence
- Useful for optimization
- Given a query, equivalent queries can be created
by transformations
These two queries are equivalent under both the
flexible and semiflexible semantics
26Database Equivalence
- D1 and D2 are equivalent if for all queries Q,
- the set of matchings of Q w.r.t. to D1
- is equal to
- the set of matchings of Q w.r.t. to D2
- Both databases must have the same set of objects
and the same root
27Database Transformation
University
1
Course
Course
Course
2
3
4
Logic
Compilers
Databases
Teacher
Teacher
Teacher
6
8
A. Cohen
C. Katz
The databases are equivalent under both the
flexible and semiflexible semantics
A DAG has become a TREE!
28Transforming a Database into a Tree
- Reasons for transforming a database into an
equivalent tree database - Evaluation of queries over a tree database is
more efficient - In a graphical user interface, it is easier to
present trees than dags or cyclic graphs - Storing the data in a serial form (e.g., XML)
requires no references
29Transformation into a Tree
- There are algorithms for
- Testing if a database can be transformed into an
equivalent tree database, and - Performing the transformation
- For the semiflexible semantics
- The algorithms are polynomial
- For the flexible semantics
- The algorithms are exponential
30o0
l1
o1
l6
l3
l2
o6
o3
o2
l4
l4
o4
l5
l5
o5
o0, o1, o2, o4, o5
o0, o1, o3, o4, o5
o0, o5, o6
31Complexity Analysis
- for
- Query Containment
- and
- Database Equivalence
32Complexity of Query Containment
- Under the semiflexible semantics, Q1 ? Q2 iff the
identity mapping is a semiflexible matching of Q1
w.r.t. Q2 - Thus, containment is
- in coNP when Q1 is a cyclic graph and Q2 is
either a dag or a cyclic graph - in polynomial time in all other cases
- Under the flexible semantics, query containment
is always in polynomial time
33Complexity of Database Equivalence
- For the semiflexible semantics, deciding
equivalence of databases is - in polynomial time if both databases are dags
- in coNP if one of the databases has cycles
- For the flexible semantics, deciding equivalence
of databases is polynomial in all cases
34Conclusion
- Flexible and semiflexible queries facilitate easy
and intuitive querying of semistructured
databases - Querying the database even when the user is
oblivious to the structure of the database - Queries are insensitive to variations in the
structure of the database
35Conclusion (contd)
- Compared to languages that use regular path
expressions, - Less expressive power, but
- Easier to formulate queries, and
- More favorable complexities for
- Query evaluation, and
- Query optimization
36Thank You