Title: The Graph Query Language
1The Graph Query Language
- David Silberberg
- The Johns Hopkins University
- Applied Physics Laboratory
- July 18, 2006
2Team Members
- Wayne Bethea
- Jim Cavanaugh
- Clay Fink
- Paul Frank
- John Gersh
- Elisabeth Immer
- Roger Remington
3Outline
- Goals Example Scenario
- Related Work and Key Features of GQL
- Graph Model and Query Language
- Computational Complexity of Query Execution
- Future Directions
4Goals of the Graph Query Language (GQL)Project
- To introduce a new approach to graph query
languages for graph analysis - Enable graph analysts to perform semantic search
and iterative analysis over large graphs in a
scalable fashion - Seamlessly integrate graph analysis functions
into the graph query language - To quantify the scalability of this type of
language - To use ontologies to enrich graph querying
5Example Scenario
- Farmer Jones' lettuce crop did well this year,
but few other farmers did well. Why? - First, find Farmer Jones.
6Example Scenario
- Rabbits usually eat lettuce. Let's find the
rabbits that ate Farmer Jones' lettuce.
7Example Scenario
- Let's look at all the farmers, and their
locations, whose lettuce was eaten by fewer than
5 rabbits.
8Example Scenario
- What commonalities do the farmers have with each
other and with the rabbits?
9Graph Interaction Methods
- Graph Analysis is a process of both browsing and
searching elements of the graph - Browsing
- One-step-at-a-time graph navigation
- One-operation-at-a-time graph algorithms
- Searching
- Several-steps-at-a-time graph navigation
- The steps can include one or more graph
algorithms - GQL is a declarative graph query language for
searching!
10Outline
- Goals Example Scenario
- Related Work and Key Features of GQL
- Graph Model and Query Language
- Computational Complexity of Query Execution
- Future Directions
11Related Work
- Four categories of graph query languages
- Knowledge base (subject-predicate-object) query
languages - SPARQL, RQL, RAL, RDF Query Language
- Graph reasoning query languages
- OWL-QL, GraphLog, Query and Inference Service for
RDF - Query languages with graph operators
- GOQL
- GRAM
- Graphical user interface query language
- QGRAPH
12Key Features of GQL
- Graph Paradigm
- Syntax, operators and results use the graph
paradigm - Returns a single graph or a set of graphs (not
tables or XML files) to support analysis of large
graphs - Facilitates iterative graph querying
- Semantic Graph Query
- Schema-based
- Can be extended to utilize ontology-based
inference - Graph Exploration
- Wildcard searches
- Query over patterns
13Key Features of GQL (continued)
- Expressivity
- Composite entities
- New graph construction of results
- Universal and existential quantification
- Analysis support
- Hypothesis expressions
- Special graph functions (Shortest Path, Adjacent
Vertices, etc.) - Aggregation functions (count, sum, average, min,
max) - Set aggregation functions (union, intersection,
difference)
14Outline
- Goals Example Scenario
- Related Work and Key Features of GQL
- Graph Model and Query Language
- Computational Complexity of Query Execution
- Future Directions
15Graph Data Models
- Simple model
- Vertices usually represent concepts or objects
- Edges usually represent relationships between
vertices - Properties attributes of objects or
relationships - Represent highly-connected information such as
- Social networks
- Knowledge bases
- Disciplines that use graphs
- Link mining analysis
- Semantic Web
- Bioinformatics
16Example Graph Model
17GQL Operators - Overview
- Basic Syntax
- SUBGRAPH clause
- Finds a subgraph in the source graph
- CONSTRAINT clause
- Filters the subgraph based on property
constraints - RETURN clause
- Describes the resulting graph or sets of graphs
to return - Syntax for analysis
- ASSUME clause
- Supports hypothesis statements
- PATTERN clause
- Defines search patterns
18Basic GQL Operators
- Subgraph Template Operators SUBGRAPH clause
- Conjunctions and disjunctions of path-segment
operators - Hierarchy operators (for composite vertices)
- Constraint Operators CONSTRAINT clause
- Standard first-order logic
- Conjunctions, disjunctions and negations as well
as universal and existential quantification of
predicates. - Projection Operators RETURN clause
- Constructs the result graph(s)
- Path segment operator
- Hierarchy operator (for composite vertices)
- Present results as a set of graphs
- Edge expansion operator
- Common join operator
19Simple Query
- SUBGRAPH Fox Chases Rabbit AND Fox Eats Rabbit
- CONSTRAINT Chases.Time lt Eats.Time
- RETURN Fox Chases Rabbit AND Fox Eats Rabbit
20New Result Graph Structure Query
- SUBGRAPH Fox Eats Rabbit AND Rabbit Eats Lettuce
- RETURN Fox new(Ingests) Lettuce
Fox fox1
Lettuce lettuce1
Ingests ingests1
age 3
name George
name PrizeLettuce
Fox fox2
name Fred
age 2
Lettuce lettuce2
Ingests ingests3
name Icy
21Aliasing
- SUBGRAPH Fox ALIAS ChasingFox Chases Rabbit AND
- Fox ALIAS EatingFox Eats Rabbit
- CONSTRAINT ChasingFox.name ltgt EatingFox.name
- RETURN ChasingFox Chases Rabbit AND
- EatingFox Eats Rabbit
- If our graph had an additional edge in which
George Fox chased Jack Rabbit at 8 a.m., the
result would look like
Fox fox1
age 3
name George
Chases chases3
time 8am
Fox fox2
Rabbit rabbit3
Eats eats2
name Fred
age 2
age 1
name Jack
time 9am
22Wildcard Queries
- SUBGRAPH Fox ALIAS InterestingEdge Rabbit
- RETURN Fox InterestingEdge Rabbit
Fox fox1
Rabbit rabbit1
Chases chases1
time 2pm
age 3
name George
age 2
name Peter
Eats eats1
time 3pm
Chases chases2
Rabbit rabbit2
time 5pm
age 4
name Bugs
Fox fox2
Rabbit rabbit3
Eats eats2
name Fred
age 2
age 1
name Jack
time 9am
23Composite Vertices
- Composite vertices
- Composed of vertices and edges
- Contained vertices can be composite as well
24Composite Vertex Queries - continued
- SUBGRAPH HuntingEvent OccuredAt Place AND
- HuntingEvent DIRECTLY CONTAINS Rabbit AND
- Rabbit Eats Lettuce
- CONSTRAINT Place.name Smith Game Park
- RETURN Rabbit Eats Lettuce
time
Lettuce
name
Eats
Rabbit
name
age
25Patterns
- Pattern Definition
- Assigns names to interesting graph patterns
- Can be used in multiple queries
- PATTERN Predator (Fox new(PreysUpon) Rabbit)
- SUBGRAPH Fox Chases Rabbit AND
- Fox Eats Rabbit
- CONSTRAINT Chases.time lt Eats.time
- RETURN Fox new(PreysUpon) Rabbit
26Pattern Use
- Query
- SUBGRAPH Predator(Fox PreysUpon Rabbit) AND
- Rabbit Eats Lettuce
- RETURN Fox new(Ingests) Lettuce
- Is evaluated as if it were
- SUBGRAPH Fox Chases Rabbit AND
- Fox Eats Rabbit AND
- Rabbit Eats Lettuce
- CONSTRAINT Chases.time lt Eats.time
- RETURN Fox new(Ingests) Lettuce
27Hypothesis Expressions
- Enables queries on hypothetical data
- SUBGRAPH Fox Chases Rabbit AND
- Fox Eats Rabbit AND
- Rabbit Eats Lettuce
- CONSTRAINT Chases.time lt 8am
- RETURN Fox new(Ingests) Lettuce
- ASSUME EDGE Chases NEW time 7am
- FROM FoxCONSTRAINT name Fred
- TO RabbitCONSTRAINT name Jack
-
28Special Graph Operator Queries
- Shortest Path
- SUBGRAPH GameWarden Chases Fox AND
- ShortestPath(Fox, Rabbit) ALIAS SP_alias AND
- Rabbit Eats Lettuce
- RETURN GameWarden Chases Fox AND
- SP_alias AND
- Rabbit Eats Lettuce
- Adjacent Vertices
- SUBGRAPH AdjacentVertices(Rabbit) ALIAS AV_alias
- CONSTRAINT count_edges(Rabbit) gt 10
- RETURN AV_alias
29Returning a Set of Graphs
- Can be done with edge expansion or joins in the
RETURN clause - Can be seamlessly integrated with non-graph
expansion expressions - Any query can be returned as a set of graphs if
desired - SUBGRAPH Fox Chases Rabbit
- RETURN Fox Chases Rabbit
30Outline
- Goals Example Scenario
- Related Work and Key Features of GQL
- Graph Model and Query Language
- Computational Complexity of Query Execution
- Future Directions
31Query Optimization
- Query execution time is the key to success for
any query language GQL is no exception - Our approach
- Address query optimization on a per path-segment
basis - Address path-segment ordering
- Address the management of large amounts of
intermediate results of a query - Our efforts so far
- Addressed per path-segment optimization
- Started to address path-segment ordering
- Have not yet addressed the management of large
amounts of intermediate results
32Query Optimization
- Query plan representations are used to define
query execution plans - Query plan representations are manipulated to
optimize the query execution time - Via laws of graph algebra
- Via graph statistics to estimate query costs for
each operation - Query optimizer determines
- The best algorithm to execute each operation
- The best operation ordering to optimize overall
query execution time
33Query Planning and Optimization
- Query planning process determines the operators
required to solve a query - Query optimization process determines the most
efficient way to - Execute query operators
- Order the execution of query operators
- Heuristics have been identified to implement
query planning and optimization based on
statistical analysis
34Graph Statistics
- Estimating costs requires statistical knowledge
of the graph - We estimate the cost of the path segment operator
- One of the most common and costly operations
- Statistics that we initially considered useful
- Vertex Cardinality The number of vertices of
type v is count(v) or just V. - Vertex Edge Set Cardinality The total number of
edges e that emanate from all vertices of type v
is count(ev) or just EV. - Edge Cardinality The number of edges of type e
is count(e) or just E. - Edge Distribution The number of different vertex
type pairs that edges of type e connect of just
ED. - Selectivity Factor The percentage of vertices or
edges that match a property constraint is sel(?),
where ? is the property constraint. - Uniformity assumption
- Independence assumption
35Path Segment Vertex Search, No Indices
- Algorithm
- Iterate through a set of vertices of type v in
O(V) time - For each vertex, iterate through its edge list to
find edges of type e in O(EV/V) time - Follow the edge to vertex w in constant time
- Execution time is O(V(EV/V)) O(EV)
36Path Segment Indices on Vertex Edge Set
- Requires each edge set to be indexed through a
logarithmic-time search tree (e.g., B tree) - Next values are (virtually) collocated with the
matching value - Enables a constant time search for the next
value(s) - Algorithm
- Iterate through vertices of type v in time O(V)
- Find matching edge(s) in logarithmic time
O(log(EV/V) - Iterate through the matching edges in time
O(E/EDV) - Execution time is O(V (log(EV/V) E/EDV) )
O(Vlog(EV/V) E/ED) - If ED ? E (i.e., one edge of type e emanates from
each v), then the algorithm tends to operate in
time O(Vlog(EV/V)) - If ED ? E and EV ?V, the algorithm tends operate
in time O(V) - If ED ? E and EV?gtgt V, the algorithm tends to
operate in time O(Vlog(EV)) - If ED gtgt E, then the algorithm tends to operate
in time O(E/ED)
37Path Segment Edge Indices, Constraint
- Beneficial when the query includes a constraint
?v on an indexed property of vertices of type v - Vertex edge sets are indexed as well
- Algorithm
- Logarithmic-time search through the indexed
properties ?v in time O(log(V)) - Iterate through vertices (collocated in the
index) that satisfy the constraint in time
O(sel(?v)V) - Performs a logarithmic-time search on the edges
of each matching vertex in time O(log(EV/V)) - Iterate through the matching edges in time
O(E/EDV) - Execution time is O(log(V) (sel(?v)V(log(EV/V)
E/EDV)) ) O(log(V) sel(?v)Vlog(EV/V)
sel(?v)E/ED) - If sel(?v) ? 0, the dominant factor is the search
for vertices or O(log(V)) - If the selectivity factor is higher, the
execution time approaches the times of the
previous slide
38Path Segment Edge Search, No Indices
- Algorithm
- Iterate over edge types e and select those that
connect v to w in time O(E) - Find the corresponding vertices in constant time
- Execution time is O(E)
39Path Segment Edge Search, Constraint
- Beneficial when the query statement includes a
constraint ?e on an indexed property of edges of
type e - Algorithm
- Performs a logarithmic-time search through
properties to find the first matching edge in
time O(log(E)) - Performs a linear search through all subsequent
matching edges in time O(sel(?e)E) - Find both vertices attached to each edge in
constant time - Execution time is O(log(E) sel(?e)E)
- If sel(?e) ? 0, the algorithm tends to an
execution time of O(log(E)) - Otherwise, the algorithm tends to an execution
time of O(E)
40Varying Number of Vertices per Vertex Type
41Varying Number of Edges per Vertex
42Varying Edge Types with Constraints
43Path Segment Ordering
- Assume the following query
- SUBGRAPH Fox Chases Rabbit AND
- Rabbit Eats Lettuce
- CONSTRAINT Rabbit.age lt 3
- RETURN Fox new(Ingests) Lettuce
- Query processing produces the following query
execution plan
p Fox new (Ingests) Lettuce
s Rabbit.age lt 3
?
?
Lettuce
Eats
Fox
Rabbit
Chases
44Path Segment Execution Order Choice
p Fox new Ingests Lettuce
p Fox new Ingests Lettuce
s Rabbit.age lt 3
or
?
?
Lettuce
Eats
Fox
Rabbit
Chases
45Execution Order Heuristics
- In simple terms
- Identify the path segment operation that promises
to return the least number of results - Then identify the next operation that promises to
return the next least number of results - It is actually more complicated than this
- Need to search an exponential number of orderings
to find the most efficient ordering - Heuristics can make this search tractable
46Path-Segment Ordering Metric
- Order the path segment operators to return the
fewest results - Rough heuristic
- If predicates ?v, ?e, and ?w are applied to V, E
and W respectively - Start with V and use selectivity factors to
estimate execution time - Execution time is
- V sel(?v) (E/EDV) sel(?e) (WED/E)
sel(?w) - Or, sel(?v) sel(?e) sel(?w) W
- Use this formula to determine whether Fox Chases
Rabbit should precede or follow Rabbit Eats
Lettuce
47Outline
- Goals Example Scenario
- Related Work and Key Features of GQL
- Graph Model and Query Language
- Computational Complexity of Query Execution
- Future Directions
48Future Work
- Create an operational prototype of a Graph Query
Language system - Continue to address query optimization issues
- Use ontologies to enrich graph queries
- Address language issues
- Define the query execution process
- Inferences
- Ontology to graph mappings
- Tie GQL to a graphical interface
- Enables analysts to express queries through
graphical means - Can leverage several technologies (QGraph,
Conceptual Graphs, etc.) - Augment GQL to include Uncertainty, Geospatial
and Temporal operators and data structures
49Backups
50Costs of Various Path Strategies
- Search by Vertex Type
- Plain O(EV)
- With indexed Edges O(Vlog(EV/V) E/ED)
- If ED ? E (i.e., one edge of type e emanates from
each v), then the algorithm tends to operate in
time O(Vlog(EV/V)) - If ED ? E and EV ?V, the algorithm tends operate
in time O(V) - If ED ? E and EV?gtgt V, the algorithm tends to
operate in time O(Vlog(EV)) - If ED gtgt E, then the algorithm tends to operate
in time O(E/ED) - With indexed Properties and Edges O(log(V)
sel(?v)Vlog(EV/V) sel(?v)E/ED) - If sel(?v) ? 0, the dominant factor is the search
for vertices or O(log(V)) - Otherwise, the execution time approaches the
times of the previous strategy - Search by Edge Type
- Plain O(E)
- Since EVW ? EV, the execution time is at least as
fast as that of the first algorithm - With indexed Properties O(log(E) sel(?e)E)
- If sel(?e) ? 0, the algorithm tends to an
execution time of O(log(E)) - Otherwise, the algorithm tends to an execution
time of O(E)