Title: AutoJoin: Providing Freedom from Specifying Joins
1AutoJoin Providing Freedom from Specifying Joins
- Terrence Mason (terrence-mason_at_uiowa.edu)
- Lixin Wang (lixin-wang_at_uiowa.edu)
- Dr. Ramon Lawrence (ramon-lawrence_at_uiowa.edu)
- Iowa Database and Emerging Application Laboratory
- University of Iowa
7th International Conference on Enterprise
Information Systems ICEIS 2005 Miami, Florida
2Presentation Outline
- Define Query Inference
- Query Languages that require Inference
- AutoJoin Architecture
- Join Graph represent a schema
- Queries and Query Interpretations on a Join Graph
- Pre-compute maximal join trees
- Algorithm EMO
- Query time processing Example
- Performance Evaluation
3Query Inference Problem New Languages
- The query inference problem requires enumerating
and ranking query interpretations of a query such
that the query interpretation desired by the user
is among the highest ranked interpretations.
4Motivation for Query Inference
- State of the art query languages require it
- Keyword Search automatically relate keywords
across relations of a schema - Conceptual Queries Concepts mapped to database
must be related - Natural Language Queries
- Natural language query mapped to concepts
- Relate concepts as in Conceptual Queries
- Current approaches not scalable
- Tied to specific language
- Or conceptual model
5Motivation for Query Inference
- Reduces to graph problem
- Connect relations (nodes) with joins (edges)
- Exponential solutions for highly connected graphs
(database graphs less connected) - Approaches to join determination
- Grow all ways
- Universal Relation (Maier and Ullman, 1983)
- Discover (Keyword) (Hristidis and
Papakonstantinou, 2002, 2003, 2004) - Shortest Paths
- CQL Conceptual Query Language (Owei and Navathe,
2001) - Limited Interpretations
- Steiner Tree (2-Trees) (Wald and Sorenson, 1984)
- Limit number of joins and interpretations (Zhang
et al., 1999) - Query time find spanning trees of keywords
- DBXplorer Keyword Search (Agrawal et al. 2002)
6Goal of AutoJoin
- Consistent, Scalable Inference Engine
- Abstract database schema from users
- Automatically determine joins to relate relations
and attributes - Consistent approach to handle ambiguity in
queries - Efficient algorithm to pre-compute potential
joins - Minimal overhead at query time
- Demonstrate efficiency and scalability
- Structured on relational model without any
required conceptual models
7Example Query on TPC-H Schema
- English Query
- List all parts ordered by Customers
- in the United States.
- Attribute-only SQL
- Determine Joins with AutoJoin
- New formulation for Query Inference problem.
8- TPC-H Schema
- TPC-H BENCHMARK (http//www.tpc.org/)
- List all parts ordered by Customers in the United
States.
Table Attributes
Part partkey, name, mfgr, brand, type, size, container, retailprice, comment
Supplier supkey, name, address, nationkey, phone, acctbal, comment
PartSupp partkey, suppkey, availqty, supplycost, comment
Customer custkey, name, address, nationkey, phone, acctbal, mktsegment, comment
Order orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority, clerk, shippriority, comment
LineItem orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, returnflag, tax, linestatus, shipdate, commitdate, receiptdate, shipinstruct, shipmode, comment
Nation nationkey, name, regionkey, comment
Region regionkey, name, comment
9- Attribute-only Query
- Select Part.Name where Nation.NameUnited
States - Part.Name - name attribute in Part Table
- Nation.Name name attribute in Nation Table
- Select and where similar to SQL
- No From clause or joins specified
- Keyword Query
- Part United States
- Maps Part to Part relation
- Maps United States to tuple in Nation relation
- No joins specified
10SQL Query Select Part.Name where Nation.Name
United States
- SELECT P.name
- FROM part P, nation N, partsupp PS, lineitem LI,
- orders O, customer C
- WHERE N.name United States
- And P.partkey PS.partkey
- And PS.partkey LI.partkey
- And PS.suppkey LI.suppkey
- And O.custkey C.custkey
- And C.nationkey N.nationkey
- And LI.orderkey O.orderkey
Specified Joins and Tables
11AutoJoin Architecture
User
Relational Database
Execute Queries
Query Interface
Inference Request
Interpretations
Query Builder
Iterator
Generator
Ranker
XML Document
Loader
AutoJoin Inference Engine
12Representing Joins of a SchemaJoin Graph
- Graph representation of relational schema
- Nodes
- Relations in schema
- Directed Edges
- Foreign key constraint between relations
- Edges directed from N to 1 cardinality of
relationships - Maintain Lossless property (No spurious tuples on
joins)
13Create Join Graph TPC-H
Nodes Joined Foreign key/Join
Line Item to Part partkey ? partkey
Line Item to PartSupp partkey, suppkey ? partkey, suppkey
Line Item to Supplier suppkey ? suppkey
Line Item to Order l_orderkey ? o_orderkey
PartSupp to Part ps_partkey ? p_partkey
PartSupp to Supplier ps_suppkey ? s_suppkey
Supplier to Nation s_nationkey ? n_nationkey
Order to Customer o_custkey ? c_custkey
Customer to Nation c_nationkey ? n_nationkey
Nation to Region n_regionkey ? r_regionkey
Tables as Nodes
Line Item
Part Supp
Part
Supplier
Order
Nation
Customer
Region
14Pre-compute Maximal Join Trees
- EMO Algorithm on Join Graph
- Efficiently computes all Trees
- Executes where previous strategy failed
- Direction of edges results in lossless join trees
- Pre-computed
- Executed once prior to query time
- Structures built for query time performance
15Compute Lossless Joins
- Maximal sets of lossless joins
- Ambiguity inherent in the schema
- Two types of ambiguity
- Single relation that plays multiple roles
- Node with more than one incoming edge in join
graph - Multiple semantic relationships between entities
- Strongly connected components greater than one
node
16Creation of Maximal Join TreesLossless Joins
- Efficient Algorithm EMO
- Determine all reachable graphs from nodes that
may be a root for Maximal Set of Lossless Joins - Identify all Strong Connected Components (SCC)
- For each SCC
- If SCC is single node and no incoming edges,
create reachable graph from this node - If SCC has multiple nodes, for each node in SCC
with no incoming edges that are not part of SCC
create reachable graph. - For each reachable graph find all spanning trees
- Spanning trees represent Maximal Join Trees
17Maximal Join Trees of TPC-H
- LineItem is the only root for a reachable graph.
- No strongly connected components
- Join graph is reachable graph
- Enumerate spanning trees on original graph
- Remove shortcut joins and re-compute
18TPC-H Join Graph
Line Item
Part Supp
Order
Supplier
Part
Nation
Customer
Region
19TPC-H Maximal Join Trees
Line Item
Line Item
Line Item
Line Item
Part Supp
Part Supp
Part Supp
Part Supp
Supplier
Order
Supplier
Order
Part
Supplier
Order
Part
Part
Supplier
Order
Part
Nation
Customer
Nation
Customer
Nation
Customer
Nation
Customer
4
Region
1
3
Region
Region
2
Region
Line Item
Line Item
Line Item
Line Item
Part Supp
Part Supp
Part Supp
Part Supp
Supplier
Order
Supplier
Order
Part
Supplier
Order
Part
Part
Supplier
Order
Part
Nation
Customer
Nation
Customer
Nation
Customer
Nation
Customer
8
7
6
5
Region
Region
Region
Region
20Shortcut Joins
- Semantically equivalent join paths
- A shortcut join is a join that is semantically
equivalent to a longer join path - Core join path (longer) preserved in join graph
- Shortcut join removed for join determination
- Appears to be a semantically different
interpretation of the query - Substituted back into query
- No nodes on core path in query (faster)
execution) - TPC-H has two shortcut joins
21TPC-H Join Graph Remove Shortcut Joins
Line Item
Red Shortcut Joins
Part Supp
Order
Supplier
Part
Nation
Customer
Region
22Original TPC-H Maximal Join Trees
Line Item
Line Item
Line Item
Line Item
Part Supp
Part Supp
Part Supp
Part Supp
Supplier
Order
Supplier
Order
Part
Supplier
Order
Part
Part
Supplier
Order
Part
Nation
Customer
Nation
Customer
Nation
Customer
Nation
Customer
4
Region
1
3
Region
Region
2
Region
Line Item
Line Item
Line Item
Line Item
Part Supp
Part Supp
Part Supp
Part Supp
Supplier
Order
Supplier
Order
Part
Supplier
Order
Part
Part
Supplier
Order
Part
Nation
Customer
Nation
Customer
Nation
Customer
Nation
Customer
8
7
6
5
Region
Region
Region
Region
23TPC-H Semantically Unique Maximal Join Trees
Line Item
Line Item
Part Supp
Part Supp
Supplier
Part
Order
Order
Supplier
Part
Nation
Customer
Nation
Customer
Region
Region
1
2
24Query and Query Interpretation AutoJoin
- Join Graphs
- Query
- Sub-graph of the join graph
- Nodes and (optionally) edges
- Not connected requires inference
- Query Interpretation
- Connected sub-graph of the join graph
- Includes all specified nodes and edges
25Example Query
- SELECT Part.Name
- WHERE Nation.Name United States
- Relate Part.Name to Nation.Name
- Part and Nation Nodes.
- Query of Part and Nation nodes to AutoJoin.
- The query is ambiguous
- More than one query interpretation
- Nation relates to Supplier and Customer
- Return the query with fewest joins first
26Efficient Query Time Execution
- Find maximal join trees with query nodes
- Reverse index - relation to its set of join trees
- Intersect lists
- Build Interpretations
- Least common ancestor (vs. recursive prune)
- Pre-compute ancestor lists
- No lossless interpretations (no trees)
- Find lossy interpretation
- Rank interpretations by cost function
maximal sets of lossless joins
27Both Trees Contain Query Nodes Select Part.Name
where Nation.Name United States
Line Item
Line Item
Part Supp
Part Supp
Supplier
Part
Order
Order
Supplier
Part
Nation
Customer
Nation
Customer
Region
Red Target Nodes
Region
1
2
28Query Processing
Line Item
Line Item
Part Supp
Part Supp
Supplier
Part
Order
Order
Supplier
Part
Nation
Customer
Nation
Customer
Red Target Nodes Blue Tree Nodes Gray Nodes
to Prune
Region
Region
1
2
29Query Interpretations
Select Part.Name where Customer.Nation.Name
United States
Select Part.Name where Supplier.Nation.Name
United States
Line Item
Part Supp
Supplier
Part
Part Supp
Part
Order
Nation
Nation
Customer
1
2
30Unambiguous Query Select Supplier.Name where
Order.Id 73
Line Item
Line Item
Part Supp
Part Supp
Supplier
Part
Order
Order
Supplier
Part
Nation
Customer
Nation
Customer
Region
Red Target Nodes
Region
1
2
31Query Processing Select Supplier.Name where
Order.Id 73
Line Item
Line Item
Part Supp
Part Supp
Supplier
Part
Order
Order
Supplier
Part
Nation
Customer
Nation
Customer
Red Target Nodes Blue Tree Nodes Gray Nodes
to Prune
Region
Region
1
2
32Query Interpretations Select Supplier.Name where
Order.Id 73
Line Item
Line Item
Part Supp
Part Supp
Supplier
Order
Supplier
Order
1
2
33The Unambiguous Query Interpretation Select
Supplier.Name where Order.Id 73
Line Item
Part Supp
Supplier
Order
34Additional InterpretationsLossy Joins
- Related through a node involved in two distinct
roles - Two maximal join trees contain all query nodes
and have at least one node in common - Union maximal join trees
- Common nodes provide relation for trees.
- Interpretation where node will have two incoming
edges - No longer lossless
- Example Customer and Supplier related through
Nation in TPC-H. - Cross products of Customers and Suppliers with
the same nation
35Beyond Natural Joins
- Theta joins
- Merge the two nodes related by theta join into
single node and re-compute maximal objects. - Expand this node for final query interpretation
with theta join - Tuple Variables
- A query interface may specify tuple variables
- Additional nodes and edges will be added to join
graph to complete the query interpretations
36Performance Experiments
- Broad Range of Schemas
- caBIO (NCI) 149 relations, 213 joins, and 1253
maximal join trees - TPC-H Standard Database
- Inferred standard queries (21 specified queries)
- Ambiguity reduced by removing shortcut joins
- Tenant 9 nodes, 50 joins, and 1286 maximal join
trees
37Peformance Results
- Time to generate all Maximal Join Trees
- Handles schemas where previous method failed
- Worst test 2.7 seconds
- Average lt 1 second
- Reduce Ambiguity
- Removing shortcut joins reduces ambiguity
- Increased number of unambiguous query
- From 45 to 68 for TPC-H Benchmark Queries
- Minimal overhead of inference at query time
- Average lt 1 millisecond
- Worst test 7.4 milliseconds
38Compute Maximal Join TreesEMO vs. All Ways
39Reducing Ambiguity Remove Shortcut Joins
40Query Inference Time(Milliseconds)
41AutoJoin Conclusions
- Scalable inference engine
- Efficiently pre-compute maximal join trees
- Reduced ambiguity by removing shortcut joins
- Overhead is minimal
- Complex queries can be inferred
- Built directly on relational model
42Future Work
- Develop a query language
- Remove requirement of understanding the
underlying schema - Automatically determines joins
- End user interface based on AutoJoin
- Query inference for integration systems.
43Query Inference(Previous)
- The translation of a query in a query language
into an unambiguous representation of the query - Wald and Sorenson, 1984
44Universal Relation
- First model to require query inference
- Maximal Objects (Maier and Ullman, 1983)
- Lossless Join property to identify potential
joins - Grows all ways on hyper-graph
- Returns a union of all query interpretations
- Minimum Directed Cost Steiner Tree (Wald and
Sorenson, 1984) - Limited to Partial 2-Trees
- Returns only lowest cost query interpretation
- Generate a single interpretation
- Do not meet need of new query languages
- Limited query interpretations possible
45State of the Art Query Languages
- Keyword Searches
- Keywords map to either specific data, attribute
names, or relation names in a database. - Must identify joins to relate keywords spread
across multiple relations. - Multiple approaches to identifying the top-k
relationships between keywords.
46Keyword SearchTop-K Relationships
- Discover (Hristidis and Papakonstantinou, 2002,
2003, 2004) - Grow all ways from a keyword
- Limit on number of joins
- Creates extra graphs
- DBXplorer (Agrawal et al. 2002)
- Generates spanning trees at query time
- BANKS ( )
- Graph of all tuples related by joins
- Must fit in memory (limited to smaller databases)
47State of the Art Query Languages
- Conceptual Query Languages or Models
- Queries built with concepts that map to a
database. - Remove the burden of knowledge of the schema.
- Must determine joins to relate concepts in query.
- Use conceptual model to determine joins
48Conceptual Query Languages
- CQL (Owei and Navathe, 2001)
- Queries may include roles or joins required for a
query - Pathfinder algorithm for completing the query
- Based on shortest path between source and target
concepts in query - Semantically Constrained ER Diagram as a graph
used to determine joins. - Conceptual Model (Zhang et al., 1999)
- Semantic graph of database
- Search algorithm constrained by number of joins
or number of interpretations
49State of the Art Query Languages
- Natural Language Queries
- Natural language queries map the language to
concepts in a database - Joins must be determined to relate concepts in
database similar to Conceptual Query Languages
50Functional Dependencies due to Primary KeysTPC-H
Table Functional Dependencies
Part p_partkey ? p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment
Supplier s_suppkey ? s_name, s_address, s_nationkey, s_phone, s_acctbal, s_comment
PartSupp ps_partkey, ps_suppkey ? ps_availqty, ps_supplycost, ps_comment
Customer c_custkey ? c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment
Order o_orderkey ? o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment
LineItem l_orderkey, l_linenumber ? l_partkey, l_suppkey, l_orderkey , l_quantity, l_extendedprice, l_discount, l_returnflag, l_tax, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment
Nation n_nationkey ? n_name, n_regionkey, n_comment
Region r_regionkey ? r_name, r_comment
Primary Keys Foreign Keys
51Function Dependencies TPC-H implied by Foreign
Keys
Table with Foreign Key Table Referenced Functional Dependencies
LineItem Part l_partkey ? p_partkey
LineItem Supplier l_suppkey ? s_suppkey
LineItem PartSupp l_partkey, l_suppkey ? ps_partkey, ps_suppkey
LineItem Order l_orderkey ? o_orderkey
PartSupp Part ps_partkey ? p_partkey
PartSupp Supplier ps_suppkey ? s_suppkey
Supplier Nation s_nationkey ? n_nationkey
Customer Nation c_nationkey ? n_nationkey
Order Customer o_custkey ? c_custkey
Nation Region n_regionkey ? r_regionkey
Primary Keys Foreign Keys
52TPC-H Join Graph
Line Item
Part Supp
Order
Supplier
Part
Nation
Customer
Region