AutoJoin: Providing Freedom from Specifying Joins

About This Presentation
Title:

AutoJoin: Providing Freedom from Specifying Joins

Description:

State of the art query languages require it ... EMO Algorithm on Join Graph. Efficiently computes all Trees ... Efficient Algorithm EMO ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 53
Provided by: idea1

less

Transcript and Presenter's Notes

Title: AutoJoin: Providing Freedom from Specifying Joins


1
AutoJoin Providing Freedom from Specifying Joins
  • Terrence Mason (terrence-mason_at_uiowa.edu)
  • Lixin Wang (lixin-wang_at_uiowa.edu)
  • Dr. Ramon Lawrence (ramon-lawrence_at_uiowa.edu)
  • Iowa Database and Emerging Application Laboratory
  • University of Iowa

7th International Conference on Enterprise
Information Systems ICEIS 2005 Miami, Florida
2
Presentation Outline
  • Define Query Inference
  • Query Languages that require Inference
  • AutoJoin Architecture
  • Join Graph represent a schema
  • Queries and Query Interpretations on a Join Graph
  • Pre-compute maximal join trees
  • Algorithm EMO
  • Query time processing Example
  • Performance Evaluation

3
Query Inference Problem New Languages
  • The query inference problem requires enumerating
    and ranking query interpretations of a query such
    that the query interpretation desired by the user
    is among the highest ranked interpretations.

4
Motivation for Query Inference
  • State of the art query languages require it
  • Keyword Search automatically relate keywords
    across relations of a schema
  • Conceptual Queries Concepts mapped to database
    must be related
  • Natural Language Queries
  • Natural language query mapped to concepts
  • Relate concepts as in Conceptual Queries
  • Current approaches not scalable
  • Tied to specific language
  • Or conceptual model

5
Motivation for Query Inference
  • Reduces to graph problem
  • Connect relations (nodes) with joins (edges)
  • Exponential solutions for highly connected graphs
    (database graphs less connected)
  • Approaches to join determination
  • Grow all ways
  • Universal Relation (Maier and Ullman, 1983)
  • Discover (Keyword) (Hristidis and
    Papakonstantinou, 2002, 2003, 2004)
  • Shortest Paths
  • CQL Conceptual Query Language (Owei and Navathe,
    2001)
  • Limited Interpretations
  • Steiner Tree (2-Trees) (Wald and Sorenson, 1984)
  • Limit number of joins and interpretations (Zhang
    et al., 1999)
  • Query time find spanning trees of keywords
  • DBXplorer Keyword Search (Agrawal et al. 2002)

6
Goal of AutoJoin
  • Consistent, Scalable Inference Engine
  • Abstract database schema from users
  • Automatically determine joins to relate relations
    and attributes
  • Consistent approach to handle ambiguity in
    queries
  • Efficient algorithm to pre-compute potential
    joins
  • Minimal overhead at query time
  • Demonstrate efficiency and scalability
  • Structured on relational model without any
    required conceptual models

7
Example Query on TPC-H Schema
  • English Query
  • List all parts ordered by Customers
  • in the United States.
  • Attribute-only SQL
  • Determine Joins with AutoJoin
  • New formulation for Query Inference problem.

8
  • TPC-H Schema
  • TPC-H BENCHMARK (http//www.tpc.org/)
  • List all parts ordered by Customers in the United
    States.

Table Attributes
Part partkey, name, mfgr, brand, type, size, container, retailprice, comment
Supplier supkey, name, address, nationkey, phone, acctbal, comment
PartSupp partkey, suppkey, availqty, supplycost, comment
Customer custkey, name, address, nationkey, phone, acctbal, mktsegment, comment
Order orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority, clerk, shippriority, comment
LineItem orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, returnflag, tax, linestatus, shipdate, commitdate, receiptdate, shipinstruct, shipmode, comment
Nation nationkey, name, regionkey, comment
Region regionkey, name, comment
9
  • Attribute-only Query
  • Select Part.Name where Nation.NameUnited
    States
  • Part.Name - name attribute in Part Table
  • Nation.Name name attribute in Nation Table
  • Select and where similar to SQL
  • No From clause or joins specified
  • Keyword Query
  • Part United States
  • Maps Part to Part relation
  • Maps United States to tuple in Nation relation
  • No joins specified

10
SQL Query Select Part.Name where Nation.Name
United States
  • SELECT P.name
  • FROM part P, nation N, partsupp PS, lineitem LI,
  • orders O, customer C
  • WHERE N.name United States
  • And P.partkey PS.partkey
  • And PS.partkey LI.partkey
  • And PS.suppkey LI.suppkey
  • And O.custkey C.custkey
  • And C.nationkey N.nationkey
  • And LI.orderkey O.orderkey

Specified Joins and Tables
11
AutoJoin Architecture
User
Relational Database
Execute Queries
Query Interface
Inference Request
Interpretations
Query Builder
Iterator
Generator
Ranker
XML Document
Loader
AutoJoin Inference Engine
12
Representing Joins of a SchemaJoin Graph
  • Graph representation of relational schema
  • Nodes
  • Relations in schema
  • Directed Edges
  • Foreign key constraint between relations
  • Edges directed from N to 1 cardinality of
    relationships
  • Maintain Lossless property (No spurious tuples on
    joins)

13
Create Join Graph TPC-H
Nodes Joined Foreign key/Join
Line Item to Part partkey ? partkey
Line Item to PartSupp partkey, suppkey ? partkey, suppkey
Line Item to Supplier suppkey ? suppkey
Line Item to Order l_orderkey ? o_orderkey
PartSupp to Part ps_partkey ? p_partkey
PartSupp to Supplier ps_suppkey ? s_suppkey
Supplier to Nation s_nationkey ? n_nationkey
Order to Customer o_custkey ? c_custkey
Customer to Nation c_nationkey ? n_nationkey
Nation to Region n_regionkey ? r_regionkey
Tables as Nodes
Line Item
Part Supp
Part
Supplier
Order
Nation
Customer
Region
14
Pre-compute Maximal Join Trees
  • EMO Algorithm on Join Graph
  • Efficiently computes all Trees
  • Executes where previous strategy failed
  • Direction of edges results in lossless join trees
  • Pre-computed
  • Executed once prior to query time
  • Structures built for query time performance

15
Compute Lossless Joins
  • Maximal sets of lossless joins
  • Ambiguity inherent in the schema
  • Two types of ambiguity
  • Single relation that plays multiple roles
  • Node with more than one incoming edge in join
    graph
  • Multiple semantic relationships between entities
  • Strongly connected components greater than one
    node

16
Creation of Maximal Join TreesLossless Joins
  • Efficient Algorithm EMO
  • Determine all reachable graphs from nodes that
    may be a root for Maximal Set of Lossless Joins
  • Identify all Strong Connected Components (SCC)
  • For each SCC
  • If SCC is single node and no incoming edges,
    create reachable graph from this node
  • If SCC has multiple nodes, for each node in SCC
    with no incoming edges that are not part of SCC
    create reachable graph.
  • For each reachable graph find all spanning trees
  • Spanning trees represent Maximal Join Trees

17
Maximal Join Trees of TPC-H
  • LineItem is the only root for a reachable graph.
  • No strongly connected components
  • Join graph is reachable graph
  • Enumerate spanning trees on original graph
  • Remove shortcut joins and re-compute

18
TPC-H Join Graph
Line Item
Part Supp
Order
Supplier
Part
Nation
Customer
Region
19
TPC-H Maximal Join Trees
Line Item
Line Item
Line Item
Line Item
Part Supp
Part Supp
Part Supp
Part Supp
Supplier
Order
Supplier
Order
Part
Supplier
Order
Part
Part
Supplier
Order
Part
Nation
Customer
Nation
Customer
Nation
Customer
Nation
Customer
4
Region
1
3
Region
Region
2
Region
Line Item
Line Item
Line Item
Line Item
Part Supp
Part Supp
Part Supp
Part Supp
Supplier
Order
Supplier
Order
Part
Supplier
Order
Part
Part
Supplier
Order
Part
Nation
Customer
Nation
Customer
Nation
Customer
Nation
Customer
8
7
6
5
Region
Region
Region
Region
20
Shortcut Joins
  • Semantically equivalent join paths
  • A shortcut join is a join that is semantically
    equivalent to a longer join path
  • Core join path (longer) preserved in join graph
  • Shortcut join removed for join determination
  • Appears to be a semantically different
    interpretation of the query
  • Substituted back into query
  • No nodes on core path in query (faster)
    execution)
  • TPC-H has two shortcut joins

21
TPC-H Join Graph Remove Shortcut Joins
Line Item
Red Shortcut Joins
Part Supp
Order
Supplier
Part
Nation
Customer
Region
22
Original TPC-H Maximal Join Trees
Line Item
Line Item
Line Item
Line Item
Part Supp
Part Supp
Part Supp
Part Supp
Supplier
Order
Supplier
Order
Part
Supplier
Order
Part
Part
Supplier
Order
Part
Nation
Customer
Nation
Customer
Nation
Customer
Nation
Customer
4
Region
1
3
Region
Region
2
Region
Line Item
Line Item
Line Item
Line Item
Part Supp
Part Supp
Part Supp
Part Supp
Supplier
Order
Supplier
Order
Part
Supplier
Order
Part
Part
Supplier
Order
Part
Nation
Customer
Nation
Customer
Nation
Customer
Nation
Customer
8
7
6
5
Region
Region
Region
Region
23
TPC-H Semantically Unique Maximal Join Trees
Line Item
Line Item
Part Supp
Part Supp
Supplier
Part
Order
Order
Supplier
Part
Nation
Customer
Nation
Customer
Region
Region
1
2
24
Query and Query Interpretation AutoJoin
  • Join Graphs
  • Query
  • Sub-graph of the join graph
  • Nodes and (optionally) edges
  • Not connected requires inference
  • Query Interpretation
  • Connected sub-graph of the join graph
  • Includes all specified nodes and edges

25
Example Query
  • SELECT Part.Name
  • WHERE Nation.Name United States
  • Relate Part.Name to Nation.Name
  • Part and Nation Nodes.
  • Query of Part and Nation nodes to AutoJoin.
  • The query is ambiguous
  • More than one query interpretation
  • Nation relates to Supplier and Customer
  • Return the query with fewest joins first

26
Efficient Query Time Execution
  • Find maximal join trees with query nodes
  • Reverse index - relation to its set of join trees
  • Intersect lists
  • Build Interpretations
  • Least common ancestor (vs. recursive prune)
  • Pre-compute ancestor lists
  • No lossless interpretations (no trees)
  • Find lossy interpretation
  • Rank interpretations by cost function

maximal sets of lossless joins
27
Both Trees Contain Query Nodes Select Part.Name
where Nation.Name United States
Line Item
Line Item
Part Supp
Part Supp
Supplier
Part
Order
Order
Supplier
Part
Nation
Customer
Nation
Customer
Region
Red Target Nodes
Region
1
2
28
Query Processing
Line Item
Line Item
Part Supp
Part Supp
Supplier
Part
Order
Order
Supplier
Part
Nation
Customer
Nation
Customer
Red Target Nodes Blue Tree Nodes Gray Nodes
to Prune
Region
Region
1
2
29
Query Interpretations
Select Part.Name where Customer.Nation.Name
United States
Select Part.Name where Supplier.Nation.Name
United States
Line Item
Part Supp
Supplier
Part
Part Supp
Part
Order
Nation
Nation
Customer
1
2
30
Unambiguous Query Select Supplier.Name where
Order.Id 73
Line Item
Line Item
Part Supp
Part Supp
Supplier
Part
Order
Order
Supplier
Part
Nation
Customer
Nation
Customer
Region
Red Target Nodes
Region
1
2
31
Query Processing Select Supplier.Name where
Order.Id 73
Line Item
Line Item
Part Supp
Part Supp
Supplier
Part
Order
Order
Supplier
Part
Nation
Customer
Nation
Customer
Red Target Nodes Blue Tree Nodes Gray Nodes
to Prune
Region
Region
1
2
32
Query Interpretations Select Supplier.Name where
Order.Id 73
Line Item
Line Item
Part Supp
Part Supp
Supplier
Order
Supplier
Order
1
2
33
The Unambiguous Query Interpretation Select
Supplier.Name where Order.Id 73
Line Item
Part Supp
Supplier
Order
34
Additional InterpretationsLossy Joins
  • Related through a node involved in two distinct
    roles
  • Two maximal join trees contain all query nodes
    and have at least one node in common
  • Union maximal join trees
  • Common nodes provide relation for trees.
  • Interpretation where node will have two incoming
    edges
  • No longer lossless
  • Example Customer and Supplier related through
    Nation in TPC-H.
  • Cross products of Customers and Suppliers with
    the same nation

35
Beyond Natural Joins
  • Theta joins
  • Merge the two nodes related by theta join into
    single node and re-compute maximal objects.
  • Expand this node for final query interpretation
    with theta join
  • Tuple Variables
  • A query interface may specify tuple variables
  • Additional nodes and edges will be added to join
    graph to complete the query interpretations

36
Performance Experiments
  • Broad Range of Schemas
  • caBIO (NCI) 149 relations, 213 joins, and 1253
    maximal join trees
  • TPC-H Standard Database
  • Inferred standard queries (21 specified queries)
  • Ambiguity reduced by removing shortcut joins
  • Tenant 9 nodes, 50 joins, and 1286 maximal join
    trees

37
Peformance Results
  • Time to generate all Maximal Join Trees
  • Handles schemas where previous method failed
  • Worst test 2.7 seconds
  • Average lt 1 second
  • Reduce Ambiguity
  • Removing shortcut joins reduces ambiguity
  • Increased number of unambiguous query
  • From 45 to 68 for TPC-H Benchmark Queries
  • Minimal overhead of inference at query time
  • Average lt 1 millisecond
  • Worst test 7.4 milliseconds

38
Compute Maximal Join TreesEMO vs. All Ways
39
Reducing Ambiguity Remove Shortcut Joins
40
Query Inference Time(Milliseconds)
41
AutoJoin Conclusions
  • Scalable inference engine
  • Efficiently pre-compute maximal join trees
  • Reduced ambiguity by removing shortcut joins
  • Overhead is minimal
  • Complex queries can be inferred
  • Built directly on relational model

42
Future Work
  • Develop a query language
  • Remove requirement of understanding the
    underlying schema
  • Automatically determines joins
  • End user interface based on AutoJoin
  • Query inference for integration systems.

43
Query Inference(Previous)
  • The translation of a query in a query language
    into an unambiguous representation of the query
  • Wald and Sorenson, 1984

44
Universal Relation
  • First model to require query inference
  • Maximal Objects (Maier and Ullman, 1983)
  • Lossless Join property to identify potential
    joins
  • Grows all ways on hyper-graph
  • Returns a union of all query interpretations
  • Minimum Directed Cost Steiner Tree (Wald and
    Sorenson, 1984)
  • Limited to Partial 2-Trees
  • Returns only lowest cost query interpretation
  • Generate a single interpretation
  • Do not meet need of new query languages
  • Limited query interpretations possible

45
State of the Art Query Languages
  • Keyword Searches
  • Keywords map to either specific data, attribute
    names, or relation names in a database.
  • Must identify joins to relate keywords spread
    across multiple relations.
  • Multiple approaches to identifying the top-k
    relationships between keywords.

46
Keyword SearchTop-K Relationships
  • Discover (Hristidis and Papakonstantinou, 2002,
    2003, 2004)
  • Grow all ways from a keyword
  • Limit on number of joins
  • Creates extra graphs
  • DBXplorer (Agrawal et al. 2002)
  • Generates spanning trees at query time
  • BANKS ( )
  • Graph of all tuples related by joins
  • Must fit in memory (limited to smaller databases)

47
State of the Art Query Languages
  • Conceptual Query Languages or Models
  • Queries built with concepts that map to a
    database.
  • Remove the burden of knowledge of the schema.
  • Must determine joins to relate concepts in query.
  • Use conceptual model to determine joins

48
Conceptual Query Languages
  • CQL (Owei and Navathe, 2001)
  • Queries may include roles or joins required for a
    query
  • Pathfinder algorithm for completing the query
  • Based on shortest path between source and target
    concepts in query
  • Semantically Constrained ER Diagram as a graph
    used to determine joins.
  • Conceptual Model (Zhang et al., 1999)
  • Semantic graph of database
  • Search algorithm constrained by number of joins
    or number of interpretations

49
State of the Art Query Languages
  • Natural Language Queries
  • Natural language queries map the language to
    concepts in a database
  • Joins must be determined to relate concepts in
    database similar to Conceptual Query Languages

50
Functional Dependencies due to Primary KeysTPC-H
Table Functional Dependencies
Part p_partkey ? p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment
Supplier s_suppkey ? s_name, s_address, s_nationkey, s_phone, s_acctbal, s_comment
PartSupp ps_partkey, ps_suppkey ? ps_availqty, ps_supplycost, ps_comment
Customer c_custkey ? c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment
Order o_orderkey ? o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment
LineItem l_orderkey, l_linenumber ? l_partkey, l_suppkey, l_orderkey , l_quantity, l_extendedprice, l_discount, l_returnflag, l_tax, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment
Nation n_nationkey ? n_name, n_regionkey, n_comment
Region r_regionkey ? r_name, r_comment
Primary Keys Foreign Keys
51
Function Dependencies TPC-H implied by Foreign
Keys
Table with Foreign Key Table Referenced Functional Dependencies
LineItem Part l_partkey ? p_partkey
LineItem Supplier l_suppkey ? s_suppkey
LineItem PartSupp l_partkey, l_suppkey ? ps_partkey, ps_suppkey
LineItem Order l_orderkey ? o_orderkey
PartSupp Part ps_partkey ? p_partkey
PartSupp Supplier ps_suppkey ? s_suppkey
Supplier Nation s_nationkey ? n_nationkey
Customer Nation c_nationkey ? n_nationkey
Order Customer o_custkey ? c_custkey
Nation Region n_regionkey ? r_regionkey
Primary Keys Foreign Keys
52
TPC-H Join Graph
Line Item
Part Supp
Order
Supplier
Part
Nation
Customer
Region
Write a Comment
User Comments (0)