Title: DBXplorer:%20A%20System%20for%20Keyword-Based%20Search%20over%20Relational%20Databases
1DBXplorer A System for Keyword-Based Search over
Relational Databases
- Sanjay Agrawal
- Surajit Chaudhuri
- Gautam Das
- Presented by
- Bhushan Pachpande
2Contents
- Introduction
- Overview of DBXplorer
- Symbol Table Design - Publish
- Keyword Search
- Support for Generalized Matches
- Conclusion
3Introduction
- Internet search engines have popularized
keyword-based search. - Searching on traditional database management
system is done through customized applications
which are closely tied to the database schema. - Traditional database management systems do not
support keyword-based search. - e.g. search the Microsoft intranet on Jim Gray
to obtain matched rows, i.e., rows in the
database where Jim Gray occur. - In this paper, DBXplorer, an efficient and
scalable keyword search utility for relational
databases, is described. - The goal is to enable such searches without
necessarily requiring the users to know the
schema of the respective databases.
4Example
Searching for a book
- Keywords Programming by Ritchie
- Less probability of presence of both keywords in
single row of table - For result, rows need to be generated by joining
tables on the fly (all possible combinations)
Authors
AuthorsBooks
Books
BookStores
Store
5Overview of DBXplorer
- Objective - Given a set of query keywords,
DBXplorer returns all rows (either from single
tables, or by joining tables connected by
foreign-key joins) such that the each row
contains all keywords. - Applying IR techniques from the documents world
to databases is difficult, because of - Database normalization by which logical units
of information may be fragmented and scattered
across several tables - Matching row may be obtained by joining several
tables on the fly - IR techniques use Inverted Lists Symbol Table
in databases - Symbol Table - Stores the information about
keywords at different granularities (column/row),
i.e. for each keyword stores the list of all rows
6Overview of DBXplorer - Publish
- Enabling keyword search in DBXplorer requires 2
steps - Publish
- enables database for keywords by building Symbol
table and associated structures - Search
- retrieves matching rows from published database
- Steps in Publish
- Step 1 A database is identified, along with the
set of tables and columns within the database to
be published. - Step 2 Auxiliary tables like Symbol table are
created for supporting keyword searches.
7Overview of DBXplorer - Search
- Steps in Search
- Looks up symbol table to find tables / columns
which contain keywords - Identify all possible joins (subsets of table if
joined) whose rows might contain required
keywords - Generate SQL statements for each join (gives rows
which contain all keywords), rank rows and return
8Architecture of DBXplorer
- The publish component provides interfaces to
- select a database,
- select tables/columns within the database to
publish, and - modify/remove/maintain the publication.
- For a given set of keywords, the search component
provides interfaces to - retrieve matching databases from a set of
published databases, and - selectively identify tables, columns/rows that
need to be searched within each database
identified in step (1).
Architecture of DBXplorer
9Symbol Table Design
- Exact match problem considered only
- Symbol Table (S) - stores the information about
keywords at different granularities (column/row),
i.e. for each keyword stores the list of all rows - Column Level granularity (Pub-Col)
- For every keyword S maintains list of all
database columns (i.e. table.column) - Cell level granularity (Pub-cell)
- For every keyword S maintains list of all
database cells (i.e. table.column.rowid)
10Symbol Table - Design factors
- Space and Time Requirements
- Size pub-col are smaller than pub-cell since
repetition of keyword in a column does not
increases entries in case of pub-col - Time pub-col takes less time to build
- Keyword search performance
- Depends on efficient generation and execution of
SQL statement (built from symbol table entries) - pub-cell returns more number of SQL statements
than pub-col as for a keyword in column there are
multiple entries - Ease of maintenance
- Insert/Update required for insertion of distinct
new value in case of pub-col while pub-cell needs
for every update/insert - Same for delete
11Storing Symbol Table
- Store symbol tables (pub-col) in database as
(keyword hash , column Id) - FK Compression (Foreign Key)
- If there is foreign key relationship between c1
and c2, store only c1 - CP Compression
- Partition H into a minimum number of bipartite
cliques (a bipartite clique is any subgraph of H
with a maximal number of edges). - Compress each clique.
- Stores symbol table (pub-cell) in database as
(keyword hash, list of all cellids)
v2 v3 v4
c1 c2
x
Uncompressed hash table
ColumnsMap table
Compressed hash table
12Search - Enumerating Join Trees
- Step1 - Looks up symbol table to find tables /
columns which contain keywords - Step2 - Enumerate join trees
- Identify and enumerate all potential subsets of
tables in the database that, if joined, might
contain rows having all keywords. - The resulting relation will contain all potential
rows having all keywords specified in the query.
keywords
- If we view the schema graph G as an undirected
graph, this step enumerates join trees, i.e.,
sub-trees of G such that - the leaves belong to MatchedTables
- together, the leaves contain all keywords of the
query
Join Trees
13Search Identify matching rows
- The input to this final search step is the
enumerated join trees. - Each join tree is then mapped to a single SQL
statement that joins the tables as specified in
the tree, and selects those rows that contain all
keywords. - The retrieved rows are ranked before being
output. - Rows ranked by number of joins involved (ties
broken arbitrarily) (same as keywords occurring
close to one another in documents are ranked
higher)
Join Trees
14Generalized Matches Token Matches
- Token matches - the keyword in the query matches
only a token or sub-string of an attribute value
(e.g., retrieve rows of address by specifying
only a street name). - Pub-Prefix method
- B tree indexes can be used to retrieve rows
whose cell matches a given prefix string - This clause is of the form
- WHERE T.C LIKE PK
- During publishing of a database, for every
keyword K, the entry (hash(K), T.C, P) is kept in
the symbol table if there exists a string in
column T.C which - contains a token K, and
- has prefix P
15Generalized Matches - Token Matches
Let the hash values of the searchable tokens
i.e., string, ball and round be 1, 2 and 3
respectively
RowId C
1 this is a string
2 this string
3 this is a ball
4 no string
5 any ball is round
Pub-Prefix table
Database table T
Consider searching keyword string Pub-Prefix
table returns prefixes th and no and
subsequent SQL will contain (T.C LIKE
thstring) OR (T.C LIKE nostring)
16Conclusion
- This paper discusses DBXplorer, a system that
enables keyword-based search in relational
databases. - DBXplorer uses symbol table alternatives to store
the location of keywords in database. - DBXplorer support exact matches and generalized
matches upto some extent.