DBXplorer:%20A%20System%20for%20Keyword-Based%20Search%20over%20Relational%20Databases - PowerPoint PPT Presentation

About This Presentation

Title:

DBXplorer:%20A%20System%20for%20Keyword-Based%20Search%20over%20Relational%20Databases

Description:

Internet search engines have popularized keyword-based search. ... pub-cell returns more number of SQL statements than pub-col as for a keyword in ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 17

Provided by: cmk264

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: DBXplorer:%20A%20System%20for%20Keyword-Based%20Search%20over%20Relational%20Databases

1
DBXplorer A System for Keyword-Based Search over
Relational Databases

Sanjay Agrawal
Surajit Chaudhuri
Gautam Das
Presented by
Bhushan Pachpande

2
Contents

Introduction
Overview of DBXplorer
Symbol Table Design - Publish
Keyword Search
Support for Generalized Matches
Conclusion

3
Introduction

Internet search engines have popularized
keyword-based search.
Searching on traditional database management
system is done through customized applications
which are closely tied to the database schema.
Traditional database management systems do not
support keyword-based search.
e.g. search the Microsoft intranet on Jim Gray
to obtain matched rows, i.e., rows in the
database where Jim Gray occur.
In this paper, DBXplorer, an efficient and
scalable keyword search utility for relational
databases, is described.
The goal is to enable such searches without
necessarily requiring the users to know the
schema of the respective databases.

4
Example
Searching for a book

Keywords Programming by Ritchie
Less probability of presence of both keywords in
single row of table
For result, rows need to be generated by joining
tables on the fly (all possible combinations)

Authors
AuthorsBooks
Books
BookStores
Store
5
Overview of DBXplorer

Objective - Given a set of query keywords,
DBXplorer returns all rows (either from single
tables, or by joining tables connected by
foreign-key joins) such that the each row
contains all keywords.
Applying IR techniques from the documents world
to databases is difficult, because of
Database normalization by which logical units
of information may be fragmented and scattered
across several tables
Matching row may be obtained by joining several
tables on the fly
IR techniques use Inverted Lists Symbol Table
in databases
Symbol Table - Stores the information about
keywords at different granularities (column/row),
i.e. for each keyword stores the list of all rows

6
Overview of DBXplorer - Publish

Enabling keyword search in DBXplorer requires 2
steps
Publish
enables database for keywords by building Symbol
table and associated structures
Search
retrieves matching rows from published database
Steps in Publish
Step 1 A database is identified, along with the
set of tables and columns within the database to
be published.
Step 2 Auxiliary tables like Symbol table are
created for supporting keyword searches.

7
Overview of DBXplorer - Search

Steps in Search
Looks up symbol table to find tables / columns
which contain keywords
Identify all possible joins (subsets of table if
joined) whose rows might contain required
keywords
Generate SQL statements for each join (gives rows
which contain all keywords), rank rows and return

8
Architecture of DBXplorer

The publish component provides interfaces to
select a database,
select tables/columns within the database to
publish, and
modify/remove/maintain the publication.
For a given set of keywords, the search component
provides interfaces to
retrieve matching databases from a set of
published databases, and
selectively identify tables, columns/rows that
need to be searched within each database
identified in step (1).

Architecture of DBXplorer
9
Symbol Table Design

Exact match problem considered only
Symbol Table (S) - stores the information about
keywords at different granularities (column/row),
i.e. for each keyword stores the list of all rows
Column Level granularity (Pub-Col)
For every keyword S maintains list of all
database columns (i.e. table.column)
Cell level granularity (Pub-cell)
For every keyword S maintains list of all
database cells (i.e. table.column.rowid)

10
Symbol Table - Design factors

Space and Time Requirements
Size pub-col are smaller than pub-cell since
repetition of keyword in a column does not
increases entries in case of pub-col
Time pub-col takes less time to build
Keyword search performance
Depends on efficient generation and execution of
SQL statement (built from symbol table entries)
pub-cell returns more number of SQL statements
than pub-col as for a keyword in column there are
multiple entries
Ease of maintenance
Insert/Update required for insertion of distinct
new value in case of pub-col while pub-cell needs
for every update/insert
Same for delete

11
Storing Symbol Table

Store symbol tables (pub-col) in database as
(keyword hash , column Id)
FK Compression (Foreign Key)
If there is foreign key relationship between c1
and c2, store only c1
CP Compression
Partition H into a minimum number of bipartite
cliques (a bipartite clique is any subgraph of H
with a maximal number of edges).
Compress each clique.
Stores symbol table (pub-cell) in database as
(keyword hash, list of all cellids)

v2 v3 v4
c1 c2
x
Uncompressed hash table
ColumnsMap table
Compressed hash table
12
Search - Enumerating Join Trees

Step1 - Looks up symbol table to find tables /
columns which contain keywords
Step2 - Enumerate join trees
Identify and enumerate all potential subsets of
tables in the database that, if joined, might
contain rows having all keywords.
The resulting relation will contain all potential
rows having all keywords specified in the query.

keywords

If we view the schema graph G as an undirected
graph, this step enumerates join trees, i.e.,
sub-trees of G such that
the leaves belong to MatchedTables
together, the leaves contain all keywords of the
query

Join Trees
13
Search Identify matching rows

The input to this final search step is the
enumerated join trees.
Each join tree is then mapped to a single SQL
statement that joins the tables as specified in
the tree, and selects those rows that contain all
keywords.
The retrieved rows are ranked before being
output.
Rows ranked by number of joins involved (ties
broken arbitrarily) (same as keywords occurring
close to one another in documents are ranked
higher)

Join Trees
14
Generalized Matches Token Matches

Token matches - the keyword in the query matches
only a token or sub-string of an attribute value
(e.g., retrieve rows of address by specifying
only a street name).
Pub-Prefix method
B tree indexes can be used to retrieve rows
whose cell matches a given prefix string
This clause is of the form
WHERE T.C LIKE PK
During publishing of a database, for every
keyword K, the entry (hash(K), T.C, P) is kept in
the symbol table if there exists a string in
column T.C which
contains a token K, and
has prefix P

15
Generalized Matches - Token Matches
Let the hash values of the searchable tokens
i.e., string, ball and round be 1, 2 and 3
respectively
RowId C
1 this is a string
2 this string
3 this is a ball
4 no string
5 any ball is round
Pub-Prefix table
Database table T
Consider searching keyword string Pub-Prefix
table returns prefixes th and no and
subsequent SQL will contain (T.C LIKE
thstring) OR (T.C LIKE nostring)
16
Conclusion

This paper discusses DBXplorer, a system that
enables keyword-based search in relational
databases.
DBXplorer uses symbol table alternatives to store
the location of keywords in database.
DBXplorer support exact matches and generalized
matches upto some extent.

Write a Comment

User Comments (0)