Keyword Searching in Relational Databases - PowerPoint PPT Presentation

1 / 62

About This Presentation

Title:

Keyword Searching in Relational Databases

Description:

... forward and backward edge weights ... Make edge weight scale-free by dividing edge weigth by wmin ... Subdue extreme node weights by using log(1 indegree) ... – PowerPoint PPT presentation

Number of Views:583

Avg rating:3.0/5.0

Slides: 63

Provided by: cseIi3

Category:

more less

Transcript and Presenter's Notes

Title: Keyword Searching in Relational Databases

1

Keyword Searching in Relational Databases
Esha Palta (05329017)
Kumar Gaurav Bijay (02005013)

2
Dilbert Strip ?
3
Motivation

Keyword search
We have SQL, why keyword-querying?
SQL - not appropriate for naive users
So many online databases (imdb, citeseer,
bseindia ) user cannot keep track of schema
for all of these

4
Simple Approaches

Using Form interfaces
Require separate form for each type of query
confusing
Not suitable for ad-hoc queries how many forms
will you provide?
How about Google?
Export data from db to documents and do keyword-
querying on these
Suffers from duplication overheads
Google wants all keywords in one document. DB is
often normalized, so need to join tables and
store as documents
Multiple combinations of tables to join. Not
scalable

5
Differences from Web Search

Related data split across multiple tuples due to
normalization
Different keywords may match tuples from
different relations
What joins are to be computed can only be decided
on the fly
Need to find result containing all keywords and
rank them somehow

Writes (AuthorId, PaperId)
Paper (PaperId, PaperName)
Author (AuthorId, AuthorName)
Cites (Citing, Cited)
The DBLP Bibliography Schema
6
Systems for DB search

BANKS (Browsing and Keyword Search)
IITB (ICDE 02)
DBXplorer Microsoft Research (ICDE 02)
ObjectRank IBM, UCSD, FIU (VLDB 04)
Bidirectional BANKS IITB (VLDB 05)

7
Systems for DB search

BANKS (Browsing and Keyword Search)
IITB (ICDE 02)
DBXplorer Microsoft Research (ICDE 02)
ObjectRank IBM, UCSD, FIU (VLDB 04)
Bidirectional BANKS IITB (VLDB 05)
will cover in depth

8
BANKS (ICDE 02)
9
The BANKS system

BANKS Architecture
Available on the web
http//www.cse.iitb.ac.in/banks
Connects to database using JDBC
JDBC metadata features used to provide schema
browsing
Preprocesses db

User
BANKS
HTTP
JDBC
Web-server
Database
10
Basic Model

Database modeled as a graph
Nodes tuples
Edges references between tuples
foreign key (assume for this talk), inclusion
dependencies, ..
Edges are directed.

PaperIdPaperName
AuthorIDPaperId
AuthorId
DBLP example
11
The BANKS Answer Model

Query set of search terms t1, t2, .., tn
For each search term ti we find set of nodes Si
matching ti
Eg Query Sudarshan Roy (t1 Sudarshan, t2
Roy)
Answer rooted, directed tree connecting nodes
matching keywords
Root node has special significance, may be
restricted to some relations
E.g. relations representing entities, not
relationships
May include intermediate nodes not in any Si
(Steiner Tree)
Multiple answers
Ranking based on proximity prestige

12
Answer Example
Query sudarshan roy
Paper
Writes
Writes
Author
Author

We would like to find sets of (closely) connected
tuples that match all given keywords

13
Edge Directionality

Directed tree will miss desired answers. For eg
Query DBXplorer ObjectRank
So, for each forward edge, BANKS adds a back edge

CitedBy
Cited
Cites
BANKS
DBXPlorer
ObjectRank
Cites
CitedBy
Cited
DBXPlorer
BANKS
Cites
ObjectRank
Cites
14
Edge Directionality

What if we ignore directionality?
Some popular tuples are connected to many other
tuples
E.g. Students -gt departments -gt university
Problem A popular tuple would create misleading
shortcuts between tuples
E.g. every student would be closely linked with
every other student via the department/university
Solution define different forward and backward
edge weights
Forward edges In the direction of the foreign
key reference

15
Edge Weight

Weight of forward edge based on schema
e.g. citation link weights gt writes link
weights
Weight of backward edge indegree of edges
pointing to the node

16
Edge Weight Scaling

Normalize edge score Escore(e)
Make edge weight scale-free by dividing edge
weigth by wmin
Problem Some backward edges have unduly large
weights
Depress the scale by defining Escore(e) as
log(1w(e)/wmin )
Overall Escore E 1 / (1 ?e Escore(e))

17
Node Weight

Set weight of a Node Indegree of the node
As per prestige rankings nodes with multiple
pointers to them get a higher prestige
So, higher node weight corresponds to higher
prestige
Problem Nodes with many in-edges result in
skewed answers
Subdue extreme node weights by using
log(1indegree)
Node score Nscore Average of node scores
(root-node-weight ? leaf-node-weights)

18
Combining Scores

Combining two independent metrics node weight
and edge weight
Normalize each to 0-1
Combine using weighting factor ?
Additive (1- ?) Escore ? Nscore
Multiplicative Escore Nscore?
Performance study to compare alternatives and to
find reasonable values for ?

19
First Step Symbol Table

The first step is to build a symbol table
This table is in the db and is not normalized
Example

20
Searching for Best Answers

Backward Expanding Search Algorithm
Assume graph fits in memory
Idea find vertices from which a forward path
exists to at least one node from each Si.
Run concurrent single source shortest path
algorithm from each node matching a keyword
Create an iterator for each node matching a
keyword
Traverse the graph edges in reverse direction
Output a node whenever it is on the intersection
of the sets of nodes reached from each keyword
Answer trees may not be generated in relevance
order

21
Backward Expanding Search
Iterators
22
BANKS Query Result Example

Result of Sudarshan Roy

23
Result Ordering

Answers need not be always in Relevance order

This tree is output
Better Root Missed
2
2
2
5
2
1
24
Result Ordering (contd)

Solution
Generate all connection trees and then sort them
Increases computation costs and leads to a
greatly increased time to generate initial
results
Create a small heap ordered on the relevance of
the trees
Output highest ranked tree from heap to user when
heap is full
What about duplicate results?
Maintain a list of generated results for
duplicate detection
Discard result according to relevance

25
Experience and Performance

BANKS provides keyword search coupled with
extensive browsing facilities
Schema browsing data browsing
Graphical display of data
Implemented using Java servlets
Keyword search response times typically 1 to 3
seconds on
DBLP database with 100,000 tuples/300,000 edges
P3 600 MHz, 512 MB RAM

26
Anecdotes

Mohan
Returns C. Mohan at top based on prestige (number
of papers written)
Transaction
Returns Jim Grays classic paper and textbook as
top answers based on prestige (number of
citations)
Sunita Seltzer
No common papers, but both have papers with
Stonebraker system finds this connection

27
Effect of Parameters

Log scaling of edge weights worked well
(1- ?) E ? N versus E N???????made little
difference
Best with ? .2 (subdue node weights but not
entirely)

28
BANKS (VLDB 05)
29
Motivation

BANKS performs poorly if
Keyword matches lot of nodes (so lot of Dijkstra
sources)
Search hits a node with large fan in.

Wastes time
Sudarshan
Roy
30
New Ideas Forward Search

Why only backward, lets search forward too

How about fwd Searching ?

Sudarshan
Roy
31
New Ideas - Activation

Activation - Cannot forward search from each
node.
Spread activation from keyword nodes to others.
Activation is like Page Rank with decay.
High Activation ? close to many keywords.

32
Activation Spreading

Spreading Activation
Node with highest activation explored first
Activation spread to neighbors (µ 0.3)
Gives low activation to neighbors of hubs

33
Modifications to Model

Graph model stays the same.
BANKS is concerned with search more than how to
tune parameters or define node weights / edge
weights.
BANKS code
Tree Node Score, N
Tree Edge Score, E
Total Score ENl (l 0.2)

34
The New Algorithm

Need two priority queues
Qin - do backward search from these nodes
Qout - do forward search from these nodes
Each node, n keeps 3 variables per keyword, ti
sp i Node to got to from n for shortest-path
to ti
distance i Length of the shortest-path from
n to ti
Activation i Activation to n from keyword ti

35
The New Algorithm continued

Set initial activation keyword nodes and add to
Qin for backward-search.
At each step, pick node with maximum activation
i.e. if (Qin.getMaxActivation gt Qout.
getMaxActivation))
// use node from Qin
else
// use node from Qout
If node from Qin, do backward search and add
itself to Qout. (newly explored nodes into Qin)
If node from Qout, do forward search
If node has reached from all keyword, generate
result-tree. answer is buffered as results can
be out of order

36
Explanation with example
Qin
Qout
N100
N4
Roy Sudarshan
N1

N3
N2
Roy
Sudarshan
37
Explanation with example
Qin
Qout
N100
Roy Sudarshan
N4
N1 N2
N1

N3
N2
Roy
Sudarshan
38
Explanation with example
Qin
Qout
N100
N1 Roy Sudarshan
N2 N3 N100
N4
N1

N3
N2
Roy
Sudarshan
Result Found !
39
Generation of top-k results

If we know the score of next-best answer, all
buffered answers with better score can be output.
Need upper bounds

40
Computation of upper bound

For each keyword ti, we have explored nodes upto
some length say li.
So, next best score (approx.)
This is not a true upper bound, but works quite
well and is simple !

41
Are we losing answers ?

BANKS I used many Dijkstra states, BANKS II
uses 2 only forward and backward search-states.
The result is that we can now lose answers !

42
Answer Loss Example
Ny
K1
K2
Nx
K1
This is the generated answer.
This answer is lost.
43

But, we will generate this tree rooted at Nx
So, a rotated tree with same nodes but different
root is often generated !

NY
K2
NX
K1
44
Metrics of Performance

Manually obtain best relevant answers.
Determine 2 times
Time taken to produce last relevant answer.
Time taken to output last relevant answer.
Search algorithms
MI-Bkwd original backward search
Iterator for every node matching a keyword
SI-Bkwd backward search with single backward
iterator
Bidirec bidirectional search
Datasets
DBLP, IMDB 2 million nodes, 9 million edges
US Patent DB 4 million nodes, 15 million edges

45
Graph - I

MI-Bkwd versus SI-Bkwd

SI-Bkwd gain increases with origin size,
keywords

46
Graph - II

SI-Bkwd versus Bidirec

Bidirec gain increases with origin size,
keywords

47
A Critique

BANKS needs a lot of memory.
Need to cluster and keep parts of graph on disk.
Work is in progress ?

48
DBXplorer (ICDE 02)
49
DBXplorer (Microsoft Research)

Use symbol table to determine which tables to
join.
Generate all possible table join combinations
Figure

T1, T2, T3, T4 and T5 are tables
50
Cool ideas in DBXplorer

Symbol table need not be at tuple level. If
column has an index, column level symbol table
is ok.
Table Compression
e.g. Keywords Columns Keywords Columns

K1
K1
C1
C1
K2
K2
X
K3
C2
C2
K3
K4
K4
K5
K5
Intermediate Column
51
ObjectRank (VLDB 04)
52
ObjectRank (IBM, FIU, UCSD)

Creates objects in database. Object definition is
manual.
e.g. in DBLP, author, conference and paper can
be defined as objects.
Heavily inspired by PageRank.
Each node is given global ObjectRank just like
PageRank of Google.

53
ObjectRank Ideas

Keyword-level ObjectRank for each keyword,
precompute and save object ranks of nodes can
optimize by defining cut-off)
Score of node, n w.r.t. keyword k
scorek(n) f (Global-object-rank (n),
Objectrankk (n))
At run time, scores are combined
scorek1,k2,,km(n) scorek1(n) scorek2(n)
scorekm(n)

54
ObjectRank Algorithm and answers

If graph is DAG or near DAG, topologically sort
and spread ObjectRank in this order.
Answers are single objects and not Cluster /
group as in BANKS.
Demo at
http//teriyaki.ucsd.edu9099/objrank/main05_ne
w.html

55
Conclusion

Studied BANKS, both versions.
Covered cool ideas from DBXplorer and ObjectRank.
Graph of BANKS must be made disk-resident.

56
References

Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
Soumen Chakrabarti, and S. Sudarshan.Keyword
Searching and Browsing in Databases using
BANKS.In International Conference on Data
Engineering (ICDE), pages 10831096, 2002.
Varun Kacholia, Shashank Pandit, Soumen
Chakrabarti, S. Sudarshan et. al.Bidirectional
Expansion for Keyword Search on Graph
Databases.In VLDB Conference, pages 505516,
2005.
Sanjay Agrawal, Surajit Chaudhari, and Gautam
Das.DBXplorer A System for Keyword-Based Search
over Relational Databases.In International
Conference on Data Engineering (ICDE), pages
522, 2002.
Andrey Balmin, Vagelis Hristidis, and Yannis
Papakonstantinou.ObjectRank Authority-Based
Keyword Search in Databases.In VLDB Conference,
pages 564575, 2004.

57
Appendix

Extra slides

58
Browsing - May add??????

Hyperlinks are there for all primary key foreign
key attributes
Each table is displayed with set of tools for
interacting with data
Projection (using drop), Selection, Join,
Group-by, Sort
Template facilities to do a variety of tasks
Browsing data by grouping and creating crosstabs
e.g., theses grouped by department and year
Hierarchical views of data
Nested XML style, even on relational data
Graphical displays
Bar charts, pie charts, etc
Templates are generic and can be applied on any
data matching assumed schema
Can be applied after applying selections
New templates can be created by user,
interactively

59
Example of Browsing in BANKS
60
Related Work

DataSpot (DTL)/Mercado Intuifind VLDB 98
Based on patent by Palmon (filed 1995, granted
1998)
Based on hypergraph model, similar answer model
to ours
Differences our model of backward link weights
and prestige
Proximity Search VLDB98
Different model of proximity based on adding up
support
No edge weights, prestige, different evaluation
algorithm
Information units (linked Web pages) WWW10
No directionality, only studied in Web context
Microsoft DBExplorer (this conference)
No ranking, based on SQL generation
Addresses efficient construction of text indexes
Microsoft English query

61
Extensions

Summarization of output
group the output tuples into sets that have same
tree structure
define the notion of similarity between two
result trees
perform restricted search
Metadata queries (attributekeyword queries)
For example authorlevy
match all the tuples of a relation
costly
Forward searching approach

62
Proposed Conclusions and Future Work

BANKS is an integrated browsing and keyword
querying system for relational databases
Future work
Keyword queries on XML
Disambiguating queries by selecting
Nodes G.W.Bush Bush Jr or Bush Sr
Tree structure coauthors or cites
Boolean queries
Metadata queries
Summarization of output