Lecture 17: Keyword Search on Relational Databases - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Lecture 17: Keyword Search on Relational Databases

Description:

We have SQL, why keyword-querying? SQL - not appropriate for naive users. So many online databases (imdb, ... Query: set of search terms {t1, t2, .., tn} ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 56

Provided by: ChengXi4

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 17: Keyword Search on Relational Databases

1
Lecture 17 Keyword Search on Relational
Databases
Oct. 31, 2007 ChengXiang Zhai
Most slides are from Esha Palta and Kumar Gaurav
Bijays presentation
2
Motivation

Keyword search
We have SQL, why keyword-querying?
SQL - not appropriate for naive users
So many online databases (imdb, citeseer,
bseindia ) user cannot keep track of schema
for all of these

3
Simple Approaches

Using Form interfaces
Require separate form for each type of query
confusing
Not suitable for ad-hoc queries how many forms
will you provide?
How about Google?
Export data from db to documents and do keyword-
querying on these
Suffers from duplication overheads
Google wants all keywords in one document. DB is
often normalized, so need to join tables and
store as documents
Multiple combinations of tables to join. Not
scalable

4
Differences from Web Search

Related data split across multiple tuples due to
normalization
Different keywords may match tuples from
different relations
What joins are to be computed can only be decided
on the fly
Need to find result containing all keywords and
rank them somehow

Writes (AuthorId, PaperId)
Paper (PaperId, PaperName)
Author (AuthorId, AuthorName)
Cites (Citing, Cited)
The DBLP Bibliography Schema
5
Systems for DB search

BANKS (Browsing and Keyword Search)
IITB (ICDE 02)
DBXplorer Microsoft Research (ICDE 02)
ObjectRank IBM, UCSD, FIU (VLDB 04)
Bidirectional BANKS IITB (VLDB 05)

6
Systems for DB search

BANKS (Browsing and Keyword Search)
IITB (ICDE 02)
DBXplorer Microsoft Research (ICDE 02)
ObjectRank IBM, UCSD, FIU (VLDB 04)
Bidirectional BANKS IITB (VLDB 05)

7
BANKS (ICDE 02)
8
The BANKS system

BANKS Architecture
Available on the web
http//www.cse.iitb.ac.in/banks
Connects to database using JDBC
JDBC metadata features used to provide schema
browsing
Preprocesses db

User
BANKS
HTTP
JDBC
Web-server
Database
9
Basic Model

Database modeled as a graph
Nodes tuples
Edges references between tuples
foreign key (assume for this talk), inclusion
dependencies, ..
Edges are directed.

PaperIdPaperName
AuthorIDPaperId
AuthorId
DBLP example
10
The BANKS Answer Model

Query set of search terms t1, t2, .., tn
For each search term ti we find set of nodes Si
matching ti
Eg Query Sudarshan Roy (t1 Sudarshan, t2
Roy)
Answer rooted, directed tree connecting nodes
matching keywords
Root node has special significance, may be
restricted to some relations
E.g. relations representing entities, not
relationships
May include intermediate nodes not in any Si
(Steiner Tree)
Multiple answers
Ranking based on proximity prestige

11
Answer Example
Query sudarshan roy
Paper
Writes
Writes
Author
Author

We would like to find sets of (closely) connected
tuples that match all given keywords

12
Edge Directionality

Directed tree will miss desired answers. For eg
Query DBXplorer ObjectRank
So, for each forward edge, BANKS adds a back edge

CitedBy
Cited
Cites
BANKS
DBXPlorer
ObjectRank
Cites
CitedBy
Cited
DBXPlorer
BANKS
Cites
ObjectRank
Cites
13
Edge Directionality

What if we ignore directionality?
Some popular tuples are connected to many other
tuples
E.g. Students -gt departments -gt university
Problem A popular tuple would create misleading
shortcuts between tuples
E.g. every student would be closely linked with
every other student via the department/university
Solution define different forward and backward
edge weights
Forward edges In the direction of the foreign
key reference

14
Edge Weight

Weight of forward edge based on schema
e.g. citation link weights gt writes link
weights
Weight of backward edge indegree of edges
pointing to the node

15
Edge Weight Scaling

Normalize edge score Escore(e)
Make edge weight scale-free by dividing edge
weigth by wmin
Problem Some backward edges have unduly large
weights
Depress the scale by defining Escore(e) as
log(1w(e)/wmin )
Overall Escore E 1 / (1 ?e Escore(e))

16
Node Weight

Set weight of a Node Indegree of the node
As per prestige rankings nodes with multiple
pointers to them get a higher prestige
So, higher node weight corresponds to higher
prestige
Problem Nodes with many in-edges result in
skewed answers
Subdue extreme node weights by using
log(1indegree)
Node score Nscore Average of node scores
(root-node-weight ? leaf-node-weights)

17
Combining Scores

Combining two independent metrics node weight
and edge weight
Normalize each to 0-1
Combine using weighting factor ?
Additive (1- ?) Escore ? Nscore
Multiplicative Escore Nscore?
Performance study to compare alternatives and to
find reasonable values for ?

18
First Step Symbol Table

The first step is to build a symbol table
This table is in the db and is not normalized
Example

19
Searching for Best Answers

Backward Expanding Search Algorithm
Assume graph fits in memory
Idea find vertices from which a forward path
exists to at least one node from each Si.
Run concurrent single source shortest path
algorithm from each node matching a keyword
Create an iterator for each node matching a
keyword
Traverse the graph edges in reverse direction
Output a node whenever it is on the intersection
of the sets of nodes reached from each keyword
Answer trees may not be generated in relevance
order

20
Backward Expanding Search
Iterators
21
BANKS Query Result Example

Result of Sudarshan Roy

22
Result Ordering

Answers need not be always in Relevance order

This tree is output
Better Root Missed
2
2
2
5
2
1
23
Result Ordering (contd)

Solution
Generate all connection trees and then sort them
Increases computation costs and leads to a
greatly increased time to generate initial
results
Create a small heap ordered on the relevance of
the trees
Output highest ranked tree from heap to user when
heap is full
What about duplicate results?
Maintain a list of generated results for
duplicate detection
Discard result according to relevance

24
Experience and Performance

BANKS provides keyword search coupled with
extensive browsing facilities
Schema browsing data browsing
Graphical display of data
Implemented using Java servlets
Keyword search response times typically 1 to 3
seconds on
DBLP database with 100,000 tuples/300,000 edges
P3 600 MHz, 512 MB RAM

25
Anecdotes

Mohan
Returns C. Mohan at top based on prestige (number
of papers written)
Transaction
Returns Jim Grays classic paper and textbook as
top answers based on prestige (number of
citations)
Sunita Seltzer
No common papers, but both have papers with
Stonebraker system finds this connection

26
Effect of Parameters

Log scaling of edge weights worked well
(1- ?) E ? N versus E N???????made little
difference
Best with ? .2 (subdue node weights but not
entirely)

27
BANKS (VLDB 05)
28
Motivation

BANKS performs poorly if
Keyword matches lot of nodes (so lot of Dijkstra
sources)
Search hits a node with large fan in.

Wastes time
Sudarshan
Roy
29
New Ideas Forward Search

Why only backward, lets search forward too

How about fwd Searching ?

Sudarshan
Roy
30
New Ideas - Activation

Activation - Cannot forward search from each
node.
Spread activation from keyword nodes to others.
Activation is like Page Rank with decay.
High Activation ? close to many keywords.

31
Activation Spreading

Spreading Activation
Node with highest activation explored first
Activation spread to neighbors (µ 0.3)
Gives low activation to neighbors of hubs

32
Modifications to Model

Graph model stays the same.
BANKS is concerned with search more than how to
tune parameters or define node weights / edge
weights.
BANKS code
Tree Node Score, N
Tree Edge Score, E
Total Score ENl (l 0.2)

33
The New Algorithm

Need two priority queues
Qin - do backward search from these nodes
Qout - do forward search from these nodes
Each node, n keeps 3 variables per keyword, ti
sp i Node to got to from n for shortest-path
to ti
distance i Length of the shortest-path from
n to ti
Activation i Activation to n from keyword ti

34
The New Algorithm continued

Set initial activation keyword nodes and add to
Qin for backward-search.
At each step, pick node with maximum activation
i.e. if (Qin.getMaxActivation gt Qout.
getMaxActivation))
// use node from Qin
else
// use node from Qout
If node from Qin, do backward search and add
itself to Qout. (newly explored nodes into Qin)
If node from Qout, do forward search
If node has reached from all keyword, generate
result-tree. answer is buffered as results can
be out of order

35
Explanation with example
Qin
Qout
N100
N4
Roy Sudarshan
N1

N3
N2
Roy
Sudarshan
36
Explanation with example
Qin
Qout
N100
Roy Sudarshan
N4
N1 N2
N1

N3
N2
Roy
Sudarshan
37
Explanation with example
Qin
Qout
N100
N1 Roy Sudarshan
N2 N3 N100
N4
N1

N3
N2
Roy
Sudarshan
Result Found !
38
Generation of top-k results

If we know the score of next-best answer, all
buffered answers with better score can be output.
Need upper bounds

39
Computation of upper bound

For each keyword ti, we have explored nodes upto
some length say li.
So, next best score (approx.)
This is not a true upper bound, but works quite
well and is simple !

40
Are we losing answers ?

BANKS I used many Dijkstra states, BANKS II
uses 2 only forward and backward search-states.
The result is that we can now lose answers !

41
Answer Loss Example
Ny
K1
K2
Nx
K1
This is the generated answer.
This answer is lost.
42

But, we will generate this tree rooted at Nx
So, a rotated tree with same nodes but different
root is often generated !

NY
K2
NX
K1
43
Metrics of Performance

Manually obtain best relevant answers.
Determine 2 times
Time taken to produce last relevant answer.
Time taken to output last relevant answer.
Search algorithms
MI-Bkwd original backward search
Iterator for every node matching a keyword
SI-Bkwd backward search with single backward
iterator
Bidirec bidirectional search
Datasets
DBLP, IMDB 2 million nodes, 9 million edges
US Patent DB 4 million nodes, 15 million edges

44
Graph - I

MI-Bkwd versus SI-Bkwd

SI-Bkwd gain increases with origin size,
keywords

45
Graph - II

SI-Bkwd versus Bidirec

Bidirec gain increases with origin size,
keywords

46
A Critique

BANKS needs a lot of memory.
Need to cluster and keep parts of graph on disk.

47
DBXplorer (ICDE 02)
48
DBXplorer (Microsoft Research)

Use symbol table to determine which tables to
join.
Generate all possible table join combinations
Figure

T1, T2, T3, T4 and T5 are tables
49
Cool ideas in DBXplorer

Symbol table need not be at tuple level. If
column has an index, column level symbol table
is ok.
Table Compression
e.g. Keywords Columns Keywords Columns

K1
K1
C1
C1
K2
K2
X
K3
C2
C2
K3
K4
K4
K5
K5
Intermediate Column
50
ObjectRank (VLDB 04)
51
ObjectRank (IBM, FIU, UCSD)

Creates objects in database. Object definition is
manual.
e.g. in DBLP, author, conference and paper can
be defined as objects.
Heavily inspired by PageRank.
Each node is given global ObjectRank just like
PageRank of Google.

52
ObjectRank Ideas

Keyword-level ObjectRank for each keyword,
precompute and save object ranks of nodes can
optimize by defining cut-off)
Score of node, n w.r.t. keyword k
scorek(n) f (Global-object-rank (n),
Objectrankk (n))
At run time, scores are combined
scorek1,k2,,km(n) scorek1(n) scorek2(n)
scorekm(n)

53
ObjectRank Algorithm and answers

If graph is DAG or near DAG, topologically sort
and spread ObjectRank in this order.
Answers are single objects and not Cluster /
group as in BANKS.
Demo at
http//teriyaki.ucsd.edu9099/objrank/main05_ne
w.html

54
References

Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
Soumen Chakrabarti, and S. Sudarshan.Keyword
Searching and Browsing in Databases using
BANKS.In International Conference on Data
Engineering (ICDE), pages 10831096, 2002.
Varun Kacholia, Shashank Pandit, Soumen
Chakrabarti, S. Sudarshan et. al.Bidirectional
Expansion for Keyword Search on Graph
Databases.In VLDB Conference, pages 505516,
2005.
Sanjay Agrawal, Surajit Chaudhari, and Gautam
Das.DBXplorer A System for Keyword-Based Search
over Relational Databases.In International
Conference on Data Engineering (ICDE), pages
522, 2002.
Andrey Balmin, Vagelis Hristidis, and Yannis
Papakonstantinou.ObjectRank Authority-Based
Keyword Search in Databases.In VLDB Conference,
pages 564575, 2004.