Title: Lecture 17: Keyword Search on Relational Databases
1Lecture 17 Keyword Search on Relational
Databases
Oct. 31, 2007 ChengXiang Zhai
Most slides are from Esha Palta and Kumar Gaurav
Bijays presentation
2Motivation
- Keyword search
-
- We have SQL, why keyword-querying?
- SQL - not appropriate for naive users
- So many online databases (imdb, citeseer,
bseindia ) user cannot keep track of schema
for all of these
3Simple Approaches
- Using Form interfaces
- Require separate form for each type of query
confusing - Not suitable for ad-hoc queries how many forms
will you provide? - How about Google?
- Export data from db to documents and do keyword-
querying on these - Suffers from duplication overheads
- Google wants all keywords in one document. DB is
often normalized, so need to join tables and
store as documents - Multiple combinations of tables to join. Not
scalable
4Differences from Web Search
- Related data split across multiple tuples due to
normalization - Different keywords may match tuples from
different relations - What joins are to be computed can only be decided
on the fly - Need to find result containing all keywords and
rank them somehow
Writes (AuthorId, PaperId)
Paper (PaperId, PaperName)
Author (AuthorId, AuthorName)
Cites (Citing, Cited)
The DBLP Bibliography Schema
5Systems for DB search
- BANKS (Browsing and Keyword Search)
- IITB (ICDE 02)
- DBXplorer Microsoft Research (ICDE 02)
- ObjectRank IBM, UCSD, FIU (VLDB 04)
- Bidirectional BANKS IITB (VLDB 05)
-
6Systems for DB search
- BANKS (Browsing and Keyword Search)
- IITB (ICDE 02)
- DBXplorer Microsoft Research (ICDE 02)
- ObjectRank IBM, UCSD, FIU (VLDB 04)
- Bidirectional BANKS IITB (VLDB 05)
-
7 BANKS (ICDE 02)
8The BANKS system
- BANKS Architecture
- Available on the web
- http//www.cse.iitb.ac.in/banks
- Connects to database using JDBC
- JDBC metadata features used to provide schema
browsing - Preprocesses db
User
BANKS
HTTP
JDBC
Web-server
Database
9Basic Model
- Database modeled as a graph
- Nodes tuples
- Edges references between tuples
- foreign key (assume for this talk), inclusion
dependencies, .. - Edges are directed.
PaperIdPaperName
AuthorIDPaperId
AuthorId
DBLP example
10The BANKS Answer Model
- Query set of search terms t1, t2, .., tn
- For each search term ti we find set of nodes Si
matching ti - Eg Query Sudarshan Roy (t1 Sudarshan, t2
Roy) - Answer rooted, directed tree connecting nodes
matching keywords - Root node has special significance, may be
restricted to some relations - E.g. relations representing entities, not
relationships - May include intermediate nodes not in any Si
(Steiner Tree) - Multiple answers
- Ranking based on proximity prestige
11Answer Example
Query sudarshan roy
Paper
Writes
Writes
Author
Author
- We would like to find sets of (closely) connected
tuples that match all given keywords
12Edge Directionality
- Directed tree will miss desired answers. For eg
- Query DBXplorer ObjectRank
- So, for each forward edge, BANKS adds a back edge
CitedBy
Cited
Cites
BANKS
DBXPlorer
ObjectRank
Cites
CitedBy
Cited
DBXPlorer
BANKS
Cites
ObjectRank
Cites
13Edge Directionality
- What if we ignore directionality?
- Some popular tuples are connected to many other
tuples - E.g. Students -gt departments -gt university
- Problem A popular tuple would create misleading
shortcuts between tuples - E.g. every student would be closely linked with
every other student via the department/university - Solution define different forward and backward
edge weights - Forward edges In the direction of the foreign
key reference
14Edge Weight
- Weight of forward edge based on schema
- e.g. citation link weights gt writes link
weights - Weight of backward edge indegree of edges
pointing to the node
15Edge Weight Scaling
- Normalize edge score Escore(e)
- Make edge weight scale-free by dividing edge
weigth by wmin - Problem Some backward edges have unduly large
weights - Depress the scale by defining Escore(e) as
log(1w(e)/wmin ) - Overall Escore E 1 / (1 ?e Escore(e))
16Node Weight
- Set weight of a Node Indegree of the node
- As per prestige rankings nodes with multiple
pointers to them get a higher prestige - So, higher node weight corresponds to higher
prestige - Problem Nodes with many in-edges result in
skewed answers - Subdue extreme node weights by using
log(1indegree) - Node score Nscore Average of node scores
(root-node-weight ? leaf-node-weights)
17Combining Scores
- Combining two independent metrics node weight
and edge weight - Normalize each to 0-1
- Combine using weighting factor ?
- Additive (1- ?) Escore ? Nscore
- Multiplicative Escore Nscore?
- Performance study to compare alternatives and to
find reasonable values for ?
18First Step Symbol Table
- The first step is to build a symbol table
- This table is in the db and is not normalized
- Example
19Searching for Best Answers
- Backward Expanding Search Algorithm
- Assume graph fits in memory
- Idea find vertices from which a forward path
exists to at least one node from each Si. - Run concurrent single source shortest path
algorithm from each node matching a keyword - Create an iterator for each node matching a
keyword - Traverse the graph edges in reverse direction
- Output a node whenever it is on the intersection
of the sets of nodes reached from each keyword - Answer trees may not be generated in relevance
order
20Backward Expanding Search
Iterators
21BANKS Query Result Example
22Result Ordering
- Answers need not be always in Relevance order
This tree is output
Better Root Missed
2
2
2
5
2
1
23Result Ordering (contd)
- Solution
- Generate all connection trees and then sort them
- Increases computation costs and leads to a
greatly increased time to generate initial
results - Create a small heap ordered on the relevance of
the trees - Output highest ranked tree from heap to user when
heap is full - What about duplicate results?
- Maintain a list of generated results for
duplicate detection - Discard result according to relevance
24Experience and Performance
- BANKS provides keyword search coupled with
extensive browsing facilities - Schema browsing data browsing
- Graphical display of data
- Implemented using Java servlets
- Keyword search response times typically 1 to 3
seconds on - DBLP database with 100,000 tuples/300,000 edges
- P3 600 MHz, 512 MB RAM
25Anecdotes
- Mohan
- Returns C. Mohan at top based on prestige (number
of papers written) - Transaction
- Returns Jim Grays classic paper and textbook as
top answers based on prestige (number of
citations) - Sunita Seltzer
- No common papers, but both have papers with
Stonebraker system finds this connection
26Effect of Parameters
- Log scaling of edge weights worked well
- (1- ?) E ? N versus E N???????made little
difference - Best with ? .2 (subdue node weights but not
entirely)
27 BANKS (VLDB 05)
28Motivation
- BANKS performs poorly if
- Keyword matches lot of nodes (so lot of Dijkstra
sources) - Search hits a node with large fan in.
Wastes time
Sudarshan
Roy
29New Ideas Forward Search
- Why only backward, lets search forward too
How about fwd Searching ?
Sudarshan
Roy
30New Ideas - Activation
- Activation - Cannot forward search from each
node. - Spread activation from keyword nodes to others.
- Activation is like Page Rank with decay.
- High Activation ? close to many keywords.
31Activation Spreading
- Spreading Activation
- Node with highest activation explored first
- Activation spread to neighbors (µ 0.3)
- Gives low activation to neighbors of hubs
32Modifications to Model
- Graph model stays the same.
- BANKS is concerned with search more than how to
tune parameters or define node weights / edge
weights. - BANKS code
- Tree Node Score, N
- Tree Edge Score, E
- Total Score ENl (l 0.2)
33The New Algorithm
- Need two priority queues
- Qin - do backward search from these nodes
- Qout - do forward search from these nodes
- Each node, n keeps 3 variables per keyword, ti
- sp i Node to got to from n for shortest-path
to ti - distance i Length of the shortest-path from
n to ti - Activation i Activation to n from keyword ti
34The New Algorithm continued
- Set initial activation keyword nodes and add to
Qin for backward-search. - At each step, pick node with maximum activation
- i.e. if (Qin.getMaxActivation gt Qout.
getMaxActivation)) - // use node from Qin
- else
- // use node from Qout
- If node from Qin, do backward search and add
itself to Qout. (newly explored nodes into Qin) - If node from Qout, do forward search
- If node has reached from all keyword, generate
result-tree. answer is buffered as results can
be out of order
35Explanation with example
Qin
Qout
N100
N4
Roy Sudarshan
N1
N3
N2
Roy
Sudarshan
36Explanation with example
Qin
Qout
N100
Roy Sudarshan
N4
N1 N2
N1
N3
N2
Roy
Sudarshan
37Explanation with example
Qin
Qout
N100
N1 Roy Sudarshan
N2 N3 N100
N4
N1
N3
N2
Roy
Sudarshan
Result Found !
38Generation of top-k results
- If we know the score of next-best answer, all
buffered answers with better score can be output. - Need upper bounds
39Computation of upper bound
- For each keyword ti, we have explored nodes upto
some length say li. - So, next best score (approx.)
- This is not a true upper bound, but works quite
well and is simple !
40Are we losing answers ?
- BANKS I used many Dijkstra states, BANKS II
uses 2 only forward and backward search-states. - The result is that we can now lose answers !
41Answer Loss Example
Ny
K1
K2
Nx
K1
This is the generated answer.
This answer is lost.
42- But, we will generate this tree rooted at Nx
- So, a rotated tree with same nodes but different
root is often generated !
NY
K2
NX
K1
43Metrics of Performance
- Manually obtain best relevant answers.
- Determine 2 times
- Time taken to produce last relevant answer.
- Time taken to output last relevant answer.
- Search algorithms
- MI-Bkwd original backward search
- Iterator for every node matching a keyword
- SI-Bkwd backward search with single backward
iterator - Bidirec bidirectional search
- Datasets
- DBLP, IMDB 2 million nodes, 9 million edges
- US Patent DB 4 million nodes, 15 million edges
44Graph - I
- SI-Bkwd gain increases with origin size,
keywords
45Graph - II
- Bidirec gain increases with origin size,
keywords
46A Critique
- BANKS needs a lot of memory.
- Need to cluster and keep parts of graph on disk.
-
47 DBXplorer (ICDE 02)
48DBXplorer (Microsoft Research)
- Use symbol table to determine which tables to
join. - Generate all possible table join combinations
-
- Figure
T1, T2, T3, T4 and T5 are tables
49Cool ideas in DBXplorer
- Symbol table need not be at tuple level. If
column has an index, column level symbol table
is ok. - Table Compression
- e.g. Keywords Columns Keywords Columns
K1
K1
C1
C1
K2
K2
X
K3
C2
C2
K3
K4
K4
K5
K5
Intermediate Column
50 ObjectRank (VLDB 04)
51ObjectRank (IBM, FIU, UCSD)
- Creates objects in database. Object definition is
manual. - e.g. in DBLP, author, conference and paper can
be defined as objects. - Heavily inspired by PageRank.
- Each node is given global ObjectRank just like
PageRank of Google.
52ObjectRank Ideas
- Keyword-level ObjectRank for each keyword,
precompute and save object ranks of nodes can
optimize by defining cut-off) - Score of node, n w.r.t. keyword k
- scorek(n) f (Global-object-rank (n),
Objectrankk (n)) - At run time, scores are combined
- scorek1,k2,,km(n) scorek1(n) scorek2(n)
scorekm(n)
53ObjectRank Algorithm and answers
- If graph is DAG or near DAG, topologically sort
and spread ObjectRank in this order. - Answers are single objects and not Cluster /
group as in BANKS. - Demo at
- http//teriyaki.ucsd.edu9099/objrank/main05_ne
w.html
54References
- Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
Soumen Chakrabarti, and S. Sudarshan.Keyword
Searching and Browsing in Databases using
BANKS.In International Conference on Data
Engineering (ICDE), pages 10831096, 2002. - Varun Kacholia, Shashank Pandit, Soumen
Chakrabarti, S. Sudarshan et. al.Bidirectional
Expansion for Keyword Search on Graph
Databases.In VLDB Conference, pages 505516,
2005. - Sanjay Agrawal, Surajit Chaudhari, and Gautam
Das.DBXplorer A System for Keyword-Based Search
over Relational Databases.In International
Conference on Data Engineering (ICDE), pages
522, 2002. - Andrey Balmin, Vagelis Hristidis, and Yannis
Papakonstantinou.ObjectRank Authority-Based
Keyword Search in Databases.In VLDB Conference,
pages 564575, 2004.
55What You Should Know
- Why answering keyword queries on relational
databases is useful - Basic ideas of BANKS