Title: Keyword Searching in Relational Databases
1- Keyword Searching in Relational Databases
- Esha Palta (05329017)
- Kumar Gaurav Bijay (02005013)
2Dilbert Strip ?
3Motivation
- Keyword search
-
- We have SQL, why keyword-querying?
- SQL - not appropriate for naive users
- So many online databases (imdb, citeseer,
bseindia ) user cannot keep track of schema
for all of these
4Simple Approaches
- Using Form interfaces
- Require separate form for each type of query
confusing - Not suitable for ad-hoc queries how many forms
will you provide? - How about Google?
- Export data from db to documents and do keyword-
querying on these - Suffers from duplication overheads
- Google wants all keywords in one document. DB is
often normalized, so need to join tables and
store as documents - Multiple combinations of tables to join. Not
scalable
5Differences from Web Search
- Related data split across multiple tuples due to
normalization - Different keywords may match tuples from
different relations - What joins are to be computed can only be decided
on the fly - Need to find result containing all keywords and
rank them somehow
Writes (AuthorId, PaperId)
Paper (PaperId, PaperName)
Author (AuthorId, AuthorName)
Cites (Citing, Cited)
The DBLP Bibliography Schema
6Systems for DB search
- BANKS (Browsing and Keyword Search)
- IITB (ICDE 02)
- DBXplorer Microsoft Research (ICDE 02)
- ObjectRank IBM, UCSD, FIU (VLDB 04)
- Bidirectional BANKS IITB (VLDB 05)
-
7Systems for DB search
- BANKS (Browsing and Keyword Search)
- IITB (ICDE 02)
- DBXplorer Microsoft Research (ICDE 02)
- ObjectRank IBM, UCSD, FIU (VLDB 04)
- Bidirectional BANKS IITB (VLDB 05)
- will cover in depth
8 BANKS (ICDE 02)
9The BANKS system
- BANKS Architecture
- Available on the web
- http//www.cse.iitb.ac.in/banks
- Connects to database using JDBC
- JDBC metadata features used to provide schema
browsing - Preprocesses db
User
BANKS
HTTP
JDBC
Web-server
Database
10Basic Model
- Database modeled as a graph
- Nodes tuples
- Edges references between tuples
- foreign key (assume for this talk), inclusion
dependencies, .. - Edges are directed.
PaperIdPaperName
AuthorIDPaperId
AuthorId
DBLP example
11The BANKS Answer Model
- Query set of search terms t1, t2, .., tn
- For each search term ti we find set of nodes Si
matching ti - Eg Query Sudarshan Roy (t1 Sudarshan, t2
Roy) - Answer rooted, directed tree connecting nodes
matching keywords - Root node has special significance, may be
restricted to some relations - E.g. relations representing entities, not
relationships - May include intermediate nodes not in any Si
(Steiner Tree) - Multiple answers
- Ranking based on proximity prestige
12Answer Example
Query sudarshan roy
Paper
Writes
Writes
Author
Author
- We would like to find sets of (closely) connected
tuples that match all given keywords
13Edge Directionality
- Directed tree will miss desired answers. For eg
- Query DBXplorer ObjectRank
- So, for each forward edge, BANKS adds a back edge
CitedBy
Cited
Cites
BANKS
DBXPlorer
ObjectRank
Cites
CitedBy
Cited
DBXPlorer
BANKS
Cites
ObjectRank
Cites
14Edge Directionality
- What if we ignore directionality?
- Some popular tuples are connected to many other
tuples - E.g. Students -gt departments -gt university
- Problem A popular tuple would create misleading
shortcuts between tuples - E.g. every student would be closely linked with
every other student via the department/university - Solution define different forward and backward
edge weights - Forward edges In the direction of the foreign
key reference
15Edge Weight
- Weight of forward edge based on schema
- e.g. citation link weights gt writes link
weights - Weight of backward edge indegree of edges
pointing to the node
16Edge Weight Scaling
- Normalize edge score Escore(e)
- Make edge weight scale-free by dividing edge
weigth by wmin - Problem Some backward edges have unduly large
weights - Depress the scale by defining Escore(e) as
log(1w(e)/wmin ) - Overall Escore E 1 / (1 ?e Escore(e))
17Node Weight
- Set weight of a Node Indegree of the node
- As per prestige rankings nodes with multiple
pointers to them get a higher prestige - So, higher node weight corresponds to higher
prestige - Problem Nodes with many in-edges result in
skewed answers - Subdue extreme node weights by using
log(1indegree) - Node score Nscore Average of node scores
(root-node-weight ? leaf-node-weights)
18Combining Scores
- Combining two independent metrics node weight
and edge weight - Normalize each to 0-1
- Combine using weighting factor ?
- Additive (1- ?) Escore ? Nscore
- Multiplicative Escore Nscore?
- Performance study to compare alternatives and to
find reasonable values for ?
19First Step Symbol Table
- The first step is to build a symbol table
- This table is in the db and is not normalized
- Example
20Searching for Best Answers
- Backward Expanding Search Algorithm
- Assume graph fits in memory
- Idea find vertices from which a forward path
exists to at least one node from each Si. - Run concurrent single source shortest path
algorithm from each node matching a keyword - Create an iterator for each node matching a
keyword - Traverse the graph edges in reverse direction
- Output a node whenever it is on the intersection
of the sets of nodes reached from each keyword - Answer trees may not be generated in relevance
order
21Backward Expanding Search
Iterators
22BANKS Query Result Example
23Result Ordering
- Answers need not be always in Relevance order
This tree is output
Better Root Missed
2
2
2
5
2
1
24Result Ordering (contd)
- Solution
- Generate all connection trees and then sort them
- Increases computation costs and leads to a
greatly increased time to generate initial
results - Create a small heap ordered on the relevance of
the trees - Output highest ranked tree from heap to user when
heap is full - What about duplicate results?
- Maintain a list of generated results for
duplicate detection - Discard result according to relevance
25Experience and Performance
- BANKS provides keyword search coupled with
extensive browsing facilities - Schema browsing data browsing
- Graphical display of data
- Implemented using Java servlets
- Keyword search response times typically 1 to 3
seconds on - DBLP database with 100,000 tuples/300,000 edges
- P3 600 MHz, 512 MB RAM
26Anecdotes
- Mohan
- Returns C. Mohan at top based on prestige (number
of papers written) - Transaction
- Returns Jim Grays classic paper and textbook as
top answers based on prestige (number of
citations) - Sunita Seltzer
- No common papers, but both have papers with
Stonebraker system finds this connection
27Effect of Parameters
- Log scaling of edge weights worked well
- (1- ?) E ? N versus E N???????made little
difference - Best with ? .2 (subdue node weights but not
entirely)
28 BANKS (VLDB 05)
29Motivation
- BANKS performs poorly if
- Keyword matches lot of nodes (so lot of Dijkstra
sources) - Search hits a node with large fan in.
Wastes time
Sudarshan
Roy
30New Ideas Forward Search
- Why only backward, lets search forward too
How about fwd Searching ?
Sudarshan
Roy
31New Ideas - Activation
- Activation - Cannot forward search from each
node. - Spread activation from keyword nodes to others.
- Activation is like Page Rank with decay.
- High Activation ? close to many keywords.
32Activation Spreading
- Spreading Activation
- Node with highest activation explored first
- Activation spread to neighbors (µ 0.3)
- Gives low activation to neighbors of hubs
33Modifications to Model
- Graph model stays the same.
- BANKS is concerned with search more than how to
tune parameters or define node weights / edge
weights. - BANKS code
- Tree Node Score, N
- Tree Edge Score, E
-
- Total Score ENl (l 0.2)
34The New Algorithm
- Need two priority queues
- Qin - do backward search from these nodes
- Qout - do forward search from these nodes
- Each node, n keeps 3 variables per keyword, ti
- sp i Node to got to from n for shortest-path
to ti - distance i Length of the shortest-path from
n to ti - Activation i Activation to n from keyword ti
35The New Algorithm continued
- Set initial activation keyword nodes and add to
Qin for backward-search. - At each step, pick node with maximum activation
- i.e. if (Qin.getMaxActivation gt Qout.
getMaxActivation)) - // use node from Qin
- else
- // use node from Qout
- If node from Qin, do backward search and add
itself to Qout. (newly explored nodes into Qin) - If node from Qout, do forward search
- If node has reached from all keyword, generate
result-tree. answer is buffered as results can
be out of order
36Explanation with example
Qin
Qout
N100
N4
Roy Sudarshan
N1
N3
N2
Roy
Sudarshan
37Explanation with example
Qin
Qout
N100
Roy Sudarshan
N4
N1 N2
N1
N3
N2
Roy
Sudarshan
38Explanation with example
Qin
Qout
N100
N1 Roy Sudarshan
N2 N3 N100
N4
N1
N3
N2
Roy
Sudarshan
Result Found !
39Generation of top-k results
- If we know the score of next-best answer, all
buffered answers with better score can be output. - Need upper bounds
40Computation of upper bound
- For each keyword ti, we have explored nodes upto
some length say li. - So, next best score (approx.)
- This is not a true upper bound, but works quite
well and is simple !
41Are we losing answers ?
- BANKS I used many Dijkstra states, BANKS II
uses 2 only forward and backward search-states. - The result is that we can now lose answers !
42Answer Loss Example
Ny
K1
K2
Nx
K1
This is the generated answer.
This answer is lost.
43- But, we will generate this tree rooted at Nx
- So, a rotated tree with same nodes but different
root is often generated !
NY
K2
NX
K1
44Metrics of Performance
- Manually obtain best relevant answers.
- Determine 2 times
- Time taken to produce last relevant answer.
- Time taken to output last relevant answer.
- Search algorithms
- MI-Bkwd original backward search
- Iterator for every node matching a keyword
- SI-Bkwd backward search with single backward
iterator - Bidirec bidirectional search
- Datasets
- DBLP, IMDB 2 million nodes, 9 million edges
- US Patent DB 4 million nodes, 15 million edges
45Graph - I
- SI-Bkwd gain increases with origin size,
keywords
46Graph - II
- Bidirec gain increases with origin size,
keywords
47A Critique
- BANKS needs a lot of memory.
- Need to cluster and keep parts of graph on disk.
- Work is in progress ?
48 DBXplorer (ICDE 02)
49DBXplorer (Microsoft Research)
- Use symbol table to determine which tables to
join. - Generate all possible table join combinations
-
- Figure
T1, T2, T3, T4 and T5 are tables
50Cool ideas in DBXplorer
- Symbol table need not be at tuple level. If
column has an index, column level symbol table
is ok. - Table Compression
- e.g. Keywords Columns Keywords Columns
K1
K1
C1
C1
K2
K2
X
K3
C2
C2
K3
K4
K4
K5
K5
Intermediate Column
51 ObjectRank (VLDB 04)
52ObjectRank (IBM, FIU, UCSD)
- Creates objects in database. Object definition is
manual. - e.g. in DBLP, author, conference and paper can
be defined as objects. - Heavily inspired by PageRank.
- Each node is given global ObjectRank just like
PageRank of Google.
53ObjectRank Ideas
- Keyword-level ObjectRank for each keyword,
precompute and save object ranks of nodes can
optimize by defining cut-off) - Score of node, n w.r.t. keyword k
- scorek(n) f (Global-object-rank (n),
Objectrankk (n)) - At run time, scores are combined
- scorek1,k2,,km(n) scorek1(n) scorek2(n)
scorekm(n)
54ObjectRank Algorithm and answers
- If graph is DAG or near DAG, topologically sort
and spread ObjectRank in this order. - Answers are single objects and not Cluster /
group as in BANKS. - Demo at
- http//teriyaki.ucsd.edu9099/objrank/main05_ne
w.html
55Conclusion
- Studied BANKS, both versions.
- Covered cool ideas from DBXplorer and ObjectRank.
- Graph of BANKS must be made disk-resident.
56References
- Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
Soumen Chakrabarti, and S. Sudarshan.Keyword
Searching and Browsing in Databases using
BANKS.In International Conference on Data
Engineering (ICDE), pages 10831096, 2002. - Varun Kacholia, Shashank Pandit, Soumen
Chakrabarti, S. Sudarshan et. al.Bidirectional
Expansion for Keyword Search on Graph
Databases.In VLDB Conference, pages 505516,
2005. - Sanjay Agrawal, Surajit Chaudhari, and Gautam
Das.DBXplorer A System for Keyword-Based Search
over Relational Databases.In International
Conference on Data Engineering (ICDE), pages
522, 2002. - Andrey Balmin, Vagelis Hristidis, and Yannis
Papakonstantinou.ObjectRank Authority-Based
Keyword Search in Databases.In VLDB Conference,
pages 564575, 2004.
57 Appendix
58Browsing - May add??????
- Hyperlinks are there for all primary key foreign
key attributes - Each table is displayed with set of tools for
interacting with data - Projection (using drop), Selection, Join,
Group-by, Sort - Template facilities to do a variety of tasks
- Browsing data by grouping and creating crosstabs
- e.g., theses grouped by department and year
- Hierarchical views of data
- Nested XML style, even on relational data
- Graphical displays
- Bar charts, pie charts, etc
- Templates are generic and can be applied on any
data matching assumed schema - Can be applied after applying selections
- New templates can be created by user,
interactively
59Example of Browsing in BANKS
60Related Work
- DataSpot (DTL)/Mercado Intuifind VLDB 98
- Based on patent by Palmon (filed 1995, granted
1998) - Based on hypergraph model, similar answer model
to ours - Differences our model of backward link weights
and prestige - Proximity Search VLDB98
- Different model of proximity based on adding up
support - No edge weights, prestige, different evaluation
algorithm - Information units (linked Web pages) WWW10
- No directionality, only studied in Web context
- Microsoft DBExplorer (this conference)
- No ranking, based on SQL generation
- Addresses efficient construction of text indexes
- Microsoft English query
61Extensions
- Summarization of output
- group the output tuples into sets that have same
tree structure - define the notion of similarity between two
result trees - perform restricted search
- Metadata queries (attributekeyword queries)
- For example authorlevy
- match all the tuples of a relation
- costly
- Forward searching approach
62Proposed Conclusions and Future Work
- BANKS is an integrated browsing and keyword
querying system for relational databases - Future work
- Keyword queries on XML
- Disambiguating queries by selecting
- Nodes G.W.Bush Bush Jr or Bush Sr
- Tree structure coauthors or cites
- Boolean queries
- Metadata queries
- Summarization of output