Title: Keyword Search in RDB
 1Keyword Search in RDB
- Jeffrey Xu Yu 
- Chinese Univeristy of Hong Kong 
- yu_at_se.cuhk.edu.hk
Acknowledgement many slides used in this talk 
are originally taken from the authors 
presentations given in the conferences. 
 2Traditional Data Access Methods (SIGMOD09 
Tutorial)
- Databases / XML data 
- Structured, with rich meta-data 
- Accessed by query languages 
- High search quality 
- Small user population that masters DB 
-  Text documents 
- Unstructured 
- Accessed by keywords 
- Limited search quality 
- Large user population 
2
SIGMOD09 Tutorial
2010-1-2 
 3The Challenges of Accessing Structured Data
- Query languages long learning curves 
- Schemas Complex, evolving, or even unavailable. 
- What about filling in query forms? 
- Limited access pattern. 
- Hard to design and maintain forms on dynamic and 
 heterogeneous data!
select paper.title from conference c, paper p, 
author a1, author a2, write w1, write w2 
 where c.cid  p.cid AND p.pid  w1.pid 
AND p.pid  w2.pid AND w1.aid  a1.aid AND w2.aid 
 a2.aid AND a1.name  John AND a2.name  
Mary AND c.name  SIGMOD
The usability of DB is severely limited unless 
easier ways to access databases are developed 
Jagadish, SIGMOD 07.
3
SIGMOD09 Tutorial
2010-1-2 
 4Supporting Keyword Search on DB  Advantages /1
- Easy to use 
- The most important factor for the majority of 
 users.
- The same advantage of keyword search on text 
 documents
4
SIGMOD09 Tutorial
2010-1-2 
 5Supporting Keyword Search on DB  Advantages /2
- Enabling interesting or unexpected discoveries 
- Relevant data pieces that are scattered but are 
 collectively relevant to the query should be
 automatically assembled in the results
- Larger scope for data inter-connection 
Seltzer, Berkeley 
Is Seltzer a student at UC Berkeley?
Seltzer is a developer of Berkeley DB.
Wow.
5
SIGMOD09 Tutorial
2010-1-2 
 6Supporting Keyword Search on DB  Advantages /3
- Returning meaningful results by exploiting 
 structural information.
- An unique opportunity in structured data 
Query Bernstein, skyline
Structured Document
Such a result will have a low rank.
Text Document
scientist
scientist
Bernstein is a computer scientist.......... One 
of Bernsteins colleagues, Duane, recently 
published a paper about skyline query processing.
publications
name
publications
name
paper
Bernstein
paper
Duane
title
title
skyline
model management
6
SIGMOD09 Tutorial
2010-1-2 
 7Supporting Keyword Search on DB  Summary of 
Advantages
- Increasing the DB usability 
- Increasing the coverage and quality of keyword 
 search
7
SIGMOD09 Tutorial
2010-1-2 
 8Supporting Keyword Search on DB  Challenges /1
-  Semantics keyword queries are ambiguous 
- How to infer the query semantics and find 
 relevant answers?
- How to effectively rank the results in the order 
 of their relevance?
- How to help users analyze results? 
- How to evaluate the quality of search results?
8
SIGMOD09 Tutorial
2010-1-2 
 9Supporting Keyword Search on DB  Challenges /2
- Efficiency 
- Many problems in keyword search on DB are shown 
 to be NP-hard.
- Generating results, query segmentation, snippet 
 generation, etc.,
- Large datasets 
- How to generate (top-k) query results efficiently?
9
SIGMOD09 Tutorial
2010-1-2 
 10Keyword Search on DB State-of-the Art
- Keyword search on DB has become a hot research 
 direction, and attracted researchers in DB, IR,
 theory, etc
- More than 50 research papers, from both research 
 labs and universities in major database
 conferences/journals
- Workshop about keyword search on DB (KEYS, June 
 28, 09)
and counting...
10
SIGMOD09 Tutorial
2010-1-2 
 11Keyword Search in RDBs
- Report on the DB/IR Panel at SIGMOD 2005 by Sihem 
 Amer-Yahia and Pat Case (SIGMOD Record)
- Internet search engines have popularized 
 keyword-based search.
- DBMSs do not support IR-style keyword-based 
 search.
- An Example 
- Keywords Programming by Ritchie 
- For result, rows need to be generated by joining 
 tables on the fly (all possible combinations)
Authors
AuthorsBooks
Books
BookStores
Store 
 12DBXplorer S. Agrawal et al. ICDE02 
- Given a set of query keywords, DBXplorer returns 
 all rows (either from single tables, or by
 joining tables connected by foreign-key joins)
 such that the each row contains all keywords.
- IR techniques use Inverted Lists  Symbol Table 
 in databases
13Symbol Table Design
- Symbol Table (S) - stores the information about 
 keywords at different granularities (column/row),
 i.e. for each keyword it stores the list of all
 rows
- Column Level granularity (Pub-Col) 
- For every keyword S it maintains a list of all 
 database columns (i.e. table.column)
- Cell level granularity (Pub-cell) 
- For every keyword S it maintains a list of all 
 database cells (i.e. table.column.rowid)
14Storing Symbol Table
- Store symbol tables (pub-col) in database as 
 (keyword hash, column Id)
- FK Compression (Foreign Key) 
- If there is foreign key relationship between c1 
 and c2, store only c1
- CP Compression 
- Partition H into a minimum number of bipartite 
 cliques (a bipartite clique is any subgraph of H
 with a maximal number of edges).
- Compress each clique. 
- Stores symbol table (pub-cell) in database as 
 (keyword hash, list of all cellids)
v2 v3 v4
c1 c2
x
Uncompressed hash table
ColumnsMap table
Compressed hash table 
 15Search - Enumerating Join Trees
- Step1 - Looks up symbol table to find tables / 
 columns which contain keywords
- Step2 - Enumerate join trees 
- Identify and enumerate all potential subsets of 
 tables in the database that, if joined, might
 contain rows having all keywords.
- The resulting relation will contain all potential 
 rows having all keywords specified in the query.
keywords
-  If it views the schema graph G as an undirected 
 graph, this step enumerates join trees, i.e.,
 sub-trees of G such that
- the leaves belong to MatchedTables 
- together, the leaves contain all keywords of the 
 query
Join Trees 
 16Search  Identify matching rows
- The input to this final search step is the 
 enumerated join trees.
- Each join tree is then mapped to a single SQL 
 statement that joins the tables as specified in
 the tree, and selects those rows that contain all
 keywords.
- The retrieved rows are ranked before being 
 output.
- Rows ranked by number of joins involved (ties 
 broken arbitrarily) (same as keywords occurring
 close to one another in documents are ranked
 higher)
Join Trees 
 17Generalized Matches  Token Matches
- Token matches - the keyword in the query matches 
 only a token or sub-string of an attribute value
 (e.g., retrieve rows of address by specifying
 only a street name).
- Pub-Prefix method 
- B tree indexes can be used to retrieve rows 
 whose cell matches a given prefix string
- This clause is of the form 
-  WHERE T.C LIKE PK 
- During publishing of a database, for every 
 keyword K, the entry (hash(K), T.C, P) is kept in
 the symbol table if there exists a string in
 column T.C which
- contains a token K, and 
- has prefix P
18Generalized Matches - Token Matches
Let the hash values of the searchable tokens 
i.e., string, ball and round be 1, 2 and 3 
respectively
Pub-Prefix table
Database table T
Consider searching keyword string Pub-Prefix 
table returns prefixes th and no and 
subsequent SQL will contain (T.C LIKE 
nostring) OR (T.C LIKE thstring) 
 19DISCOVER Hristidis et al. VLDB02
- Result is tree T of tuples where 
- each edge corresponds to a primary-foreign key 
 relationship
- every keyword contained in a tuple of T (total) 
- no tuple of T is redundant (minimal) 
20Example - Data 
 21Example  Keyword Query
Query Smith, Miller 
 22Example  Keyword Query
Results
Query Smith, Miller 
 23Example  Keyword Query
Results
Query Smith, Miller 
 24Architecture
User 
 25Architecture 
 26Candidate Network - Example
ORDERS Smith
n1 
n1 
n1 
CUSTOMER 
NATION 
ORDERS 
n1 
ORDERS Miller 
 27Candidate Network - Example
ORDERS Smith
n1 
n1 
n1 
CUSTOMER 
NATION 
ORDERS 
n1 
ORDERS Miller
CN1 OSmith ? C ? OMiller size2 
 28Candidate Network - Example
ORDERS Smith
n1 
n1 
n1 
CUSTOMER 
NATION 
ORDERS 
n1 
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
CN2 OSmith ? C ? N ? C ? OMiller size4 
 29Candidate Network - Example
ORDERS Smith
n1 
n1 
n1 
CUSTOMER 
NATION 
ORDERS 
n1 
ORDERS Miller
CN3 OSmith ? C ? OMiller ? C size3 
- The CN3 is not minimal, because the rightmost C 
 does not contain a key.
30Candidate Network - Example
ORDERS Smith
n1 
n1 
n1 
CUSTOMER 
NATION 
ORDERS 
n1 
ORDERS Miller
CN4 OSmith ? C ? O ? C ? OMiller size4 
- c1 ? c2 , because primary to foreign key from 
 CUSTOMER to ORDERS (an answer may contain the
 same tuple twice  not a tree)
- Pruning Condition RK?S?RL
31Candidate Networks Generator - Algorithm
- Traverse tuple set graph breadth first 
- Q ? tuple sets containing keyword k1 
- For each network n of tuple sets in Q do 
- If pruning_condition(n) drop n 
- else if is_CN(n) output n 
- else expand n by one tuple set to all possible 
 directions in tuple set graph and insert
 expansions to Q
-  eg if n is OSmith ? C then we add to Q 
-  OSmith ? C ? OMiller, OSmith ? C ? O, OSmith ? 
 C ? N
32Candidate Networks Generator is Complete and 
Non-Redundant
- The set of Candidate Networks generated is 
- Complete All solutions generated by a CN 
- Non-redundant There is database instance, where 
 by removing a CN a solution is lost
33Architecture 
 34Execution Plan
- Each CN corresponds to a SQL statement 
- CN1 OSmith ? C ? OMiller 
-  CN2 OSmith ? C ? N ? C ? OMiller 
- Execution Plan 
-  CN1 ? OSmith ?? C ?? OMiller 
-  CN2 ? OSmith ?? C ?? N ?? C ?? OMiller
35Reuse Common Subexpressions - Example
- Execution Plan 
-  CN1 ? OSmith ?? C ?? OMiller 
-  CN2 ? OSmith ?? C ?? N ?? C ?? OMiller 
- Optimized Execution Plan 
-  Temp ? OSmith ?? C 
-  CN1 ? Temp ?? OMiller 
-  CN2 ? Temp ?? N ?? C ?? OMiller
36KS in RDB Streams Markowetz et al. SIGMOD07
Query Tarantino, Travolta 
 37Data Graph G
- Nodes  Tuples 
- Edges  can be joined
38MTJNT
- Sub-graph of G 
- Contain all keywords 
- Minimal 
- Answer R-KWS query 
- Limited to Tmax nodes 
- Longer joins  irrelevant results
39Candidate Networks (CN)
  40Operator Trees for Candidate Networks
CN
- Leaves  Selections 
- Inner nodes  Joins
OP-Tree
Output MTJNT 
 41Instantaneous Data Graph 
 42Operator Mesh 
 43Demand Driven Operator Execution (I)
- d at least one parent is running 
- r right input is not empty
44Demand Driven Operator Execution (II) 
 45BANKS-I/II (ICDE02,VLDB05)
- Keyword Search on (Directed) Graphs
paper
Multi-Query Optimization
E.g., Sudarshan Roy
writes
writes
author
author
Prasan Roy
Sudarshan 
 46Ranking
- Edge Score  EA 
- Smaller tree gt higher score 
- EA  1/ (S edge weights) 
- Node Score  NA 
- Measure of authority of nodes in tree 
- NA  S (leaf and root node authorities) 
- Overall score  f (EA, NA) 
- f (EA, NA)  EA . NAl
47Finding Answer Trees
- Intuition travel backwards from keyword nodes 
 till you hit a common node
Query sudarshan roy
MultiQuery Optimization
paper
writes
Sudarshan
Prasan Roy
authors 
 48Backward Search Algorithm
- Run concurrent single source shortest path 
 iterators from each node matching a keyword
- Traverse the graph edges in reverse direction 
- Output next nearest node on each get-next() call 
- Do best-first search across iterators 
- Output node if in the intersection of sets of 
 nodes reached from each keyword
49Backward Search Limitations
- Wasteful exploration of graph 
- Frequently occurring keywords 
- Hub nodes in the graph (high in-degree) 
Shashank Sudarshan Database
Schema Legend
Database
author
writes
paper
Shashank
Sudarshan 
 50Bidirectional Search Motivation 
 51Bidir Search Intuition
- First cut solution 
- Dont go backward if a keyword matches many nodes 
- Dont go backward if a node points to a hub 
- Instead explore forward from other keywords
52Bidir Search Example
Shashank Sudarshan Database
Database
Schema Legend
author
writes
Shashank
Sudarshan
paper 
 53Bidir Search Issues
- What should threshold for not expanding be? 
- The proposed solution prioritize expansion of 
 nodes based on spreading activation
- to penalize frequent keywords and bushy trees 
- How to manage exploration in both directions?
54Bidir Search Spreading Activation
- Spreading Activation 
- Node with highest activation explored first 
- Every node given an initial activation 
- Gives low activation to frequently occurring 
 keywords
1/5
1/5
1/5
1/5
1/5
John 
 55Bidir Search Spreading Activation
- Spreading Activation 
- Node with highest activation explored first 
- Activation spread to neighbors (µ  0.3) 
- Gives low activation to neighbors of hubs
0.7 x 1/5 x 1/4
0
1
1/5
1
0.7 x 1/5 x 1/4
0
1
0
0.7 x 1/5 x 1/4
0.3 x 1/5
1
0.7 x 1/5 x 1/4
0 
 56Bidir Search Iterators
- How to manage exploration in both directions? 
-  
- Single backward iterator  single forward 
 iterator w/ suitable datastructures
- E.g., to keep track of parents of nodes
Dist from A, Dist from B
7
6
8,8
2,3 8
8,8 2
2,8
8,1
8,1
1,8
3
4
5
0,8
8,0
2
1
A
B 
 57Bidir Search top-k results
- Results need not be generated in-order 
- Naïve solution 
- Store results in an intermediate heap 
- Output top k results after mk total results have 
 been generated (m  10)
58Minimum Group Steiner Tree 
 59An Example 
 60An Example 
 61An Example 
 62Dynamic Programming Ding et al. ICDE07
  63Dynamic Programming Equation 
 64Dynamic Programming Equation 
 65The Order to Compute T(v,p) 
 66BLINKS He et al. SIGMOD07
- Score definition 
- For an answer T ?r,(n1,,nm)? to a query q  
 (w1, , wm), the score is defined as S(T)  f(
 Sr(r)  ? Sn(ni, wi)  ? Sp(r, ni) )
- Considers both content and graph structure 
- Match-distributive property 
- Contribution of matches and root-match paths can 
 be computed in a distributive manner by summing
 over all matches
- Allow pre-computation of best path, independently 
 for each node/keyword
- Graph-distance property 
- The contribution of a root-match path, Sp(r,ni), 
 is defined to be shortest-path distance from r to
 ni
- To simplify presentation, we focus on the path 
 contribution ?Sp(r,ni)
paths from root to matches 
 67Graph Search Strategies
- Backward search Bhalotia et al., ICDE02 
- Starting from keyword nodes (containing at least 
 one query keyword)
- In each search step, choose an incoming edge to a 
 previously visited node and follow the edge
 backward to visit its source node
- Discover an answer root r if r is visited from 
 every keyword
- Bidirectional search Kacholia et al., VLDB05 
- Explore the graph by following forward edges as 
 well
- Choose which node to visit by heuristic 
 activation factors
w1
Conceptually, expand clusters of visited nodes 
for each keyword
w2
Graph 
 68Graph Search Strategies (contd)
- Each search step needs to decide 
- Which node to expand within a cluster 
- Which keyword cluster to expand 
- New approach 
- Equi-distance expansion in each cluster 
- Cost-balanced expansion across clusters balance 
 of nodes expanded across clusters
- Cost is at most m times that of an oracle 
 backward search algorithm (m   of query
 keywords)
? Equi-distance expansion node closest to 
cluster origin in graph distance
Optimal
No Guarantee
? Distance-balanced expansion balance diameter 
across all clusters
Assume 3 keywords w1 w2  w3 
Optimal
m-optimal 
 69Using a Single-Level Index
- What is inefficient with search without index? 
- Needs to maintain, for each keyword, a priority 
 queue storing nodes in current expansion
 frontier ? High space/time complexity
- Existing forward expansion is largely guesswork 
- New ideas 
- (I) For each keyword, index nodes in the order of 
 visiting them in search Keyword-node lists
- For each keyword w, a list LKN(w) contains nodes 
 that can reach w, ordered by their shortest
 distances to w
- (II) Index shortest distances from nodes to 
 keywords, enabling forward jumps Node-keyword
 map
- Given node u and keyword w, a hash map MNK (u,w) 
 returns the shortest distance from u to w in
 O(1) time
v1
v2
v3
v4
v5
v6
v7
LKN(w1)
0, v2, v2, v2
0, v7, v7, v7
1, v6, v7, v7
MNK(v2,w1)
0, v2, v2
MNK(v2,w2)
1, v4, v4 
 70Search with Single-Level Index
- Search algorithm using the single-level index, 
 applying the search strategies
- Equi-distance expansion ? Use one cursor to 
 traverse each LKN(wi)
- Cost-balanced expansion ? Pick the cursor to 
 expand next in a round-robin manner
- Forward expansion ? When visiting a node, look up 
 its distances to other keywords by MNK
-  
- Efficiency 
- Managing exploration states by m cursors instead 
 of m priority queues
Keyword-node lists
v1
LKN(w1)
0, v2, v2, v2
0, v7, v7, v7
1, v6, v7, v7
v2
1
v3
LKN(w2)
0, v4, v4, v4
0,v11,v11,v11
1, v2, v4, v4
v4
v5
v8
3
v6
v9
Node-keyword map
v7
MNK
v10
Partial Answers
v12
v11
ltv2,(0, ?)gt
ltv4,(?, 0)gt
Answers
ltv2,(0, 1)gt
ltv6,(1, 2)gt 
 71Bi-Level Indexing in BLINKS
- Single-level index is impractical for large 
 graphs
- Space complexity O(VK) where K is the number 
 of keywords
- BLINKS Bi-Level Index for Keyword Search 
- Partition a data graph into multiple, say B, 
 subgraphs, or blocks
- Partitioning by nodes, called portals, which will 
 play key roles in search
- There are many partitioning algorithms, such as 
 Breadth-first and METIS
- (Top-level) block index map keywords and portals 
 to blocks
- Purpose Initiate backward expansion in relevant 
 blocks guide backward expansion across blocks
 (through portals)
- (Low-level) intra-block index store similar 
 information as in a single-level index, but
 restricted to within each block
- Purpose Help backward expansion and forward 
 jumps within blocks
72Search with the Bi-Level Index
- Similar to searching with single-level index in 
- Overall expansion policies (which keyword 
 cluster/node to explore next)
- Index access (scanning LKN lists and looking-up 
 MNK hash map)
- New challenges/complications introduced by graph 
 partitioning
- A single cursor for a keyword is no longer 
 sufficient
- Need simultaneously backward expansion in 
 multiple blocks that contains the keyword
- So it maintains a queue of cursors, one for each 
 block we are currently exploring
- Backward expansion needs to continue across block 
 boundaries
- When encountering boundaries, it retrieves new 
 blocks to visit from the block index and add them
 to the queue
- Distance information in the intra-block index ? 
 global shortest distance
- The path with the shortest distance may happen to 
 go across blocks
- Our exploration order guarantees correct global 
 shortest distance
73References (Keywords)
- Roy Goldman, Narayanan Shivakumar, Suresh 
 Venkatasubramanian, Hector Garcia-Molina
 Proximity Search in Databases, VLDB99, 1999.
- Bettina Berendt, myra Spiliopoulou Analysis of 
 navigation behaviour in web sites integrating
 multiple information systems, VLDBJ, 2000.
- Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, 
 Soumen Chakrabarti, S. Sudarshan Keyword
 Searching and Browsing in Database using BANKS,
 ICDE02, 2002.
- Sanjay Agrawal, Surajit Chaudhuri, Gautam Das 
 DBXplorer A System for Keyword-Based Search over
 Relational Databases, ICDE02, 2002.
- Yi Chen, Wei Wang, Ziyang Liu, Xuemin Lin 
 Keyword Search on Structured and Semistructured
 Data, SIGMOD09, 2009.
- Vagelis Hristidis, Yannis Papakonstantinou 
 DISCOVER Keyword Search in Relational Databases,
 VLDB02, 2002.
- Vagelis Hristidis, Yannis Papakonstantinou, 
 Andrey Balmin Keyword Proximity Search on XML
 Graps, ICDE03, 2003.
- Vagelis Hristidis, Luis Gravano, Yannis 
 Papakonstantinou Efficient IR-Style Keyword
 Search over Relational Databases, VLDB03, 2003.
74References (Keywords)
- Andrey Balmin, Vagelis Hristidis, Yannis, 
 Papakonstantinou ObjectRank Authority-Based
 Keyword Search in Databases, VLDB04, 2004.
- Sara Cohen, Yaron Kanza, Benny Kimelfeld 
 Interconnection Semantics for Keyword Search in
 XML, CIKM05, 2005.
- Varun Kacholia, Shashank Pandit, Soumen 
 Chakrabarti, S. Sudarshan Bidirectional
 Expansion for Keyword Search on Graph Databases,
 VLDB05, 2005.
- Benny Kimelfeld, Yehoshua Sagiv Efficient 
 Engines for Keyword Proximity Search, WebDB05,
 2005.
- Benny Kimelfeld, Yehoshua Sagiv Efficiently 
 Enumerating Results of Keyword Search, DBLP05,
 2005.
- Fang Liu, Clement Yu, Weiyi Meng, Abdur 
 Chowdhury Effective Keyword Search in Relational
 Databases, SIMOD06, 2006.
- Benny Kimelfeld, Yehoshua Sagiv Finding and 
 Approximating Top-k Answersin Keyword Proximity
 search, PODS06, 2006.
75References (Keywords)
- Bolin Ding, Jeffrey Xu Yu, Shan Wang, Lu Qin, 
 Xiao Zhang, Xuemin Lin Finding Top-k Min-Cost
 Connected Trees in Databases, ICDE07, 2007.
- Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou 
 SPARK Top-k Keyword Query in Relational
 Databases, SIGMOD07, 2007.
- Hao He, Haixun Wang, Jun Yang, Philip S. Yu 
 BLINKS Ranked Keyword Searches on Graphs,
 SIGMOD07, 2007.
- Alexander Markowetz, Ying Yang, Dimitris 
 Papadias Keyword Search on Relational Data
 Streams, SIGMOD07, 2007.
- Konstantin Golenberg, Benny Kimelfeld, Yehoshua 
 Sagiv Keyword Proximity Search in Complex Data
 Graphs, SIGMOD08, 2008.
- Guoliang Li, BengChin Ooi, Jianhua Feng, Jianyong 
 Wang, Lizhu Zhou EASE An Effective 3-in-1
 Keyword Search Method for Unstructured,
 Semi-structured and Structured Data, SIGMOD08,
 2008.
- Guang Hleu Vu, Beng Chin Ooi, Dimitris Papadias, 
 Anthony K. H. Tung A Graph Method for
 Keyword-based Selection of the top-K
 Databases,SIGMOD08, 2008.