Title: Keyword Search in RDB
1Keyword Search in RDB
- Jeffrey Xu Yu
- Chinese Univeristy of Hong Kong
- yu_at_se.cuhk.edu.hk
Acknowledgement many slides used in this talk
are originally taken from the authors
presentations given in the conferences.
2Traditional Data Access Methods (SIGMOD09
Tutorial)
- Databases / XML data
- Structured, with rich meta-data
- Accessed by query languages
- High search quality
- Small user population that masters DB
- Text documents
- Unstructured
- Accessed by keywords
- Limited search quality
- Large user population
2
SIGMOD09 Tutorial
2010-1-2
3The Challenges of Accessing Structured Data
- Query languages long learning curves
- Schemas Complex, evolving, or even unavailable.
- What about filling in query forms?
- Limited access pattern.
- Hard to design and maintain forms on dynamic and
heterogeneous data!
select paper.title from conference c, paper p,
author a1, author a2, write w1, write w2
where c.cid p.cid AND p.pid w1.pid
AND p.pid w2.pid AND w1.aid a1.aid AND w2.aid
a2.aid AND a1.name John AND a2.name
Mary AND c.name SIGMOD
The usability of DB is severely limited unless
easier ways to access databases are developed
Jagadish, SIGMOD 07.
3
SIGMOD09 Tutorial
2010-1-2
4Supporting Keyword Search on DB Advantages /1
- Easy to use
- The most important factor for the majority of
users. - The same advantage of keyword search on text
documents
4
SIGMOD09 Tutorial
2010-1-2
5Supporting Keyword Search on DB Advantages /2
- Enabling interesting or unexpected discoveries
- Relevant data pieces that are scattered but are
collectively relevant to the query should be
automatically assembled in the results - Larger scope for data inter-connection
Seltzer, Berkeley
Is Seltzer a student at UC Berkeley?
Seltzer is a developer of Berkeley DB.
Wow.
5
SIGMOD09 Tutorial
2010-1-2
6Supporting Keyword Search on DB Advantages /3
- Returning meaningful results by exploiting
structural information. - An unique opportunity in structured data
Query Bernstein, skyline
Structured Document
Such a result will have a low rank.
Text Document
scientist
scientist
Bernstein is a computer scientist.......... One
of Bernsteins colleagues, Duane, recently
published a paper about skyline query processing.
publications
name
publications
name
paper
Bernstein
paper
Duane
title
title
skyline
model management
6
SIGMOD09 Tutorial
2010-1-2
7Supporting Keyword Search on DB Summary of
Advantages
- Increasing the DB usability
- Increasing the coverage and quality of keyword
search
7
SIGMOD09 Tutorial
2010-1-2
8Supporting Keyword Search on DB Challenges /1
- Semantics keyword queries are ambiguous
- How to infer the query semantics and find
relevant answers? - How to effectively rank the results in the order
of their relevance? - How to help users analyze results?
- How to evaluate the quality of search results?
8
SIGMOD09 Tutorial
2010-1-2
9Supporting Keyword Search on DB Challenges /2
- Efficiency
- Many problems in keyword search on DB are shown
to be NP-hard. - Generating results, query segmentation, snippet
generation, etc., - Large datasets
- How to generate (top-k) query results efficiently?
9
SIGMOD09 Tutorial
2010-1-2
10Keyword Search on DB State-of-the Art
- Keyword search on DB has become a hot research
direction, and attracted researchers in DB, IR,
theory, etc - More than 50 research papers, from both research
labs and universities in major database
conferences/journals - Workshop about keyword search on DB (KEYS, June
28, 09)
and counting...
10
SIGMOD09 Tutorial
2010-1-2
11Keyword Search in RDBs
- Report on the DB/IR Panel at SIGMOD 2005 by Sihem
Amer-Yahia and Pat Case (SIGMOD Record) - Internet search engines have popularized
keyword-based search. - DBMSs do not support IR-style keyword-based
search. - An Example
- Keywords Programming by Ritchie
- For result, rows need to be generated by joining
tables on the fly (all possible combinations)
Authors
AuthorsBooks
Books
BookStores
Store
12DBXplorer S. Agrawal et al. ICDE02
- Given a set of query keywords, DBXplorer returns
all rows (either from single tables, or by
joining tables connected by foreign-key joins)
such that the each row contains all keywords. - IR techniques use Inverted Lists Symbol Table
in databases
13Symbol Table Design
- Symbol Table (S) - stores the information about
keywords at different granularities (column/row),
i.e. for each keyword it stores the list of all
rows - Column Level granularity (Pub-Col)
- For every keyword S it maintains a list of all
database columns (i.e. table.column) - Cell level granularity (Pub-cell)
- For every keyword S it maintains a list of all
database cells (i.e. table.column.rowid)
14Storing Symbol Table
- Store symbol tables (pub-col) in database as
(keyword hash, column Id) - FK Compression (Foreign Key)
- If there is foreign key relationship between c1
and c2, store only c1 - CP Compression
- Partition H into a minimum number of bipartite
cliques (a bipartite clique is any subgraph of H
with a maximal number of edges). - Compress each clique.
- Stores symbol table (pub-cell) in database as
(keyword hash, list of all cellids)
v2 v3 v4
c1 c2
x
Uncompressed hash table
ColumnsMap table
Compressed hash table
15Search - Enumerating Join Trees
- Step1 - Looks up symbol table to find tables /
columns which contain keywords - Step2 - Enumerate join trees
- Identify and enumerate all potential subsets of
tables in the database that, if joined, might
contain rows having all keywords. - The resulting relation will contain all potential
rows having all keywords specified in the query.
keywords
- If it views the schema graph G as an undirected
graph, this step enumerates join trees, i.e.,
sub-trees of G such that - the leaves belong to MatchedTables
- together, the leaves contain all keywords of the
query
Join Trees
16Search Identify matching rows
- The input to this final search step is the
enumerated join trees. - Each join tree is then mapped to a single SQL
statement that joins the tables as specified in
the tree, and selects those rows that contain all
keywords. - The retrieved rows are ranked before being
output. - Rows ranked by number of joins involved (ties
broken arbitrarily) (same as keywords occurring
close to one another in documents are ranked
higher)
Join Trees
17Generalized Matches Token Matches
- Token matches - the keyword in the query matches
only a token or sub-string of an attribute value
(e.g., retrieve rows of address by specifying
only a street name). - Pub-Prefix method
- B tree indexes can be used to retrieve rows
whose cell matches a given prefix string - This clause is of the form
- WHERE T.C LIKE PK
- During publishing of a database, for every
keyword K, the entry (hash(K), T.C, P) is kept in
the symbol table if there exists a string in
column T.C which - contains a token K, and
- has prefix P
18Generalized Matches - Token Matches
Let the hash values of the searchable tokens
i.e., string, ball and round be 1, 2 and 3
respectively
Pub-Prefix table
Database table T
Consider searching keyword string Pub-Prefix
table returns prefixes th and no and
subsequent SQL will contain (T.C LIKE
nostring) OR (T.C LIKE thstring)
19DISCOVER Hristidis et al. VLDB02
- Result is tree T of tuples where
- each edge corresponds to a primary-foreign key
relationship - every keyword contained in a tuple of T (total)
- no tuple of T is redundant (minimal)
20Example - Data
21Example Keyword Query
Query Smith, Miller
22Example Keyword Query
Results
Query Smith, Miller
23Example Keyword Query
Results
Query Smith, Miller
24Architecture
User
25Architecture
26Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
27Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
28Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
CN2 OSmith ? C ? N ? C ? OMiller size4
29Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN3 OSmith ? C ? OMiller ? C size3
- The CN3 is not minimal, because the rightmost C
does not contain a key.
30Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN4 OSmith ? C ? O ? C ? OMiller size4
- c1 ? c2 , because primary to foreign key from
CUSTOMER to ORDERS (an answer may contain the
same tuple twice not a tree) - Pruning Condition RK?S?RL
31Candidate Networks Generator - Algorithm
- Traverse tuple set graph breadth first
- Q ? tuple sets containing keyword k1
- For each network n of tuple sets in Q do
- If pruning_condition(n) drop n
- else if is_CN(n) output n
- else expand n by one tuple set to all possible
directions in tuple set graph and insert
expansions to Q - eg if n is OSmith ? C then we add to Q
- OSmith ? C ? OMiller, OSmith ? C ? O, OSmith ?
C ? N
32Candidate Networks Generator is Complete and
Non-Redundant
- The set of Candidate Networks generated is
- Complete All solutions generated by a CN
- Non-redundant There is database instance, where
by removing a CN a solution is lost
33Architecture
34Execution Plan
- Each CN corresponds to a SQL statement
- CN1 OSmith ? C ? OMiller
- CN2 OSmith ? C ? N ? C ? OMiller
- Execution Plan
- CN1 ? OSmith ?? C ?? OMiller
- CN2 ? OSmith ?? C ?? N ?? C ?? OMiller
35Reuse Common Subexpressions - Example
- Execution Plan
- CN1 ? OSmith ?? C ?? OMiller
- CN2 ? OSmith ?? C ?? N ?? C ?? OMiller
- Optimized Execution Plan
- Temp ? OSmith ?? C
- CN1 ? Temp ?? OMiller
- CN2 ? Temp ?? N ?? C ?? OMiller
36KS in RDB Streams Markowetz et al. SIGMOD07
Query Tarantino, Travolta
37Data Graph G
- Nodes Tuples
- Edges can be joined
38MTJNT
- Sub-graph of G
- Contain all keywords
- Minimal
- Answer R-KWS query
- Limited to Tmax nodes
- Longer joins irrelevant results
39Candidate Networks (CN)
40Operator Trees for Candidate Networks
CN
- Leaves Selections
- Inner nodes Joins
OP-Tree
Output MTJNT
41Instantaneous Data Graph
42Operator Mesh
43Demand Driven Operator Execution (I)
- d at least one parent is running
- r right input is not empty
44Demand Driven Operator Execution (II)
45BANKS-I/II (ICDE02,VLDB05)
- Keyword Search on (Directed) Graphs
paper
Multi-Query Optimization
E.g., Sudarshan Roy
writes
writes
author
author
Prasan Roy
Sudarshan
46Ranking
- Edge Score EA
- Smaller tree gt higher score
- EA 1/ (S edge weights)
- Node Score NA
- Measure of authority of nodes in tree
- NA S (leaf and root node authorities)
- Overall score f (EA, NA)
- f (EA, NA) EA . NAl
47Finding Answer Trees
- Intuition travel backwards from keyword nodes
till you hit a common node
Query sudarshan roy
MultiQuery Optimization
paper
writes
Sudarshan
Prasan Roy
authors
48Backward Search Algorithm
- Run concurrent single source shortest path
iterators from each node matching a keyword - Traverse the graph edges in reverse direction
- Output next nearest node on each get-next() call
- Do best-first search across iterators
- Output node if in the intersection of sets of
nodes reached from each keyword
49Backward Search Limitations
- Wasteful exploration of graph
- Frequently occurring keywords
- Hub nodes in the graph (high in-degree)
Shashank Sudarshan Database
Schema Legend
Database
author
writes
paper
Shashank
Sudarshan
50Bidirectional Search Motivation
51Bidir Search Intuition
- First cut solution
- Dont go backward if a keyword matches many nodes
- Dont go backward if a node points to a hub
- Instead explore forward from other keywords
52Bidir Search Example
Shashank Sudarshan Database
Database
Schema Legend
author
writes
Shashank
Sudarshan
paper
53Bidir Search Issues
- What should threshold for not expanding be?
- The proposed solution prioritize expansion of
nodes based on spreading activation - to penalize frequent keywords and bushy trees
- How to manage exploration in both directions?
54Bidir Search Spreading Activation
- Spreading Activation
- Node with highest activation explored first
- Every node given an initial activation
- Gives low activation to frequently occurring
keywords
1/5
1/5
1/5
1/5
1/5
John
55Bidir Search Spreading Activation
- Spreading Activation
- Node with highest activation explored first
- Activation spread to neighbors (µ 0.3)
- Gives low activation to neighbors of hubs
0.7 x 1/5 x 1/4
0
1
1/5
1
0.7 x 1/5 x 1/4
0
1
0
0.7 x 1/5 x 1/4
0.3 x 1/5
1
0.7 x 1/5 x 1/4
0
56Bidir Search Iterators
- How to manage exploration in both directions?
-
- Single backward iterator single forward
iterator w/ suitable datastructures - E.g., to keep track of parents of nodes
Dist from A, Dist from B
7
6
8,8
2,3 8
8,8 2
2,8
8,1
8,1
1,8
3
4
5
0,8
8,0
2
1
A
B
57Bidir Search top-k results
- Results need not be generated in-order
- Naïve solution
- Store results in an intermediate heap
- Output top k results after mk total results have
been generated (m 10)
58Minimum Group Steiner Tree
59An Example
60An Example
61An Example
62Dynamic Programming Ding et al. ICDE07
63Dynamic Programming Equation
64Dynamic Programming Equation
65The Order to Compute T(v,p)
66BLINKS He et al. SIGMOD07
- Score definition
- For an answer T ?r,(n1,,nm)? to a query q
(w1, , wm), the score is defined as S(T) f(
Sr(r) ? Sn(ni, wi) ? Sp(r, ni) ) - Considers both content and graph structure
- Match-distributive property
- Contribution of matches and root-match paths can
be computed in a distributive manner by summing
over all matches - Allow pre-computation of best path, independently
for each node/keyword - Graph-distance property
- The contribution of a root-match path, Sp(r,ni),
is defined to be shortest-path distance from r to
ni - To simplify presentation, we focus on the path
contribution ?Sp(r,ni)
paths from root to matches
67Graph Search Strategies
- Backward search Bhalotia et al., ICDE02
- Starting from keyword nodes (containing at least
one query keyword) - In each search step, choose an incoming edge to a
previously visited node and follow the edge
backward to visit its source node - Discover an answer root r if r is visited from
every keyword - Bidirectional search Kacholia et al., VLDB05
- Explore the graph by following forward edges as
well - Choose which node to visit by heuristic
activation factors
w1
Conceptually, expand clusters of visited nodes
for each keyword
w2
Graph
68Graph Search Strategies (contd)
- Each search step needs to decide
- Which node to expand within a cluster
- Which keyword cluster to expand
- New approach
- Equi-distance expansion in each cluster
- Cost-balanced expansion across clusters balance
of nodes expanded across clusters - Cost is at most m times that of an oracle
backward search algorithm (m of query
keywords)
? Equi-distance expansion node closest to
cluster origin in graph distance
Optimal
No Guarantee
? Distance-balanced expansion balance diameter
across all clusters
Assume 3 keywords w1 w2 w3
Optimal
m-optimal
69Using a Single-Level Index
- What is inefficient with search without index?
- Needs to maintain, for each keyword, a priority
queue storing nodes in current expansion
frontier ? High space/time complexity - Existing forward expansion is largely guesswork
- New ideas
- (I) For each keyword, index nodes in the order of
visiting them in search Keyword-node lists - For each keyword w, a list LKN(w) contains nodes
that can reach w, ordered by their shortest
distances to w - (II) Index shortest distances from nodes to
keywords, enabling forward jumps Node-keyword
map - Given node u and keyword w, a hash map MNK (u,w)
returns the shortest distance from u to w in
O(1) time
v1
v2
v3
v4
v5
v6
v7
LKN(w1)
0, v2, v2, v2
0, v7, v7, v7
1, v6, v7, v7
MNK(v2,w1)
0, v2, v2
MNK(v2,w2)
1, v4, v4
70Search with Single-Level Index
- Search algorithm using the single-level index,
applying the search strategies - Equi-distance expansion ? Use one cursor to
traverse each LKN(wi) - Cost-balanced expansion ? Pick the cursor to
expand next in a round-robin manner - Forward expansion ? When visiting a node, look up
its distances to other keywords by MNK -
- Efficiency
- Managing exploration states by m cursors instead
of m priority queues
Keyword-node lists
v1
LKN(w1)
0, v2, v2, v2
0, v7, v7, v7
1, v6, v7, v7
v2
1
v3
LKN(w2)
0, v4, v4, v4
0,v11,v11,v11
1, v2, v4, v4
v4
v5
v8
3
v6
v9
Node-keyword map
v7
MNK
v10
Partial Answers
v12
v11
ltv2,(0, ?)gt
ltv4,(?, 0)gt
Answers
ltv2,(0, 1)gt
ltv6,(1, 2)gt
71Bi-Level Indexing in BLINKS
- Single-level index is impractical for large
graphs - Space complexity O(VK) where K is the number
of keywords - BLINKS Bi-Level Index for Keyword Search
- Partition a data graph into multiple, say B,
subgraphs, or blocks - Partitioning by nodes, called portals, which will
play key roles in search - There are many partitioning algorithms, such as
Breadth-first and METIS - (Top-level) block index map keywords and portals
to blocks - Purpose Initiate backward expansion in relevant
blocks guide backward expansion across blocks
(through portals) - (Low-level) intra-block index store similar
information as in a single-level index, but
restricted to within each block - Purpose Help backward expansion and forward
jumps within blocks
72Search with the Bi-Level Index
- Similar to searching with single-level index in
- Overall expansion policies (which keyword
cluster/node to explore next) - Index access (scanning LKN lists and looking-up
MNK hash map) - New challenges/complications introduced by graph
partitioning - A single cursor for a keyword is no longer
sufficient - Need simultaneously backward expansion in
multiple blocks that contains the keyword - So it maintains a queue of cursors, one for each
block we are currently exploring - Backward expansion needs to continue across block
boundaries - When encountering boundaries, it retrieves new
blocks to visit from the block index and add them
to the queue - Distance information in the intra-block index ?
global shortest distance - The path with the shortest distance may happen to
go across blocks - Our exploration order guarantees correct global
shortest distance
73References (Keywords)
- Roy Goldman, Narayanan Shivakumar, Suresh
Venkatasubramanian, Hector Garcia-Molina
Proximity Search in Databases, VLDB99, 1999. - Bettina Berendt, myra Spiliopoulou Analysis of
navigation behaviour in web sites integrating
multiple information systems, VLDBJ, 2000. - Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan Keyword
Searching and Browsing in Database using BANKS,
ICDE02, 2002. - Sanjay Agrawal, Surajit Chaudhuri, Gautam Das
DBXplorer A System for Keyword-Based Search over
Relational Databases, ICDE02, 2002. - Yi Chen, Wei Wang, Ziyang Liu, Xuemin Lin
Keyword Search on Structured and Semistructured
Data, SIGMOD09, 2009. - Vagelis Hristidis, Yannis Papakonstantinou
DISCOVER Keyword Search in Relational Databases,
VLDB02, 2002. - Vagelis Hristidis, Yannis Papakonstantinou,
Andrey Balmin Keyword Proximity Search on XML
Graps, ICDE03, 2003. - Vagelis Hristidis, Luis Gravano, Yannis
Papakonstantinou Efficient IR-Style Keyword
Search over Relational Databases, VLDB03, 2003.
74References (Keywords)
- Andrey Balmin, Vagelis Hristidis, Yannis,
Papakonstantinou ObjectRank Authority-Based
Keyword Search in Databases, VLDB04, 2004. - Sara Cohen, Yaron Kanza, Benny Kimelfeld
Interconnection Semantics for Keyword Search in
XML, CIKM05, 2005. - Varun Kacholia, Shashank Pandit, Soumen
Chakrabarti, S. Sudarshan Bidirectional
Expansion for Keyword Search on Graph Databases,
VLDB05, 2005. - Benny Kimelfeld, Yehoshua Sagiv Efficient
Engines for Keyword Proximity Search, WebDB05,
2005. - Benny Kimelfeld, Yehoshua Sagiv Efficiently
Enumerating Results of Keyword Search, DBLP05,
2005. - Fang Liu, Clement Yu, Weiyi Meng, Abdur
Chowdhury Effective Keyword Search in Relational
Databases, SIMOD06, 2006. - Benny Kimelfeld, Yehoshua Sagiv Finding and
Approximating Top-k Answersin Keyword Proximity
search, PODS06, 2006.
75References (Keywords)
- Bolin Ding, Jeffrey Xu Yu, Shan Wang, Lu Qin,
Xiao Zhang, Xuemin Lin Finding Top-k Min-Cost
Connected Trees in Databases, ICDE07, 2007. - Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou
SPARK Top-k Keyword Query in Relational
Databases, SIGMOD07, 2007. - Hao He, Haixun Wang, Jun Yang, Philip S. Yu
BLINKS Ranked Keyword Searches on Graphs,
SIGMOD07, 2007. - Alexander Markowetz, Ying Yang, Dimitris
Papadias Keyword Search on Relational Data
Streams, SIGMOD07, 2007. - Konstantin Golenberg, Benny Kimelfeld, Yehoshua
Sagiv Keyword Proximity Search in Complex Data
Graphs, SIGMOD08, 2008. - Guoliang Li, BengChin Ooi, Jianhua Feng, Jianyong
Wang, Lizhu Zhou EASE An Effective 3-in-1
Keyword Search Method for Unstructured,
Semi-structured and Structured Data, SIGMOD08,
2008. - Guang Hleu Vu, Beng Chin Ooi, Dimitris Papadias,
Anthony K. H. Tung A Graph Method for
Keyword-based Selection of the top-K
Databases,SIGMOD08, 2008.