Keyword Search in RDB presentation

About This Presentation

Transcript and Presenter's Notes

Title: Keyword Search in RDB

1
Keyword Search in RDB

Jeffrey Xu Yu
Chinese Univeristy of Hong Kong
yu_at_se.cuhk.edu.hk

Acknowledgement many slides used in this talk
are originally taken from the authors
presentations given in the conferences.
2
Traditional Data Access Methods (SIGMOD09
Tutorial)

Databases / XML data
Structured, with rich meta-data
Accessed by query languages
High search quality
Small user population that masters DB

Text documents
Unstructured
Accessed by keywords
Limited search quality
Large user population

2
SIGMOD09 Tutorial
2010-1-2
3
The Challenges of Accessing Structured Data

Query languages long learning curves
Schemas Complex, evolving, or even unavailable.
What about filling in query forms?
Limited access pattern.
Hard to design and maintain forms on dynamic and
heterogeneous data!

select paper.title from conference c, paper p,
author a1, author a2, write w1, write w2
where c.cid p.cid AND p.pid w1.pid
AND p.pid w2.pid AND w1.aid a1.aid AND w2.aid
a2.aid AND a1.name John AND a2.name
Mary AND c.name SIGMOD
The usability of DB is severely limited unless
easier ways to access databases are developed
Jagadish, SIGMOD 07.
3
SIGMOD09 Tutorial
2010-1-2
4
Supporting Keyword Search on DB Advantages /1

Easy to use
The most important factor for the majority of
users.
The same advantage of keyword search on text
documents

4
SIGMOD09 Tutorial
2010-1-2
5
Supporting Keyword Search on DB Advantages /2

Enabling interesting or unexpected discoveries
Relevant data pieces that are scattered but are
collectively relevant to the query should be
automatically assembled in the results
Larger scope for data inter-connection

Seltzer, Berkeley
Is Seltzer a student at UC Berkeley?
Seltzer is a developer of Berkeley DB.
Wow.
5
SIGMOD09 Tutorial
2010-1-2
6
Supporting Keyword Search on DB Advantages /3

Returning meaningful results by exploiting
structural information.
An unique opportunity in structured data

Query Bernstein, skyline
Structured Document
Such a result will have a low rank.
Text Document
scientist
scientist
Bernstein is a computer scientist.......... One
of Bernsteins colleagues, Duane, recently
published a paper about skyline query processing.
publications
name
publications
name
paper
Bernstein
paper
Duane
title
title
skyline
model management
6
SIGMOD09 Tutorial
2010-1-2
7
Supporting Keyword Search on DB Summary of
Advantages

Increasing the DB usability
Increasing the coverage and quality of keyword
search

7
SIGMOD09 Tutorial
2010-1-2
8
Supporting Keyword Search on DB Challenges /1

Semantics keyword queries are ambiguous
How to infer the query semantics and find
relevant answers?
How to effectively rank the results in the order
of their relevance?
How to help users analyze results?
How to evaluate the quality of search results?

8
SIGMOD09 Tutorial
2010-1-2
9
Supporting Keyword Search on DB Challenges /2

Efficiency
Many problems in keyword search on DB are shown
to be NP-hard.
Generating results, query segmentation, snippet
generation, etc.,
Large datasets
How to generate (top-k) query results efficiently?

9
SIGMOD09 Tutorial
2010-1-2
10
Keyword Search on DB State-of-the Art

Keyword search on DB has become a hot research
direction, and attracted researchers in DB, IR,
theory, etc
More than 50 research papers, from both research
labs and universities in major database
conferences/journals
Workshop about keyword search on DB (KEYS, June
28, 09)

and counting...
10
SIGMOD09 Tutorial
2010-1-2
11
Keyword Search in RDBs

Report on the DB/IR Panel at SIGMOD 2005 by Sihem
Amer-Yahia and Pat Case (SIGMOD Record)
Internet search engines have popularized
keyword-based search.
DBMSs do not support IR-style keyword-based
search.
An Example
Keywords Programming by Ritchie
For result, rows need to be generated by joining
tables on the fly (all possible combinations)

Authors
AuthorsBooks
Books
BookStores
Store
12
DBXplorer S. Agrawal et al. ICDE02

Given a set of query keywords, DBXplorer returns
all rows (either from single tables, or by
joining tables connected by foreign-key joins)
such that the each row contains all keywords.
IR techniques use Inverted Lists Symbol Table
in databases

13
Symbol Table Design

Symbol Table (S) - stores the information about
keywords at different granularities (column/row),
i.e. for each keyword it stores the list of all
rows
Column Level granularity (Pub-Col)
For every keyword S it maintains a list of all
database columns (i.e. table.column)
Cell level granularity (Pub-cell)
For every keyword S it maintains a list of all
database cells (i.e. table.column.rowid)

14
Storing Symbol Table

Store symbol tables (pub-col) in database as
(keyword hash, column Id)
FK Compression (Foreign Key)
If there is foreign key relationship between c1
and c2, store only c1
CP Compression
Partition H into a minimum number of bipartite
cliques (a bipartite clique is any subgraph of H
with a maximal number of edges).
Compress each clique.
Stores symbol table (pub-cell) in database as
(keyword hash, list of all cellids)

v2 v3 v4
c1 c2
x
Uncompressed hash table
ColumnsMap table
Compressed hash table
15
Search - Enumerating Join Trees

Step1 - Looks up symbol table to find tables /
columns which contain keywords
Step2 - Enumerate join trees
Identify and enumerate all potential subsets of
tables in the database that, if joined, might
contain rows having all keywords.
The resulting relation will contain all potential
rows having all keywords specified in the query.

keywords

If it views the schema graph G as an undirected
graph, this step enumerates join trees, i.e.,
sub-trees of G such that
the leaves belong to MatchedTables
together, the leaves contain all keywords of the
query

Join Trees
16
Search Identify matching rows

The input to this final search step is the
enumerated join trees.
Each join tree is then mapped to a single SQL
statement that joins the tables as specified in
the tree, and selects those rows that contain all
keywords.
The retrieved rows are ranked before being
output.
Rows ranked by number of joins involved (ties
broken arbitrarily) (same as keywords occurring
close to one another in documents are ranked
higher)

Join Trees
17
Generalized Matches Token Matches

Token matches - the keyword in the query matches
only a token or sub-string of an attribute value
(e.g., retrieve rows of address by specifying
only a street name).
Pub-Prefix method
B tree indexes can be used to retrieve rows
whose cell matches a given prefix string
This clause is of the form
WHERE T.C LIKE PK
During publishing of a database, for every
keyword K, the entry (hash(K), T.C, P) is kept in
the symbol table if there exists a string in
column T.C which
contains a token K, and
has prefix P

18
Generalized Matches - Token Matches
Let the hash values of the searchable tokens
i.e., string, ball and round be 1, 2 and 3
respectively
Pub-Prefix table
Database table T
Consider searching keyword string Pub-Prefix
table returns prefixes th and no and
subsequent SQL will contain (T.C LIKE
nostring) OR (T.C LIKE thstring)
19
DISCOVER Hristidis et al. VLDB02

Result is tree T of tuples where
each edge corresponds to a primary-foreign key
relationship
every keyword contained in a tuple of T (total)
no tuple of T is redundant (minimal)

20
Example - Data
21
Example Keyword Query
Query Smith, Miller
22
Example Keyword Query
Results
Query Smith, Miller
23
Example Keyword Query
Results
Query Smith, Miller
24
Architecture
User
25
Architecture
26
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
27
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
28
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
CN2 OSmith ? C ? N ? C ? OMiller size4
29
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN3 OSmith ? C ? OMiller ? C size3

The CN3 is not minimal, because the rightmost C
does not contain a key.

30
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN4 OSmith ? C ? O ? C ? OMiller size4

c1 ? c2 , because primary to foreign key from
CUSTOMER to ORDERS (an answer may contain the
same tuple twice not a tree)
Pruning Condition RK?S?RL

31
Candidate Networks Generator - Algorithm

Traverse tuple set graph breadth first
Q ? tuple sets containing keyword k1
For each network n of tuple sets in Q do
If pruning_condition(n) drop n
else if is_CN(n) output n
else expand n by one tuple set to all possible
directions in tuple set graph and insert
expansions to Q
eg if n is OSmith ? C then we add to Q
OSmith ? C ? OMiller, OSmith ? C ? O, OSmith ?
C ? N

32
Candidate Networks Generator is Complete and
Non-Redundant

The set of Candidate Networks generated is
Complete All solutions generated by a CN
Non-redundant There is database instance, where
by removing a CN a solution is lost

33
Architecture
34
Execution Plan

Each CN corresponds to a SQL statement
CN1 OSmith ? C ? OMiller
CN2 OSmith ? C ? N ? C ? OMiller
Execution Plan
CN1 ? OSmith ?? C ?? OMiller
CN2 ? OSmith ?? C ?? N ?? C ?? OMiller

35
Reuse Common Subexpressions - Example

Execution Plan
CN1 ? OSmith ?? C ?? OMiller
CN2 ? OSmith ?? C ?? N ?? C ?? OMiller
Optimized Execution Plan
Temp ? OSmith ?? C
CN1 ? Temp ?? OMiller
CN2 ? Temp ?? N ?? C ?? OMiller

36
KS in RDB Streams Markowetz et al. SIGMOD07
Query Tarantino, Travolta
37
Data Graph G

Nodes Tuples
Edges can be joined

38
MTJNT

Sub-graph of G
Contain all keywords
Minimal
Answer R-KWS query
Limited to Tmax nodes
Longer joins irrelevant results

39
Candidate Networks (CN)

Abstractions of MTJNT

40
Operator Trees for Candidate Networks
CN

Leaves Selections
Inner nodes Joins

OP-Tree
Output MTJNT
41
Instantaneous Data Graph
42
Operator Mesh
43
Demand Driven Operator Execution (I)

d at least one parent is running
r right input is not empty

44
Demand Driven Operator Execution (II)
45
BANKS-I/II (ICDE02,VLDB05)

Keyword Search on (Directed) Graphs

paper
Multi-Query Optimization
E.g., Sudarshan Roy
writes
writes
author
author
Prasan Roy
Sudarshan
46
Ranking

Edge Score EA
Smaller tree gt higher score
EA 1/ (S edge weights)
Node Score NA
Measure of authority of nodes in tree
NA S (leaf and root node authorities)
Overall score f (EA, NA)
f (EA, NA) EA . NAl

47
Finding Answer Trees

Intuition travel backwards from keyword nodes
till you hit a common node

Query sudarshan roy
MultiQuery Optimization
paper
writes
Sudarshan
Prasan Roy
authors
48
Backward Search Algorithm

Run concurrent single source shortest path
iterators from each node matching a keyword
Traverse the graph edges in reverse direction
Output next nearest node on each get-next() call
Do best-first search across iterators
Output node if in the intersection of sets of
nodes reached from each keyword

49
Backward Search Limitations

Wasteful exploration of graph
Frequently occurring keywords
Hub nodes in the graph (high in-degree)

Shashank Sudarshan Database

Schema Legend
Database

author
writes
paper
Shashank
Sudarshan
50
Bidirectional Search Motivation
51
Bidir Search Intuition

First cut solution
Dont go backward if a keyword matches many nodes
Dont go backward if a node points to a hub
Instead explore forward from other keywords

52
Bidir Search Example
Shashank Sudarshan Database

Database
Schema Legend

author
writes
Shashank
Sudarshan
paper
53
Bidir Search Issues

What should threshold for not expanding be?
The proposed solution prioritize expansion of
nodes based on spreading activation
to penalize frequent keywords and bushy trees
How to manage exploration in both directions?

54
Bidir Search Spreading Activation

Spreading Activation
Node with highest activation explored first
Every node given an initial activation
Gives low activation to frequently occurring
keywords

1/5
1/5
1/5
1/5
1/5
John
55
Bidir Search Spreading Activation

Spreading Activation
Node with highest activation explored first
Activation spread to neighbors (µ 0.3)
Gives low activation to neighbors of hubs

0.7 x 1/5 x 1/4
0
1
1/5
1
0.7 x 1/5 x 1/4
0
1
0
0.7 x 1/5 x 1/4
0.3 x 1/5
1
0.7 x 1/5 x 1/4
0
56
Bidir Search Iterators

How to manage exploration in both directions?
Single backward iterator single forward
iterator w/ suitable datastructures
E.g., to keep track of parents of nodes

Dist from A, Dist from B
7
6
8,8
2,3 8
8,8 2
2,8

8,1
8,1
1,8
3
4
5
0,8
8,0
2
1
A
B
57
Bidir Search top-k results

Results need not be generated in-order
Naïve solution
Store results in an intermediate heap
Output top k results after mk total results have
been generated (m 10)

58
Minimum Group Steiner Tree
59
An Example
60
An Example
61
An Example
62
Dynamic Programming Ding et al. ICDE07

A Naïve Approach

63
Dynamic Programming Equation
64
Dynamic Programming Equation
65
The Order to Compute T(v,p)
66
BLINKS He et al. SIGMOD07

Score definition
For an answer T ?r,(n1,,nm)? to a query q
(w1, , wm), the score is defined as S(T) f(
Sr(r) ? Sn(ni, wi) ? Sp(r, ni) )
Considers both content and graph structure
Match-distributive property
Contribution of matches and root-match paths can
be computed in a distributive manner by summing
over all matches
Allow pre-computation of best path, independently
for each node/keyword
Graph-distance property
The contribution of a root-match path, Sp(r,ni),
is defined to be shortest-path distance from r to
ni
To simplify presentation, we focus on the path
contribution ?Sp(r,ni)

paths from root to matches
67
Graph Search Strategies

Backward search Bhalotia et al., ICDE02
Starting from keyword nodes (containing at least
one query keyword)
In each search step, choose an incoming edge to a
previously visited node and follow the edge
backward to visit its source node
Discover an answer root r if r is visited from
every keyword
Bidirectional search Kacholia et al., VLDB05
Explore the graph by following forward edges as
well
Choose which node to visit by heuristic
activation factors

w1
Conceptually, expand clusters of visited nodes
for each keyword
w2
Graph
68
Graph Search Strategies (contd)

Each search step needs to decide
Which node to expand within a cluster
Which keyword cluster to expand
New approach
Equi-distance expansion in each cluster
Cost-balanced expansion across clusters balance
of nodes expanded across clusters
Cost is at most m times that of an oracle
backward search algorithm (m of query
keywords)

? Equi-distance expansion node closest to
cluster origin in graph distance
Optimal
No Guarantee
? Distance-balanced expansion balance diameter
across all clusters
Assume 3 keywords w1 w2 w3
Optimal
m-optimal
69
Using a Single-Level Index

What is inefficient with search without index?
Needs to maintain, for each keyword, a priority
queue storing nodes in current expansion
frontier ? High space/time complexity
Existing forward expansion is largely guesswork
New ideas
(I) For each keyword, index nodes in the order of
visiting them in search Keyword-node lists
For each keyword w, a list LKN(w) contains nodes
that can reach w, ordered by their shortest
distances to w
(II) Index shortest distances from nodes to
keywords, enabling forward jumps Node-keyword
map
Given node u and keyword w, a hash map MNK (u,w)
returns the shortest distance from u to w in
O(1) time

v1
v2
v3
v4
v5
v6
v7
LKN(w1)

0, v2, v2, v2
0, v7, v7, v7
1, v6, v7, v7
MNK(v2,w1)
0, v2, v2
MNK(v2,w2)
1, v4, v4

70
Search with Single-Level Index

Search algorithm using the single-level index,
applying the search strategies
Equi-distance expansion ? Use one cursor to
traverse each LKN(wi)
Cost-balanced expansion ? Pick the cursor to
expand next in a round-robin manner
Forward expansion ? When visiting a node, look up
its distances to other keywords by MNK
Efficiency
Managing exploration states by m cursors instead
of m priority queues

Keyword-node lists
v1
LKN(w1)

0, v2, v2, v2
0, v7, v7, v7
1, v6, v7, v7
v2
1
v3
LKN(w2)

0, v4, v4, v4
0,v11,v11,v11
1, v2, v4, v4
v4
v5
v8
3
v6
v9
Node-keyword map
v7
MNK
v10
Partial Answers
v12
v11
ltv2,(0, ?)gt
ltv4,(?, 0)gt
Answers
ltv2,(0, 1)gt
ltv6,(1, 2)gt
71
Bi-Level Indexing in BLINKS

Single-level index is impractical for large
graphs
Space complexity O(VK) where K is the number
of keywords
BLINKS Bi-Level Index for Keyword Search
Partition a data graph into multiple, say B,
subgraphs, or blocks
Partitioning by nodes, called portals, which will
play key roles in search
There are many partitioning algorithms, such as
Breadth-first and METIS
(Top-level) block index map keywords and portals
to blocks
Purpose Initiate backward expansion in relevant
blocks guide backward expansion across blocks
(through portals)
(Low-level) intra-block index store similar
information as in a single-level index, but
restricted to within each block
Purpose Help backward expansion and forward
jumps within blocks

72
Search with the Bi-Level Index

Similar to searching with single-level index in
Overall expansion policies (which keyword
cluster/node to explore next)
Index access (scanning LKN lists and looking-up
MNK hash map)
New challenges/complications introduced by graph
partitioning
A single cursor for a keyword is no longer
sufficient
Need simultaneously backward expansion in
multiple blocks that contains the keyword
So it maintains a queue of cursors, one for each
block we are currently exploring
Backward expansion needs to continue across block
boundaries
When encountering boundaries, it retrieves new
blocks to visit from the block index and add them
to the queue
Distance information in the intra-block index ?
global shortest distance
The path with the shortest distance may happen to
go across blocks
Our exploration order guarantees correct global
shortest distance

73
References (Keywords)

Roy Goldman, Narayanan Shivakumar, Suresh
Venkatasubramanian, Hector Garcia-Molina
Proximity Search in Databases, VLDB99, 1999.
Bettina Berendt, myra Spiliopoulou Analysis of
navigation behaviour in web sites integrating
multiple information systems, VLDBJ, 2000.
Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan Keyword
Searching and Browsing in Database using BANKS,
ICDE02, 2002.
Sanjay Agrawal, Surajit Chaudhuri, Gautam Das
DBXplorer A System for Keyword-Based Search over
Relational Databases, ICDE02, 2002.
Yi Chen, Wei Wang, Ziyang Liu, Xuemin Lin
Keyword Search on Structured and Semistructured
Data, SIGMOD09, 2009.
Vagelis Hristidis, Yannis Papakonstantinou
DISCOVER Keyword Search in Relational Databases,
VLDB02, 2002.
Vagelis Hristidis, Yannis Papakonstantinou,
Andrey Balmin Keyword Proximity Search on XML
Graps, ICDE03, 2003.
Vagelis Hristidis, Luis Gravano, Yannis
Papakonstantinou Efficient IR-Style Keyword
Search over Relational Databases, VLDB03, 2003.

74
References (Keywords)

Andrey Balmin, Vagelis Hristidis, Yannis,
Papakonstantinou ObjectRank Authority-Based
Keyword Search in Databases, VLDB04, 2004.
Sara Cohen, Yaron Kanza, Benny Kimelfeld
Interconnection Semantics for Keyword Search in
XML, CIKM05, 2005.
Varun Kacholia, Shashank Pandit, Soumen
Chakrabarti, S. Sudarshan Bidirectional
Expansion for Keyword Search on Graph Databases,
VLDB05, 2005.
Benny Kimelfeld, Yehoshua Sagiv Efficient
Engines for Keyword Proximity Search, WebDB05,
2005.
Benny Kimelfeld, Yehoshua Sagiv Efficiently
Enumerating Results of Keyword Search, DBLP05,
2005.
Fang Liu, Clement Yu, Weiyi Meng, Abdur
Chowdhury Effective Keyword Search in Relational
Databases, SIMOD06, 2006.
Benny Kimelfeld, Yehoshua Sagiv Finding and
Approximating Top-k Answersin Keyword Proximity
search, PODS06, 2006.

75
References (Keywords)

Bolin Ding, Jeffrey Xu Yu, Shan Wang, Lu Qin,
Xiao Zhang, Xuemin Lin Finding Top-k Min-Cost
Connected Trees in Databases, ICDE07, 2007.
Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou
SPARK Top-k Keyword Query in Relational
Databases, SIGMOD07, 2007.
Hao He, Haixun Wang, Jun Yang, Philip S. Yu
BLINKS Ranked Keyword Searches on Graphs,
SIGMOD07, 2007.
Alexander Markowetz, Ying Yang, Dimitris
Papadias Keyword Search on Relational Data
Streams, SIGMOD07, 2007.
Konstantin Golenberg, Benny Kimelfeld, Yehoshua
Sagiv Keyword Proximity Search in Complex Data
Graphs, SIGMOD08, 2008.
Guoliang Li, BengChin Ooi, Jianhua Feng, Jianyong
Wang, Lizhu Zhou EASE An Effective 3-in-1
Keyword Search Method for Unstructured,
Semi-structured and Structured Data, SIGMOD08,
2008.
Guang Hleu Vu, Beng Chin Ooi, Dimitris Papadias,
Anthony K. H. Tung A Graph Method for
Keyword-based Selection of the top-K
Databases,SIGMOD08, 2008.

Write a Comment

User Comments (0)

About PowerShow.com

Keyword Search in RDB PowerPoint PPT Presentation