Keyword Search in RDB - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Keyword Search in RDB

Description:

One of Bernstein's colleagues, Duane, recently published a paper about skyline query processing. ... (S) - stores the information about keywords at different ... – PowerPoint PPT presentation

Number of Views:390
Avg rating:3.0/5.0
Slides: 76
Provided by: JimKurosea309
Category:
Tags: rdb | keyword | run | search | style | up

less

Transcript and Presenter's Notes

Title: Keyword Search in RDB


1
Keyword Search in RDB
  • Jeffrey Xu Yu
  • Chinese Univeristy of Hong Kong
  • yu_at_se.cuhk.edu.hk

Acknowledgement many slides used in this talk
are originally taken from the authors
presentations given in the conferences.
2
Traditional Data Access Methods (SIGMOD09
Tutorial)
  • Databases / XML data
  • Structured, with rich meta-data
  • Accessed by query languages
  • High search quality
  • Small user population that masters DB
  • Text documents
  • Unstructured
  • Accessed by keywords
  • Limited search quality
  • Large user population

2
SIGMOD09 Tutorial
2010-1-2
3
The Challenges of Accessing Structured Data
  • Query languages long learning curves
  • Schemas Complex, evolving, or even unavailable.
  • What about filling in query forms?
  • Limited access pattern.
  • Hard to design and maintain forms on dynamic and
    heterogeneous data!

select paper.title from conference c, paper p,
author a1, author a2, write w1, write w2
where c.cid p.cid AND p.pid w1.pid
AND p.pid w2.pid AND w1.aid a1.aid AND w2.aid
a2.aid AND a1.name John AND a2.name
Mary AND c.name SIGMOD
The usability of DB is severely limited unless
easier ways to access databases are developed
Jagadish, SIGMOD 07.
3
SIGMOD09 Tutorial
2010-1-2
4
Supporting Keyword Search on DB Advantages /1
  • Easy to use
  • The most important factor for the majority of
    users.
  • The same advantage of keyword search on text
    documents

4
SIGMOD09 Tutorial
2010-1-2
5
Supporting Keyword Search on DB Advantages /2
  • Enabling interesting or unexpected discoveries
  • Relevant data pieces that are scattered but are
    collectively relevant to the query should be
    automatically assembled in the results
  • Larger scope for data inter-connection

Seltzer, Berkeley
Is Seltzer a student at UC Berkeley?
Seltzer is a developer of Berkeley DB.
Wow.
5
SIGMOD09 Tutorial
2010-1-2
6
Supporting Keyword Search on DB Advantages /3
  • Returning meaningful results by exploiting
    structural information.
  • An unique opportunity in structured data

Query Bernstein, skyline
Structured Document
Such a result will have a low rank.
Text Document
scientist
scientist
Bernstein is a computer scientist.......... One
of Bernsteins colleagues, Duane, recently
published a paper about skyline query processing.
publications
name
publications
name
paper
Bernstein
paper
Duane
title
title
skyline
model management
6
SIGMOD09 Tutorial
2010-1-2
7
Supporting Keyword Search on DB Summary of
Advantages
  • Increasing the DB usability
  • Increasing the coverage and quality of keyword
    search

7
SIGMOD09 Tutorial
2010-1-2
8
Supporting Keyword Search on DB Challenges /1
  • Semantics keyword queries are ambiguous
  • How to infer the query semantics and find
    relevant answers?
  • How to effectively rank the results in the order
    of their relevance?
  • How to help users analyze results?
  • How to evaluate the quality of search results?

8
SIGMOD09 Tutorial
2010-1-2
9
Supporting Keyword Search on DB Challenges /2
  • Efficiency
  • Many problems in keyword search on DB are shown
    to be NP-hard.
  • Generating results, query segmentation, snippet
    generation, etc.,
  • Large datasets
  • How to generate (top-k) query results efficiently?

9
SIGMOD09 Tutorial
2010-1-2
10
Keyword Search on DB State-of-the Art
  • Keyword search on DB has become a hot research
    direction, and attracted researchers in DB, IR,
    theory, etc
  • More than 50 research papers, from both research
    labs and universities in major database
    conferences/journals
  • Workshop about keyword search on DB (KEYS, June
    28, 09)

and counting...
10
SIGMOD09 Tutorial
2010-1-2
11
Keyword Search in RDBs
  • Report on the DB/IR Panel at SIGMOD 2005 by Sihem
    Amer-Yahia and Pat Case (SIGMOD Record)
  • Internet search engines have popularized
    keyword-based search.
  • DBMSs do not support IR-style keyword-based
    search.
  • An Example
  • Keywords Programming by Ritchie
  • For result, rows need to be generated by joining
    tables on the fly (all possible combinations)

Authors
AuthorsBooks
Books
BookStores
Store
12
DBXplorer S. Agrawal et al. ICDE02
  • Given a set of query keywords, DBXplorer returns
    all rows (either from single tables, or by
    joining tables connected by foreign-key joins)
    such that the each row contains all keywords.
  • IR techniques use Inverted Lists Symbol Table
    in databases

13
Symbol Table Design
  • Symbol Table (S) - stores the information about
    keywords at different granularities (column/row),
    i.e. for each keyword it stores the list of all
    rows
  • Column Level granularity (Pub-Col)
  • For every keyword S it maintains a list of all
    database columns (i.e. table.column)
  • Cell level granularity (Pub-cell)
  • For every keyword S it maintains a list of all
    database cells (i.e. table.column.rowid)

14
Storing Symbol Table
  • Store symbol tables (pub-col) in database as
    (keyword hash, column Id)
  • FK Compression (Foreign Key)
  • If there is foreign key relationship between c1
    and c2, store only c1
  • CP Compression
  • Partition H into a minimum number of bipartite
    cliques (a bipartite clique is any subgraph of H
    with a maximal number of edges).
  • Compress each clique.
  • Stores symbol table (pub-cell) in database as
    (keyword hash, list of all cellids)

v2 v3 v4
c1 c2
x
Uncompressed hash table
ColumnsMap table
Compressed hash table
15
Search - Enumerating Join Trees
  • Step1 - Looks up symbol table to find tables /
    columns which contain keywords
  • Step2 - Enumerate join trees
  • Identify and enumerate all potential subsets of
    tables in the database that, if joined, might
    contain rows having all keywords.
  • The resulting relation will contain all potential
    rows having all keywords specified in the query.

keywords
  • If it views the schema graph G as an undirected
    graph, this step enumerates join trees, i.e.,
    sub-trees of G such that
  • the leaves belong to MatchedTables
  • together, the leaves contain all keywords of the
    query

Join Trees
16
Search Identify matching rows
  • The input to this final search step is the
    enumerated join trees.
  • Each join tree is then mapped to a single SQL
    statement that joins the tables as specified in
    the tree, and selects those rows that contain all
    keywords.
  • The retrieved rows are ranked before being
    output.
  • Rows ranked by number of joins involved (ties
    broken arbitrarily) (same as keywords occurring
    close to one another in documents are ranked
    higher)

Join Trees
17
Generalized Matches Token Matches
  • Token matches - the keyword in the query matches
    only a token or sub-string of an attribute value
    (e.g., retrieve rows of address by specifying
    only a street name).
  • Pub-Prefix method
  • B tree indexes can be used to retrieve rows
    whose cell matches a given prefix string
  • This clause is of the form
  • WHERE T.C LIKE PK
  • During publishing of a database, for every
    keyword K, the entry (hash(K), T.C, P) is kept in
    the symbol table if there exists a string in
    column T.C which
  • contains a token K, and
  • has prefix P

18
Generalized Matches - Token Matches
Let the hash values of the searchable tokens
i.e., string, ball and round be 1, 2 and 3
respectively
Pub-Prefix table
Database table T
Consider searching keyword string Pub-Prefix
table returns prefixes th and no and
subsequent SQL will contain (T.C LIKE
nostring) OR (T.C LIKE thstring)
19
DISCOVER Hristidis et al. VLDB02
  • Result is tree T of tuples where
  • each edge corresponds to a primary-foreign key
    relationship
  • every keyword contained in a tuple of T (total)
  • no tuple of T is redundant (minimal)

20
Example - Data
21
Example Keyword Query
Query Smith, Miller
22
Example Keyword Query
Results
Query Smith, Miller
23
Example Keyword Query
Results
Query Smith, Miller
24
Architecture
User
25
Architecture
26
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
27
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
28
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN1 OSmith ? C ? OMiller size2
CN2 OSmith ? C ? N ? C ? OMiller size4
29
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN3 OSmith ? C ? OMiller ? C size3
  • The CN3 is not minimal, because the rightmost C
    does not contain a key.

30
Candidate Network - Example
ORDERS Smith
n1
n1
n1
CUSTOMER
NATION
ORDERS
n1
ORDERS Miller
CN4 OSmith ? C ? O ? C ? OMiller size4
  • c1 ? c2 , because primary to foreign key from
    CUSTOMER to ORDERS (an answer may contain the
    same tuple twice not a tree)
  • Pruning Condition RK?S?RL

31
Candidate Networks Generator - Algorithm
  • Traverse tuple set graph breadth first
  • Q ? tuple sets containing keyword k1
  • For each network n of tuple sets in Q do
  • If pruning_condition(n) drop n
  • else if is_CN(n) output n
  • else expand n by one tuple set to all possible
    directions in tuple set graph and insert
    expansions to Q
  • eg if n is OSmith ? C then we add to Q
  • OSmith ? C ? OMiller, OSmith ? C ? O, OSmith ?
    C ? N

32
Candidate Networks Generator is Complete and
Non-Redundant
  • The set of Candidate Networks generated is
  • Complete All solutions generated by a CN
  • Non-redundant There is database instance, where
    by removing a CN a solution is lost

33
Architecture
34
Execution Plan
  • Each CN corresponds to a SQL statement
  • CN1 OSmith ? C ? OMiller
  • CN2 OSmith ? C ? N ? C ? OMiller
  • Execution Plan
  • CN1 ? OSmith ?? C ?? OMiller
  • CN2 ? OSmith ?? C ?? N ?? C ?? OMiller

35
Reuse Common Subexpressions - Example
  • Execution Plan
  • CN1 ? OSmith ?? C ?? OMiller
  • CN2 ? OSmith ?? C ?? N ?? C ?? OMiller
  • Optimized Execution Plan
  • Temp ? OSmith ?? C
  • CN1 ? Temp ?? OMiller
  • CN2 ? Temp ?? N ?? C ?? OMiller

36
KS in RDB Streams Markowetz et al. SIGMOD07
Query Tarantino, Travolta
37
Data Graph G
  • Nodes Tuples
  • Edges can be joined

38
MTJNT
  • Sub-graph of G
  • Contain all keywords
  • Minimal
  • Answer R-KWS query
  • Limited to Tmax nodes
  • Longer joins irrelevant results

39
Candidate Networks (CN)
  • Abstractions of MTJNT

40
Operator Trees for Candidate Networks
CN
  • Leaves Selections
  • Inner nodes Joins

OP-Tree
Output MTJNT
41
Instantaneous Data Graph
42
Operator Mesh
43
Demand Driven Operator Execution (I)
  • d at least one parent is running
  • r right input is not empty

44
Demand Driven Operator Execution (II)
45
BANKS-I/II (ICDE02,VLDB05)
  • Keyword Search on (Directed) Graphs

paper
Multi-Query Optimization
E.g., Sudarshan Roy
writes
writes
author
author
Prasan Roy
Sudarshan
46
Ranking
  • Edge Score EA
  • Smaller tree gt higher score
  • EA 1/ (S edge weights)
  • Node Score NA
  • Measure of authority of nodes in tree
  • NA S (leaf and root node authorities)
  • Overall score f (EA, NA)
  • f (EA, NA) EA . NAl

47
Finding Answer Trees
  • Intuition travel backwards from keyword nodes
    till you hit a common node

Query sudarshan roy
MultiQuery Optimization
paper
writes
Sudarshan
Prasan Roy
authors
48
Backward Search Algorithm
  • Run concurrent single source shortest path
    iterators from each node matching a keyword
  • Traverse the graph edges in reverse direction
  • Output next nearest node on each get-next() call
  • Do best-first search across iterators
  • Output node if in the intersection of sets of
    nodes reached from each keyword

49
Backward Search Limitations
  • Wasteful exploration of graph
  • Frequently occurring keywords
  • Hub nodes in the graph (high in-degree)

Shashank Sudarshan Database

Schema Legend
Database

author
writes
paper
Shashank
Sudarshan
50
Bidirectional Search Motivation
51
Bidir Search Intuition
  • First cut solution
  • Dont go backward if a keyword matches many nodes
  • Dont go backward if a node points to a hub
  • Instead explore forward from other keywords

52
Bidir Search Example
Shashank Sudarshan Database


Database
Schema Legend

author
writes
Shashank
Sudarshan
paper
53
Bidir Search Issues
  • What should threshold for not expanding be?
  • The proposed solution prioritize expansion of
    nodes based on spreading activation
  • to penalize frequent keywords and bushy trees
  • How to manage exploration in both directions?

54
Bidir Search Spreading Activation
  • Spreading Activation
  • Node with highest activation explored first
  • Every node given an initial activation
  • Gives low activation to frequently occurring
    keywords

1/5
1/5
1/5
1/5
1/5
John
55
Bidir Search Spreading Activation
  • Spreading Activation
  • Node with highest activation explored first
  • Activation spread to neighbors (µ 0.3)
  • Gives low activation to neighbors of hubs

0.7 x 1/5 x 1/4
0
1
1/5
1
0.7 x 1/5 x 1/4
0
1
0
0.7 x 1/5 x 1/4
0.3 x 1/5
1
0.7 x 1/5 x 1/4
0
56
Bidir Search Iterators
  • How to manage exploration in both directions?
  • Single backward iterator single forward
    iterator w/ suitable datastructures
  • E.g., to keep track of parents of nodes

Dist from A, Dist from B
7
6
8,8
2,3 8
8,8 2
2,8

8,1
8,1
1,8
3
4
5
0,8
8,0
2
1
A
B
57
Bidir Search top-k results
  • Results need not be generated in-order
  • Naïve solution
  • Store results in an intermediate heap
  • Output top k results after mk total results have
    been generated (m 10)

58
Minimum Group Steiner Tree
59
An Example
60
An Example
61
An Example
62
Dynamic Programming Ding et al. ICDE07
  • A Naïve Approach

63
Dynamic Programming Equation
64
Dynamic Programming Equation
65
The Order to Compute T(v,p)
66
BLINKS He et al. SIGMOD07
  • Score definition
  • For an answer T ?r,(n1,,nm)? to a query q
    (w1, , wm), the score is defined as S(T) f(
    Sr(r) ? Sn(ni, wi) ? Sp(r, ni) )
  • Considers both content and graph structure
  • Match-distributive property
  • Contribution of matches and root-match paths can
    be computed in a distributive manner by summing
    over all matches
  • Allow pre-computation of best path, independently
    for each node/keyword
  • Graph-distance property
  • The contribution of a root-match path, Sp(r,ni),
    is defined to be shortest-path distance from r to
    ni
  • To simplify presentation, we focus on the path
    contribution ?Sp(r,ni)

paths from root to matches
67
Graph Search Strategies
  • Backward search Bhalotia et al., ICDE02
  • Starting from keyword nodes (containing at least
    one query keyword)
  • In each search step, choose an incoming edge to a
    previously visited node and follow the edge
    backward to visit its source node
  • Discover an answer root r if r is visited from
    every keyword
  • Bidirectional search Kacholia et al., VLDB05
  • Explore the graph by following forward edges as
    well
  • Choose which node to visit by heuristic
    activation factors

w1
Conceptually, expand clusters of visited nodes
for each keyword
w2
Graph
68
Graph Search Strategies (contd)
  • Each search step needs to decide
  • Which node to expand within a cluster
  • Which keyword cluster to expand
  • New approach
  • Equi-distance expansion in each cluster
  • Cost-balanced expansion across clusters balance
    of nodes expanded across clusters
  • Cost is at most m times that of an oracle
    backward search algorithm (m of query
    keywords)

? Equi-distance expansion node closest to
cluster origin in graph distance
Optimal
No Guarantee
? Distance-balanced expansion balance diameter
across all clusters
Assume 3 keywords w1 w2 w3
Optimal
m-optimal
69
Using a Single-Level Index
  • What is inefficient with search without index?
  • Needs to maintain, for each keyword, a priority
    queue storing nodes in current expansion
    frontier ? High space/time complexity
  • Existing forward expansion is largely guesswork
  • New ideas
  • (I) For each keyword, index nodes in the order of
    visiting them in search Keyword-node lists
  • For each keyword w, a list LKN(w) contains nodes
    that can reach w, ordered by their shortest
    distances to w
  • (II) Index shortest distances from nodes to
    keywords, enabling forward jumps Node-keyword
    map
  • Given node u and keyword w, a hash map MNK (u,w)
    returns the shortest distance from u to w in
    O(1) time

v1
v2
v3
v4
v5
v6
v7
LKN(w1)

0, v2, v2, v2
0, v7, v7, v7
1, v6, v7, v7
MNK(v2,w1)
0, v2, v2
MNK(v2,w2)
1, v4, v4

70
Search with Single-Level Index
  • Search algorithm using the single-level index,
    applying the search strategies
  • Equi-distance expansion ? Use one cursor to
    traverse each LKN(wi)
  • Cost-balanced expansion ? Pick the cursor to
    expand next in a round-robin manner
  • Forward expansion ? When visiting a node, look up
    its distances to other keywords by MNK
  • Efficiency
  • Managing exploration states by m cursors instead
    of m priority queues

Keyword-node lists
v1
LKN(w1)

0, v2, v2, v2
0, v7, v7, v7
1, v6, v7, v7
v2
1
v3
LKN(w2)

0, v4, v4, v4
0,v11,v11,v11
1, v2, v4, v4
v4
v5
v8
3
v6
v9
Node-keyword map
v7
MNK
v10
Partial Answers
v12
v11
ltv2,(0, ?)gt
ltv4,(?, 0)gt
Answers
ltv2,(0, 1)gt
ltv6,(1, 2)gt
71
Bi-Level Indexing in BLINKS
  • Single-level index is impractical for large
    graphs
  • Space complexity O(VK) where K is the number
    of keywords
  • BLINKS Bi-Level Index for Keyword Search
  • Partition a data graph into multiple, say B,
    subgraphs, or blocks
  • Partitioning by nodes, called portals, which will
    play key roles in search
  • There are many partitioning algorithms, such as
    Breadth-first and METIS
  • (Top-level) block index map keywords and portals
    to blocks
  • Purpose Initiate backward expansion in relevant
    blocks guide backward expansion across blocks
    (through portals)
  • (Low-level) intra-block index store similar
    information as in a single-level index, but
    restricted to within each block
  • Purpose Help backward expansion and forward
    jumps within blocks

72
Search with the Bi-Level Index
  • Similar to searching with single-level index in
  • Overall expansion policies (which keyword
    cluster/node to explore next)
  • Index access (scanning LKN lists and looking-up
    MNK hash map)
  • New challenges/complications introduced by graph
    partitioning
  • A single cursor for a keyword is no longer
    sufficient
  • Need simultaneously backward expansion in
    multiple blocks that contains the keyword
  • So it maintains a queue of cursors, one for each
    block we are currently exploring
  • Backward expansion needs to continue across block
    boundaries
  • When encountering boundaries, it retrieves new
    blocks to visit from the block index and add them
    to the queue
  • Distance information in the intra-block index ?
    global shortest distance
  • The path with the shortest distance may happen to
    go across blocks
  • Our exploration order guarantees correct global
    shortest distance

73
References (Keywords)
  • Roy Goldman, Narayanan Shivakumar, Suresh
    Venkatasubramanian, Hector Garcia-Molina
    Proximity Search in Databases, VLDB99, 1999.
  • Bettina Berendt, myra Spiliopoulou Analysis of
    navigation behaviour in web sites integrating
    multiple information systems, VLDBJ, 2000.
  • Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
    Soumen Chakrabarti, S. Sudarshan Keyword
    Searching and Browsing in Database using BANKS,
    ICDE02, 2002.
  • Sanjay Agrawal, Surajit Chaudhuri, Gautam Das
    DBXplorer A System for Keyword-Based Search over
    Relational Databases, ICDE02, 2002.
  • Yi Chen, Wei Wang, Ziyang Liu, Xuemin Lin
    Keyword Search on Structured and Semistructured
    Data, SIGMOD09, 2009.
  • Vagelis Hristidis, Yannis Papakonstantinou
    DISCOVER Keyword Search in Relational Databases,
    VLDB02, 2002.
  • Vagelis Hristidis, Yannis Papakonstantinou,
    Andrey Balmin Keyword Proximity Search on XML
    Graps, ICDE03, 2003.
  • Vagelis Hristidis, Luis Gravano, Yannis
    Papakonstantinou Efficient IR-Style Keyword
    Search over Relational Databases, VLDB03, 2003.

74
References (Keywords)
  • Andrey Balmin, Vagelis Hristidis, Yannis,
    Papakonstantinou ObjectRank Authority-Based
    Keyword Search in Databases, VLDB04, 2004.
  • Sara Cohen, Yaron Kanza, Benny Kimelfeld
    Interconnection Semantics for Keyword Search in
    XML, CIKM05, 2005.
  • Varun Kacholia, Shashank Pandit, Soumen
    Chakrabarti, S. Sudarshan Bidirectional
    Expansion for Keyword Search on Graph Databases,
    VLDB05, 2005.
  • Benny Kimelfeld, Yehoshua Sagiv Efficient
    Engines for Keyword Proximity Search, WebDB05,
    2005.
  • Benny Kimelfeld, Yehoshua Sagiv Efficiently
    Enumerating Results of Keyword Search, DBLP05,
    2005.
  • Fang Liu, Clement Yu, Weiyi Meng, Abdur
    Chowdhury Effective Keyword Search in Relational
    Databases, SIMOD06, 2006.
  • Benny Kimelfeld, Yehoshua Sagiv Finding and
    Approximating Top-k Answersin Keyword Proximity
    search, PODS06, 2006.

75
References (Keywords)
  • Bolin Ding, Jeffrey Xu Yu, Shan Wang, Lu Qin,
    Xiao Zhang, Xuemin Lin Finding Top-k Min-Cost
    Connected Trees in Databases, ICDE07, 2007.
  • Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou
    SPARK Top-k Keyword Query in Relational
    Databases, SIGMOD07, 2007.
  • Hao He, Haixun Wang, Jun Yang, Philip S. Yu
    BLINKS Ranked Keyword Searches on Graphs,
    SIGMOD07, 2007.
  • Alexander Markowetz, Ying Yang, Dimitris
    Papadias Keyword Search on Relational Data
    Streams, SIGMOD07, 2007.
  • Konstantin Golenberg, Benny Kimelfeld, Yehoshua
    Sagiv Keyword Proximity Search in Complex Data
    Graphs, SIGMOD08, 2008.
  • Guoliang Li, BengChin Ooi, Jianhua Feng, Jianyong
    Wang, Lizhu Zhou EASE An Effective 3-in-1
    Keyword Search Method for Unstructured,
    Semi-structured and Structured Data, SIGMOD08,
    2008.
  • Guang Hleu Vu, Beng Chin Ooi, Dimitris Papadias,
    Anthony K. H. Tung A Graph Method for
    Keyword-based Selection of the top-K
    Databases,SIGMOD08, 2008.
Write a Comment
User Comments (0)
About PowerShow.com