Title: Bidirectional Expansion for Keyword Search on Graph Databases
1Bidirectional Expansion for Keyword Searchon
Graph Databases
Varun Kacholia
Shashank Pandit Soumen Chakrabarti
S. Sudarshan Rushi Desai
Hrishikesh Karambelkar
http//www.cse.iitb.ac.in/banks/
2Keyword Search on Graph Representation of Data
- Keyword search on relational, XML, HTML, etc.
data - BANKS, Discover, DBXplorer, XRank, etc.
- Need to find a (closely) connected set of nodes
that together match all given keywords - Focus of our work
- Search algorithms to find connections between
nodes
3Outline
- Data, Query and Response Models
- Backward Search Algorithm
- Bidirectional Search Algorithm
- Experiments
- Related Work
- Conclusions
4Graph Data Model
- Data modeled as a directed weighted graph BANKS
ICDE02 - Can model relational, XML, HTML, etc. data
- E.g., DBLP database
- Node tuple
- Edge foreign key reference
5Graph Data Model (2)
- E.g., XML data
- ltproceedingsgt
- ltpaper id1gt
- lttitlegtDatabaseslt/titlegt
- lt/papergt
- ltpaper id2gt
- lttitlegtKeyword Searchlt/titlegt
- ltcite ref1gtDatabaseslt/citegt
- lt/papergt
- lt/proceedingsgt
proceedings
paper (_at_id 1)
paper (_at_id 2)
cite
title
title
6Response Model
- Response Minimal, rooted tree connecting keyword
nodes - Undirected Discover, DBXplorer
- Directed BANKS
paper
Multi-Query Optimization
E.g., Sudarshan Roy
writes
writes
author
author
Prasan Roy
Sudarshan
7Response Ranking
- Edge Score EA
- Smaller tree gt higher score
- E.g., BANKS EA 1/ (S edge weights)
- Node Score NA
- Measure of authority of nodes in tree
- E.g., BANKS NA S (leaf and root node
authorities) - Overall score f (EA, NA)
- E.g., BANKS f (EA, NA) EA . NAl
8Finding Answer Trees
- Backward Expanding Search
- BANKS ICDE02
- Intuition travel backwards from keyword nodes
till you hit a common node
Query sudarshan roy
MultiQuery Optimization
paper
writes
Sudarshan
Prasan Roy
authors
9Backward Search Algorithm
- Algorithm
- Run concurrent single source shortest path
iterators from each node matching a keyword - Traverse the graph edges in reverse direction
- Output next nearest node on each get-next() call
- Do best-first search across iterators
- Output node if in the intersection of sets of
nodes reached from each keyword
10Backward Search Limitations
- Wasteful exploration of graph
- Frequently occurring keywords
- Hub nodes in the graph (high in-degree)
Shashank Sudarshan Database
Schema Legend
Database
author
writes
paper
Shashank
Sudarshan
11Bidirectional Search Motivation
12Bidir Search Intuition
- First cut solution
- Dont go backward if keyword matches many nodes
- Dont go backward if node points to a hub
- Instead explore forward from other keywords
13Bidir Search Example
Shashank Sudarshan Database
Database
Schema Legend
author
writes
Shashank
Sudarshan
paper
14Bidir Search Issues
- What should threshold for not expanding be?
- Our solution prioritize expansion of nodes based
on spreading activation - to penalize frequent keywords and bushy trees
- How to manage exploration in both directions?
15Bidir Search Spreading Activation
- Spreading Activation
- Node with highest activation explored first
- Every node given an initial activation
- Gives low activation to frequently occurring
keywords
1/5
1/5
1/5
1/5
1/5
John
16Bidir Search Spreading Activation
- Spreading Activation
- Node with highest activation explored first
- Activation spread to neighbors (µ 0.3)
- Gives low activation to neighbors of hubs
0.7 x 1/5 x 1/4
0
1
1/5
1
0.7 x 1/5 x 1/4
0
1
0
0.7 x 1/5 x 1/4
0.3 x 1/5
1
0.7 x 1/5 x 1/4
0
17Bidir Search Iterators
- How to manage exploration in both directions?
-
- Single backward iterator single forward
iterator w/ suitable datastructures - E.g., to keep track of parents of nodes
- Details in full paper
Dist from A, Dist from B
7
6
8,8
2,3 8
8,8 2
2,8
8,1
8,1
1,8
3
4
5
0,8
8,0
2
1
A
B
18Bidir Search Algorithm
- Algorithm
- Activate matching nodes insert into backward
iterator - while (iterators are not empty)
- Choose iterator for expansion in best-first
manner - Explore node with highest activation
- Spread activation to neighbors
- Update path weights (and other datastructures)
- Propagate values to ancestors if necessary
- Insert nodes explored in the backward direction
into the forward iterator / for future forward
exploration / - Stop when top-k results are produced
19Bidir Search top-k results
- Results need not be generated in-order
- Naïve solution
- Store results in an intermediate heap
- Output top k results after mk total results have
been generated (m 10) - Can do better
- Compute upper bound on score of next result
output answers with a higher score - Similar to NRA algorithm (Fagin et al., PODS01)
20Experiments
- Datasets
- DBLP, IMDB 2 million nodes, 9 million edges
- US Patent DB 4 million nodes, 15 million edges
- Workload
- Keywords randomly picked from results of SQL join
statements - Search algorithms
- MI-Bkwd original backward search
- Iterator for every node matching a keyword
- SI-Bkwd backward search with single backward
iterator - Bidirec bidirectional search
- Time taken/nodes explored
- Measured when 10th answer is generated (or last
answer if answers lt 10) - Origin size
- nodes matched by keywords in the query
21Experiments (2)
- MI-Bkwd versus SI-Bkwd
- SI-Bkwd gain increases with origin size,
keywords
22Experiments (3)
- SI-Bkwd versus Bidirec
- Bidirec gain increases with origin size,
keywords
23Experiments (4)
- Precision/Recall experiments
- Relevant answers are well-defined can be
generated through SQL statements - Both MI-Backward and Bidirectional show similar
performance - Recall 100
- Precision 100 at near full recall
- Few irrelevant answers produced before generating
all relevant answers - Bidirectional runs faster, yet minimal loss of
relevant results!
24Experiments (5)
- Comparison with Sparse
- Hristidis et al. VLDB03
- Generate join expressions leading to query
results - Use DB-provided scores for ranking tuples and
aggregate them to rank answer trees - For top-k results automatically determine
required number of join expressions - Sparse-LB
- Manually generate required join expressions
- Sparse needs to do at least this much (and
usually a lot more!) - Bidirectional versus Sparse-LB
- Bidirectional outperforms by a factor of 3
(esp. when joins is large)
25Experiments (6)
- SI-Bkwd versus Bidirec by origin size
- Bidirec gains more with unbalanced origin sizes
A (T,S,S,S) B (M,M,M,M) C (M,L,L,L) D
(M,M,L,L) E (T,L,L,L) F (T,S,M,L) G
(T,M,L,L) H (T,T,T,L)
26Discussion
- Bidirectional search as dynamic per-tuple join
ordering - Related work in this area Eddies
- Bidirectional search
- Schema-less
- Prioritization based on activation instead of
selectivity - Generate answers in relevance order
27Related Work
- Keyword querying on relational data Discover
(UCSD), DBExplorer (Microsoft) - Use SQL generation, without in-memory data
structures - Issues generate join plans, re-use common
sub-expressions, etc. - Keyword querying on XML
- XRank (Cornell), Schema-Free XQuery (Michigan),
- Tree model is too limited
- ObjectRank
28Conclusions
- Graph model
- Convenient common denominator representation
- Schema-free querying leads to graph search
- Purely backward strategy inadequate
- Bidirectional search with spreading activation
performs much better - Dynamically choose join order on per-tuple basis
29Thank You!
Questions??
30Future of Keyword Search in DBs
- Next generation of intelligent search will
require context information - E.g. search email, files, calendar, ..
- Information integration will be important
- Graph structured data will be a key component
- Is there a killer app?
- Deep web?
- Display of answers
- Users dont want to see schema details
- Can we leverage off existing (Web) apps?
31BANKS Future Work
- Applications of BANKS
- Soumen Chakrabarti, Sunita Sarawagi and students
- Exploit BANKS to integrate different sources of
data - Extract information, Infer soft links
- BANKS for personal information management
- SPIN Search Personal Information Networks
- Ongoing/future work on BANKS
- More sysadmin/user control on ranking
- One size does not fit all
- BANKS provides infrastructure
- Characterize bidirectional search better
- And find other applications
- Security
32Bidir Search top-k results (2)
- Compute upper bound on score of next result
output answers with a higher score - Computing the bound
- mi minimum path length explored backward from
keyword i - unseen answer node 1/(m1 m2 mn )
- visited answer node suppose reached from first x
keywords with distance di - 1/(d1 d2 dx ) (mx1 mx2
mn ) - combine this with max node prestige
- We simply use 1/(m1 m2 mn )!
- Experiments show no significant loss in using
this heuristic