Keyword Search on Graph-Structured Data

About This Presentation

Title:

Keyword Search on Graph-Structured Data

Description:

Title: Keyword Searching and Browsing in databases using BANKS Author: S. Sudarshan Last modified by: S. Sudarshan Created Date: 9/14/2001 9:28:43 AM – PowerPoint PPT presentation

Number of Views:444

Avg rating:3.0/5.0

Slides: 59

Provided by: S259

Category:

more less

Transcript and Presenter's Notes

Title: Keyword Search on Graph-Structured Data

1
Keyword Search on Graph-Structured Data

S. Sudarshan IIT Bombay

Joint work withSoumen Chakrabarti, Gaurav
Bhalotia, Charuta Nakhe,Rushi Desai, Hrishi K.,
Arvind Hulgeri, Bhavana Dalvi and Meghana
Kshirsagar
Jan 2009
2
Outline

Motivation and Graph Data Model
Query/Answer models
Tree answer model
Proximity queries
Graph Search Algorithms
Backward Expanding Search
Bidirectional Search
Search on external memory graphs
Conclusion

3
Keyword Search on Semi-Structured Data

Keyword search of documents on the Web has been
enormously successful
Much data is resident in databases
Organizational, government, scientific, medical
data
Deep web
Goal querying of data from multiple data
sources, with different data models
Often with no schema or partially defined schema

4
Keyword Search on Structured/Semi-Structured Data

Key differences from IR/Web Search
Normalization (implicit/explicit) splits related
data across multiple tuples
To answer a keyword query we need to find a
(closely) connected set of entities that together
match all given keywords
soumen crawling or soumen byron

5
Graph Data Model

Lowest common denominator across many data models
Relational
node tuple, edge foreign key
XML
Node element, edge containment/idref/keyref
HTML
node page, edge hyperlink
Documents
node document, edge links inferred by data
extraction
Knowledge representation
node entity, edge relationship
Network data
e.g. social networks, communication networks

6
Graph Data Model (Cont)

Nodes can have
labels
E.g. relation name, or XML tag
textual or structured (attribute-value) data
Edges can have labels

7
Outline

Motivation and Graph Data Model
Query/Answer models
Tree answer model
Proximity queries
Graph Search Algorithms
Backward Expanding Search
Bidirectional Search
Search on external memory graphs
Conclusion

8
Query/Answer Models

Basic query model
Keywords match node text/labels
Can extend query model with attribute
specification, path specifications
e.g. paper(year lt 2005, titlexquery),
Alternative answer models
tree connecting nodes matching query keywords
nodes in proximity to (near) query keywords

9
Tree Answer Model

Answer Rooted, directed tree connecting keyword
nodes
In general, a Steiner tree
Multiple answers possible
Answer relevance computed from
answer edge score combined with
answer node score

paper
Focused Crawling
writes
writes
author
author
Soumen C.
Byron Dom
Eg. Soumen Byron
10
Answer Ranking

Naïve model answers ranked by number of edges
Problem
Some tuples are connected to many other tuples
E.g. highly cited papers, popular web sites
Highly connected tuples create misleading
shortcuts
six degrees of separation
Solution use directed edges with edge weights
allow answer tree to have edge u?v if original
graph has v?u, but at higher cost

11
Edge Weight Model-1

Forward edge weight (edge present in data)
Default to 1, can be based on schema
Lower weight ? closer connection
Create extra backward edges v?u for each edge u?v
present in data
Edge weight ? log(1edges pointing to v)
Overall Answer-tree Edge Score EA 1/ (S edge
weights)
Higher score ? better result

1
1
1
12
Edge Weight Model -2

Probabilistic edge scoring model
Edge traversal probability (from a given node)
Forward ? 1/out-degree
Backward ? 1/in-degree
Can be weighted by edge type
Path weight ? probability of following each
edge in path
Edge score log(edge traversal probability)
Answer-tree Edge Score EA (harmonic) mean of
path weights from root to each leaf

Note
other edge weight models possible
our search algorithms are independent of how edge
weights are computed

13
Node Weight

Node prestige based on indegree
More incoming edges ? higher prestige
PageRank style transfer of prestige
Node weight computing using biased random walk
model
Node weight function of node prestige, other
optional criteria such as TF/IDF
Answer-tree Node score NA root node weight
S leaf node weights

14
Overall Tree Answer Score

Overall score of answer tree A
combine tree and node scores
for details, and recall/precision metrics see
BANKS papers in ICDE 2002 and VLDB 2005
Anecdotal results on DBLP Bibliography
Transaction Jim Grays classic paper and
textbook at the top because of prestige ( of
citations)
soumen sudarshan several coauthored papers,
links via common co-authors
goldman shivakumar hector The VLDB 98
proximity search paper, followed by
citation/co-author connections

15
Answer Models

Tree Answer Model
Proximity (near query) model

16
Proximity Queries

Node weight by proximity
author (near olap) (on DBLP)
faculty (near earthquake) (on IITB thesis
database)
Node prestige gt if close to multiple nodes
matching near keywords
Example applications
Finding experts on a particular area

OLAP over uncertain ..
Widom
Raghu
Computing sparse cubes
Overview of OLAP
Allocation in OLAP
17
Proximity via Spreading Activation

Idea
Each near keyword has activation of 1
Divided among nodes matching keyword,
proportional to their node prestige
Each node
keeps fraction 1-µ of its received activation and
spreads fraction µ amongst its neighbors
Combine activation ai received from neighbors
a 1 ?(1-ai) (belief function)
Graph may have cycles
Iterate till convergence

18
Example Answers

Anecdotal results on DBLP Bibliography
author (near recovery) Dave Lomet, C. Mohan, etc
sudarshan(near change) Sudarshan Chawate
sudarshan(near query) S. Sudarshan
Queries can combine proximity scores with tree
scores
hector sudarshan(near query) vs. hector
sudarshan
author(near transactions) data integration

19
Related Work

Proximity Search
Goldman, Shivakumar, Venkatasubramanian and
Garcia-Molina VLDB98
Considers only shortest path from each node,
aggregates across nodes
Our version aggregates evidence from alternative
paths
E.g. author (near Surajit Chaudhuri)
Object Rank VLDB04
Similar idea to ours, precomputed

20
Related Work

Keyword querying on relational databases
DBExplorer (Microsoft, ICDE02) Discover (UCSD,
VLDB02, VLDB03),
Use SQL generation, not applicable to arbitrary
graphs
ranking based only on nodes/edges
Keyword querying on XML Tree Model
XRank (Cornell, SIGMOD03), proximity in XML (ATT
Research, VLDB03), Schema-Free XQuery (Michigan,
VLDB04),
Tree model is too limited
Keyword querying on XML Graph Model
XKeyword (UCSD, ICDE03, VLDB03), SphereSearch
(MaxPlanck, VLDB05)
ranking based only on nodes/edges

21
Outline

Motivation and Graph Data Model
Query/Answer models
Tree answer model
Proximity queries
Graph Search Algorithms
Backward Expanding Search
Bidirectional Search
Search on external memory graphs
Conclusion

22
Finding Answer Trees

Backward Expanding Search Algorithm (Bhalotia et
al, ICDE02)
Intuition find vertices from which a forward
path exists to at least one node from each Si.
Run concurrent single source shortest path
algorithm from each node matching a keyword
Create an iterator for each node matching a
keyword
Traverse the graph edges in reverse direction
Output next nearest node on each get-next() call
Do best-first search across iterators
Output an answer when its root has been reached
from each keyword
Answer heap to collect and output results in
score order

23
Backward Expanding Search
Query soumen byron
Focused Crawling
paper
writes
Soumen C.
Byron Dom
authors
24
Backward Exp. Search Limitations

Too many iterators (one per keyword node)
Solution single iterator per keyword (SI-Bkwd
search)
tracks shortest path from node to keyword
Changes answer set slightly
Different justifications for same root may be
lost
Not a big problem in practice
Nodes explored for different keywords can vary
greatly
E.g. mining or query vs knuth
High fan-out when traversing backwards from some
nodes
Connection with join ordering
Similar to traversing backwards from all
relations that have selections

25
Bidirectional Search Motivation
26
Bidirectional Search Intuition

First cut solution
Dont expand backward if keyword matches many
nodes
Instead explore forward from other keywords
Problems
Doesnt deal with high fan-out during search
What should cutoff for not expanding be?
Better solution Kacholia et al, VLDB 2005
Perform forward search from all nodes reached
Prioritize expansion of nodes based on
path weight (as in backward expanding search)
spreading activation
to penalize frequent keyword and bushy trees

27
Bidirectional Search Example
OLAP
Divesh
Harper
Query harper divesh olap
28
Bidirectional Search (1)

Spreading activation to prioritize backward
search
(Different from spreading activation for near
queries)
Lower weight edges get higher share of activation
Nodes prioritized by sum of activations
Single forward iterator

29
Bidirectional Search (2)

Forward search iterator
Forward search from all nodes reached by backward
search
Track best forward path to each keyword
Initially infinite cost
Whenever this changes, propagate cost change to
all affected ancestors

2,8
2,2
8,8
8,2
k2
k1
30
Bidirectional Search (3)

On each path length update (due to backward or
forward search)
Check if node can reach all keywords
If so, add it to output heap
When to output nodes from heap
For each keyword Ki, track Mi
Mi minimum path length to Ki among all
yet-to-be-explored nodes in backward search tree
Edge score bounds
What is the best possible edge score of a future
answer?
Bounds similar to NRA algorithm (Fagin)
Cheaper bounds (e.g. 1/Max(Mi)) or heuristics
(e.g. 1/Sum(Mi)) can be used
Output answer if its score is gt overall score
upper bound for future answers

31
Performance

Worst case complexity polynomial in size of
graph
But for typical (average) case, even linear is
too expensive
Intuition typical query should access only small
part of graph
Studied experimentally
Datasets DBLP, IMDB, US Patent
Queries manually created
Typical cases
lt 1 second to generate answer
10K-100K nodes explored

32
Performance Results

Two versions of backward search
Iterator per node (MI-Bkwd) vs Iterator per
keyword (SI-Bkwd)
Origin size number of nodes matching keywords

Time ratio MI/SI

Very minor loss in recall

33
Performance Results

SI-Bkwd versus Bidirectional search
Bidirectional search gain increases with origin
size, keywords

34
Related Work (1)

Publish as document approach
Gather related data into a (virtual) document and
index the document (Su/Widom, IDEAS05)
Positives
Avoids run-time graph search
Works well for a class of applications
E.g. Bibliographic data ? DBLP page per author
Negatives
Not all connections can be captured
Duplication of data across multiple documents
High index space overhead

35
Related Work (2)

DPBF (Ding et al., ICDE07)
dynamic programming technique
exact for top-1 answer, heuristic for top-k
BLINKS (He et al.,SIGMOD07)
Round-robin expansion across iterators
Optimal within a factor of m, with m keywords
Forward index node to keyword distance
Used instead of searching forward
single level index impractically large space
bi-level index gt main memory ?IO efficiency?

36
Outline

Motivation and Graph Data Model
Query/Answer models
Tree answer model
Proximity queries
Graph Search Algorithms
Backward Expanding Search
Bidirectional Search
Search on external memory graphs
Conclusion

37
External Memory Graph Search

Graph representation quite efficient
Requires of lt 20 bytes per node/edge
Problem what if graph size gt memory?
Alternative 1 Virtual Memory
thrashing
Alternative 2 (for relational data) SQL
not good for top-K answer generation across
multiple SQL queries
Alternative 3 use compressed graph
representation to reduce IO
Dalvi et al, VLDB 2008

38
Supernodes and Superedges
39
Multi-granular Graph

Dumb algorithm
search on supernode graph
get kF answers, expand their supernodes into
memory, search on resultant graph
no guarantees on answers
Better idea use multi-granular graph
Supernode graph in memory
Some nodes expanded
Expanded nodes are part of cache
Algorithms on multi-granular graph (coming up)

40
Multi-granular Graph
41
Expanding Nodes

Key idea Edge score of answer containing a
supernode is lower bound on actual edge score of
any corresponding real answer

42
Iterative Search

Iterative search on multi-granular graph
Repeat
search on current multi-granular graph using any
search algorithm, to find top results
expand super nodes in top results
Until top K answers are all pure
Guarantees finding top-K answers
Very good IO efficiency
But high CPU cost due to repeated work
Details nodes expanded above never evicted from
virtual memory cache

43
Incremental Search

Idea when node expanded, incrementally update
state of search algorithm to reflect change in
multi-granular graph
Run search algorithm until top K answers are all
pure
Currently implemented for backward search
Modifies the state of the Dijkstra shortest path
algorithm used by backward search
One shortest path search iterator per keyword
SPI tree shortest path iterator tree

44
Incremental Search (1)
SPI tree for k1
45
Incremental Search (2)
46
Incremental Search (3)
47
External Memory Search Performance
48
External Memory Search Performance

Supernode graph very effective at minimizing IO
Cache misses with incremental often lt nodes
matching keywords
Iterative has high CPU cost
VM (backward search with cache as virtual memory)
has high IO cost
Incremental combines low IO cost with low CPU cost

49
Conclusions

Keyword search on graphs continues to grow in
importance
E.g. graph representation of extracted knowledge
in YAGO/NAGA (Max Planck)
Ranking is critical
Edge and node weights, spreading activation
Efficient graph search is important
In-memory and external-memory

50
Ongoing/Future Work

External memory graph search
Compression ratios for supernode graph for
DBLP/IMDB factor of 5 to 10
Ongoing work on graph clustering shows good
results
Graph search in a parallel cluster
Goal search integrated WWW/Wikipedia graph
New search algorithms
Integration with existing applications
To provide more natural display of results,
hiding schema details
Authorization

51
BANKS References

Keyword Searching and Browsing in databases using
BANKS, Gaurav Bhalotia, Arvind Hulgeri, Charuta
Nakhe, Soumen Chakrabarti, S. Sudarshan ICDE
2002
User Interaction in the BANKS System, Demo
paper,B. Aditya, Soumen Chakrabarti, Rushi
Desai, Arvind Hulgeri, Hrishikesh Karambelkar,
Rupesh Nasre, Parag, S. Sudarshan ICDE 2003
Bidirectional Expansion For Keyword Search on
Graph Databases, Varun Kacholia, Shashank
Pandit, Soumen Chakrabarti, S Sudarshan, Rushi
Desai and Hrishikesh Karambelkar,VLDB 2005
Keyword Search on External Memory Data
GraphsBhavana Dalvi, Meghana Kshirsagar and S.
Sudarshan,VLDB 2008

52
Thanks!
53
Time and Nodes Explored
Bidir Nodes
BiDir Time
54
Screenshots (1)

author (near recovery)

55
Near Queries with Multiple Keywords

Spread activation from each keyword separately
Then combine the activations from different
keywords
OR use addition or belief combination
AND take product of activations
Gives better results

56
The BANKS System
JDBC
HTTP
Database
BANKS
User
Web Server Servlets

Available on the web, with DBLP, IMDB and IITB
ETD data
http//www.cse.iitb.ac.in/banks/
No programming needed for customization
Minimal preprocessing to create indices and give
weights to links
Provides keyword search coupled with extensive
browsing features
Schema browsing data browsing
Hyperlinks are automatically added to all
displayed results
Browsing data by grouping and creating crosstabs
Graphical display of data bar charts, pie
charts, etc

57
BANKS Architecture

Data resident on disk
Graph structure of data resident in memory
Nodes and edges with their types/counts
16xV8xE bytes
Search done in memory
Why
Allows us to use interesting graph traversal
based algorithms without being constrained by SQL
and related performance issues
With current memory sizes, database graphs for
most applications will fit in memory