Title: Keyword Search on Graph-Structured Data
1Keyword Search on Graph-Structured Data
Joint work withSoumen Chakrabarti, Gaurav
Bhalotia, Charuta Nakhe,Rushi Desai, Hrishi K.,
Arvind Hulgeri, Bhavana Dalvi and Meghana
Kshirsagar
Jan 2009
2Outline
- Motivation and Graph Data Model
- Query/Answer models
- Tree answer model
- Proximity queries
- Graph Search Algorithms
- Backward Expanding Search
- Bidirectional Search
- Search on external memory graphs
- Conclusion
3Keyword Search on Semi-Structured Data
- Keyword search of documents on the Web has been
enormously successful - Much data is resident in databases
- Organizational, government, scientific, medical
data - Deep web
- Goal querying of data from multiple data
sources, with different data models - Often with no schema or partially defined schema
4Keyword Search on Structured/Semi-Structured Data
- Key differences from IR/Web Search
- Normalization (implicit/explicit) splits related
data across multiple tuples - To answer a keyword query we need to find a
(closely) connected set of entities that together
match all given keywords - soumen crawling or soumen byron
5Graph Data Model
- Lowest common denominator across many data models
- Relational
- node tuple, edge foreign key
- XML
- Node element, edge containment/idref/keyref
- HTML
- node page, edge hyperlink
- Documents
- node document, edge links inferred by data
extraction - Knowledge representation
- node entity, edge relationship
- Network data
- e.g. social networks, communication networks
6Graph Data Model (Cont)
- Nodes can have
- labels
- E.g. relation name, or XML tag
- textual or structured (attribute-value) data
- Edges can have labels
7Outline
- Motivation and Graph Data Model
- Query/Answer models
- Tree answer model
- Proximity queries
- Graph Search Algorithms
- Backward Expanding Search
- Bidirectional Search
- Search on external memory graphs
- Conclusion
8Query/Answer Models
- Basic query model
- Keywords match node text/labels
- Can extend query model with attribute
specification, path specifications - e.g. paper(year lt 2005, titlexquery),
- Alternative answer models
- tree connecting nodes matching query keywords
- nodes in proximity to (near) query keywords
9Tree Answer Model
- Answer Rooted, directed tree connecting keyword
nodes - In general, a Steiner tree
- Multiple answers possible
- Answer relevance computed from
- answer edge score combined with
- answer node score
paper
Focused Crawling
writes
writes
author
author
Soumen C.
Byron Dom
Eg. Soumen Byron
10Answer Ranking
- Naïve model answers ranked by number of edges
- Problem
- Some tuples are connected to many other tuples
- E.g. highly cited papers, popular web sites
- Highly connected tuples create misleading
shortcuts - six degrees of separation
- Solution use directed edges with edge weights
- allow answer tree to have edge u?v if original
graph has v?u, but at higher cost
11Edge Weight Model-1
- Forward edge weight (edge present in data)
- Default to 1, can be based on schema
- Lower weight ? closer connection
- Create extra backward edges v?u for each edge u?v
present in data - Edge weight ? log(1edges pointing to v)
- Overall Answer-tree Edge Score EA 1/ (S edge
weights) - Higher score ? better result
1
1
1
12Edge Weight Model -2
- Probabilistic edge scoring model
- Edge traversal probability (from a given node)
- Forward ? 1/out-degree
- Backward ? 1/in-degree
- Can be weighted by edge type
- Path weight ? probability of following each
edge in path - Edge score log(edge traversal probability)
- Answer-tree Edge Score EA (harmonic) mean of
path weights from root to each leaf
- Note
- other edge weight models possible
- our search algorithms are independent of how edge
weights are computed
13Node Weight
- Node prestige based on indegree
- More incoming edges ? higher prestige
- PageRank style transfer of prestige
- Node weight computing using biased random walk
model - Node weight function of node prestige, other
optional criteria such as TF/IDF - Answer-tree Node score NA root node weight
S leaf node weights
14Overall Tree Answer Score
- Overall score of answer tree A
- combine tree and node scores
- for details, and recall/precision metrics see
BANKS papers in ICDE 2002 and VLDB 2005 - Anecdotal results on DBLP Bibliography
- Transaction Jim Grays classic paper and
textbook at the top because of prestige ( of
citations) - soumen sudarshan several coauthored papers,
links via common co-authors - goldman shivakumar hector The VLDB 98
proximity search paper, followed by
citation/co-author connections
15Answer Models
- Tree Answer Model
- Proximity (near query) model
16Proximity Queries
- Node weight by proximity
- author (near olap) (on DBLP)
- faculty (near earthquake) (on IITB thesis
database) - Node prestige gt if close to multiple nodes
matching near keywords - Example applications
- Finding experts on a particular area
OLAP over uncertain ..
Widom
Raghu
Computing sparse cubes
Overview of OLAP
Allocation in OLAP
17Proximity via Spreading Activation
- Idea
- Each near keyword has activation of 1
- Divided among nodes matching keyword,
proportional to their node prestige - Each node
- keeps fraction 1-µ of its received activation and
- spreads fraction µ amongst its neighbors
- Combine activation ai received from neighbors
- a 1 ?(1-ai) (belief function)
- Graph may have cycles
- Iterate till convergence
18Example Answers
- Anecdotal results on DBLP Bibliography
- author (near recovery) Dave Lomet, C. Mohan, etc
- sudarshan(near change) Sudarshan Chawate
- sudarshan(near query) S. Sudarshan
- Queries can combine proximity scores with tree
scores - hector sudarshan(near query) vs. hector
sudarshan - author(near transactions) data integration
19Related Work
- Proximity Search
- Goldman, Shivakumar, Venkatasubramanian and
Garcia-Molina VLDB98 - Considers only shortest path from each node,
aggregates across nodes - Our version aggregates evidence from alternative
paths - E.g. author (near Surajit Chaudhuri)
- Object Rank VLDB04
- Similar idea to ours, precomputed
20Related Work
- Keyword querying on relational databases
- DBExplorer (Microsoft, ICDE02) Discover (UCSD,
VLDB02, VLDB03), - Use SQL generation, not applicable to arbitrary
graphs - ranking based only on nodes/edges
- Keyword querying on XML Tree Model
- XRank (Cornell, SIGMOD03), proximity in XML (ATT
Research, VLDB03), Schema-Free XQuery (Michigan,
VLDB04), - Tree model is too limited
- Keyword querying on XML Graph Model
- XKeyword (UCSD, ICDE03, VLDB03), SphereSearch
(MaxPlanck, VLDB05) - ranking based only on nodes/edges
21Outline
- Motivation and Graph Data Model
- Query/Answer models
- Tree answer model
- Proximity queries
- Graph Search Algorithms
- Backward Expanding Search
- Bidirectional Search
- Search on external memory graphs
- Conclusion
22Finding Answer Trees
- Backward Expanding Search Algorithm (Bhalotia et
al, ICDE02) - Intuition find vertices from which a forward
path exists to at least one node from each Si. - Run concurrent single source shortest path
algorithm from each node matching a keyword - Create an iterator for each node matching a
keyword - Traverse the graph edges in reverse direction
- Output next nearest node on each get-next() call
- Do best-first search across iterators
- Output an answer when its root has been reached
from each keyword - Answer heap to collect and output results in
score order
23Backward Expanding Search
Query soumen byron
Focused Crawling
paper
writes
Soumen C.
Byron Dom
authors
24Backward Exp. Search Limitations
- Too many iterators (one per keyword node)
- Solution single iterator per keyword (SI-Bkwd
search) - tracks shortest path from node to keyword
- Changes answer set slightly
- Different justifications for same root may be
lost - Not a big problem in practice
- Nodes explored for different keywords can vary
greatly - E.g. mining or query vs knuth
- High fan-out when traversing backwards from some
nodes - Connection with join ordering
- Similar to traversing backwards from all
relations that have selections
25Bidirectional Search Motivation
26Bidirectional Search Intuition
- First cut solution
- Dont expand backward if keyword matches many
nodes - Instead explore forward from other keywords
- Problems
- Doesnt deal with high fan-out during search
- What should cutoff for not expanding be?
- Better solution Kacholia et al, VLDB 2005
- Perform forward search from all nodes reached
- Prioritize expansion of nodes based on
- path weight (as in backward expanding search)
- spreading activation
- to penalize frequent keyword and bushy trees
27Bidirectional Search Example
OLAP
Divesh
Harper
Query harper divesh olap
28Bidirectional Search (1)
- Spreading activation to prioritize backward
search - (Different from spreading activation for near
queries) - Lower weight edges get higher share of activation
- Nodes prioritized by sum of activations
- Single forward iterator
29Bidirectional Search (2)
- Forward search iterator
- Forward search from all nodes reached by backward
search - Track best forward path to each keyword
- Initially infinite cost
- Whenever this changes, propagate cost change to
all affected ancestors
2,8
2,2
8,8
8,2
k2
k1
30Bidirectional Search (3)
- On each path length update (due to backward or
forward search) - Check if node can reach all keywords
- If so, add it to output heap
- When to output nodes from heap
- For each keyword Ki, track Mi
- Mi minimum path length to Ki among all
yet-to-be-explored nodes in backward search tree - Edge score bounds
- What is the best possible edge score of a future
answer? - Bounds similar to NRA algorithm (Fagin)
- Cheaper bounds (e.g. 1/Max(Mi)) or heuristics
(e.g. 1/Sum(Mi)) can be used - Output answer if its score is gt overall score
upper bound for future answers
31Performance
- Worst case complexity polynomial in size of
graph - But for typical (average) case, even linear is
too expensive - Intuition typical query should access only small
part of graph - Studied experimentally
- Datasets DBLP, IMDB, US Patent
- Queries manually created
- Typical cases
- lt 1 second to generate answer
- 10K-100K nodes explored
32Performance Results
- Two versions of backward search
- Iterator per node (MI-Bkwd) vs Iterator per
keyword (SI-Bkwd) - Origin size number of nodes matching keywords
Time ratio MI/SI
- Very minor loss in recall
33Performance Results
- SI-Bkwd versus Bidirectional search
- Bidirectional search gain increases with origin
size, keywords
34Related Work (1)
- Publish as document approach
- Gather related data into a (virtual) document and
index the document (Su/Widom, IDEAS05) - Positives
- Avoids run-time graph search
- Works well for a class of applications
- E.g. Bibliographic data ? DBLP page per author
- Negatives
- Not all connections can be captured
- Duplication of data across multiple documents
- High index space overhead
35Related Work (2)
- DPBF (Ding et al., ICDE07)
- dynamic programming technique
- exact for top-1 answer, heuristic for top-k
- BLINKS (He et al.,SIGMOD07)
- Round-robin expansion across iterators
- Optimal within a factor of m, with m keywords
- Forward index node to keyword distance
- Used instead of searching forward
- single level index impractically large space
- bi-level index gt main memory ?IO efficiency?
36Outline
- Motivation and Graph Data Model
- Query/Answer models
- Tree answer model
- Proximity queries
- Graph Search Algorithms
- Backward Expanding Search
- Bidirectional Search
- Search on external memory graphs
- Conclusion
37External Memory Graph Search
- Graph representation quite efficient
- Requires of lt 20 bytes per node/edge
- Problem what if graph size gt memory?
- Alternative 1 Virtual Memory
- thrashing
- Alternative 2 (for relational data) SQL
- not good for top-K answer generation across
multiple SQL queries - Alternative 3 use compressed graph
representation to reduce IO - Dalvi et al, VLDB 2008
38Supernodes and Superedges
39Multi-granular Graph
- Dumb algorithm
- search on supernode graph
- get kF answers, expand their supernodes into
memory, search on resultant graph - no guarantees on answers
- Better idea use multi-granular graph
- Supernode graph in memory
- Some nodes expanded
- Expanded nodes are part of cache
- Algorithms on multi-granular graph (coming up)
40Multi-granular Graph
41Expanding Nodes
- Key idea Edge score of answer containing a
supernode is lower bound on actual edge score of
any corresponding real answer
42Iterative Search
- Iterative search on multi-granular graph
- Repeat
- search on current multi-granular graph using any
search algorithm, to find top results - expand super nodes in top results
- Until top K answers are all pure
- Guarantees finding top-K answers
- Very good IO efficiency
- But high CPU cost due to repeated work
- Details nodes expanded above never evicted from
virtual memory cache
43Incremental Search
- Idea when node expanded, incrementally update
state of search algorithm to reflect change in
multi-granular graph - Run search algorithm until top K answers are all
pure - Currently implemented for backward search
- Modifies the state of the Dijkstra shortest path
algorithm used by backward search - One shortest path search iterator per keyword
- SPI tree shortest path iterator tree
44Incremental Search (1)
SPI tree for k1
45Incremental Search (2)
46Incremental Search (3)
47External Memory Search Performance
48External Memory Search Performance
- Supernode graph very effective at minimizing IO
- Cache misses with incremental often lt nodes
matching keywords - Iterative has high CPU cost
- VM (backward search with cache as virtual memory)
has high IO cost - Incremental combines low IO cost with low CPU cost
49Conclusions
- Keyword search on graphs continues to grow in
importance - E.g. graph representation of extracted knowledge
in YAGO/NAGA (Max Planck) - Ranking is critical
- Edge and node weights, spreading activation
- Efficient graph search is important
- In-memory and external-memory
50Ongoing/Future Work
- External memory graph search
- Compression ratios for supernode graph for
DBLP/IMDB factor of 5 to 10 - Ongoing work on graph clustering shows good
results - Graph search in a parallel cluster
- Goal search integrated WWW/Wikipedia graph
- New search algorithms
- Integration with existing applications
- To provide more natural display of results,
hiding schema details - Authorization
51BANKS References
- Keyword Searching and Browsing in databases using
BANKS, Gaurav Bhalotia, Arvind Hulgeri, Charuta
Nakhe, Soumen Chakrabarti, S. Sudarshan ICDE
2002 - User Interaction in the BANKS System, Demo
paper,B. Aditya, Soumen Chakrabarti, Rushi
Desai, Arvind Hulgeri, Hrishikesh Karambelkar,
Rupesh Nasre, Parag, S. Sudarshan ICDE 2003 - Bidirectional Expansion For Keyword Search on
Graph Databases, Varun Kacholia, Shashank
Pandit, Soumen Chakrabarti, S Sudarshan, Rushi
Desai and Hrishikesh Karambelkar,VLDB 2005 - Keyword Search on External Memory Data
GraphsBhavana Dalvi, Meghana Kshirsagar and S.
Sudarshan,VLDB 2008
52Thanks!
53Time and Nodes Explored
Bidir Nodes
BiDir Time
54Screenshots (1)
55Near Queries with Multiple Keywords
- Spread activation from each keyword separately
- Then combine the activations from different
keywords - OR use addition or belief combination
- AND take product of activations
- Gives better results
56The BANKS System
JDBC
HTTP
Database
BANKS
User
Web Server Servlets
- Available on the web, with DBLP, IMDB and IITB
ETD data - http//www.cse.iitb.ac.in/banks/
- No programming needed for customization
- Minimal preprocessing to create indices and give
weights to links - Provides keyword search coupled with extensive
browsing features - Schema browsing data browsing
- Hyperlinks are automatically added to all
displayed results - Browsing data by grouping and creating crosstabs
- Graphical display of data bar charts, pie
charts, etc
57BANKS Architecture
- Data resident on disk
- Graph structure of data resident in memory
- Nodes and edges with their types/counts
- 16xV8xE bytes
- Search done in memory
- Why
- Allows us to use interesting graph traversal
based algorithms without being constrained by SQL
and related performance issues - With current memory sizes, database graphs for
most applications will fit in memory
58Probabilistic Edge Score Model (2)
- Paths from root to leaves are considered
separately, even if they share edges - More efficient search algorithms with this models
(He et al., SIGMOD07)