Bidirectional Expansion for Keyword Search on Graph Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Bidirectional Expansion for Keyword Search on Graph Databases

Description:

Need to find a (closely) connected set of nodes that together match all given keywords ... Propagate values to ancestors if necessary ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 33
Provided by: Char183
Category:

less

Transcript and Presenter's Notes

Title: Bidirectional Expansion for Keyword Search on Graph Databases


1
Bidirectional Expansion for Keyword Searchon
Graph Databases
Varun Kacholia
Shashank Pandit Soumen Chakrabarti
S. Sudarshan Rushi Desai
Hrishikesh Karambelkar
http//www.cse.iitb.ac.in/banks/
2
Keyword Search on Graph Representation of Data
  • Keyword search on relational, XML, HTML, etc.
    data
  • BANKS, Discover, DBXplorer, XRank, etc.
  • Need to find a (closely) connected set of nodes
    that together match all given keywords
  • Focus of our work
  • Search algorithms to find connections between
    nodes

3
Outline
  • Data, Query and Response Models
  • Backward Search Algorithm
  • Bidirectional Search Algorithm
  • Experiments
  • Related Work
  • Conclusions

4
Graph Data Model
  • Data modeled as a directed weighted graph BANKS
    ICDE02
  • Can model relational, XML, HTML, etc. data
  • E.g., DBLP database
  • Node tuple
  • Edge foreign key reference

5
Graph Data Model (2)
  • E.g., XML data
  • ltproceedingsgt
  • ltpaper id1gt
  • lttitlegtDatabaseslt/titlegt
  • lt/papergt
  • ltpaper id2gt
  • lttitlegtKeyword Searchlt/titlegt
  • ltcite ref1gtDatabaseslt/citegt
  • lt/papergt
  • lt/proceedingsgt

proceedings
paper (_at_id 1)
paper (_at_id 2)
cite
title
title
6
Response Model
  • Response Minimal, rooted tree connecting keyword
    nodes
  • Undirected Discover, DBXplorer
  • Directed BANKS

paper
Multi-Query Optimization
E.g., Sudarshan Roy
writes
writes
author
author
Prasan Roy
Sudarshan
7
Response Ranking
  • Edge Score EA
  • Smaller tree gt higher score
  • E.g., BANKS EA 1/ (S edge weights)
  • Node Score NA
  • Measure of authority of nodes in tree
  • E.g., BANKS NA S (leaf and root node
    authorities)
  • Overall score f (EA, NA)
  • E.g., BANKS f (EA, NA) EA . NAl

8
Finding Answer Trees
  • Backward Expanding Search
  • BANKS ICDE02
  • Intuition travel backwards from keyword nodes
    till you hit a common node

Query sudarshan roy
MultiQuery Optimization
paper
writes
Sudarshan
Prasan Roy
authors
9
Backward Search Algorithm
  • Algorithm
  • Run concurrent single source shortest path
    iterators from each node matching a keyword
  • Traverse the graph edges in reverse direction
  • Output next nearest node on each get-next() call
  • Do best-first search across iterators
  • Output node if in the intersection of sets of
    nodes reached from each keyword

10
Backward Search Limitations
  • Wasteful exploration of graph
  • Frequently occurring keywords
  • Hub nodes in the graph (high in-degree)

Shashank Sudarshan Database

Schema Legend
Database

author
writes
paper
Shashank
Sudarshan
11
Bidirectional Search Motivation
12
Bidir Search Intuition
  • First cut solution
  • Dont go backward if keyword matches many nodes
  • Dont go backward if node points to a hub
  • Instead explore forward from other keywords

13
Bidir Search Example
Shashank Sudarshan Database


Database
Schema Legend

author
writes
Shashank
Sudarshan
paper
14
Bidir Search Issues
  • What should threshold for not expanding be?
  • Our solution prioritize expansion of nodes based
    on spreading activation
  • to penalize frequent keywords and bushy trees
  • How to manage exploration in both directions?

15
Bidir Search Spreading Activation
  • Spreading Activation
  • Node with highest activation explored first
  • Every node given an initial activation
  • Gives low activation to frequently occurring
    keywords

1/5
1/5
1/5
1/5
1/5
John
16
Bidir Search Spreading Activation
  • Spreading Activation
  • Node with highest activation explored first
  • Activation spread to neighbors (µ 0.3)
  • Gives low activation to neighbors of hubs

0.7 x 1/5 x 1/4
0
1
1/5
1
0.7 x 1/5 x 1/4
0
1
0
0.7 x 1/5 x 1/4
0.3 x 1/5
1
0.7 x 1/5 x 1/4
0
17
Bidir Search Iterators
  • How to manage exploration in both directions?
  • Single backward iterator single forward
    iterator w/ suitable datastructures
  • E.g., to keep track of parents of nodes
  • Details in full paper

Dist from A, Dist from B
7
6
8,8
2,3 8
8,8 2
2,8

8,1
8,1
1,8
3
4
5
0,8
8,0
2
1
A
B
18
Bidir Search Algorithm
  • Algorithm
  • Activate matching nodes insert into backward
    iterator
  • while (iterators are not empty)
  • Choose iterator for expansion in best-first
    manner
  • Explore node with highest activation
  • Spread activation to neighbors
  • Update path weights (and other datastructures)
  • Propagate values to ancestors if necessary
  • Insert nodes explored in the backward direction
    into the forward iterator / for future forward
    exploration /
  • Stop when top-k results are produced

19
Bidir Search top-k results
  • Results need not be generated in-order
  • Naïve solution
  • Store results in an intermediate heap
  • Output top k results after mk total results have
    been generated (m 10)
  • Can do better
  • Compute upper bound on score of next result
    output answers with a higher score
  • Similar to NRA algorithm (Fagin et al., PODS01)

20
Experiments
  • Datasets
  • DBLP, IMDB 2 million nodes, 9 million edges
  • US Patent DB 4 million nodes, 15 million edges
  • Workload
  • Keywords randomly picked from results of SQL join
    statements
  • Search algorithms
  • MI-Bkwd original backward search
  • Iterator for every node matching a keyword
  • SI-Bkwd backward search with single backward
    iterator
  • Bidirec bidirectional search
  • Time taken/nodes explored
  • Measured when 10th answer is generated (or last
    answer if answers lt 10)
  • Origin size
  • nodes matched by keywords in the query

21
Experiments (2)
  • MI-Bkwd versus SI-Bkwd
  • SI-Bkwd gain increases with origin size,
    keywords

22
Experiments (3)
  • SI-Bkwd versus Bidirec
  • Bidirec gain increases with origin size,
    keywords

23
Experiments (4)
  • Precision/Recall experiments
  • Relevant answers are well-defined can be
    generated through SQL statements
  • Both MI-Backward and Bidirectional show similar
    performance
  • Recall 100
  • Precision 100 at near full recall
  • Few irrelevant answers produced before generating
    all relevant answers
  • Bidirectional runs faster, yet minimal loss of
    relevant results!

24
Experiments (5)
  • Comparison with Sparse
  • Hristidis et al. VLDB03
  • Generate join expressions leading to query
    results
  • Use DB-provided scores for ranking tuples and
    aggregate them to rank answer trees
  • For top-k results automatically determine
    required number of join expressions
  • Sparse-LB
  • Manually generate required join expressions
  • Sparse needs to do at least this much (and
    usually a lot more!)
  • Bidirectional versus Sparse-LB
  • Bidirectional outperforms by a factor of 3
    (esp. when joins is large)

25
Experiments (6)
  • SI-Bkwd versus Bidirec by origin size
  • Bidirec gains more with unbalanced origin sizes

A (T,S,S,S) B (M,M,M,M) C (M,L,L,L) D
(M,M,L,L) E (T,L,L,L) F (T,S,M,L) G
(T,M,L,L) H (T,T,T,L)
26
Discussion
  • Bidirectional search as dynamic per-tuple join
    ordering
  • Related work in this area Eddies
  • Bidirectional search
  • Schema-less
  • Prioritization based on activation instead of
    selectivity
  • Generate answers in relevance order

27
Related Work
  • Keyword querying on relational data Discover
    (UCSD), DBExplorer (Microsoft)
  • Use SQL generation, without in-memory data
    structures
  • Issues generate join plans, re-use common
    sub-expressions, etc.
  • Keyword querying on XML
  • XRank (Cornell), Schema-Free XQuery (Michigan),
  • Tree model is too limited
  • ObjectRank

28
Conclusions
  • Graph model
  • Convenient common denominator representation
  • Schema-free querying leads to graph search
  • Purely backward strategy inadequate
  • Bidirectional search with spreading activation
    performs much better
  • Dynamically choose join order on per-tuple basis

29
Thank You!
Questions??
30
Future of Keyword Search in DBs
  • Next generation of intelligent search will
    require context information
  • E.g. search email, files, calendar, ..
  • Information integration will be important
  • Graph structured data will be a key component
  • Is there a killer app?
  • Deep web?
  • Display of answers
  • Users dont want to see schema details
  • Can we leverage off existing (Web) apps?

31
BANKS Future Work
  • Applications of BANKS
  • Soumen Chakrabarti, Sunita Sarawagi and students
  • Exploit BANKS to integrate different sources of
    data
  • Extract information, Infer soft links
  • BANKS for personal information management
  • SPIN Search Personal Information Networks
  • Ongoing/future work on BANKS
  • More sysadmin/user control on ranking
  • One size does not fit all
  • BANKS provides infrastructure
  • Characterize bidirectional search better
  • And find other applications
  • Security

32
Bidir Search top-k results (2)
  • Compute upper bound on score of next result
    output answers with a higher score
  • Computing the bound
  • mi minimum path length explored backward from
    keyword i
  • unseen answer node 1/(m1 m2 mn )
  • visited answer node suppose reached from first x
    keywords with distance di
  • 1/(d1 d2 dx ) (mx1 mx2
    mn )
  • combine this with max node prestige
  • We simply use 1/(m1 m2 mn )!
  • Experiments show no significant loss in using
    this heuristic
Write a Comment
User Comments (0)
About PowerShow.com