Keyword Search on Graph-Structured Data - PowerPoint PPT Presentation

About This Presentation
Title:

Keyword Search on Graph-Structured Data

Description:

Title: Keyword Searching and Browsing in databases using BANKS Author: S. Sudarshan Last modified by: S. Sudarshan Created Date: 9/14/2001 9:28:43 AM – PowerPoint PPT presentation

Number of Views:444
Avg rating:3.0/5.0
Slides: 59
Provided by: S259
Category:

less

Transcript and Presenter's Notes

Title: Keyword Search on Graph-Structured Data


1
Keyword Search on Graph-Structured Data
  • S. Sudarshan IIT Bombay

Joint work withSoumen Chakrabarti, Gaurav
Bhalotia, Charuta Nakhe,Rushi Desai, Hrishi K.,
Arvind Hulgeri, Bhavana Dalvi and Meghana
Kshirsagar
Jan 2009
2
Outline
  • Motivation and Graph Data Model
  • Query/Answer models
  • Tree answer model
  • Proximity queries
  • Graph Search Algorithms
  • Backward Expanding Search
  • Bidirectional Search
  • Search on external memory graphs
  • Conclusion

3
Keyword Search on Semi-Structured Data
  • Keyword search of documents on the Web has been
    enormously successful
  • Much data is resident in databases
  • Organizational, government, scientific, medical
    data
  • Deep web
  • Goal querying of data from multiple data
    sources, with different data models
  • Often with no schema or partially defined schema

4
Keyword Search on Structured/Semi-Structured Data
  • Key differences from IR/Web Search
  • Normalization (implicit/explicit) splits related
    data across multiple tuples
  • To answer a keyword query we need to find a
    (closely) connected set of entities that together
    match all given keywords
  • soumen crawling or soumen byron

5
Graph Data Model
  • Lowest common denominator across many data models
  • Relational
  • node tuple, edge foreign key
  • XML
  • Node element, edge containment/idref/keyref
  • HTML
  • node page, edge hyperlink
  • Documents
  • node document, edge links inferred by data
    extraction
  • Knowledge representation
  • node entity, edge relationship
  • Network data
  • e.g. social networks, communication networks

6
Graph Data Model (Cont)
  • Nodes can have
  • labels
  • E.g. relation name, or XML tag
  • textual or structured (attribute-value) data
  • Edges can have labels

7
Outline
  • Motivation and Graph Data Model
  • Query/Answer models
  • Tree answer model
  • Proximity queries
  • Graph Search Algorithms
  • Backward Expanding Search
  • Bidirectional Search
  • Search on external memory graphs
  • Conclusion

8
Query/Answer Models
  • Basic query model
  • Keywords match node text/labels
  • Can extend query model with attribute
    specification, path specifications
  • e.g. paper(year lt 2005, titlexquery),
  • Alternative answer models
  • tree connecting nodes matching query keywords
  • nodes in proximity to (near) query keywords

9
Tree Answer Model
  • Answer Rooted, directed tree connecting keyword
    nodes
  • In general, a Steiner tree
  • Multiple answers possible
  • Answer relevance computed from
  • answer edge score combined with
  • answer node score

paper
Focused Crawling
writes
writes
author
author
Soumen C.
Byron Dom
Eg. Soumen Byron
10
Answer Ranking
  • Naïve model answers ranked by number of edges
  • Problem
  • Some tuples are connected to many other tuples
  • E.g. highly cited papers, popular web sites
  • Highly connected tuples create misleading
    shortcuts
  • six degrees of separation
  • Solution use directed edges with edge weights
  • allow answer tree to have edge u?v if original
    graph has v?u, but at higher cost

11
Edge Weight Model-1
  • Forward edge weight (edge present in data)
  • Default to 1, can be based on schema
  • Lower weight ? closer connection
  • Create extra backward edges v?u for each edge u?v
    present in data
  • Edge weight ? log(1edges pointing to v)
  • Overall Answer-tree Edge Score EA 1/ (S edge
    weights)
  • Higher score ? better result

1
1
1
12
Edge Weight Model -2
  • Probabilistic edge scoring model
  • Edge traversal probability (from a given node)
  • Forward ? 1/out-degree
  • Backward ? 1/in-degree
  • Can be weighted by edge type
  • Path weight ? probability of following each
    edge in path
  • Edge score log(edge traversal probability)
  • Answer-tree Edge Score EA (harmonic) mean of
    path weights from root to each leaf
  • Note
  • other edge weight models possible
  • our search algorithms are independent of how edge
    weights are computed

13
Node Weight
  • Node prestige based on indegree
  • More incoming edges ? higher prestige
  • PageRank style transfer of prestige
  • Node weight computing using biased random walk
    model
  • Node weight function of node prestige, other
    optional criteria such as TF/IDF
  • Answer-tree Node score NA root node weight
    S leaf node weights

14
Overall Tree Answer Score
  • Overall score of answer tree A
  • combine tree and node scores
  • for details, and recall/precision metrics see
    BANKS papers in ICDE 2002 and VLDB 2005
  • Anecdotal results on DBLP Bibliography
  • Transaction Jim Grays classic paper and
    textbook at the top because of prestige ( of
    citations)
  • soumen sudarshan several coauthored papers,
    links via common co-authors
  • goldman shivakumar hector The VLDB 98
    proximity search paper, followed by
    citation/co-author connections

15
Answer Models
  • Tree Answer Model
  • Proximity (near query) model

16
Proximity Queries
  • Node weight by proximity
  • author (near olap) (on DBLP)
  • faculty (near earthquake) (on IITB thesis
    database)
  • Node prestige gt if close to multiple nodes
    matching near keywords
  • Example applications
  • Finding experts on a particular area

OLAP over uncertain ..
Widom
Raghu
Computing sparse cubes
Overview of OLAP
Allocation in OLAP
17
Proximity via Spreading Activation
  • Idea
  • Each near keyword has activation of 1
  • Divided among nodes matching keyword,
    proportional to their node prestige
  • Each node
  • keeps fraction 1-µ of its received activation and
  • spreads fraction µ amongst its neighbors
  • Combine activation ai received from neighbors
  • a 1 ?(1-ai) (belief function)
  • Graph may have cycles
  • Iterate till convergence

18
Example Answers
  • Anecdotal results on DBLP Bibliography
  • author (near recovery) Dave Lomet, C. Mohan, etc
  • sudarshan(near change) Sudarshan Chawate
  • sudarshan(near query) S. Sudarshan
  • Queries can combine proximity scores with tree
    scores
  • hector sudarshan(near query) vs. hector
    sudarshan
  • author(near transactions) data integration

19
Related Work
  • Proximity Search
  • Goldman, Shivakumar, Venkatasubramanian and
    Garcia-Molina VLDB98
  • Considers only shortest path from each node,
    aggregates across nodes
  • Our version aggregates evidence from alternative
    paths
  • E.g. author (near Surajit Chaudhuri)
  • Object Rank VLDB04
  • Similar idea to ours, precomputed

20
Related Work
  • Keyword querying on relational databases
  • DBExplorer (Microsoft, ICDE02) Discover (UCSD,
    VLDB02, VLDB03),
  • Use SQL generation, not applicable to arbitrary
    graphs
  • ranking based only on nodes/edges
  • Keyword querying on XML Tree Model
  • XRank (Cornell, SIGMOD03), proximity in XML (ATT
    Research, VLDB03), Schema-Free XQuery (Michigan,
    VLDB04),
  • Tree model is too limited
  • Keyword querying on XML Graph Model
  • XKeyword (UCSD, ICDE03, VLDB03), SphereSearch
    (MaxPlanck, VLDB05)
  • ranking based only on nodes/edges

21
Outline
  • Motivation and Graph Data Model
  • Query/Answer models
  • Tree answer model
  • Proximity queries
  • Graph Search Algorithms
  • Backward Expanding Search
  • Bidirectional Search
  • Search on external memory graphs
  • Conclusion

22
Finding Answer Trees
  • Backward Expanding Search Algorithm (Bhalotia et
    al, ICDE02)
  • Intuition find vertices from which a forward
    path exists to at least one node from each Si.
  • Run concurrent single source shortest path
    algorithm from each node matching a keyword
  • Create an iterator for each node matching a
    keyword
  • Traverse the graph edges in reverse direction
  • Output next nearest node on each get-next() call
  • Do best-first search across iterators
  • Output an answer when its root has been reached
    from each keyword
  • Answer heap to collect and output results in
    score order

23
Backward Expanding Search
Query soumen byron
Focused Crawling
paper
writes
Soumen C.
Byron Dom
authors
24
Backward Exp. Search Limitations
  • Too many iterators (one per keyword node)
  • Solution single iterator per keyword (SI-Bkwd
    search)
  • tracks shortest path from node to keyword
  • Changes answer set slightly
  • Different justifications for same root may be
    lost
  • Not a big problem in practice
  • Nodes explored for different keywords can vary
    greatly
  • E.g. mining or query vs knuth
  • High fan-out when traversing backwards from some
    nodes
  • Connection with join ordering
  • Similar to traversing backwards from all
    relations that have selections

25
Bidirectional Search Motivation
26
Bidirectional Search Intuition
  • First cut solution
  • Dont expand backward if keyword matches many
    nodes
  • Instead explore forward from other keywords
  • Problems
  • Doesnt deal with high fan-out during search
  • What should cutoff for not expanding be?
  • Better solution Kacholia et al, VLDB 2005
  • Perform forward search from all nodes reached
  • Prioritize expansion of nodes based on
  • path weight (as in backward expanding search)
  • spreading activation
  • to penalize frequent keyword and bushy trees

27
Bidirectional Search Example
OLAP
Divesh
Harper
Query harper divesh olap
28
Bidirectional Search (1)
  • Spreading activation to prioritize backward
    search
  • (Different from spreading activation for near
    queries)
  • Lower weight edges get higher share of activation
  • Nodes prioritized by sum of activations
  • Single forward iterator

29
Bidirectional Search (2)
  • Forward search iterator
  • Forward search from all nodes reached by backward
    search
  • Track best forward path to each keyword
  • Initially infinite cost
  • Whenever this changes, propagate cost change to
    all affected ancestors

2,8
2,2
8,8
8,2
k2
k1
30
Bidirectional Search (3)
  • On each path length update (due to backward or
    forward search)
  • Check if node can reach all keywords
  • If so, add it to output heap
  • When to output nodes from heap
  • For each keyword Ki, track Mi
  • Mi minimum path length to Ki among all
    yet-to-be-explored nodes in backward search tree
  • Edge score bounds
  • What is the best possible edge score of a future
    answer?
  • Bounds similar to NRA algorithm (Fagin)
  • Cheaper bounds (e.g. 1/Max(Mi)) or heuristics
    (e.g. 1/Sum(Mi)) can be used
  • Output answer if its score is gt overall score
    upper bound for future answers

31
Performance
  • Worst case complexity polynomial in size of
    graph
  • But for typical (average) case, even linear is
    too expensive
  • Intuition typical query should access only small
    part of graph
  • Studied experimentally
  • Datasets DBLP, IMDB, US Patent
  • Queries manually created
  • Typical cases
  • lt 1 second to generate answer
  • 10K-100K nodes explored

32
Performance Results
  • Two versions of backward search
  • Iterator per node (MI-Bkwd) vs Iterator per
    keyword (SI-Bkwd)
  • Origin size number of nodes matching keywords

Time ratio MI/SI
  • Very minor loss in recall

33
Performance Results
  • SI-Bkwd versus Bidirectional search
  • Bidirectional search gain increases with origin
    size, keywords

34
Related Work (1)
  • Publish as document approach
  • Gather related data into a (virtual) document and
    index the document (Su/Widom, IDEAS05)
  • Positives
  • Avoids run-time graph search
  • Works well for a class of applications
  • E.g. Bibliographic data ? DBLP page per author
  • Negatives
  • Not all connections can be captured
  • Duplication of data across multiple documents
  • High index space overhead

35
Related Work (2)
  • DPBF (Ding et al., ICDE07)
  • dynamic programming technique
  • exact for top-1 answer, heuristic for top-k
  • BLINKS (He et al.,SIGMOD07)
  • Round-robin expansion across iterators
  • Optimal within a factor of m, with m keywords
  • Forward index node to keyword distance
  • Used instead of searching forward
  • single level index impractically large space
  • bi-level index gt main memory ?IO efficiency?

36
Outline
  • Motivation and Graph Data Model
  • Query/Answer models
  • Tree answer model
  • Proximity queries
  • Graph Search Algorithms
  • Backward Expanding Search
  • Bidirectional Search
  • Search on external memory graphs
  • Conclusion

37
External Memory Graph Search
  • Graph representation quite efficient
  • Requires of lt 20 bytes per node/edge
  • Problem what if graph size gt memory?
  • Alternative 1 Virtual Memory
  • thrashing
  • Alternative 2 (for relational data) SQL
  • not good for top-K answer generation across
    multiple SQL queries
  • Alternative 3 use compressed graph
    representation to reduce IO
  • Dalvi et al, VLDB 2008

38
Supernodes and Superedges
39
Multi-granular Graph
  • Dumb algorithm
  • search on supernode graph
  • get kF answers, expand their supernodes into
    memory, search on resultant graph
  • no guarantees on answers
  • Better idea use multi-granular graph
  • Supernode graph in memory
  • Some nodes expanded
  • Expanded nodes are part of cache
  • Algorithms on multi-granular graph (coming up)

40
Multi-granular Graph
41
Expanding Nodes
  • Key idea Edge score of answer containing a
    supernode is lower bound on actual edge score of
    any corresponding real answer

42
Iterative Search
  • Iterative search on multi-granular graph
  • Repeat
  • search on current multi-granular graph using any
    search algorithm, to find top results
  • expand super nodes in top results
  • Until top K answers are all pure
  • Guarantees finding top-K answers
  • Very good IO efficiency
  • But high CPU cost due to repeated work
  • Details nodes expanded above never evicted from
    virtual memory cache

43
Incremental Search
  • Idea when node expanded, incrementally update
    state of search algorithm to reflect change in
    multi-granular graph
  • Run search algorithm until top K answers are all
    pure
  • Currently implemented for backward search
  • Modifies the state of the Dijkstra shortest path
    algorithm used by backward search
  • One shortest path search iterator per keyword
  • SPI tree shortest path iterator tree

44
Incremental Search (1)
SPI tree for k1
45
Incremental Search (2)
46
Incremental Search (3)
47
External Memory Search Performance
48
External Memory Search Performance
  • Supernode graph very effective at minimizing IO
  • Cache misses with incremental often lt nodes
    matching keywords
  • Iterative has high CPU cost
  • VM (backward search with cache as virtual memory)
    has high IO cost
  • Incremental combines low IO cost with low CPU cost

49
Conclusions
  • Keyword search on graphs continues to grow in
    importance
  • E.g. graph representation of extracted knowledge
    in YAGO/NAGA (Max Planck)
  • Ranking is critical
  • Edge and node weights, spreading activation
  • Efficient graph search is important
  • In-memory and external-memory

50
Ongoing/Future Work
  • External memory graph search
  • Compression ratios for supernode graph for
    DBLP/IMDB factor of 5 to 10
  • Ongoing work on graph clustering shows good
    results
  • Graph search in a parallel cluster
  • Goal search integrated WWW/Wikipedia graph
  • New search algorithms
  • Integration with existing applications
  • To provide more natural display of results,
    hiding schema details
  • Authorization

51
BANKS References
  • Keyword Searching and Browsing in databases using
    BANKS, Gaurav Bhalotia, Arvind Hulgeri, Charuta
    Nakhe, Soumen Chakrabarti, S. Sudarshan ICDE
    2002
  • User Interaction in the BANKS System, Demo
    paper,B. Aditya, Soumen Chakrabarti, Rushi
    Desai, Arvind Hulgeri, Hrishikesh Karambelkar,
    Rupesh Nasre, Parag, S. Sudarshan ICDE 2003
  • Bidirectional Expansion For Keyword Search on
    Graph Databases, Varun Kacholia, Shashank
    Pandit, Soumen Chakrabarti, S Sudarshan, Rushi
    Desai and Hrishikesh Karambelkar,VLDB 2005
  • Keyword Search on External Memory Data
    GraphsBhavana Dalvi, Meghana Kshirsagar and S.
    Sudarshan,VLDB 2008

52
Thanks!
53
Time and Nodes Explored
Bidir Nodes
BiDir Time
54
Screenshots (1)
  • author (near recovery)

55
Near Queries with Multiple Keywords
  • Spread activation from each keyword separately
  • Then combine the activations from different
    keywords
  • OR use addition or belief combination
  • AND take product of activations
  • Gives better results

56
The BANKS System
JDBC
HTTP
Database
BANKS
User
Web Server Servlets
  • Available on the web, with DBLP, IMDB and IITB
    ETD data
  • http//www.cse.iitb.ac.in/banks/
  • No programming needed for customization
  • Minimal preprocessing to create indices and give
    weights to links
  • Provides keyword search coupled with extensive
    browsing features
  • Schema browsing data browsing
  • Hyperlinks are automatically added to all
    displayed results
  • Browsing data by grouping and creating crosstabs
  • Graphical display of data bar charts, pie
    charts, etc

57
BANKS Architecture
  • Data resident on disk
  • Graph structure of data resident in memory
  • Nodes and edges with their types/counts
  • 16xV8xE bytes
  • Search done in memory
  • Why
  • Allows us to use interesting graph traversal
    based algorithms without being constrained by SQL
    and related performance issues
  • With current memory sizes, database graphs for
    most applications will fit in memory

58
Probabilistic Edge Score Model (2)
  • Paths from root to leaves are considered
    separately, even if they share edges
  • More efficient search algorithms with this models
    (He et al., SIGMOD07)
Write a Comment
User Comments (0)
About PowerShow.com