Keyword Querying on SemiStructured Data - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Keyword Querying on SemiStructured Data

Description:

A significant fraction of data is resident in relational databases or in semi ... Similar to NRA algorithm (Fagin et al., PODS'01) Experiments. Datasets ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 57
Provided by: Char183
Category:

less

Transcript and Presenter's Notes

Title: Keyword Querying on SemiStructured Data


1
Keyword Querying on Semi-Structured Data
  • S. Sudarshan IIT Bombay

Joint work with Soumen Chakrabarti, and
a large number of past
and current students
http//www.cse.iitb.ac.in/banks/
2
Keyword Search on Semi-Structured Data
  • A significant fraction of data is resident in
    relational databases or in semi-structured data
    (XML)
  • Organizational, government, scientific, medical
    data
  • Goal ad-hoc/exploratory database querying using
    keywords
  • SQL/XQuery is not appropriate for casual users
  • Form interfaces cumbersome
  • Require separate form for each type of query
  • Not suitable for ad hoc queries

3
Keyword Search on Semistructured Data (Cont.)
  • Differences from IR/Web Search
  • Normalization splits related data across multiple
    tuples
  • In XML/HTML, edges represent connections between
    different nodes
  • To answer a keyword query we need to find a
    (closely) connected set of tuples/nodes that
    together match all given keywords

4
Graph Representation of Data
  • Database modeled as a graph
  • Nodes tuples
  • Directed edges for foreign key, inclusion
    dependencies, etc
  • Information integration Graph representation of
    integrated information keyword querying
  • can model relational, XML, HTML, .., data in a
    single graph

5
Graph Data Model (2)
  • E.g., XML data
  • ltproceedings confVLDB, year 2009gt
  • ltpaper id1gt
  • lttitlegtRecovering from Query
    Optimizationlt/titlegt .
  • lt/papergt
  • ltpaper id2gt
  • lttitlegtConcurrency Control for
    Keyword Searchlt/titlegt .
  • ltcite ref1gtRecovering from Query
    Optimizationlt/citegt
  • lt/papergt
  • lt/proceedingsgt

6
Answer Model
  • Answer Minimal rooted directed tree connecting
    nodes containing keywords
  • Undirected tree Discover, DBXplorer, ..
  • Multiple answers possible
  • Answer relevance computed from answer edge
    score combined with answer node score
    (prestige)

Eg. Sudarshan Roy
7
Edge Directionality
  • Some popular tuples are connected to many other
    tuples
  • Paper1 ? vldb06 ? paper2
  • Students ? departments ? university
  • Popular tuples would create misleading shortcuts
  • E.g. every student would be closely linked with
    every other student via the department/university

8
Edge Weight Model in BANKS
  • Idea define different forward and backward edge
    weights
  • Forward edge weight (in direction of foreign key
    ref/XML containment)
  • Default to 1, can be based on schema
  • e.g. citation link weight 10, writes link
    weight 5
  • Backward edge u?v weight (where foreign key v?u)
  • Proportional to edges pointing to u
  • Log scaling

9
Node Prestige in BANKS
  • Node weight (prestige) based on indegree
  • More incoming edges ? higher prestige
  • Google PageRank style transfer of prestige
  • Node weight computed using biased random walk
    model
  • Bias based on edge type, direction

10
Response Ranking
  • Edge Score EA
  • Smaller tree gt higher score
  • E.g., BANKS EA 1/ (S edge weights)
  • Variant
  • Score of path from root to leaf probability of
    random walk from root reaching that leaf
  • Tree score product of leaf path scores
  • Node Score NA
  • Measure of authority of nodes in tree
  • E.g., BANKS NA S leaf/root nodes log (node
    authorities)
  • Overall score f (EA, NA)
  • E.g., BANKS f (EA, NA) EA . NAl
  • l0.2 works well

11
The BANKS System
JDBC
HTTP
Database
BANKS
User
Web Server Servlets
preprocess
XML Data source
  • Available on the web, with DBLP, IMDB and IITB
    ETD data
  • http//www.cse.iitb.ac.in/banks/
  • Preprocessing to create indices and give weights
    to links
  • Provides keyword search and browsing features

12
BANKS Architecture
  • Data resident on disk
  • Graph representation of data resident in memory
  • Nodes and edges with their types/counts
  • 16xV8xE bytes
  • Search done in memory
  • Why in memory?
  • Allows us to use interesting graph traversal
    based algorithms without being constrained by SQL
    and related performance issues
  • With current memory sizes, database graphs for
    many applications will fit in server memory
  • External memory search ongoing work

13
Related Work
  • Keyword querying on relational databases
  • DBExplorer ICDE02, Discover VLDB02
  • Use SQL generation
  • BANKS ICDE02 (G. Bhalotia, Charuta N., A.
    Hulgeri, Soumen C.,
    S. Sudarshan)
  • pays more attention to result ranking
  • does not require schema
  • Keyword querying on XML
  • Tree model (answer based on containment edges)
  • XRank (Cornell), proximity in XML (ATT Research)
  • Schema-Free XQuery (Michigan),
  • Tree model cannot handle arbitrary graph edges
  • Graph model
  • Sphere search (VLDB2005)
  • Generates XML tags to represent context
  • Query can specify keyword context
  • Does not explore edge weights

14
Proximity (Near) Queries
15
Proximity Queries
  • Node weight by proximity
  • E.g. author (near recovery)
  • Node prestige gt if close to multiple nodes
    matching near keywords
  • Example applications
  • Finding experts on a particular area
  • faculty (near earthquake)
  • Author (near recovery)

Analysis of Earthquake ..
Mohan
Raghu
Earthquake Resistant
Earthquake Measurement
Building Earthquake
16
Proximity via Spreading Activation
  • Idea
  • Each near keyword has activation of 1
  • Divided among nodes matching keyword,
    proportional to their node prestige
  • Each node
  • keeps fraction 1-µ of its received activation and
  • spreads fraction µ amongst its neighbors
  • Graph may have cycles
  • Combine activation received from neighbors
  • a 1 (1-a1)(1-a2) (belief function)
  • Additive combination (a1a2) may diverge w/
    cycles

17
Activation Change Propagation
  • Algorithm to incrementally propagate activation
    change d
  • Nodes to propagate d from are in queue
  • Best first propagation
  • Propagation to node already in queue simply
    modifies its d value
  • Stops when d becomes smaller than cutoff

0.2
0.12
1
.6
0.08
0.08
0.2
0.12
18
Near Queries with Multiple Keywords
  • Spread activation from each keyword separately
  • Then combine the activations from different
    keywords
  • OR use addition or belief combination
  • AND take product of activations
  • Gives better results

19
Proximity and Tree Scores
  • Queries can combine proximity scores with tree
    scores
  • author(near transactions) data integration
  • Related work
  • Goldman et al VLDB98
  • Considers only shortest path from each node
  • Author (near Surajit Chaudhuri)
  • Object Rank VLDB04
  • Done independently
  • Precomputed high space overhead
  • Subsequently extended to IR context in the SPIN
    system

20
Example Answers
  • Anecdotal results on DBLP Bibliography
  • author (near recovery) Dave Lomet, C. Mohan,
    etc
  • Transaction Jim Grays classic paper and
    textbook at the top based on prestige ( of
    citations)
  • Johnson(near OLAP) Theodore Johnson
  • And on IIT Bombay Thesis/Dissertation Database
  • faculty (near earthquake) R.S. Jangid, P.
    Banerji, R. Sinha
  • faculty (near database)

21
Other Query Extensions
  • Restriction of context
  • authorwashington vs. statewashington
  • Twig and approximate twig patterns
  • recovery cites optimization

22
Graph Search Algorithms To Find Answer Trees
23
Bidirectional Expansion for Keyword Searchon
Graph Databases
Varun Kacholia
Shashank Pandit Soumen Chakrabarti
S. Sudarshan Rushi Desai
Hrishikesh Karambelkar
http//www.cse.iitb.ac.in/banks/
24
Finding Answer Trees
  • Backward Expanding Search
  • BANKS ICDE02
  • Intuition travel backwards from keyword nodes
    till you hit a common node

Query sudarshan roy
..
MultiQuery Optimization
paper
writes
Sudarshan
Prasan Roy
authors
25
Backward Search Algorithm
  • Algorithm
  • Run concurrent single source shortest path
    iterators from each node matching a keyword
  • Traverse the graph edges in reverse direction
  • Output next nearest node on each get-next() call
  • Do best-first search across iterators
  • Output node if in the intersection of sets of
    nodes reached from each keyword

26
Backward Search Limitations
  • Wasteful exploration of graph
  • Frequently occurring keywords
  • Hub nodes in the graph (high in-degree)

Shashank Sudarshan Database

Schema Legend
Database

author
writes
paper
Shashank
Sudarshan
27
Bidirectional Search Motivation
28
Bidir Search Intuition
  • First cut solution
  • Dont go backward if keyword matches many nodes
  • Dont go backward if node points to a hub
  • Instead explore forward from other keywords

29
Bidir Search Example
Shashank Sudarshan Database


Database
Schema Legend

author
writes
Shashank
Sudarshan
paper
30
Bidir Search Issues
  • What should threshold for not expanding be?
  • Our solution prioritize expansion of nodes based
    on spreading activation
  • to penalize frequent keywords and bushy trees
  • How to manage exploration in both directions?

31
Bidir Search Spreading Activation
  • Spreading Activation
  • Node with highest activation explored first
  • Every node given an initial activation
  • Gives low activation to frequently occurring
    keywords

1/5
1/5
1/5
1/5
1/5
John
32
Bidir Search Spreading Activation
  • Spreading Activation
  • Node with highest activation explored first
  • Activation spread to neighbors (µ 0.3)
  • Gives low activation to neighbors of hubs

0.7 x 1/5 x 1/4
0
1
1/5
1
0.7 x 1/5 x 1/4
0
1
0
0.7 x 1/5 x 1/4
0.3 x 1/5
1
0.7 x 1/5 x 1/4
0
33
Bidir Search Iterators
  • How to manage exploration in both directions?
  • Single backward iterator single forward
    iterator w/ suitable datastructures
  • E.g., to keep track of parents of nodes

Dist from A, Dist from B
7
6
8,8
2,3 8
8,8 2
2,8

8,1
8,1
1,8
3
4
5
0,8
8,0
2
1
A
B
34
Bidir Search Algorithm
  • Algorithm
  • Activate matching nodes insert into backward
    iterator
  • while (iterators are not empty)
  • Choose iterator for expansion in best-first
    manner
  • Explore node with highest activation
  • Spread activation to neighbors
  • Update path weights (and other datastructures)
  • Propagate values to ancestors if necessary
  • Insert nodes explored in the backward direction
    into the forward iterator / for future forward
    exploration /
  • Stop when top-k results are produced

35
Bidir Search top-k results
  • Results need not be generated in-order
  • Naïve solution
  • Store results in an intermediate heap
  • Output top k results after mk total results have
    been generated (m 10)
  • Can do better
  • Compute upper bound on score of next result
    output answers with a higher score
  • Similar to NRA algorithm (Fagin et al., PODS01)

36
Experiments
  • Datasets
  • DBLP, IMDB 2 million nodes, 9 million edges
  • US Patent DB 4 million nodes, 15 million edges
  • Workload
  • Keywords randomly picked from results of SQL join
    statements
  • Search algorithms
  • MI-Bkwd original backward search
  • Iterator for every node matching a keyword
  • SI-Bkwd backward search with single backward
    iterator
  • Bidirec bidirectional search
  • Time taken/nodes explored
  • Measured when 10th answer is generated (or last
    answer if answers lt 10)
  • Origin size
  • nodes matched by keywords in the query

37
Experiments (2)
  • MI-Bkwd versus SI-Bkwd
  • SI-Bkwd gain increases with origin size,
    keywords

38
Experiments (3)
  • SI-Bkwd versus Bidirec
  • Bidirec gain increases with origin size,
    keywords

39
Experiments (4)
  • Precision/Recall experiments
  • Relevant answers are well-defined can be
    generated through SQL statements
  • Both MI-Backward and Bidirectional show similar
    performance
  • Recall 100
  • Precision 100 at near full recall
  • Few irrelevant answers produced before generating
    all relevant answers
  • Bidirectional runs faster, yet minimal loss of
    relevant results!

40
Discussion
  • Bidirectional search as dynamic per-tuple join
    ordering
  • Related work in this area Eddies
  • Unlike Eddies, bidirectional search is
  • Schema-less
  • Priority based on activation instead of
    selectivity
  • Generates answers in relevance order

41
Conclusions
  • Graph model
  • Common denominator representation
  • Multiple types of queries required
  • Near queries, spanning tree queries
  • Ranking is critical
  • Edge and node weights, spreading activation
  • Efficient graph search is critical
  • Bidirectional graph search

42
Ongoing and Future Work
  • Graphs larger than memory
  • Idea Use multi-level graph representation
  • Higher levels are condensed representation of
    lower levels
  • Revised approach to search
  • Search on condensed super-graph (in-memory), to
    find potential answers
  • Expand nodes (disk I/O)
  • Redo search on expanded nodes to find real
    answers

43
Graph Condensation
S1
S2
S3
  • Cluster nodes to get supergraph
  • Different clusterings possible e.g.
    Raghavan/Garcia Molina ICDE 2003 for web graphs
  • Currently building infrastructure and exploring
    techniques

44
Searching in Condensed Graphs
  • Weight of super-edge S1?S2 min(real edge
    weigths between S1 and S2)
  • Issues
  • E.g. minimal answers on full graph may be
    non-minimal on condensed graph
  • Multi-granularity representation
  • Other types of queries on condensed graphs

45
Further Directions
  • Querying graph representation of integrated data

0.1
S. Sarawagi
Sunita S.
5
Suneeta S.
46
Future of Keyword Search in DBs
  • Next generation of intelligent search will
    require context information
  • E.g. search email, files, calendar, ..
  • Information integration will be important
  • Graph structured data will be a key component
  • Security
  • Is there a killer app?

47
Thank You!
Questions??
48
Experiments (5)
  • Comparison with Sparse
  • Hristidis et al. VLDB03
  • Generate join expressions leading to query
    results
  • Use DB-provided scores for ranking tuples and
    aggregate them to rank answer trees
  • For top-k results automatically determine
    required number of join expressions
  • Sparse-LB
  • Manually generate required join expressions
  • Sparse needs to do at least this much (and
    usually a lot more!)
  • Bidirectional versus Sparse-LB
  • Bidirectional outperforms by a factor of 3
    (esp. when joins is large)

49
Experiments (6)
  • SI-Bkwd versus Bidirec by origin size
  • Bidirec gains more with unbalanced origin sizes

A (T,S,S,S) B (M,M,M,M) C (M,L,L,L) D
(M,M,L,L) E (T,L,L,L) F (T,S,M,L) G
(T,M,L,L) H (T,T,T,L)
50
Bidir Search top-k results (2)
  • Compute upper bound on score of next result
    output answers with a higher score
  • Computing the bound
  • mi minimum path length explored backward from
    keyword i
  • unseen answer node 1/(m1 m2 mn )
  • visited answer node suppose reached from first x
    keywords with distance di
  • 1/(d1 d2 dx ) (mx1 mx2
    mn )
  • combine this with max node prestige
  • We simply use 1/(m1 m2 mn )!
  • Experiments show no significant loss in using
    this heuristic

51
Bidirectional Search (1)
  • Single backward search iterator across all
    keywords
  • Unlike per-matched-node in backward exp.
  • Changes answer set slightly
  • Different justifications for same root may be
    lost
  • Didnt find any meaningful answers lost
  • Spreading activation to prioritize backward
    search
  • Activation spread per keyword
  • Nodes prioritized by sum of activations
  • Single forward iterator

52
Bidir. Search (2)
  • For each node in backward iterator
  • dist(u,i) best path from u to node in Si
  • Si nodes matching keyword Ki
  • sp(u,i) next node in shortest path from u to Ki
  • a(u,i) activation at u from keyword Ki
  • a(u) sum of a(u,i)

53
Bidir. Search (3)
  • Spreading of activation
  • Done separately for each keyword
  • For nodes in Si (nodes matching keywords)
  • initial activation proportional to node prestige
  • total of 1 across nodes in Si
  • Node retains µ fraction of received activation,
    spreads (1-µ) fraction
  • Activation spread from a node V divided among
    neighbors Ui in inverse proportion to weight Ui?V
  • Thus incorporates path score too
  • Activation combining function max

1
1
1
2
1
1
1
1
keyword1
keyword2
54
Bidir. Search (3)
  • Forward search iterator
  • Forward search from all nodes reached by backward
    search
  • Track best forward path to each keyword
  • Initially infinite cost
  • Whenever this changes, propagate cost change to
    all affected ancestors

k2
k1
55
Bidir. Search (3)
  • On each path length update (due to backward or
    forward search)
  • Check if node can reach all keywords
  • If so, add it to output heap
  • If same undirected tree not already present
  • Output heap deals with out of order answer
    generation
  • When to output nodes from heap
  • For each keyword Ki, track Mi
  • Mi minimum path length to Ki among all
    yet-to-be-explored nodes in backward search tree
  • 1/Max (Mi) is upper bound on edge score
  • 1/Sum(Mi) can be used instead at risk of
    out-of-order answers
  • Use max of node scores to compute overall score
    upper bound
  • Output answer if its score is gt overall score
    upper bound

56
Probabilistic Edge Score Model
  • Probabilistic edge scoring model alternative to
    edge weight model
  • Path weight ? probability of following each
    edge
  • Edge probability
  • Forward ? 1/out-degree
  • Backward ? 1/in-degree
  • Can have separate in/out degrees by edge type,
    probability of following each edge type
  • Edge Score E (harmonic) mean of path weights
    from leaves to root
Write a Comment
User Comments (0)
About PowerShow.com