Keyword Searching and Browsing in Databases using BANKS - PowerPoint PPT Presentation

About This Presentation
Title:

Keyword Searching and Browsing in Databases using BANKS

Description:

9/9/09. 2. Motivation. Keyword search of documents on the Web has been ... as Information Node) has special significance, may be restricted to some relations ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 34
Provided by: Char183
Category:

less

Transcript and Presenter's Notes

Title: Keyword Searching and Browsing in Databases using BANKS


1
Keyword Searching and Browsing in Databases using
BANKS
  • Gaurav Bhalotia, Arvind Hulgeri,
  • Charuta Nakhe,
  • Soumen Chakrabarti, S. Sudarshan
  • I.I.T. Bombay

2
Motivation
  • Keyword search of documents on the Web has been
    enormously successful
  • Simple and intuitive, no need to learn any query
    language
  • Database querying using keywords is desirable
  • SQL is not appropriate for casual users
  • Form interfaces cumbersome
  • Require separate form for each type of query
    confusing for casual users of Web information
    systems
  • Not suitable for ad hoc queries

3
Motivation
  • Many Web documents are dynamically generated from
    databases
  • E.g. Catalog data
  • Keyword querying of generated Web documents
  • May miss answers that need to combine information
    on different pages
  • Suffers from duplication overheads

4
Examples of Keyword Queries
  • On a railway reservation database
  • mumbai bangalore
  • On an e-store database
  • camcorder panasonic
  • On a book store database
  • sudarshan databases

5
Differences from IR/Web Search
  • Related data split across multiple tuples due to
    normalization
  • E.g. Paper (paper-id, title, journal),
    Author (author-id, name) Writes
    (author-id, paper-id, position)
  • Different keywords may match tuples from
    different relations
  • What joins are to be computed can only be decided
    on the fly
  • Cites(citing-paper-id, cited-paper-id)

6
Connectivity
  • Tuples may be connected by
  • Foreign key
  • Implicit links (shared words), etc.
  • Tuples belonging to the same relation
  • Would like to find sets of (closely) connected
    tuples that match all given keywords

7
Basic Model
  • Database modeled as a graph
  • Nodes tuples
  • Edges references between tuples
  • foreign key, other kind of relationships
  • Edges are directed.

MultiQuery Optimization
paper
BANKS Keyword search
writes
S. Sudarshan
Prasan Roy
Charuta
author
8
Answer Example
Query sudarshan roy
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
9
Edge Directionality
  • Some popular tuples are connected to many other
    tuples
  • E.g. Students -gt departments -gt university
  • Popular tuples would create misleading shortcuts
    from every tuple to every other
  • E.g. every student would be closely linked with
    every other student via the department/university
  • Solution define different forward and backward
    edge weights
  • Forward edges In the direction of the foreign
    key reference

10
Edge Weight
  • Weight of forward edge based on schema
  • e.g. citation link weights gt writes link weights
  • Weight of backward edge indegree of edges
    pointing to the node

1
1
1
11
Edge Weight Scaling
  • Problem Some backward edges have unduly large
    weights
  • Scale edge weights by using log(1raw-edgeweight)
  • total-edge-weight ? edge-weights
  • Edge score E 1 / total-edge-weight

12
Node Weight
  • Nodes have prestige weights too
  • Observation nodes with intuitively greater
    prestige tend to have greater indegree
  • Set node weight indegree
  • Problem Nodes with many in-edges result in
    skewed answers
  • Subdue extreme node weights by using
    log(1indegree)
  • Node score N root-node-weight ?
    leaf-node-weights

13
Combining Scores
  • Problem how to combine two independent metrics
    node weight and edge weight
  • Normalize each to 0-1
  • Combine using weighting factor ?
  • Additive (1- ?) E ? N
  • Multiplicative E N?
  • Performance study to compare alternatives and to
    find reasonable values for ?

14
The BANKS Answer Model
  • Query set of keywords k1, k2, .., kn
  • Each keyword ki matches set of nodes Si
  • Answer rooted, directed tree connecting nodes,
    with one node from each Si
  • Root node(also referred to as Information Node)
    has special significance, may be restricted to
    some relations
  • E.g. relations representing entities, not
    relationships
  • May include intermediate nodes not in any Si and
    hence a Steiner tree.
  • Multiple answers
  • Ranking based on proximity prestige

15
Finding Answer Trees
  • Computation of minimum weight Steiner
  • Trees NP complete
  • Backward Expanding Search Algorithm
  • Intuition find vertices from which a forward
    path exists to at least one node from each Si.
  • Run concurrent single source shortest path
    algorithm from each node matching a keyword
  • Create an iterator for each node matching a
    keyword
  • Traverse the graph edges in reverse direction
  • Output a node whenever it is on the intersection
    of the sets of nodes reached from each keyword

16
Finding Answer Tress
  • For each vertex visited, maintain a nodelist v.Li
    for each search term ti.
  • Update the ith nodelist when the search starting
    from a vertex u?Si reaches the vertex v.
  • The new result tress produced correspond to the
    nodelists u ? v.Lj
  • ij

17
Backward Expanding Search
Query sudarshan roy
S. Sudarshan
Prasan Roy
authors
18
Result Ordering
  • Answer trees may not be generated in relevance
    order
  • Solution
  • Best-first search across all iterators, based on
    path length
  • Output answers to a buffer
  • Eliminate duplicates Isomorphic Trees
  • Output highest ranked answer from buffer to user
    when buffer is full

19
THE BANKS SYSTEM
  • BANKS provides keyword search coupled with
    extensive browsing facilities
  • Schema browsing data browsing
  • Graphical display of data
  • Implemented using Java servlets
  • Keyword search response times typically 1 to 3
    seconds on
  • DBLP database with 100,000 tuples/300,000 edges
  • P3 600 MHz, 512 MB RAM
  • Try it out at www.cse.iitb.ac.in/banks/

20
The BANKS Architecture
  • Connects to any database using JDBC
  • JDBC metadata features used to provide schema
    browsing
  • No programming needed for customization
  • Minimal preprocessing of database to create
    indices and give weights to links
  • Extensive set of browsing features

BANKS
User
HTTP
JDBC
Web Server Servlets
Database
21
Browsing Features
  • Hyperlinks are automatically added to all
    displayed results
  • Template facilities to do a variety of tasks
  • Browsing data by grouping and creating crosstabs
  • e.g., theses grouped by department and year
  • Hierarchical views of data
  • Nested XML style, even on relational data
  • Graphical displays
  • Bar charts, pie charts, etc

22
Example of Browsing in BANKS
23
BANKS Query Result Example
  • Result of Soumen Sunita

24
Anecdotes
  • Mohan
  • Returns C. Mohan at top based on prestige (number
    of papers written)
  • Transaction
  • Returns Jim Grays classic paper and textbook as
    top answers based on prestige (number of
    citations)
  • Sunita Seltzer
  • No common papers, but both have papers with
    Stonebraker system finds this connection

25
Effect of Parameters
  • Log scaling of edge weights worked well
  • (1- ?) E ? N versus E N? -- made little
    difference
  • Best with ? .2 (subdue node weights but not
    entirely)

EdgeLog
26
Related Work
  • DataSpot (DTL)/Mercado Intuifind VLDB 98
  • Based on patent by Palmon (filed 1995, granted
    1998)
  • Similar answer model to ours
  • Differences our model of backward link weights
    and prestige
  • Proximity Search VLDB98
  • Different model of proximity
  • No edge weights, prestige, different evaluation
    algorithm
  • Information units (linked Web pages) WWW10
  • No directionality, only studied in Web context
  • Microsoft DBExplorer
  • No ranking, based on SQL generation
  • Addresses efficient construction of text indexes

27
Some Extensions to the BANKS
  • Searching for similar results Template Search
  • define the notion of similarity between two
    result trees
  • perform the restricted search
  • Efficiently handling meta-data queries
  • starting the search from each of the tuples in a
    table is too costly

28
Template Search
  • Feedback in terms of result tree
  • Type of a result tree defined in terms of
  • type of nodes
  • the table to which the node belongs
  • type of edges
  • the type of nodes which it connects
  • the link information e.g. cites and cited
    link between two papers.
  • Which nodes to start the search from
  • only the chosen nodes
  • all the nodes corresponding to a particular
    keyword

29
Template Search
  • Start the backward search only from allowed set
    of nodes
  • Follow the edges as defined by the result type
  • Example Consider Query sudarshan database
  • Two types of results for above query
  • papers written by professor sudarshan
  • papers cited by papers written by professor
    sudarshan
  • Two result types distinguished by whether to
    follow the cites/cited link from a paper node.

30
Metadata Keyword Queries
  • Metadata keywords match all the tuples of
  • a relation.
  • Too costly to start the search from each of
  • the tuples of a table
  • First cut approach start the forward search from
    the information node for the non-metadata
    keywords
  • selectively choose the nodes from where to
  • start the forward search

31
Example of Metadata Query
  • Consider the query sudarshan paper

writes table nodes
To paper table
(forward search)
sudarshan
32
Conclusions and Future Work
  • The next big wave keyword searching and browsing
    of databases?
  • Future work
  • Keyword queries on XML
  • Disambiguating queries by selecting
  • Nodes G.W.Bush Bush Jr or Bush Sr
  • Tree structure coauthors or cites
  • Boolean queries, stemming, thesaurus
  • Metadata column/relation names

33
Thank You
Write a Comment
User Comments (0)
About PowerShow.com