Keyword Searching and Browsing in Databases using BANKS - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Keyword Searching and Browsing in Databases using BANKS

Description:

9/24/09. 2. Motivation. Keyword search of documents on the Web has been enormously successful ... Root node has special significance, may be restricted to some ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 30
Provided by: Char183
Category:

less

Transcript and Presenter's Notes

Title: Keyword Searching and Browsing in Databases using BANKS


1
Keyword Searching and Browsing in Databases using
BANKS
  • Charuta Nakhe
  • Joint work with
  • Arvind Hulgeri, Gaurav Bhalotia,
  • Soumen Chakrabarti, S. Sudarshan
  • I.I.T. Bombay

2
Motivation
  • Keyword search of documents on the Web has been
    enormously successful
  • Simple and intuitive, no need to learn any query
    language
  • Database querying using keywords is desirable
  • SQL is not appropriate for casual users
  • Form interfaces cumbersome
  • Require separate form for each type of query
    confusing for casual users of Web information
    systems
  • Not suitable for ad hoc queries

3
Motivation
  • Many Web documents are dynamically generated from
    databases
  • E.g. Catalog data
  • Keyword querying of generated Web documents
  • May miss answers that need to combine information
    on different pages
  • Suffers from duplication overheads

4
Examples of Keyword Queries
  • On a railway reservation database
  • mumbai bangalore
  • On a university database
  • database course
  • On an e-store database
  • camcorder panasonic
  • On a book store database
  • sudarshan databases

5
Differences from IR/Web Search
  • Related data split across multiple tuples due to
    normalization
  • E.g. Paper (paper-id, title, journal),
    Author (author-id, name) Writes
    (author-id, paper-id, position)
  • Different keywords may match tuples from
    different relations
  • What joins are to be computed can only be decided
    on the fly
  • Cites(citing-paper-id, cited-paper-id)

6
Connectivity
  • Tuples may be connected by
  • Foreign key and object references
  • Inclusion dependencies and join conditions
  • Implicit links (shared words), etc.
  • Would like to find sets of (closely) connected
    tuples that match all given keywords

7
Basic Model
  • Database modeled as a graph
  • Nodes tuples
  • Edges references between tuples
  • foreign key, inclusion dependencies, ..
  • Edges are directed.

MultiQuery Optimization
paper
BANKS Keyword search
writes
S. Sudarshan
Prasan Roy
Charuta
author
8
Answer Example
Query sudarshan roy
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
9
The BANKS Answer Model
  • Query set of keywords k1, k2, .., kn
  • Each keyword ki matches set of nodes Si
  • Answer rooted, directed tree connecting nodes,
    with one node from each Si
  • Root node has special significance, may be
    restricted to some relations
  • E.g. relations representing entities, not
    relationships
  • May include intermediate nodes not in any Si and
    hence a steiner tree.
  • Multiple answers
  • Ranking based on proximity prestige

10
Edge Directionality
  • Some popular tuples are connected to many other
    tuples
  • E.g. Students -gt departments -gt university
  • Popular tuples would create misleading shortcuts
    from every tuple to every other
  • E.g. every student would be closely linked with
    every other student via the department/university
  • Solution define different forward and backward
    edge weights
  • Forward edges In the direction of the foreign
    key reference

11
Edge Weight
  • Weight of forward edge based on schema
  • e.g. citation link weights gt writes link
    weights
  • Weight of backward edge indegree of edges
    pointing to the node

1
1
1
12
Edge Weight Scaling
  • Problem Some backward edges have unduly large
    weights
  • Scale edge weights by using log(1raw-edgeweight)
  • total-edge-weight ? edge-weights
  • Edge score E 1 / total-edge-weight

13
Node Weight
  • Nodes have prestige weights too
  • Observation nodes with intuitively greater
    prestige tend to have greater indegree
  • Set node weight indegree
  • Problem Nodes with many in-edges result in
    skewed answers
  • Subdue extreme node weights by using
    log(1indegree)
  • Node score N root-node-weight ?
    leaf-node-weights

14
Combining Scores
  • Problem how to combine two independent metrics
    node weight and edge weight
  • Normalize each to 0-1
  • Combine using weighting factor ?
  • Additive (1- ?) E ? N
  • Multiplicative E N?
  • Performance study to compare alternatives and to
    find reasonable values for ?

15
Finding Answer Trees
  • Backward Expanding Search Algorithm
  • Intuition find vertices from which a forward
    path exists to at least one node from each Si.
  • Run concurrent single source shortest path
    algorithm from each node matching a keyword
  • Create an iterator for each node matching a
    keyword
  • Traverse the graph edges in reverse direction
  • Output a node whenever it is on the intersection
    of the sets of nodes reached from each keyword

16
Backward Expanding Search
Query sudarshan roy
S. Sudarshan
Prasan Roy
authors
17
Result Ordering
  • Answer trees may not be generated in relevance
    order
  • Solution
  • Best-first search across all iterators, based on
    path length
  • Output answers to a buffer
  • Output highest ranked answer from buffer to user
    when buffer is full

18
The BANKS System
  • BANKS provides keyword search coupled with
    extensive browsing facilities
  • Schema browsing data browsing
  • Graphical display of data
  • Implemented using Java servlets
  • Keyword search response times typically 1 to 3
    seconds on
  • DBLP database with 100,000 tuples/300,000 edges
  • P3 600 MHz, 512 MB RAM
  • Try it out at www.cse.iitb.ac.in/banks/

19
Example of Browsing in BANKS
20
Anecdotes
  • Mohan
  • Returns C. Mohan at top based on prestige (number
    of papers written)
  • Transaction
  • Returns Jim Grays classic paper and textbook as
    top answers based on prestige (number of
    citations)
  • Sunita Seltzer
  • No common papers, but both have papers with
    Stonebraker system finds this connection

21
Effect of Parameters
  • Log scaling of edge weights worked well
  • (1- ?) E ? N versus E N? -- made little
    difference
  • Best with ? .2 (subdue node weights but not
    entirely)

EdgeLog
22
Related Work
  • DataSpot (DTL)/Mercado Intuifind VLDB 98
  • Based on patent by Palmon (filed 1995, granted
    1998)
  • Based on hypergraph model, similar answer model
    to ours
  • Differences our model of backward link weights
    and prestige
  • Proximity Search VLDB98
  • Different model of proximity based on adding up
    support
  • No edge weights, prestige, different evaluation
    algorithm
  • Information units (linked Web pages) WWW10
  • No directionality, only studied in Web context
  • Microsoft DBExplorer (this conference)
  • No ranking, based on SQL generation
  • Addresses efficient construction of text indexes
  • Microsoft English query

23
Conclusions and Future Work
  • The next big wave keyword searching and browsing
    of databases?
  • Future work
  • Keyword queries on XML
  • Disambiguating queries by selecting
  • Nodes G.W.Bush Bush Jr or Bush Sr
  • Tree structure coauthors or cites
  • Boolean queries, stemming, thesaurus
  • Metadata column/relation names

24
Thank You
25
BANKS Query Result Example
  • Result of Soumen Sunita

26
(No Transcript)
27
Browsing Features
  • Hyperlinks are automatically added to all
    displayed results
  • Template facilities to do a variety of tasks
  • Browsing data by grouping and creating crosstabs
  • e.g., theses grouped by department and year
  • Hierarchical views of data
  • Nested XML style, even on relational data
  • Graphical displays
  • Bar charts, pie charts, etc
  • Templates are generic and can be applied on any
    data matching assumed schema
  • Can be applied after applying selections
  • New templates can be created by user,
    interactively

28
Combining Keyword Search and Browsing
  • Catalog searching applications
  • Keywords may restrict answers to a small set,
    then user needs to browse answers
  • If there are multiple answers, hierarchical
    browsing required on the answers

29
The BANKS System
  • Available on the web, with (part of) DBLP data
  • http//www.cse.iitb.ac.in/banks
  • Connects to any database using JDBC
  • JDBC metadata features used to provide schema
    browsing
  • No programming needed for customization
  • Minimal preprocessing of database to create
    indices and give weights to links
  • Extensive set of browsing features

BANKS
User
HTTP
JDBC
Web Server Servlets
Database
Write a Comment
User Comments (0)
About PowerShow.com