Lecture 17: Keyword Search on Relational Databases - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Lecture 17: Keyword Search on Relational Databases

Description:

We have SQL, why keyword-querying? SQL - not appropriate for naive users. So many online databases (imdb, ... Query: set of search terms {t1, t2, .., tn} ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 56
Provided by: ChengXi4
Category:

less

Transcript and Presenter's Notes

Title: Lecture 17: Keyword Search on Relational Databases


1
Lecture 17 Keyword Search on Relational
Databases
Oct. 31, 2007 ChengXiang Zhai
Most slides are from Esha Palta and Kumar Gaurav
Bijays presentation
2
Motivation
  • Keyword search
  • We have SQL, why keyword-querying?
  • SQL - not appropriate for naive users
  • So many online databases (imdb, citeseer,
    bseindia ) user cannot keep track of schema
    for all of these

3
Simple Approaches
  • Using Form interfaces
  • Require separate form for each type of query
    confusing
  • Not suitable for ad-hoc queries how many forms
    will you provide?
  • How about Google?
  • Export data from db to documents and do keyword-
    querying on these
  • Suffers from duplication overheads
  • Google wants all keywords in one document. DB is
    often normalized, so need to join tables and
    store as documents
  • Multiple combinations of tables to join. Not
    scalable

4
Differences from Web Search
  • Related data split across multiple tuples due to
    normalization
  • Different keywords may match tuples from
    different relations
  • What joins are to be computed can only be decided
    on the fly
  • Need to find result containing all keywords and
    rank them somehow

Writes (AuthorId, PaperId)
Paper (PaperId, PaperName)
Author (AuthorId, AuthorName)
Cites (Citing, Cited)
The DBLP Bibliography Schema
5
Systems for DB search
  • BANKS (Browsing and Keyword Search)
  • IITB (ICDE 02)
  • DBXplorer Microsoft Research (ICDE 02)
  • ObjectRank IBM, UCSD, FIU (VLDB 04)
  • Bidirectional BANKS IITB (VLDB 05)

6
Systems for DB search
  • BANKS (Browsing and Keyword Search)
  • IITB (ICDE 02)
  • DBXplorer Microsoft Research (ICDE 02)
  • ObjectRank IBM, UCSD, FIU (VLDB 04)
  • Bidirectional BANKS IITB (VLDB 05)

7
BANKS (ICDE 02)
8
The BANKS system
  • BANKS Architecture
  • Available on the web
  • http//www.cse.iitb.ac.in/banks
  • Connects to database using JDBC
  • JDBC metadata features used to provide schema
    browsing
  • Preprocesses db

User
BANKS
HTTP
JDBC
Web-server
Database
9
Basic Model
  • Database modeled as a graph
  • Nodes tuples
  • Edges references between tuples
  • foreign key (assume for this talk), inclusion
    dependencies, ..
  • Edges are directed.

PaperIdPaperName
AuthorIDPaperId
AuthorId
DBLP example
10
The BANKS Answer Model
  • Query set of search terms t1, t2, .., tn
  • For each search term ti we find set of nodes Si
    matching ti
  • Eg Query Sudarshan Roy (t1 Sudarshan, t2
    Roy)
  • Answer rooted, directed tree connecting nodes
    matching keywords
  • Root node has special significance, may be
    restricted to some relations
  • E.g. relations representing entities, not
    relationships
  • May include intermediate nodes not in any Si
    (Steiner Tree)
  • Multiple answers
  • Ranking based on proximity prestige

11
Answer Example
Query sudarshan roy
Paper
Writes
Writes
Author
Author
  • We would like to find sets of (closely) connected
    tuples that match all given keywords

12
Edge Directionality
  • Directed tree will miss desired answers. For eg
  • Query DBXplorer ObjectRank
  • So, for each forward edge, BANKS adds a back edge

CitedBy
Cited
Cites
BANKS
DBXPlorer
ObjectRank
Cites
CitedBy
Cited
DBXPlorer
BANKS
Cites
ObjectRank
Cites
13
Edge Directionality
  • What if we ignore directionality?
  • Some popular tuples are connected to many other
    tuples
  • E.g. Students -gt departments -gt university
  • Problem A popular tuple would create misleading
    shortcuts between tuples
  • E.g. every student would be closely linked with
    every other student via the department/university
  • Solution define different forward and backward
    edge weights
  • Forward edges In the direction of the foreign
    key reference

14
Edge Weight
  • Weight of forward edge based on schema
  • e.g. citation link weights gt writes link
    weights
  • Weight of backward edge indegree of edges
    pointing to the node

15
Edge Weight Scaling
  • Normalize edge score Escore(e)
  • Make edge weight scale-free by dividing edge
    weigth by wmin
  • Problem Some backward edges have unduly large
    weights
  • Depress the scale by defining Escore(e) as
    log(1w(e)/wmin )
  • Overall Escore E 1 / (1 ?e Escore(e))

16
Node Weight
  • Set weight of a Node Indegree of the node
  • As per prestige rankings nodes with multiple
    pointers to them get a higher prestige
  • So, higher node weight corresponds to higher
    prestige
  • Problem Nodes with many in-edges result in
    skewed answers
  • Subdue extreme node weights by using
    log(1indegree)
  • Node score Nscore Average of node scores
    (root-node-weight ? leaf-node-weights)

17
Combining Scores
  • Combining two independent metrics node weight
    and edge weight
  • Normalize each to 0-1
  • Combine using weighting factor ?
  • Additive (1- ?) Escore ? Nscore
  • Multiplicative Escore Nscore?
  • Performance study to compare alternatives and to
    find reasonable values for ?

18
First Step Symbol Table
  • The first step is to build a symbol table
  • This table is in the db and is not normalized
  • Example

19
Searching for Best Answers
  • Backward Expanding Search Algorithm
  • Assume graph fits in memory
  • Idea find vertices from which a forward path
    exists to at least one node from each Si.
  • Run concurrent single source shortest path
    algorithm from each node matching a keyword
  • Create an iterator for each node matching a
    keyword
  • Traverse the graph edges in reverse direction
  • Output a node whenever it is on the intersection
    of the sets of nodes reached from each keyword
  • Answer trees may not be generated in relevance
    order

20
Backward Expanding Search
Iterators
21
BANKS Query Result Example
  • Result of Sudarshan Roy

22
Result Ordering
  • Answers need not be always in Relevance order

This tree is output
Better Root Missed
2
2
2
5
2
1
23
Result Ordering (contd)
  • Solution
  • Generate all connection trees and then sort them
  • Increases computation costs and leads to a
    greatly increased time to generate initial
    results
  • Create a small heap ordered on the relevance of
    the trees
  • Output highest ranked tree from heap to user when
    heap is full
  • What about duplicate results?
  • Maintain a list of generated results for
    duplicate detection
  • Discard result according to relevance

24
Experience and Performance
  • BANKS provides keyword search coupled with
    extensive browsing facilities
  • Schema browsing data browsing
  • Graphical display of data
  • Implemented using Java servlets
  • Keyword search response times typically 1 to 3
    seconds on
  • DBLP database with 100,000 tuples/300,000 edges
  • P3 600 MHz, 512 MB RAM

25
Anecdotes
  • Mohan
  • Returns C. Mohan at top based on prestige (number
    of papers written)
  • Transaction
  • Returns Jim Grays classic paper and textbook as
    top answers based on prestige (number of
    citations)
  • Sunita Seltzer
  • No common papers, but both have papers with
    Stonebraker system finds this connection

26
Effect of Parameters
  • Log scaling of edge weights worked well
  • (1- ?) E ? N versus E N???????made little
    difference
  • Best with ? .2 (subdue node weights but not
    entirely)

27
BANKS (VLDB 05)
28
Motivation
  • BANKS performs poorly if
  • Keyword matches lot of nodes (so lot of Dijkstra
    sources)
  • Search hits a node with large fan in.


Wastes time
Sudarshan
Roy
29
New Ideas Forward Search
  • Why only backward, lets search forward too

How about fwd Searching ?

Sudarshan
Roy
30
New Ideas - Activation
  • Activation - Cannot forward search from each
    node.
  • Spread activation from keyword nodes to others.
  • Activation is like Page Rank with decay.
  • High Activation ? close to many keywords.

31
Activation Spreading
  • Spreading Activation
  • Node with highest activation explored first
  • Activation spread to neighbors (µ 0.3)
  • Gives low activation to neighbors of hubs

32
Modifications to Model
  • Graph model stays the same.
  • BANKS is concerned with search more than how to
    tune parameters or define node weights / edge
    weights.
  • BANKS code
  • Tree Node Score, N
  • Tree Edge Score, E
  • Total Score ENl (l 0.2)

33
The New Algorithm
  • Need two priority queues
  • Qin - do backward search from these nodes
  • Qout - do forward search from these nodes
  • Each node, n keeps 3 variables per keyword, ti
  • sp i Node to got to from n for shortest-path
    to ti
  • distance i Length of the shortest-path from
    n to ti
  • Activation i Activation to n from keyword ti

34
The New Algorithm continued
  • Set initial activation keyword nodes and add to
    Qin for backward-search.
  • At each step, pick node with maximum activation
  • i.e. if (Qin.getMaxActivation gt Qout.
    getMaxActivation))
  • // use node from Qin
  • else
  • // use node from Qout
  • If node from Qin, do backward search and add
    itself to Qout. (newly explored nodes into Qin)
  • If node from Qout, do forward search
  • If node has reached from all keyword, generate
    result-tree. answer is buffered as results can
    be out of order

35
Explanation with example
Qin
Qout
N100
N4
Roy Sudarshan
N1

N3
N2
Roy
Sudarshan
36
Explanation with example
Qin
Qout
N100
Roy Sudarshan
N4
N1 N2
N1

N3
N2
Roy
Sudarshan
37
Explanation with example
Qin
Qout
N100
N1 Roy Sudarshan
N2 N3 N100
N4
N1

N3
N2
Roy
Sudarshan
Result Found !
38
Generation of top-k results
  • If we know the score of next-best answer, all
    buffered answers with better score can be output.
  • Need upper bounds

39
Computation of upper bound
  • For each keyword ti, we have explored nodes upto
    some length say li.
  • So, next best score (approx.)
  • This is not a true upper bound, but works quite
    well and is simple !

40
Are we losing answers ?
  • BANKS I used many Dijkstra states, BANKS II
    uses 2 only forward and backward search-states.
  • The result is that we can now lose answers !

41
Answer Loss Example
Ny
K1
K2
Nx
K1
This is the generated answer.
This answer is lost.
42
  • But, we will generate this tree rooted at Nx
  • So, a rotated tree with same nodes but different
    root is often generated !

NY
K2
NX
K1
43
Metrics of Performance
  • Manually obtain best relevant answers.
  • Determine 2 times
  • Time taken to produce last relevant answer.
  • Time taken to output last relevant answer.
  • Search algorithms
  • MI-Bkwd original backward search
  • Iterator for every node matching a keyword
  • SI-Bkwd backward search with single backward
    iterator
  • Bidirec bidirectional search
  • Datasets
  • DBLP, IMDB 2 million nodes, 9 million edges
  • US Patent DB 4 million nodes, 15 million edges

44
Graph - I
  • MI-Bkwd versus SI-Bkwd
  • SI-Bkwd gain increases with origin size,
    keywords

45
Graph - II
  • SI-Bkwd versus Bidirec
  • Bidirec gain increases with origin size,
    keywords

46
A Critique
  • BANKS needs a lot of memory.
  • Need to cluster and keep parts of graph on disk.

47
DBXplorer (ICDE 02)
48
DBXplorer (Microsoft Research)
  • Use symbol table to determine which tables to
    join.
  • Generate all possible table join combinations
  • Figure

T1, T2, T3, T4 and T5 are tables
49
Cool ideas in DBXplorer
  • Symbol table need not be at tuple level. If
    column has an index, column level symbol table
    is ok.
  • Table Compression
  • e.g. Keywords Columns Keywords Columns

K1
K1
C1
C1
K2
K2
X
K3
C2
C2
K3
K4
K4
K5
K5
Intermediate Column
50
ObjectRank (VLDB 04)
51
ObjectRank (IBM, FIU, UCSD)
  • Creates objects in database. Object definition is
    manual.
  • e.g. in DBLP, author, conference and paper can
    be defined as objects.
  • Heavily inspired by PageRank.
  • Each node is given global ObjectRank just like
    PageRank of Google.

52
ObjectRank Ideas
  • Keyword-level ObjectRank for each keyword,
    precompute and save object ranks of nodes can
    optimize by defining cut-off)
  • Score of node, n w.r.t. keyword k
  • scorek(n) f (Global-object-rank (n),
    Objectrankk (n))
  • At run time, scores are combined
  • scorek1,k2,,km(n) scorek1(n) scorek2(n)
    scorekm(n)

53
ObjectRank Algorithm and answers
  • If graph is DAG or near DAG, topologically sort
    and spread ObjectRank in this order.
  • Answers are single objects and not Cluster /
    group as in BANKS.
  • Demo at
  • http//teriyaki.ucsd.edu9099/objrank/main05_ne
    w.html

54
References
  • Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe,
    Soumen Chakrabarti, and S. Sudarshan.Keyword
    Searching and Browsing in Databases using
    BANKS.In International Conference on Data
    Engineering (ICDE), pages 10831096, 2002.
  • Varun Kacholia, Shashank Pandit, Soumen
    Chakrabarti, S. Sudarshan et. al.Bidirectional
    Expansion for Keyword Search on Graph
    Databases.In VLDB Conference, pages 505516,
    2005.
  • Sanjay Agrawal, Surajit Chaudhari, and Gautam
    Das.DBXplorer A System for Keyword-Based Search
    over Relational Databases.In International
    Conference on Data Engineering (ICDE), pages
    522, 2002.
  • Andrey Balmin, Vagelis Hristidis, and Yannis
    Papakonstantinou.ObjectRank Authority-Based
    Keyword Search in Databases.In VLDB Conference,
    pages 564575, 2004.

55
What You Should Know
  • Why answering keyword queries on relational
    databases is useful
  • Basic ideas of BANKS
Write a Comment
User Comments (0)
About PowerShow.com