Efficient IRStyle Keyword Search over Relational Databases - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Efficient IRStyle Keyword Search over Relational Databases

Description:

Hence, submitting to DBMS a SQL query for each CN (Na ve algorithm) is inefficient. ... at most very few results, Na ve algorithm is efficient, since it fully ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 33
Provided by: users
Category:

less

Transcript and Presenter's Notes

Title: Efficient IRStyle Keyword Search over Relational Databases


1
Efficient IR-Style Keyword Searchover Relational
Databases
  • Vagelis Hristidis
  • University of California, San Diego
  • Luis Gravano
  • Columbia University
  • Yannis Papakonstantinou
  • University of California, San Diego

2
Motivation
  • Keyword search is the dominant information
    discovery method in documents
  • Increasing amount of data is stored in databases
  • Plain text coexists with structured data

3
Motivation
  • Up until recently, information discovery in
    databases required
  • Knowledge of schema
  • Knowledge of a query language (e.g., SQL)
  • Knowledge of the role of the keywords
  • Goal Enable IR-style keyword search over DBMSs
    without the above requirements

4
IR-Style Search over DBMSs
  • IR keyword search well developed for document
    search
  • Modern DBMSs offer IR-style keyword search over
    individual text attributes
  • What is equivalent to document in databases?

5
Example Complaints Database Schema
6
Example - Complaints Database Data
Complaints
Customers
Products
7
Example Keyword Query Maxtor Netvista
Complaints
Customers
Products
8
Keyword Query Semantics (definition of
document in databases)
  • Keywords are
  • in same tuple
  • in same relation
  • in tuples connected through primary-foreign key
    relationships
  • Score of result
  • distance of keywords within a tuple
  • distance between keywords in terms of
    primary-foreign key connections
  • IR-style score of result tree

9
Example Keyword Query Maxtor Netvista
Complaints
Customers
Products
Results (1) c3, (2) p2? c3, (3) p1? c1
10
Result of Keyword Query
  • Result is tree T of tuples where
  • each edge corresponds to a primary-foreign key
    relationship
  • no tuple of T is redundant (minimality)
  • - AND query semantics Every query keyword
    appears in T
  • - OR query semantics Some query keywords
    might be missing from T

11
Score of Result T
  • Combining function Score combines scores of
    attribute values of T
  • One reasonable choice
  • Score?a?TScore(a)/size(T)
  • Attribute value scores Score(a) calculated using
    the DBMS's IR datablades

12
Shortcomings of Prior Work
  • Simplistic ranking methods (e.g., based only on
    size of connecting tree), ignoring well-studied
    IR ranking strategies
  • No straightforward extension to improve
    efficiency by returning just top-k results
  • Not good in handling free-text attributes

DBXplorer,DISCOVER
13
Example Keyword Query Maxtor Netvista
Complaints
Score(c3) 4/3
Score(p1? c1) (11/3)/2 4/6
Customers
Products
Score(p2? c3) (14/3)/2 7/6
Results (1) c3, (2) p2? c3, (3) p1? c1
14
Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
15
Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
16
Candidate Network Generator
  • Find all trees of tuple sets (free or non-free)
    that may produce a result, based on DISCOVER's CN
    generator VLDB 2002
  • Use single non-free tuple set for each relation
  • allows OR semantics
  • fewer CNs are generated
  • extra filtering step required for AND semantics

17
Candidate Network Generator Example
  • For query Maxtor Netvista, CNs
  • ComplaintsQ
  • ProductsQ
  • ComplaintsQ ? ProductsQ
  • ComplaintsQ ? Customer ?ComplaintsQ
  • ComplaintsQ ? Product ? ComplaintsQ
  • Non-CNs
  • ComplaintsQ ? Customer ?Complaints
  • Product Q ? Complaints ? ProductQ

18
Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
19
Execution Algorithms
  • Users usually want top-k results.
  • Hence, submitting to DBMS a SQL query for each CN
    (Naïve algorithm) is inefficient.
  • When queries produce at most very few results,
    Naïve algorithm is efficient, since it fully
    exploits DBMS.
  • Monotonic combining functions if results T, T'
    have same schema and for every attribute
    Score(ai)Score(a'i) then Score(T)Score(T')

20
Sparse Algorithm Example Execution
p1 9 9
c2 7 7
c1 ? p1 (95)/27 (97)/2 8
  • Best when query produces at most a few results

21
Single Pipelined Algorithm Example Execution
CN ComplaintsQ ? ProductsQ
Results queue
Get next tuple from most promising non-free tuple
set
MPFS
Max(59)/2, (76)/27
Max(19)/2, (76)/26.5
Max(19)/2, (71)/25
p1?c1 7
p2?c2 6.5
Output p1?c1
p2?c2
22
Global Pipelined Algorithm Example Execution
global MPFSmax(MPFSi) over all CNs Ci
  • Best when query produces many results.

23
Hybrid Algorithm
  • Estimate number of results.
  • For OR-semantics, use DBMS estimator
  • For AND-semantics, probabilistically adjust
    DBMS estimator.
  • If at most a few query results expected, then use
    Sparse Algorithm.
  • If many query results expected, then use Global
    Pipelined Algorithm.

24
Related Work
  • DBXplorer ICDE 2002, DISCOVER VLDB 2002
  • Similar three-step architecture
  • Score 1/size(T)
  • Only AND semantics
  • No straightforward extension for efficient top-k
    execution
  • BANKS ICDE 2002, Goldman et al. VLDB 1998
  • Database viewed as graph
  • No use of schema
  • Florescu et al. WWW 2000, XQuery Full-Text
  • Ilyas et al. VLDB 2003, J algorithm VLDB
    2001
  • Top-k algorithms for join queries

25
Experiments DBLP Dataset
C Conference Y Year P Paper A Author
DBLP contains few citation edges. Synthetic
citation edges were added such that average
citations is 20. Final dataset is
56MB. Experiments run over state-of-the-art
commercial RDBMS.
26
OR Semantics Effect of Maximum Allowed CN Size
Average execution time of 100 2-keyword top-10
queries
27
OR Semantics Effect of Number of Objects
Requested k
Average execution time of 100 2-keyword queries
with maximum candidate-network size of 6
28
OR Semantics Effect of Number of Query Keywords
Average execution time of 100 top-10 queries with
maximum candidate-network size of 6
29
Conclusions
  • Extend IR-style ranking to databases.
  • Exploit text-search capabilities of modern DBMSs,
    to generate results of higher quality.
  • Support both AND and OR semantics.
  • Achieve substantial speedup over prior work via
    pipelined top-k query processing algorithms.

30
Questions?
31
Compare algorithms wrt Result size
OR-semantics
AND-semantics
Max CN size 6, top-10, 2 keywords, OR-semantics
32
Ranking Functions
  • Proposed algorithms support tuple monotone
    combining functions
  • That is, if results T, T' have same schema and
    for every attribute Score(ai)Score(a'i) then
    Score(T)Score(T')
Write a Comment
User Comments (0)
About PowerShow.com