Title: Efficient IRStyle Keyword Search over Relational Databases
1Efficient IR-Style Keyword Searchover Relational
Databases
- Vagelis Hristidis
- University of California, San Diego
- Luis Gravano
- Columbia University
- Yannis Papakonstantinou
- University of California, San Diego
2Motivation
- Keyword search is the dominant information
discovery method in documents - Increasing amount of data is stored in databases
- Plain text coexists with structured data
3Motivation
- Up until recently, information discovery in
databases required - Knowledge of schema
- Knowledge of a query language (e.g., SQL)
- Knowledge of the role of the keywords
- Goal Enable IR-style keyword search over DBMSs
without the above requirements
4IR-Style Search over DBMSs
- IR keyword search well developed for document
search - Modern DBMSs offer IR-style keyword search over
individual text attributes - What is equivalent to document in databases?
5Example Complaints Database Schema
6Example - Complaints Database Data
Complaints
Customers
Products
7Example Keyword Query Maxtor Netvista
Complaints
Customers
Products
8Keyword Query Semantics (definition of
document in databases)
- Keywords are
- in same tuple
- in same relation
- in tuples connected through primary-foreign key
relationships - Score of result
- distance of keywords within a tuple
- distance between keywords in terms of
primary-foreign key connections - IR-style score of result tree
9Example Keyword Query Maxtor Netvista
Complaints
Customers
Products
Results (1) c3, (2) p2? c3, (3) p1? c1
10Result of Keyword Query
- Result is tree T of tuples where
- each edge corresponds to a primary-foreign key
relationship - no tuple of T is redundant (minimality)
- - AND query semantics Every query keyword
appears in T - - OR query semantics Some query keywords
might be missing from T
11Score of Result T
- Combining function Score combines scores of
attribute values of T - One reasonable choice
- Score?a?TScore(a)/size(T)
- Attribute value scores Score(a) calculated using
the DBMS's IR datablades
12Shortcomings of Prior Work
- Simplistic ranking methods (e.g., based only on
size of connecting tree), ignoring well-studied
IR ranking strategies - No straightforward extension to improve
efficiency by returning just top-k results - Not good in handling free-text attributes
DBXplorer,DISCOVER
13Example Keyword Query Maxtor Netvista
Complaints
Score(c3) 4/3
Score(p1? c1) (11/3)/2 4/6
Customers
Products
Score(p2? c3) (14/3)/2 7/6
Results (1) c3, (2) p2? c3, (3) p1? c1
14Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
15Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
16Candidate Network Generator
- Find all trees of tuple sets (free or non-free)
that may produce a result, based on DISCOVER's CN
generator VLDB 2002 - Use single non-free tuple set for each relation
- allows OR semantics
- fewer CNs are generated
- extra filtering step required for AND semantics
17Candidate Network Generator Example
- For query Maxtor Netvista, CNs
- ComplaintsQ
- ProductsQ
- ComplaintsQ ? ProductsQ
- ComplaintsQ ? Customer ?ComplaintsQ
- ComplaintsQ ? Product ? ComplaintsQ
- Non-CNs
- ComplaintsQ ? Customer ?Complaints
- Product Q ? Complaints ? ProductQ
18Architecture
ComplaintsQ (c3,comments,1.33),
(c1,comments,0.33), (c2,comments,0.33)
ProductsQ (p1,manufacturer,1), (p2,model,1)
Maxtor Netvista
ComplaintsQ ProductsQ ComplaintsQ ?
ProductsQ ComplaintsQ ? Customer
?ComplaintsQ ComplaintsQ ? Product ?
ComplaintsQ
... SELECT FROM ComplaintsQ c, ProductsQ
p WHERE c.prodId p.prodId AND c.prodId? AND
c.custId ? ...
c3 p2 ? c3 p1 ? c2
19Execution Algorithms
- Users usually want top-k results.
- Hence, submitting to DBMS a SQL query for each CN
(Naïve algorithm) is inefficient. - When queries produce at most very few results,
Naïve algorithm is efficient, since it fully
exploits DBMS. - Monotonic combining functions if results T, T'
have same schema and for every attribute
Score(ai)Score(a'i) then Score(T)Score(T')
20Sparse Algorithm Example Execution
p1 9 9
c2 7 7
c1 ? p1 (95)/27 (97)/2 8
- Best when query produces at most a few results
21Single Pipelined Algorithm Example Execution
CN ComplaintsQ ? ProductsQ
Results queue
Get next tuple from most promising non-free tuple
set
MPFS
Max(59)/2, (76)/27
Max(19)/2, (76)/26.5
Max(19)/2, (71)/25
p1?c1 7
p2?c2 6.5
Output p1?c1
p2?c2
22Global Pipelined Algorithm Example Execution
global MPFSmax(MPFSi) over all CNs Ci
- Best when query produces many results.
23Hybrid Algorithm
- Estimate number of results.
- For OR-semantics, use DBMS estimator
- For AND-semantics, probabilistically adjust
DBMS estimator. - If at most a few query results expected, then use
Sparse Algorithm. - If many query results expected, then use Global
Pipelined Algorithm.
24Related Work
- DBXplorer ICDE 2002, DISCOVER VLDB 2002
- Similar three-step architecture
- Score 1/size(T)
- Only AND semantics
- No straightforward extension for efficient top-k
execution - BANKS ICDE 2002, Goldman et al. VLDB 1998
- Database viewed as graph
- No use of schema
- Florescu et al. WWW 2000, XQuery Full-Text
- Ilyas et al. VLDB 2003, J algorithm VLDB
2001 - Top-k algorithms for join queries
25Experiments DBLP Dataset
C Conference Y Year P Paper A Author
DBLP contains few citation edges. Synthetic
citation edges were added such that average
citations is 20. Final dataset is
56MB. Experiments run over state-of-the-art
commercial RDBMS.
26OR Semantics Effect of Maximum Allowed CN Size
Average execution time of 100 2-keyword top-10
queries
27OR Semantics Effect of Number of Objects
Requested k
Average execution time of 100 2-keyword queries
with maximum candidate-network size of 6
28OR Semantics Effect of Number of Query Keywords
Average execution time of 100 top-10 queries with
maximum candidate-network size of 6
29Conclusions
- Extend IR-style ranking to databases.
- Exploit text-search capabilities of modern DBMSs,
to generate results of higher quality. - Support both AND and OR semantics.
- Achieve substantial speedup over prior work via
pipelined top-k query processing algorithms.
30Questions?
31Compare algorithms wrt Result size
OR-semantics
AND-semantics
Max CN size 6, top-10, 2 keywords, OR-semantics
32Ranking Functions
- Proposed algorithms support tuple monotone
combining functions - That is, if results T, T' have same schema and
for every attribute Score(ai)Score(a'i) then
Score(T)Score(T')