Title: Expressing and Optimizing the Similaritybased Queries In SQL
1Expressing and Optimizingthe Similarity-based
QueriesIn SQL
- by Like Gao (lgao_at_gmu.edu)
- ISE Department
- GMU
- Advisor Prof. X. Sean Wang
2What is similarity-based search?
3Example time series case
- Similarity Euclidean Distance, Correlation
Coefficient - Similarity-based Search
Dissimilar Distance2.5
Similar Distance0.5
t
t
Pattern set
Query object
Nearest
Near
4Example not similarity-based search!
5Similarity-based Search is content based!
6Query optimization in DBMS? What is
available/unavailable?
7Query Optimization
Select ID From Employee Where Salary gt
100K And Age lt 40
Query Optimization is to find the query
execution plan that has the smallest cost
Employee
Supported by current DBMS!
8UDT, UDF(P) are supported
Select ID From Animal Where Photo is most like
???.jpg And animal in Photo is puppy And Age
lt 4
UDP written by users, supported by DBMS
Animal
UDT supported by DBMS i.e., BLOB
9But query optimization?
Select ID From Animal Where UDP1
(similarity-based search 1) And UDP2
(similarity-based search 2) And Age lt 4
The optimization for similarity-based queries can
not be properly optimized by current
DBMS! (traditional DBMS evaluates UDP as early as
possible!)
10Summary
RDBMS
Build-in Optimization Algorithms
?
Similarity-Based Queries (SQL)
Answer
Similarity-functions (UDP), provided by users
11Expressing and Optimizingthe Similarity-based
QueriesIn SQL
End of Part 1
12Outline
- RDBMS and SQL
- Query Optimization
- Similarity-Based Search
- Expressing and Optimizing Similarity- Based
Queries
13RDBMS and SQL
- RDBMS (Relational Database Management System)
- Based on E. F. Codds relational model (theory),
1969 - Venders IBM DB2, ORACLE, Microsoft SQL Server,
etc. - http//www.cs.yale.edu/homes/avi/db-book/i
ndex.html - SQL (Structured Query Language)
- The standard language used by RDBMS
- http//sqlzoo.net/
- A quick review of RDBMS and SQL
- http//mason.gmu.edu/xywang/infs797/lectu
re1.pdf
14Query Optimization
If most of employees salaries are very likely
gt100K, ? sequential scan
Select ID From Employee Where Salary gt 100K
If 95 of the employees salaries are unlikely
gt100K, ? using the index on Salary
Execution Engine
Optimizer ?SSCAN ?Index
Plan
IDs
Query Optimization is to find the plan that has
the smallest execution cost
Employee Data
Statistics About Employee
15Similarity-Based Search
- Nearest Neighbor
- Given a set of pattern objects P and a query
object q, an object pi in P such that sim(pi,q) lt
sim(pj,q), for any j?i. - k-nearest neighbors
- Near Neighbors
- Given a set of pattern objects P, a query object
q and a threshold TH, all object pi in P such
that sim(pi,q) lt TH. - Both nearest and near neighbors
16Example time series case
- Similarity Euclidean Distance, Correlation
Coefficient - Similarity-based Search
similar
dissimilar
t
t
pi
Nearest
q
P
pi1
Near
pi2
17Example not similarity-based search!
18Similarity-Based Queries in SQL
- Current DBMS support UDT and UDF.
- (Uuser, D defined, T type and F function)
- Express the similarity predicate as UDP
(user-defined-predicate) - NN(query object, one given object, pattern set,
K, TH) ? T/F?
PatternSet
Select P.ID From PatternSet P Where NN(q,
P.ID, Set1, 1, 0.2) And P.ID lt 1000
19Optimizing
Select P.ID From PatternSet P Where NN(q,
P.ID, Set1, 1, 0.2) And P.ID lt 1000
Only a few P.ID lt 1000
Plan A 1) P.ID lt 1000 2) NN(q, P.ID,
Set1, 1, 0.2)
Many P.ID lt 1000
Plan B 1) NN(q, P.ID, Set1, 1, 0.2) 2)
P.ID lt 1000