Title: Database%20Implementation%20of%20a%20Model-Free%20Classifier
1Database Implementation of a Model-Free
Classifier
ADBIS 2007
2Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
3Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
4Introduction
Classification
x ltx1, x2, , xDgt
? f(x)
5Introduction
ltx1,1, x1,2, , x1,D, ?1gt ltx2,1, x2,2, , x2,D,
?2gt ltx3,1, x3,2, , x3,D, ?1gt ltx4,1, x4,2, ,
x4,D, ?1gt . . .
x1 ltx1, x2, , xDgt
x2 ltx1, x2, , xDgt
Lazy
Eager
(Nearest Neighbors)
(Decision Trees)
() Faster decisions
( - ) Large/complex datasets
( - ) Dynamic datasets
( - ) Dynamic models
6Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
7Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
8Motivation
9Motivation
10Motivation
- Large/complex datasets
- Dynamic datasets
11Motivation
12Motivation
- Large/complex datasets
- Dynamic datasets
- Dynamic models
13Motivation
14Motivation
- Large/complex datasets
- Dynamic datasets
- Dynamic models
Lazy (model-free)
15Motivation
- Large/complex datasets
- Dynamic datasets
- Dynamic models
Lazy (model-free)
Nearest Neighbors
16Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
Suffers from curse of dimensionality
- Not reliable Beyer et al., ICDT 1999
- Not indexable Shaft et al., ICDT 2005
Nearest Neighbors
17Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
18Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
19Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
20Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
- Based on simple SQL queries
21Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
- Based on simple SQL queries
22Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
- Based on simple SQL queries
- Converges to optimal Bayes Classifier
23Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
- Based on simple SQL queries
- Converges to optimal Bayes Classifier
24Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
- Based on simple SQL queries
- Converges to optimal Bayes Classifier
25Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
26Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
27LOCUS
Example
28LOCUS
Ideally Dense space
29LOCUS
f2
?(lt7, 4gt) ?
Ideally Dense space
f1
30LOCUS
f2
?(lt7, 4gt)
f1
31LOCUS
f2
- Many features
- Large domains
?? Sparse space
Reality
f1
32LOCUS
f2
?(lt7, 4gt) ?
?
f1
33LOCUS
?1 2
f2
?
?(lt7, 4gt) ?
?2 1
f1
3-NN
34LOCUS
?1 2
f2
?
?(lt7, 4gt)
?2 1
f1
3-NN
35LOCUS
f2
?(lt7, 4gt) ?
f1
LOCUS
36LOCUS
?1 7
f2
?
?(lt7, 4gt) ?
?2 3
f1
LOCUS
37LOCUS
?1 7
f2
?
?(lt7, 4gt)
?2 3
f1
LOCUS
38LOCUS
f2
Disk-based implementation
f1
LOCUS
39LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
?1 7
?
?(lt7, 4gt)
?2 3
ltx1, x2gt
40LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
What if R is large?
Classical optimization techniques for a
well-known type of aggregate queries
41LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
Method reliability?
LOCUS converges to the optimal Bayes classifier
as the size of the dataset increases (proof in
the paper)
42LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
What if a feature, say f2, is categorical? (e.g.
sex)
43LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2 GROUP BY ?
R(f1, f2, ?)
What if a feature, say f2, is categorical? (e.g.
sex)
Not a problem, since generally in practice
- Combinations of categorical and numeric
features
- Categorical features have small domains
Hence, they do not contribute to sparsity
44Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
45Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
46Parallel Execution
R R1 ? R2 ? R3 ? R4
47Parallel Execution
Count distributive function
?1 23
?1 7
?1 5
?2 4
?2 1
?2 2
?1 6
?2 0
?1 5
?2 1
48Parallel Execution
- Small network traffic
- Load balancing
- Lightweight operations on the main server
?1 7
?1 5
?2 1
?2 2
?1 6
?2 0
?1 5
?2 1
49Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
50Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
51Experimental Evaluation
- LOCUS vs DTs and NNs (weka)
- Synthetic datasets
- Ten functions Agrawal et al., IEEE TKDE 1993
- D 9
- N ? 5?103, 5?106
- Real-world datasets
- UCI Repository
52Experimental Evaluation
Classification error rate (synthetic datasets, N
5?104)
53Experimental Evaluation
Effect of dataset size on classification error
rate of LOCUS (synthetic datasets, N ? 5?103,
5?106)
54Experimental Evaluation
Effect of dataset size on time scalability of
LOCUS (synthetic datasets, N ? 5?103, 5?106)
55Experimental Evaluation
Classification error rate (real-world datasets)
56Experimental Evaluation
Effect of dataset size on classification error
rate (dataset CovType, N ? 5?103, 5?105)
57Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
58Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
59Conclusions Future Work
- LOCUS
- Lazy (complex/dynamic datasets and models)
- Efficient (based on simple SQL queries)
- Reliable (converging to optimal)
- Parallelizable
60Conclusions Future Work
- Similar techniques for
- feature selection
- regression
- Implementation of a parallel version
61Questions?
62Thank you!