Database%20Implementation%20of%20a%20Model-Free%20Classifier - PowerPoint PPT Presentation

About This Presentation

Title:

Database%20Implementation%20of%20a%20Model-Free%20Classifier

Description:

Based on simple SQL queries. Converges to optimal Bayes Classifier ... Efficient (based on simple SQL queries) Reliable (converging to optimal) Parallelizable ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 63

Provided by: adb3

Learn more at: http://www.adbis.org

Category:

more less

Transcript and Presenter's Notes

Title: Database%20Implementation%20of%20a%20Model-Free%20Classifier

1
Database Implementation of a Model-Free
Classifier
ADBIS 2007

Konstantinos Morfonios

2
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
3
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
4
Introduction
Classification
x ltx1, x2, , xDgt
? f(x)
5
Introduction
ltx1,1, x1,2, , x1,D, ?1gt ltx2,1, x2,2, , x2,D,
?2gt ltx3,1, x3,2, , x3,D, ?1gt ltx4,1, x4,2, ,
x4,D, ?1gt . . .
x1 ltx1, x2, , xDgt
x2 ltx1, x2, , xDgt
Lazy
Eager
(Nearest Neighbors)
(Decision Trees)
() Faster decisions
( - ) Large/complex datasets
( - ) Dynamic datasets
( - ) Dynamic models
6
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
7
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
8
Motivation

Large/complex datasets

9
Motivation
10
Motivation

Large/complex datasets
Dynamic datasets

11
Motivation
12
Motivation

Large/complex datasets
Dynamic datasets
Dynamic models

13
Motivation
14
Motivation

Large/complex datasets
Dynamic datasets
Dynamic models

Lazy (model-free)
15
Motivation

Large/complex datasets
Dynamic datasets
Dynamic models

Lazy (model-free)
Nearest Neighbors
16
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)
Suffers from curse of dimensionality

Not reliable Beyer et al., ICDT 1999

Not indexable Shaft et al., ICDT 2005

Nearest Neighbors
17
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)

Category?

18
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)

Lazy

19
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)

Lazy

Scaling?

20
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)

Lazy

Based on simple SQL queries

21
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)

Lazy

Based on simple SQL queries

Accuracy?

22
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)

Lazy

Based on simple SQL queries

Converges to optimal Bayes Classifier

23
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)

Lazy

Based on simple SQL queries

Converges to optimal Bayes Classifier

Other features?

24
Motivation
LOCUS
(Lazy Optimal Classifier of Unlimited Scalability)

Lazy

Based on simple SQL queries

Converges to optimal Bayes Classifier

Parallelizable

25
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
26
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
27
LOCUS
Example
28
LOCUS
Ideally Dense space
29
LOCUS
f2
?(lt7, 4gt) ?
Ideally Dense space
f1
30
LOCUS
f2
?(lt7, 4gt)
f1
31
LOCUS
f2

Many features
Large domains

?? Sparse space
Reality
f1
32
LOCUS
f2
?(lt7, 4gt) ?
?
f1
33
LOCUS
?1 2
f2
?
?(lt7, 4gt) ?
?2 1
f1
3-NN
34
LOCUS
?1 2
f2
?
?(lt7, 4gt)
?2 1
f1
3-NN
35
LOCUS
f2
?(lt7, 4gt) ?
f1
LOCUS
36
LOCUS
?1 7
f2
?
?(lt7, 4gt) ?
?2 3
f1
LOCUS
37
LOCUS
?1 7
f2
?
?(lt7, 4gt)
?2 3
f1
LOCUS
38
LOCUS
f2
Disk-based implementation
f1
LOCUS
39
LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
?1 7
?
?(lt7, 4gt)
?2 3
ltx1, x2gt
40
LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
What if R is large?
Classical optimization techniques for a
well-known type of aggregate queries

Indexing

Materialized views

Presorting

41
LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
Method reliability?
LOCUS converges to the optimal Bayes classifier
as the size of the dataset increases (proof in
the paper)
42
LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2-d2 AND f2x2d2 GROUP BY ?
R(f1, f2, ?)
What if a feature, say f2, is categorical? (e.g.
sex)
43
LOCUS
SELECT ?, count() FROM R WHERE f1x1-d1 AND
f1x1d1 AND f2x2 GROUP BY ?
R(f1, f2, ?)
What if a feature, say f2, is categorical? (e.g.
sex)
Not a problem, since generally in practice

Combinations of categorical and numeric
features

Categorical features have small domains

Hence, they do not contribute to sparsity
44
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
45
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
46
Parallel Execution
R R1 ? R2 ? R3 ? R4
47
Parallel Execution
Count distributive function
?1 23
?1 7
?1 5
?2 4
?2 1
?2 2
?1 6
?2 0
?1 5
?2 1
48
Parallel Execution

Small network traffic
Load balancing
Lightweight operations on the main server

?1 7
?1 5
?2 1
?2 2
?1 6
?2 0
?1 5
?2 1
49
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
50
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
51
Experimental Evaluation

LOCUS vs DTs and NNs (weka)
Synthetic datasets
Ten functions Agrawal et al., IEEE TKDE 1993
D 9
N ? 5?103, 5?106
Real-world datasets
UCI Repository

52
Experimental Evaluation
Classification error rate (synthetic datasets, N
5?104)
53
Experimental Evaluation
Effect of dataset size on classification error
rate of LOCUS (synthetic datasets, N ? 5?103,
5?106)
54
Experimental Evaluation
Effect of dataset size on time scalability of
LOCUS (synthetic datasets, N ? 5?103, 5?106)
55
Experimental Evaluation
Classification error rate (real-world datasets)
56
Experimental Evaluation
Effect of dataset size on classification error
rate (dataset CovType, N ? 5?103, 5?105)
57
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
58
Introduction
Motivation
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions Future Work
59
Conclusions Future Work