Grasshoppers

About This Presentation

Transcript and Presenter's Notes

Title: Grasshoppers

1
The Classification Problem (informal definition)
Katydids
Given a collection of annotated data. In this
case 5 instances Katydids of and five of
Grasshoppers, decide what type of insect the
unlabeled example is.
Grasshoppers
Katydid or Grasshopper?
2
Simple Linear Classifier
R.A. Fisher 1890-1962
If previously unseen instance above the
line then class is Katydid else
class is Grasshopper
3
Nearest Neighbor Classifier
Evelyn Fix 1904-1965
Joe Hodges 1922-2000
Antenna Length
If the nearest instance to the previously unseen
instance is a Katydid class is Katydid else
class is Grasshopper
Abdomen Length
4
Up to now we have assumed that the nearest
neighbor algorithm uses the Euclidean Distance,
however this need not be the case
Max (pinf)
Manhattan (p1)
Weighted Euclidean
Mahalanobis
5
Suppose that you have two features GPA,GRE, and
you think that GRE is twice as important as GPA.
You can use the weighted Euclidean distance..
Feature 1 is GPA Feature 2 is GRE Weight vector
W 2, 1
Weighted Euclidean
6
Hold Out Data

How do we estimate the accuracy of our
classifier?
We can use Hold Out data

How do we estimate the accuracy of our
classifier?
We can use K-fold cross validation

We divide the dataset into K equal sized
sections. The algorithm is tested K times, each
time leaving out one of the K section from
building the classifier, but using it to test the
classifier instead
Number of correct classifications Number of
instances in our database
Accuracy
K 5
Insect ID Abdomen Length Antennae Length Insect Class
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
8
Setting parameters and overfitting

You need to classify widgets, you get a training
set..
You could use a Linear Classifier or Nearest
Neighbor
Nearest Neighbor
You could use 1NN, 3NN, 5NN
You could use Euclidean Distance, LP1, Lpinf,
Mahalanobis
You could do some data editing
You could do some feature weighting
You could .
Linear Classifier
You could use a Constant classifier
You could use a Linear Classifier
You could use a Quadratic Classifier
You could.

Model Selection
Parameter Selection
Or parameter tuning, tweaking
9
Setting parameters and overfitting

You need to classify widgets, you get a training
set..
You could use a Linear Classifier or Nearest
Neighbor
Nearest Neighbor
You could use 1NN, 3NN, 5NN
You could use Euclidean Distance, LP1, Lpinf,
Mahalanobis
You could do some data editing
You could do some feature weighting
You could .
Linear Classifier
You could use a Constant classifier
You could use a Linear Classifier
You could use a Quadratic Classifier
You could.

10
Overfitting
Overfitting occurs when a statistical
model describes random error or noise instead of
the underlying relationship. Overfitting
generally occurs when a model is excessively
complex, such as having too many parameters
relative to the number of observations. A model
which has been overfit will generally have
poor predictive performance, as it can exaggerate
minor fluctuations in the data.
11

Suppose we need to solve a classification problem
We are not sure if we should us the..
Simple linear classifier
or the
Simple quadratic classifier
How do we decide which to use?

We do cross validation or leave-one out and
choose the best one.
12

Simple linear classifier gets 81 accuracy
Simple quadratic classifier 99 accuracy

Simple linear classifier gets 96 accuracy
Simple quadratic classifier 97 accuracy

This problem is greatly exacerbated by having
too little data
Simple linear classifier gets 90 accuracy
Simple quadratic classifier 95 accuracy

15
What happens as we have more and more training
examples? The accuracy for all models goes
up! The chance of making a mistake (choosing the
wrong model) goes down Even if we make a mistake,
it will not matter too much (because we would
learn a degenerate quadratic it is basically a
straight line)

Simple linear 70 accuracy
Simple quadratic 90 accuracy

Simple linear 90 accuracy
Simple quadratic 95 accuracy

Simple linear 99.999999 accuracy
Simple quadratic 99.999999 accuracy

16
One Solution Charge Penalty for complex models

For example, for the simple polynomial
classifier, we could charge 1 for every
increase in the degree of the polynomial

Simple linear classifier gets 90.5 accuracy,
minus 0, equals 90.5
Simple quadratic classifier 97.0 accuracy,
minus 1, equals 96.0
Simple cubic classifier 97.05 accuracy, minus
2, equals 95.05

Accuracy 90.5
Accuracy 97.0
Accuracy 97.05
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
17
One Solution Charge Penalty for complex models

For example, for the simple polynomial
classifier, we could charge 1 for every increase
in the degree of the polynomial.
There are more principled ways to charge
penalties
In particular, there is a technique called
Minimum Description Length (MDL)

18
Suppose you have a four feature problem, and you
want to search over feature subsets. It happens
to be the case that features 2 and 3, shown
here Are all you need, and the other features are
random
19
Suppose you have a four feature problem, and you
want to search over feature subsets. It happens
to be the case that features 2 and 3, shown
here are all you need, and the other features are
random
0 1 2 3 4
20
(No Transcript)
21
Which is better, Euclidean Distance or DTW?
22
Name Classes Instances Euclidean Error () DTW Error () r
Chicken 5 446 19.96 19.96
MixedBag 9 160 4.375 4.375
Leave one out error rate on two datasets
23
Name Classes Instances Euclidean Error () DTW Error () r
Face 16 2240 3.839 3.170
Swedish Leaves 15 1125 13.33 10.84
Chicken 5 446 19.96 19.96
MixedBag 9 160 4.375 4.375
OSU Leaves 6 442 33.71 15.61
Diatoms 37 781 27.53 27.53
Plane 7 210 0.95 0.0
Fish 7 350 11.43 9.71
24
(No Transcript)
25
Name Classes Instances Euclidean Error () DTW Error () r

OSU Leaves 6 442 33.71 15.61
15.61
33.71
26
In this region Euclidean is better
DTW Error
In this region DTW is better
Euclidean Error
27
In this region DTW is better
28
A paper claims ERP performs the best (over)
DTW
Datasets are small Only three datasets (one is
synthetic) So let us do better tests
Lei Chen, Raymond T. Ng On The Marriage of
Lp-norms and Edit Distance. VLDB 2004 792-803
29
ERP performs the best (over) DTW
In this region ERP is better
30
Suppose the claim was ERP performs the best
(over) DTW, when the data is periodic (or X)
That is fine, but you state what the X is, ahead
of time.
In this region ERP is better
31
Our approach, TQuEST, significantly outperforms
the only competitor (DTW)

Johannes Aßfalg, Thomas Bernecker, Hans-Peter
Kriegel, Peer Kröger, Matthias Renz Periodic
Pattern Analysis in Time Series Databases. DASFAA
2009 354-368
Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
Peter Kunath, Alexey Pryakhin, Matthias Renz
T-Time Threshold-Based Data Mining on Time
Series. ICDE 2008 1620-1623
Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
Peter Kunath, Alexey Pryakhin, Matthias Renz
Similarity Search in Multimedia Time Series Data
Using Amplitude-Level Features. MMM 2008 123-133
Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
Peter Kunath, Alexey Pryakhin, Matthias Renz
Interval-Focused Similarity Search in Time Series
Databases. DASFAA 2007 586-597
Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
Peter Kunath, Alexey Pryakhin, Matthias Renz
Semi-Supervised Threshold Queries on
Pharmacogenomics Time Sequences. APBC 2006
307-316
Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
Peter Kunath, Alexey Pryakhin, Matthias Renz
TQuEST Threshold Query Execution for Large Sets
of Time Series. EDBT 2006 1147-1150
Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
Peter Kunath, Alexey Pryakhin, Matthias Renz
Similarity Search on Time Series Based on
Threshold Queries. EDBT 2006 276-294
Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
Peter Kunath, Alexey Pryakhin, Matthias Renz
Threshold Similarity Queries in Large Time Series
Databases. ICDE 2006 149
Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
Peter Kunath, Alexey Pryakhin, Matthias Renz
Time Series Analysis Using the Concept of
Adaptable Threshold Similarity. SSDBM 2006
251-260

32
Our approach, TQuEST, significantly outperforms
the only competitor (DTW)
In this region TQuEST is better
33
How come there are so many claims in the
literature, that are simply not true?
34
c
q
c 1.2, 1.3, 1.5, , 2.9 q 1.0, 1.2, 1.2,
, 3.1
35
c
q
c 1.2, 1.3, 1.5, , 2.9 q 1.0, 1.2, 1.2,
, 3.1 w 2, 2, 2, ,1,1,1,, 2, 2, 2
Weighted Euclidean Distance
36
How do we set the weights?
For every dataset We come up with a set of
weights (somehow) We test them with leaving
one out, and report the best results
37
I have done this.... Weighted Euclidean DOES work
a lot better
38
How does ANA work?
We downloaded the mitochondrial DNA of a monkey,
Macaca mulatta. We converted the DNA to a string
of integers, with A (Adenine) 0 C (Cytosine)
1 G (Guanine) 2 T (Thymine) 3 So the
DNA string GATCA . . . becomes 2, 0, 3, 1, 0, . .
.. Given that we have a string of 16564 integers,
we can use the integers starting a K as weighs
when calculating the weights of the Euclidean
distance. If K 1 is not good, we try K 2,
then K 3. So ANA is nothing more than the
weighed Euclidean distance, weighed by monkey DNA.
39
Researchers are adjusting the parameters after
seeing the results on the test set.
40
Hold Out Data

How do we estimate the accuracy of our
classifier?
We can use Hold Out data

We divide the dataset into 2 partitions, called
train and test. We build our models on train, and
see how well we do on test.
Insect ID Abdomen Length Antennae Length Insect Class
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
train
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
test
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
41
You can do what you want on the training data,
try different weights/parameters
settings/algorithms/feature subsets etc
train
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
When you are done, you test once on the testing
set, STOP, and report the results in your
paper. If you go back to the training data and
make a change, your results will be optimistic.
test
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
42
Take home lessons Dont believe papers you
read Avoid fooling yourself Convince the
reviewers that you are not fooling yourself

Write a Comment

User Comments (0)

About PowerShow.com

Grasshoppers PowerPoint PPT Presentation