Grasshoppers - PowerPoint PPT Presentation

About This Presentation
Title:

Grasshoppers

Description:

The Classification Problem (informal definition) Katydids Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 43
Provided by: eam73
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Grasshoppers


1
The Classification Problem (informal definition)
Katydids
Given a collection of annotated data. In this
case 5 instances Katydids of and five of
Grasshoppers, decide what type of insect the
unlabeled example is.
Grasshoppers
Katydid or Grasshopper?
2
Simple Linear Classifier
R.A. Fisher 1890-1962
If previously unseen instance above the
line then class is Katydid else
class is Grasshopper
3
Nearest Neighbor Classifier
Evelyn Fix 1904-1965
Joe Hodges 1922-2000
Antenna Length
If the nearest instance to the previously unseen
instance is a Katydid class is Katydid else
class is Grasshopper
Abdomen Length
4
Up to now we have assumed that the nearest
neighbor algorithm uses the Euclidean Distance,
however this need not be the case
Max (pinf)
Manhattan (p1)
Weighted Euclidean
Mahalanobis
5
Suppose that you have two features GPA,GRE, and
you think that GRE is twice as important as GPA.
You can use the weighted Euclidean distance..
Feature 1 is GPA Feature 2 is GRE Weight vector
W 2, 1
Weighted Euclidean
6
Hold Out Data
  • How do we estimate the accuracy of our
    classifier?
  • We can use Hold Out data

We divide the dataset into 2 partitions, called
train and test. We build our models on train, and
see how well we do on test.
Insect ID Abdomen Length Antennae Length Insect Class
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
train
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
test
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
7
Cross Validation
  • How do we estimate the accuracy of our
    classifier?
  • We can use K-fold cross validation

We divide the dataset into K equal sized
sections. The algorithm is tested K times, each
time leaving out one of the K section from
building the classifier, but using it to test the
classifier instead
Number of correct classifications Number of
instances in our database
Accuracy
K 5
Insect ID Abdomen Length Antennae Length Insect Class
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
8
Setting parameters and overfitting
  • You need to classify widgets, you get a training
    set..
  • You could use a Linear Classifier or Nearest
    Neighbor
  • Nearest Neighbor
  • You could use 1NN, 3NN, 5NN
  • You could use Euclidean Distance, LP1, Lpinf,
    Mahalanobis
  • You could do some data editing
  • You could do some feature weighting
  • You could .
  • Linear Classifier
  • You could use a Constant classifier
  • You could use a Linear Classifier
  • You could use a Quadratic Classifier
  • You could.

Model Selection
Parameter Selection
Or parameter tuning, tweaking
9
Setting parameters and overfitting
  • You need to classify widgets, you get a training
    set..
  • You could use a Linear Classifier or Nearest
    Neighbor
  • Nearest Neighbor
  • You could use 1NN, 3NN, 5NN
  • You could use Euclidean Distance, LP1, Lpinf,
    Mahalanobis
  • You could do some data editing
  • You could do some feature weighting
  • You could .
  • Linear Classifier
  • You could use a Constant classifier
  • You could use a Linear Classifier
  • You could use a Quadratic Classifier
  • You could.

10
Overfitting
Overfitting occurs when a statistical
model describes random error or noise instead of
the underlying relationship. Overfitting
generally occurs when a model is excessively
complex, such as having too many parameters
relative to the number of observations. A model
which has been overfit will generally have
poor predictive performance, as it can exaggerate
minor fluctuations in the data.
11
  • Suppose we need to solve a classification problem
  • We are not sure if we should us the..
  • Simple linear classifier
  • or the
  • Simple quadratic classifier
  • How do we decide which to use?

We do cross validation or leave-one out and
choose the best one.
12
  • Simple linear classifier gets 81 accuracy
  • Simple quadratic classifier 99 accuracy

13
  • Simple linear classifier gets 96 accuracy
  • Simple quadratic classifier 97 accuracy

14
  • This problem is greatly exacerbated by having
    too little data
  • Simple linear classifier gets 90 accuracy
  • Simple quadratic classifier 95 accuracy

15
What happens as we have more and more training
examples? The accuracy for all models goes
up! The chance of making a mistake (choosing the
wrong model) goes down Even if we make a mistake,
it will not matter too much (because we would
learn a degenerate quadratic it is basically a
straight line)
  • Simple linear 70 accuracy
  • Simple quadratic 90 accuracy
  • Simple linear 90 accuracy
  • Simple quadratic 95 accuracy
  • Simple linear 99.999999 accuracy
  • Simple quadratic 99.999999 accuracy

16
One Solution Charge Penalty for complex models
  • For example, for the simple polynomial
    classifier, we could charge 1 for every
    increase in the degree of the polynomial
  • Simple linear classifier gets 90.5 accuracy,
    minus 0, equals 90.5
  • Simple quadratic classifier 97.0 accuracy,
    minus 1, equals 96.0
  • Simple cubic classifier 97.05 accuracy, minus
    2, equals 95.05

Accuracy 90.5
Accuracy 97.0
Accuracy 97.05
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
17
One Solution Charge Penalty for complex models
  • For example, for the simple polynomial
    classifier, we could charge 1 for every increase
    in the degree of the polynomial.
  • There are more principled ways to charge
    penalties
  • In particular, there is a technique called
    Minimum Description Length (MDL)

18
Suppose you have a four feature problem, and you
want to search over feature subsets. It happens
to be the case that features 2 and 3, shown
here Are all you need, and the other features are
random
19
Suppose you have a four feature problem, and you
want to search over feature subsets. It happens
to be the case that features 2 and 3, shown
here are all you need, and the other features are
random
0 1 2 3 4
20
(No Transcript)
21
Which is better, Euclidean Distance or DTW?
22
Name Classes Instances Euclidean Error () DTW Error () r
Chicken 5 446 19.96 19.96
MixedBag 9 160 4.375 4.375
Leave one out error rate on two datasets
23
Name Classes Instances Euclidean Error () DTW Error () r
Face 16 2240 3.839 3.170
Swedish Leaves 15 1125 13.33 10.84
Chicken 5 446 19.96 19.96
MixedBag 9 160 4.375 4.375
OSU Leaves 6 442 33.71 15.61
Diatoms 37 781 27.53 27.53
Plane 7 210 0.95 0.0
Fish 7 350 11.43 9.71
24
(No Transcript)
25
Name Classes Instances Euclidean Error () DTW Error () r




OSU Leaves 6 442 33.71 15.61
15.61
33.71
26
In this region Euclidean is better
DTW Error
In this region DTW is better
Euclidean Error
27
In this region DTW is better
28
A paper claims ERP performs the best (over)
DTW
Datasets are small Only three datasets (one is
synthetic) So let us do better tests
Lei Chen, Raymond T. Ng On The Marriage of
Lp-norms and Edit Distance. VLDB 2004 792-803
29
ERP performs the best (over) DTW
In this region ERP is better
30
Suppose the claim was ERP performs the best
(over) DTW, when the data is periodic (or X)
That is fine, but you state what the X is, ahead
of time.
In this region ERP is better
31
Our approach, TQuEST, significantly outperforms
the only competitor (DTW)
  1. Johannes Aßfalg, Thomas Bernecker, Hans-Peter
    Kriegel, Peer Kröger, Matthias Renz Periodic
    Pattern Analysis in Time Series Databases. DASFAA
    2009 354-368
  2. Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
    Peter Kunath, Alexey Pryakhin, Matthias Renz
    T-Time Threshold-Based Data Mining on Time
    Series. ICDE 2008 1620-1623
  3. Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
    Peter Kunath, Alexey Pryakhin, Matthias Renz
    Similarity Search in Multimedia Time Series Data
    Using Amplitude-Level Features. MMM 2008 123-133
  4. Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
    Peter Kunath, Alexey Pryakhin, Matthias Renz
    Interval-Focused Similarity Search in Time Series
    Databases. DASFAA 2007 586-597
  5. Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
    Peter Kunath, Alexey Pryakhin, Matthias Renz
    Semi-Supervised Threshold Queries on
    Pharmacogenomics Time Sequences. APBC 2006
    307-316
  6. Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
    Peter Kunath, Alexey Pryakhin, Matthias Renz
    TQuEST Threshold Query Execution for Large Sets
    of Time Series. EDBT 2006 1147-1150
  7. Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
    Peter Kunath, Alexey Pryakhin, Matthias Renz
    Similarity Search on Time Series Based on
    Threshold Queries. EDBT 2006 276-294
  8. Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
    Peter Kunath, Alexey Pryakhin, Matthias Renz
    Threshold Similarity Queries in Large Time Series
    Databases. ICDE 2006 149
  9. Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger,
    Peter Kunath, Alexey Pryakhin, Matthias Renz
    Time Series Analysis Using the Concept of
    Adaptable Threshold Similarity. SSDBM 2006
    251-260

32
Our approach, TQuEST, significantly outperforms
the only competitor (DTW)
In this region TQuEST is better
33
How come there are so many claims in the
literature, that are simply not true?
34
c
q
c 1.2, 1.3, 1.5, , 2.9 q 1.0, 1.2, 1.2,
, 3.1
35
c
q
c 1.2, 1.3, 1.5, , 2.9 q 1.0, 1.2, 1.2,
, 3.1 w 2, 2, 2, ,1,1,1,, 2, 2, 2
Weighted Euclidean Distance
36
How do we set the weights?
For every dataset We come up with a set of
weights (somehow) We test them with leaving
one out, and report the best results
37
I have done this.... Weighted Euclidean DOES work
a lot better
38
How does ANA work?
We downloaded the mitochondrial DNA of a monkey,
Macaca mulatta. We converted the DNA to a string
of integers, with A (Adenine) 0 C (Cytosine)
1 G (Guanine) 2 T (Thymine) 3 So the
DNA string GATCA . . . becomes 2, 0, 3, 1, 0, . .
.. Given that we have a string of 16564 integers,
we can use the integers starting a K as weighs
when calculating the weights of the Euclidean
distance. If K 1 is not good, we try K 2,
then K 3. So ANA is nothing more than the
weighed Euclidean distance, weighed by monkey DNA.
39
Researchers are adjusting the parameters after
seeing the results on the test set.
40
Hold Out Data
  • How do we estimate the accuracy of our
    classifier?
  • We can use Hold Out data

We divide the dataset into 2 partitions, called
train and test. We build our models on train, and
see how well we do on test.
Insect ID Abdomen Length Antennae Length Insect Class
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
train
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
test
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
41
You can do what you want on the training data,
try different weights/parameters
settings/algorithms/feature subsets etc
train
1 2.7 5.5 Grasshopper
2 8.0 9.1 Katydid
3 0.9 4.7 Grasshopper
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
When you are done, you test once on the testing
set, STOP, and report the results in your
paper. If you go back to the training data and
make a change, your results will be optimistic.
test
6 2.9 1.9 Grasshopper
7 6.1 6.6 Katydid
8 0.5 1.0 Grasshopper
9 8.3 6.6 Katydid
10 8.1 4.7 Katydids
42
Take home lessons Dont believe papers you
read Avoid fooling yourself Convince the
reviewers that you are not fooling yourself
Write a Comment
User Comments (0)
About PowerShow.com