Classification

About This Presentation

Title:

Classification

Description:

Title: PowerPoint Presentation Author: eamonn Last modified by: eamonn Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 61

Provided by: eam9

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Classification

1
Classification

We have seen 2 classification techniques
Simple linear classifier, Nearest neighbor,.
Let us see two more techniques
Decision tree, Naïve Bayes
There are other techniques
Neural Networks, Support Vector Machines, that
we will not consider..

2
For any domain of interest, we can measure
features
Color Green, Brown, Gray, Other
Has Wings?
Thorax Length
Abdomen Length
Antennae Length
Mandible Size
Spiracle Diameter
Leg Length
3
Feature Generation

Feature generation refers to any technique to
make new features from existing features

Recall pigeon problem 2, and assume we are using
the linear classifier

Pigeon Problem 2
Examples of class A
Examples of class B
Using both features works poorly, using just X
works poorly, using just Y works poorly..
4 4
5 5
6 6
3 3
4
Feature Generation

Solution Create a new feature Z
Z absolute_value(X-Y)

0
1
2
3
4
5
6
7
8
9
10
Z-axis
5
Recall this example? It was a teaching example to
show that NN could use any distance measure
ID Name Class
1 Gunopulos Greek
2 Papadopoulos Greek
3 Kollios Greek
4 Dardanos Greek
5 Keogh Irish
6 Gough Irish
7 Greenhaugh Irish
8 Hadleigh Irish
It would not really work very well, unless we had
LOTS more data
6
Irish Names
Japanese Names
ABERCROMBIE ABERNETHY ACKART
ACKERMAN ACKERS ACKLAND ACTON ADAIR
ADLAM ADOLPH AFFLECK ALVIN AMMADON
AIKO AIMI AINA AIRI AKANE AKEMI AKI
AKIKO AKIO AKIRA AMI AOI ARATA ASUKA
7
Z number of vowels / word length
Vowels I O U A E
Irish Names
Japanese Names
ABERCROMBIE 0.45 ABERNETHY 0.33 ACKART
0.33 ACKERMAN 0.375 ACKERS 0.33 ACKLAND
0.28 ACTON 0.33
AIKO 0.75 AIMI 0.75 AINA 0.75 AIRI
0.75 AKANE 0.6 AKEMI 0.6
8
I have a box of apples..
1
H(X)
0.5
Pr(X good) p then Pr(X bad) 1 - p the
entropy of X is given by
0
0
1
binary entropy function attains its maximum value
when p 0.5
All good
All bad
9
Decision Tree Classifier
Ross Quinlan
Abdomen Length gt 7.1?
Antenna Length
yes
no
Antenna Length gt 6.0?
Katydid
yes
no
Katydid
Grasshopper
Abdomen Length
10
Antennae shorter than body?
Yes
No
3 Tarsi?
Grasshopper
Yes
No
Foretiba has ears?
Yes
No
Cricket
Decision trees predate computers
Katydids
Camel Cricket
11
Decision Tree Classification

Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the
root
Partition examples recursively based on selected
attributes
Tree pruning
Identify and remove branches that reflect noise
or outliers
Use of decision tree Classifying an unknown
sample
Test the attribute values of the sample against
the decision tree

12
How do we construct the decision tree?

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Attributes are categorical (if continuous-valued,
they can be discretized in advance)
Examples are partitioned recursively based on
selected attributes.
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
There are no samples left

13
Information Gain as A Splitting Criteria

Select the attribute with the highest information
gain (information gain is the expected reduction
in entropy).
Assume there are two classes, P and N
Let the set of examples S contain p elements of
class P and n elements of class N
The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is
defined as

0 log(0) is defined as 0
14
Information Gain in Decision Tree Induction

Assume that using attribute A, a current set will
be partitioned into some number of child sets
The encoding information that would be gained by
branching on A

Note entropy is at its minimum if the collection
of objects is completely uniform
15
Person Hair Length Weight Age Class
Homer 0 250 36 M
Marge 10 150 34 F
Bart 2 90 10 M
Lisa 6 78 8 F
Maggie 4 20 1 F
Abe 1 170 70 M
Selma 8 160 41 F
Otto 10 180 38 M
Krusty 6 200 45 M
Comic 8 290 38 ?
16
Entropy(4F,5M) -(4/9)log2(4/9) -
(5/9)log2(5/9) 0.9911
no
yes
Hair Length lt 5?
Let us try splitting on Hair length
Entropy(3F,2M) -(3/5)log2(3/5) -
(2/5)log2(2/5) 0.9710
Entropy(1F,3M) -(1/4)log2(1/4) -
(3/4)log2(3/4) 0.8113
Gain(Hair Length lt 5) 0.9911 (4/9 0.8113
5/9 0.9710 ) 0.0911
17
Entropy(4F,5M) -(4/9)log2(4/9) -
(5/9)log2(5/9) 0.9911
no
yes
Weight lt 160?
Let us try splitting on Weight
Entropy(0F,4M) -(0/4)log2(0/4) -
(4/4)log2(4/4) 0
Entropy(4F,1M) -(4/5)log2(4/5) -
(1/5)log2(1/5) 0.7219
Gain(Weight lt 160) 0.9911 (5/9 0.7219
4/9 0 ) 0.5900
18
Entropy(4F,5M) -(4/9)log2(4/9) -
(5/9)log2(5/9) 0.9911
no
yes
age lt 40?
Let us try splitting on Age
Entropy(1F,2M) -(1/3)log2(1/3) -
(2/3)log2(2/3) 0.9183
Entropy(3F,3M) -(3/6)log2(3/6) -
(3/6)log2(3/6) 1
Gain(Age lt 40) 0.9911 (6/9 1 3/9
0.9183 ) 0.0183
19
Of the 3 features we had, Weight was best. But
while people who weigh over 160 are perfectly
classified (as males), the under 160 people are
not perfectly classified So we simply recurse!
no
yes
Weight lt 160?
This time we find that we can split on Hair
length, and we are done!
no
yes
Hair Length lt 2?
20
We need dont need to keep the data around, just
the test conditions.
Weight lt 160?
yes
no
How would these people be classified?
Hair Length lt 2?
Male
yes
no
Male
Female
21
It is trivial to convert Decision Trees to rules
Weight lt 160?
yes
no
Hair Length lt 2?
Male
no
yes
Male
Female
Rules to Classify Males/Females If Weight
greater than 160, classify as Male Elseif Hair
Length less than or equal to 2, classify as
Male Else classify as Female
22
Once we have learned the decision tree, we dont
even need a computer!
This decision tree is attached to a medical
machine, and is designed to help nurses make
decisions about what type of doctor to call.
Decision tree for a typical shared-care setting
applying the system for the diagnosis of
prostatic obstructions.
23
PSA serum prostate-specific antigen
levels PSAD PSA density TRUS transrectal
ultrasound
Garzotto M et al. JCO 2005234322-4329
24
The worked examples we have seen were performed
on small datasets. However with small datasets
there is a great danger of overfitting the
data When you have few datapoints, there are
many possible splitting rules that perfectly
classify the data, but will not generalize to
future datasets.
Yes
No
Wears green?
Male
Female
For example, the rule Wears green? perfectly
classifies the data, so does Mothers name is
Jacqueline?, so does Has blue shoes
25
Avoid Overfitting in Classification

The generated tree may overfit the training data
Too many branches, some may reflect anomalies due
to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees
Use a set of data different from the training
data to decide which is the best pruned tree

26
Which of the Pigeon Problems can be solved by a
Decision Tree?

Deep Bushy Tree
Useless
Deep Bushy Tree

?
The Decision Tree has a hard time with correlated
attributes
27
Advantages/Disadvantages of Decision Trees

Advantages
Easy to understand (Doctors love them!)
Easy to generate rules
Disadvantages
May suffer from overfitting.
Classifies by rectangular partitioning (so does
not handle correlated features very well).
Can be quite large pruning is necessary.
Does not handle streaming data easily

28
There now exists, perhaps tens of million of
digitized pages of historical manuscripts dating
back to the 12th century, that feature one or
more heraldic shields
The images are often stained, faded or torn
29
Wouldnt it be great if we could automatically
hyperlink all similar shields to each other?
For example, here we could link two occurrence of
the Von Sax family shield. To do this, we need
to consider shape, color and texture. Lets just
consider shape for now
Manesse Codex an illuminated manuscript in codex
form, copied and illustrated between 1304 and
1340 in Zurich
30
Using the entire shape is not a good idea,
because the shields can have flourishes or tears
Decision Tree for Shields
Flourishes
Tear
Training data (subset)
An NSF funded project (IIS 0803410) is attempting
to solve this by using parts of the shapes,
called shaplets Shaplets allow you to build
decision trees for shapes
Spanish
Polish
French
Ye and Keogh (2009) Time Series Shapelets A New
Primitive for Data Mining. SIGKDD 2009
31
(No Transcript)
32
Naïve Bayes Classifier
Thomas Bayes 1702 - 1761
We will start off with a visual intuition, before
looking at the math
33
Grasshoppers
Katydids
Antenna Length
Abdomen Length
Remember this example? Lets get lots more data
34
With a lot of data, we can build a histogram. Let
us just build one for Antenna Length for now
Antenna Length
35
We can leave the histograms as they are, or we
can summarize them with two normal
distributions. Let us us two normal
distributions for ease of visualization in the
following slides
36

We want to classify an insect we have found. Its
antennae are 3 units long. How can we classify
it?
We can just ask ourselves, give the
distributions of antennae lengths we have seen,
is it more probable that our insect is a
Grasshopper or a Katydid.
There is a formal way to discuss the most
probable classification

p(cj d) probability of class cj, given that
we have observed d
3
Antennae length is 3
37
p(cj d) probability of class cj, given that
we have observed d
P(Grasshopper 3 ) 10 / (10 2) 0.833
P(Katydid 3 ) 2 / (10 2) 0.166
10
2
3
Antennae length is 3
38
p(cj d) probability of class cj, given that
we have observed d
P(Grasshopper 7 ) 3 / (3 9) 0.250
P(Katydid 7 ) 9 / (3 9) 0.750
9
3
7
Antennae length is 7
39
p(cj d) probability of class cj, given that
we have observed d
P(Grasshopper 5 ) 6 / (6 6) 0.500
P(Katydid 5 ) 6 / (6 6) 0.500
6
6
5
Antennae length is 5
40
Bayes Classifiers

That was a visual intuition for a simple case of
the Bayes classifier, also called
Idiot Bayes
Naïve Bayes
Simple Bayes
We are about to see some of the mathematical
formalisms, and more examples, but keep in mind
the basic idea.
Find out the probability of the previously unseen
instance belonging to each class, then simply
pick the most probable class.

41
Bayes Classifiers

Bayesian classifiers use Bayes theorem, which
says
p(cj d ) p(d cj ) p(cj)
p(d)
p(cj d) probability of instance d being in
class cj,
This is what we are trying to compute
p(d cj) probability of generating instance d
given class cj,
We can imagine that being in class cj, causes
you to have feature d with some probability
p(cj) probability of occurrence of class cj,
This is just how frequent the class cj, is in
our database
p(d) probability of instance d occurring
This can actually be ignored, since it is
the same for all classes

Assume that we have two classes
c1 male, and c2 female.
We have a person whose sex we do not know, say
drew or d.
Classifying drew as male or female is
equivalent to asking is it more probable that
drew is male or female, I.e which is greater
p(male drew) or p(female drew)

(Note Drew can be a male or female name)
Drew Barrymore
Drew Carey
What is the probability of being called drew
given that you are a male?
What is the probability of being a male?
p(male drew) p(drew male ) p(male)
p(drew)
What is the probability of being named drew?
(actually irrelevant, since it is that same for
all classes)
43
This is Officer Drew (who arrested me in 1997).
Is Officer Drew a Male or Female?
Luckily, we have a small database with names and
sex. We can use it to apply Bayes rule
Name Sex
Drew Male
Claudia Female
Drew Female
Drew Female
Alberto Male
Karin Female
Nina Female
Sergio Male
Officer Drew
p(cj d) p(d cj ) p(cj) p(d)
44
Name Sex
Drew Male
Claudia Female
Drew Female
Drew Female
Alberto Male
Karin Female
Nina Female
Sergio Male
p(cj d) p(d cj ) p(cj) p(d)
Officer Drew
p(male drew) 1/3 3/8 0.125
3/8 3/8
Officer Drew is more likely to be a Female.
p(female drew) 2/5 5/8 0.250
3/8 3/8
45
Officer Drew IS a female!
Officer Drew
p(male drew) 1/3 3/8 0.125
3/8 3/8
p(female drew) 2/5 5/8 0.250
3/8 3/8
46
So far we have only considered Bayes
Classification when we have one attribute (the
antennae length, or the name). But we may
have many features. How do we use all the
features?
p(cj d) p(d cj ) p(cj) p(d)
Name Over 170CM Eye Hair length Sex
Drew No Blue Short Male
Claudia Yes Brown Long Female
Drew No Blue Long Female
Drew No Blue Long Female
Alberto Yes Brown Short Male
Karin No Blue Long Female
Nina Yes Brown Short Female
Sergio Yes Blue Long Male
47

To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions,
and thereby estimate
p(dcj) p(d1cj) p(d2cj) . p(dncj)

The probability of class cj generating instance
d, equals.
The probability of class cj generating the
observed value for feature 1, multiplied by..
The probability of class cj generating the
observed value for feature 2, multiplied by..
48

To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions,
and thereby estimate
p(dcj) p(d1cj) p(d2cj) . p(dncj)

p(officer drewcj) p(over_170cm yescj)
p(eye bluecj) .
Officer Drew is blue-eyed, over 170cm tall, and
has long hair
p(officer drew Female) 2/5 3/5
. p(officer drew Male) 2/3 2/3
.
49
cj
The Naive Bayes classifiers is often represented
as this type of graph Note the direction of the
arrows, which state that each class causes
certain features, with a certain probability

p(d1cj) p(d2cj)
p(dncj)

50
cj
Naïve Bayes is fast and space efficient We can
look up all the probabilities with a single scan
of the database and store them in a (small) table

p(d1cj) p(d2cj)
p(dncj)

Sex Over190cm
Male Yes 0.15
Male No 0.85
Female Yes 0.01
Female No 0.99
Sex Long Hair
Male Yes 0.05
Male No 0.95
Female Yes 0.70
Female No 0.30
Sex
Male
Male
Female
Female
51
Naïve Bayes is NOT sensitive to irrelevant
features... Suppose we are trying to classify a
persons sex based on several features, including
eye color. (Of course, eye color is completely
irrelevant to a persons gender)
p(Jessica cj) p(eye browncj) p(
wears_dress yescj) .
p(Jessica Female) 9,000/10,000
9,975/10,000 . p(Jessica Male)
9,001/10,000 2/10,000 .
Almost the same!
However, this assumes that we have good enough
estimates of the probabilities, so the more data
the better.
52
cj
An obvious point. I have used a simple two class
problem, and two possible values for each
example, for my previous examples. However we can
have an arbitrary number of classes, or feature
values

p(d1cj) p(d2cj)
p(dncj)

Animal Mass gt10kg
Cat Yes 0.15
Cat No 0.85
Dog Yes 0.91
Dog No 0.09
Pig Yes 0.99
Pig No 0.01
Animal
Cat
Dog
Pig
Animal Color
Cat Black 0.33
Cat White 0.23
Cat Brown 0.44
Dog Black 0.97
Dog White 0.03
Dog Brown 0.90
Pig Black 0.04
Pig White 0.01
Pig Brown 0.95
53
Naïve Bayesian Classifier
Problem! Naïve Bayes assumes independence of
features
p(dcj)

p(d1cj) p(d2cj)
p(dncj)

Sex Over 6 foot
Male Yes 0.15
Male No 0.85
Female Yes 0.01
Female No 0.99
Sex Over 200 pounds
Male Yes 0.11
Male No 0.80
Female Yes 0.05
Female No 0.95
54
Naïve Bayesian Classifier
Solution Consider the relationships between
attributes
p(dcj)

p(d1cj) p(d2cj)
p(dncj)

Sex Over 6 foot
Male Yes 0.15
Male No 0.85
Female Yes 0.01
Female No 0.99
Sex Over 200 pounds
Male Yes and Over 6 foot 0.11
Male No and Over 6 foot 0.59
Male Yes and NOT Over 6 foot 0.05
Male No and NOT Over 6 foot 0.35
Female Yes and Over 6 foot 0.01
55
Naïve Bayesian Classifier
Solution Consider the relationships between
attributes
p(dcj)

p(d1cj) p(d2cj)
p(dncj)

But how do we find the set of connecting arcs??
56
The Naïve Bayesian Classifier has a piecewise
quadratic decision boundary
Grasshoppers
Katydids
Ants
Adapted from slide by Ricardo Gutierrez-Osuna
57
Which of the Pigeon Problems can be solved by a
decision tree?
58
Dear SIR, I am Mr. John Coleman and my sister is
Miss Rose Colemen, we are the children of late
Chief Paul Colemen from Sierra Leone. I am
writing you in absolute confidence primarily to
seek your assistance to transfer our cash of
twenty one Million Dollars (21,000.000.00) now
in the custody of a private Security trust firm
in Europe the money is in trunk boxes deposited
and declared as family valuables by my late
father as a matter of fact the company does not
know the content as money, although my father
made them to under stand that the boxes belongs
to his foreign partner.
59
This mail is probably spam. The original message
has been attached along with this report, so you
can recognize or block similar unwanted mail in
future. See http//spamassassin.org/tag/ for
more details. Content analysis details
(12.20 points, 5 required) NIGERIAN_SUBJECT2
(1.4 points) Subject is indicative of a Nigerian
spam FROM_ENDS_IN_NUMS (0.7 points) From ends
in numbers MIME_BOUND_MANY_HEX (2.9 points) Spam
tool pattern in MIME boundary URGENT_BIZ
(2.7 points) BODY Contains urgent
matter US_DOLLARS_3 (1.5 points) BODY
Nigerian scam key phrase (NN,NNN,NNN.NN) DEAR_SOM
ETHING (1.8 points) BODY Contains 'Dear
(something)' BAYES_30 (1.6 points)
BODY Bayesian classifier says spam probability
is 30 to 40 score 0.3728
60
Advantages/Disadvantages of Naïve Bayes