Title: Data Mining (and machine learning)
1Data Mining(and machine learning)
- A few important things in brief
- top10dm, - neural networks overfitting --- SVM
-
2http//is.gd/top10dm
- C4.5
- k-means
- SVMs
- A priori
- EM algorithm
6. PageRank 7. Adaboost 8. k-NN 9.
Naive-Bayes 10. CART
3http//is.gd/top10dm
- C4.5
- k-means
- SVMs
- A priori
- EM algorithm
6. PageRank 7. Adaboost 8. k-NN 9.
Naive-Bayes 10. CART
The 10 most mentioned in Data Mining academic
literature up to 2007, not including Machine
Learning literature
4http//is.gd/top10dm
- C4.5
- k-means
- SVMs
- A priori
- EM algorithm
6. PageRank 7. Adaboost 8. k-NN 9.
Naive-Bayes 10. CART
5http//is.gd/top10dm
- C4.5
- k-means
- SVMs
- A priori
- EM algorithm
6. PageRank 7. Adaboost 8. k-NN 9.
Naive-Bayes 10. CART
Decision tree methods
Machine-learning decision boundary method
Heavily mathematical// generalised version of
k-means
Specific to certain kinds of data
6So ...
- Today we will look at
- The no. 1 Machine Learning method
-
- Overfitting (this is a good place for it...)
- Support vector machines
- in the last lecture we will see the new
number 1 - Machine learning method
7Decision boundary methods finding a separating
boundary
8Decision boundary methods finding a separating
boundary
9Decision boundary methods finding a separating
boundary
10Decision boundary methods finding a separating
boundary
11Decision boundary methods finding a separating
boundary
12Decision boundary methods finding a separating
boundary
13If your data were 2D, you could plot your known
data points, colour with known classes and draw
the boundary you think looks best, and use that
as the classifier
14If your data were 2D, you could plot your known
data points, colour with known classes and draw
the boundary you think looks best, and use that
as the classifier
But its much more common (and effective) to use
machine learning methods. Neural Networks and
Support Vector Machines are the most common
decision boundary methods. In both cases they
learn by finding the parameters for a very
complex curve and thats the decision boundary.
15Artificial Neural Networks
An artificial neuron (node)
An ANN (neural network)
Nodes abstractly model neurons they do very
simple number crunching Numbers flow from left
to right the numbers arriving at the input layer
get transformed to a new set of numbers at the
output layer. There are many kinds of nodes, and
many ways of combining them into a network, but
we need only be concerned with the types
described here, which turn out to be sufficient
for any (consistent) pattern classification task.
16A single node (artificial neuron) works like this
3
2
1
-2
2
17A single node (artificial neuron) works like this
4
3
2
1
-3
-2
2
0
Field values come along (inputs from us, or from
other nodes)
18A single node (artificial neuron) works like this
4x312
2
-3x1-3
-2
0x20
They get multiplied by the strengths on the input
lines
19A single node (artificial neuron) works like this
3
2
f(12-30)
1
-2
2
The node adds up its inputs, and applies a simple
function to it
20A single node (artificial neuron) works like this
3
2 x f(9)
1
-2 x f(9)
2
It sends the result out along its output lines,
where it will in turn get multiplied by the line
weights before being delivered
21Computing AND with a NN
A
0.5
0.5
B
The blue node is the output node. It adds the
weighted inputs, and outputs 1 if the result is
gt 1, otherwise 0.
22Computing OR with a NN
A
1
1
B
The blue node is the output node. It adds the
weighted inputs, and outputs 1 if the result is
gt 1, otherwise 0. With these weights, only one
of the inputs needs to be a 1, and the
output will be 1. Output will be 0 only if both
inputs are zero.
23Computing NOT with a NN
A
-1
1
Bias unit which always sends fixed signal of 1
This NN computes the NOT of input A The blue
unit is a threshold unit with a threshold of 1 as
before. So if A is 1, the weighted sum at the
output unit is 0, hence output is 0 If A is 0,
the weighted sum is 1, so output is 1.
24So, an NN can compute AND, OR and NOT so what?
- It is straightforward to combine ANNs together,
with outputs from some becoming the inputs of
others, etc. That is, we can combine them just
like logic gates on a microchip. - And this means that a neural network can compute
ANY function of the inputs -
25And youre telling me this because ?
Imagine this. Image of handwritten character
converted into array of grey levels (inputs) 26
outputs, one for each character
a
7
2
b
0
c
0
d
e
3
0
f
Weights are the links are chosen such that the
output corresponding to the correct letter emits
a 1, and all the others emit a 0. This sort of
thing is not only possible, but routine Medical
diagnosis, wine-tasting, lift-control, sales
prediction,
26Getting the Right Weights
Clearly, an application will only be accurate if
the weights are right.
An ANN starts with randomised weights And with a
database of known examples for training
7
0
If this pattern corresponds to a c
2
0
We want these outputs
0
1
0
0
0
3
0
0
If wrong, weights are adjusted in a simple way
which makes it more likely that the ANN will be
correct for this input next time
27Training an NN
It works like this
Send Training Pattern in
Crunch to outputs
Adjust weights
STOP
All correct
Some wrong
Present a pattern as a series of numbers at
the first layer of nodes. E.g. Field values for
instance 1
Each node in the next layer does its simple
processing, and sends its results to the next
layer, and so on, until numbers call out at the
output layer
Compare the NNs output pattern with the known
correct pattern (target class). If different,
adjust the weights somehow to make it more likely
to be correct on this pattern next time.
28Classical NN Training
An algorithm called backpropagation BP is the
classic way of training a neural network.
Based on partial differentiation, it prescribes
a way to adjust the weights so that the error on
the latest pattern would probably be reduced next
time. We can instead use any optimisation
algorihm(e.g. GA) to find the weights for a NN.
E.g. the first ever significant application of
particle swarm optimisation, showed that it was
faster than BP, with better results.
29Generalisation Decision boundaries
The ANN is Learning during its training
phase When it is in use, providing
decisions/classifications for live cases
it hasnt seen before, we expect a reasonable
decision from it. I.e. we want it to generalise
well.
Suppose a network was trained with the black As
and Bs here, the black line is a visualisation
of its decision space it will think anything on
one side is an A, and anything on the other side
is a B. the white A represents an unseen test
case. In the third example, it thinks this is a
B
A
A
A
A
A
A
A
B
A
B
A
B
A
A
A
B
B
B
B
B
B
Good generalisation
Fairly poor generalisation
Stereotyping?
Coverage and extent of training data helps to
avoid poor generalisaton Main Point when an NN
generalises well, its results seems sensible,
intuitive, and generally more accurate than
people
30Poor generalisation some insight into
overfitting
Suppose we train an a classifier to tell the
difference between handwritten t and c, using
only these examples
ts
The classifier will learn easily. It will
probably gives 100 correct prediction on these
cases.
cs
31Overfitting
BUT this classifier will probably generalise
very poorly it will perform very badly on a test
set E.g. here is potential (very likely)
performance on certain unseen cases
It will probably predict that this is a c
Why?
It will probably predict that this is a t
32Avoiding Overfitting
It can be avoided by using as much training
data as possible,ensuring as much diversity as
possible in the data. This cuts down on the
potential existence of features that might be
discriminative in the training data, but are
otherwise spurious. It can be avoided by
jittering (adding noise). During training, every
time an input pattern ispresented, it is randomly
perturbed. The idea of this is that spurious
features will be washed out by the noise, but
valid discriminatory features will remain. The
problem with this approach is how to correctly
choose the level of noise.
33Avoiding Overfitting II
A typical curve showing performance during
training.
But here is performance on unseen data, not in
the training set.
Training data
error
Time for methods like neural networks
34Avoiding Overfitting III
Another approach is early stopping. During
training, keep track of the networks performance
on a separate validation set of data. At the
point where error continues to improve on the
training set, but starts to get worse on the
validation set, that is when training should be
stopped, since it is starting to overfit on the
training data. The problem here is that this
point is far from always clear cut.
35Real-world applications of ANNs are all over the
place
36Stocks, Commodities and Futures Currency Price
Predictions James O'Sullivan Controls trading
of more than 10 different financial markets with
consistent profits. Corporate Bond Rating
George Pugh Predicts corporate bond ratings
with 100 accuracy for consulting and trading.
Standard and Poor's 500 Prediction LBS Capital
Management, Inc Predicts the SP 500 one day
ahead and one week ahead with better accuracy
than traditional methods. Forecasting Stock
Prices Walkrich Investments Neural Networks rate
underpriced stock beating the SP.
37Business, Management, and Finance Direct
Marketing Mail Prediction Microsoft Improves
response rates from 4.9 to 8.2. Credit Scoring
Herbert Jensen Predicts loan application
success with 75-80 accuracy. Identifing
Policemen with Potential for Misconduct The
Chicago Police Department predict misconduct
potential based on employee records. Jury
Summoning with Neural Networks The Montgomery
Court House in Norristown, PA saves 70 million
annually using The Intelligent Summoner from MEA.
Forecasting Highway Maintenance with Neural
Networks Professor Awad Hanna at the University
of Wisconsin in Madison has trained a neural
network to predict which type of concrete is
better than another for a particular highway
problem.
38Medical Applications Breast Cancer Cell Analysis
David Weinberg, MD Image analysis ignores
benign cells and classifies malignant cells.
Hospital Expenses Reduced Anderson Memorial
Hospital Improves the quality of care, reduces
death rate, and saved 500,000 in the first 15
months of use. Diagnosing Heart Attacks J.
Furlong, MD Recognizes Acute Myocardial
Infarction from enzyme data Emergency Room Lab
Test Ordering S. Berkov, MD Saves time and money
ordering tests using symptoms and demographics.
Classifying Patients for Psychiatric Care G.
Davis, MD Predicts Length of Stay for
Psychiatric Patients, saving money
39Sports Applications Thoroughbred Horse
Racing Don Emmons 22 races, 17 winning horses.
Thoroughbred Horse Racing Rich Janeva 39 of
winners picked at odds better than 4.5 to 1. Dog
Racing Derek Anderson 94 accuracy picking
first place.
40Science Solar Flare Prediction Dr. Henrik
Lundstet Predicts the next major solar flare
helps prevent problems for power plants.
Mosquito Identification Aubrey Moore 100
accuracy distinguishing between male and female,
two species. Spectroscopy StellarNet Inc
Analyze spectral data to classify materials.
Weather Forecasting Fort Worth National Weather
Service Predict rainfall to 85 accuracy. Air
Quality Testing Researchers at the Defense
Research Establishment Suffield, Chemical
Biological Defense Section, in Alberta, Canada
have trained a neural network to recognize,
classify and characterize aerosols of unknown
origin with a high degree of accuracy.
41Manufacturing Plastics Testing Monsanto
Predicts plastics quality, saving research time,
processing time, and manufacturing expense.
Computer Chip Manufacturing Quality Intel
Analyzes chip failures to help improve yields.
Nondestructive Concrete Testing Donald G. Pratt
Detects the presence and position of flaws in
reinforced concrete. Beer Testing
Anheuser-Busch Identifies the organic content
of competitors' beer vapors with 96 accuracy.
Steam Quality Testing AECL Research in
Manitoba, Canada has developed the INSIGHT steam
quality monitor, an instrument used to measure
steam quality and mass flowrate.
42Support Vector Machines a different approach to
finding the decision surface, particularly good
at generalisation
43Suppose we can divide the classes with a simple
hyperplane
44There will be infinitely many such lines
45One of them is optimal
46Beause it maximises the average distance of the
hyperplane from the support vectors instances
that are closest to instances of different class
47A Support Vector Machine (SVM) finds this
hyperplane
48But, usually there is no simple hyperplane that
separates the classes!
49One dimension (x), two classes
50Two dimensions (x, xsin(x)),
51Now we can separate the classes
52Thats what SVMs do
If we add enough extra dimensions/fields using
arbitrary functions of the existing fields,
then it becomes very likely we can separate the
data with a straight line hyperplane. SVMs
- apply such a transformation - then find
the optimal separating hyperplane. The
optimality of the sep hyp means
good generalisation properties
53Next the classic, field-defining DM algorithm