Data Mining (and machine learning) - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining (and machine learning)

Description:

Data Mining (and machine learning) A few important things in brief top10dm, - neural networks overfitting --- SVM David Corne, and Nick Taylor, Heriot-Watt ... – PowerPoint PPT presentation

Number of Views:169

Avg rating:3.0/5.0

Slides: 54

Provided by: COR119

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining (and machine learning)

1
Data Mining(and machine learning)

A few important things in brief
top10dm, - neural networks overfitting --- SVM

2
http//is.gd/top10dm

C4.5
k-means
SVMs
A priori
EM algorithm

6. PageRank 7. Adaboost 8. k-NN 9.
Naive-Bayes 10. CART
3
http//is.gd/top10dm

C4.5
k-means
SVMs
A priori
EM algorithm

6. PageRank 7. Adaboost 8. k-NN 9.
Naive-Bayes 10. CART
The 10 most mentioned in Data Mining academic
literature up to 2007, not including Machine
Learning literature
4
http//is.gd/top10dm

C4.5
k-means
SVMs
A priori
EM algorithm

6. PageRank 7. Adaboost 8. k-NN 9.
Naive-Bayes 10. CART
5
http//is.gd/top10dm

C4.5
k-means
SVMs
A priori
EM algorithm

6. PageRank 7. Adaboost 8. k-NN 9.
Naive-Bayes 10. CART

Decision tree methods
Machine-learning decision boundary method
Heavily mathematical// generalised version of
k-means
Specific to certain kinds of data
6
So ...

Today we will look at
The no. 1 Machine Learning method
Overfitting (this is a good place for it...)
Support vector machines
in the last lecture we will see the new
number 1
Machine learning method

7
Decision boundary methods finding a separating
boundary
8
Decision boundary methods finding a separating
boundary
9
Decision boundary methods finding a separating
boundary
10
Decision boundary methods finding a separating
boundary
11
Decision boundary methods finding a separating
boundary
12
Decision boundary methods finding a separating
boundary
13
If your data were 2D, you could plot your known
data points, colour with known classes and draw
the boundary you think looks best, and use that
as the classifier
14
If your data were 2D, you could plot your known
data points, colour with known classes and draw
the boundary you think looks best, and use that
as the classifier
But its much more common (and effective) to use
machine learning methods. Neural Networks and
Support Vector Machines are the most common
decision boundary methods. In both cases they
learn by finding the parameters for a very
complex curve and thats the decision boundary.
15
Artificial Neural Networks
An artificial neuron (node)
An ANN (neural network)
Nodes abstractly model neurons they do very
simple number crunching Numbers flow from left
to right the numbers arriving at the input layer
get transformed to a new set of numbers at the
output layer. There are many kinds of nodes, and
many ways of combining them into a network, but
we need only be concerned with the types
described here, which turn out to be sufficient
for any (consistent) pattern classification task.
16
A single node (artificial neuron) works like this
3
2
1
-2
2
17
A single node (artificial neuron) works like this
4
3
2
1
-3
-2
2
0
Field values come along (inputs from us, or from
other nodes)
18
A single node (artificial neuron) works like this
4x312
2
-3x1-3
-2
0x20
They get multiplied by the strengths on the input
lines
19
A single node (artificial neuron) works like this
3
2
f(12-30)
1
-2
2
The node adds up its inputs, and applies a simple
function to it
20
A single node (artificial neuron) works like this
3
2 x f(9)
1
-2 x f(9)
2
It sends the result out along its output lines,
where it will in turn get multiplied by the line
weights before being delivered
21
Computing AND with a NN
A
0.5
0.5
B
The blue node is the output node. It adds the
weighted inputs, and outputs 1 if the result is
gt 1, otherwise 0.
22
Computing OR with a NN
A
1
1
B
The blue node is the output node. It adds the
weighted inputs, and outputs 1 if the result is
gt 1, otherwise 0. With these weights, only one
of the inputs needs to be a 1, and the
output will be 1. Output will be 0 only if both
inputs are zero.
23
Computing NOT with a NN
A
-1
1
Bias unit which always sends fixed signal of 1
This NN computes the NOT of input A The blue
unit is a threshold unit with a threshold of 1 as
before. So if A is 1, the weighted sum at the
output unit is 0, hence output is 0 If A is 0,
the weighted sum is 1, so output is 1.
24
So, an NN can compute AND, OR and NOT so what?

It is straightforward to combine ANNs together,
with outputs from some becoming the inputs of
others, etc. That is, we can combine them just
like logic gates on a microchip.
And this means that a neural network can compute
ANY function of the inputs

25
And youre telling me this because ?
Imagine this. Image of handwritten character
converted into array of grey levels (inputs) 26
outputs, one for each character
a
7
2
b
0
c
0
d
e
3

0
f
Weights are the links are chosen such that the
output corresponding to the correct letter emits
a 1, and all the others emit a 0. This sort of
thing is not only possible, but routine Medical
diagnosis, wine-tasting, lift-control, sales
prediction,
26
Getting the Right Weights
Clearly, an application will only be accurate if
the weights are right.
An ANN starts with randomised weights And with a
database of known examples for training
7
0
If this pattern corresponds to a c
2
0
We want these outputs
0
1
0
0
0
3
0
0
If wrong, weights are adjusted in a simple way
which makes it more likely that the ANN will be
correct for this input next time
27
Training an NN
It works like this
Send Training Pattern in
Crunch to outputs
Adjust weights
STOP
All correct
Some wrong
Present a pattern as a series of numbers at
the first layer of nodes. E.g. Field values for
instance 1
Each node in the next layer does its simple
processing, and sends its results to the next
layer, and so on, until numbers call out at the
output layer
Compare the NNs output pattern with the known
correct pattern (target class). If different,
adjust the weights somehow to make it more likely
to be correct on this pattern next time.
28
Classical NN Training
An algorithm called backpropagation BP is the
classic way of training a neural network.
Based on partial differentiation, it prescribes
a way to adjust the weights so that the error on
the latest pattern would probably be reduced next
time. We can instead use any optimisation
algorihm(e.g. GA) to find the weights for a NN.
E.g. the first ever significant application of
particle swarm optimisation, showed that it was
faster than BP, with better results.
29
Generalisation Decision boundaries
The ANN is Learning during its training
phase When it is in use, providing
decisions/classifications for live cases
it hasnt seen before, we expect a reasonable
decision from it. I.e. we want it to generalise
well.
Suppose a network was trained with the black As
and Bs here, the black line is a visualisation
of its decision space it will think anything on
one side is an A, and anything on the other side
is a B. the white A represents an unseen test
case. In the third example, it thinks this is a
B
A
A
A
A
A
A
A
B
A
B
A
B
A
A
A
B
B
B
B
B
B
Good generalisation
Fairly poor generalisation
Stereotyping?
Coverage and extent of training data helps to
avoid poor generalisaton Main Point when an NN
generalises well, its results seems sensible,
intuitive, and generally more accurate than
people
30
Poor generalisation some insight into
overfitting
Suppose we train an a classifier to tell the
difference between handwritten t and c, using
only these examples
ts
The classifier will learn easily. It will
probably gives 100 correct prediction on these
cases.
cs
31
Overfitting
BUT this classifier will probably generalise
very poorly it will perform very badly on a test
set E.g. here is potential (very likely)
performance on certain unseen cases
It will probably predict that this is a c
Why?
It will probably predict that this is a t
32
Avoiding Overfitting
It can be avoided by using as much training
data as possible,ensuring as much diversity as
possible in the data. This cuts down on the
potential existence of features that might be
discriminative in the training data, but are
otherwise spurious. It can be avoided by
jittering (adding noise). During training, every
time an input pattern ispresented, it is randomly
perturbed. The idea of this is that spurious
features will be washed out by the noise, but
valid discriminatory features will remain. The
problem with this approach is how to correctly
choose the level of noise.
33
Avoiding Overfitting II
A typical curve showing performance during
training.
But here is performance on unseen data, not in
the training set.
Training data
error
Time for methods like neural networks
34
Avoiding Overfitting III
Another approach is early stopping. During
training, keep track of the networks performance
on a separate validation set of data. At the
point where error continues to improve on the
training set, but starts to get worse on the
validation set, that is when training should be
stopped, since it is starting to overfit on the
training data. The problem here is that this
point is far from always clear cut.
35
Real-world applications of ANNs are all over the
place
36
Stocks, Commodities and Futures Currency Price
Predictions James O'Sullivan Controls trading
of more than 10 different financial markets with
consistent profits. Corporate Bond Rating
George Pugh Predicts corporate bond ratings
with 100 accuracy for consulting and trading.
Standard and Poor's 500 Prediction LBS Capital
Management, Inc Predicts the SP 500 one day
ahead and one week ahead with better accuracy
than traditional methods. Forecasting Stock
Prices Walkrich Investments Neural Networks rate
underpriced stock beating the SP.
37
Business, Management, and Finance Direct
Marketing Mail Prediction Microsoft Improves
response rates from 4.9 to 8.2. Credit Scoring
Herbert Jensen Predicts loan application
success with 75-80 accuracy. Identifing
Policemen with Potential for Misconduct The
Chicago Police Department predict misconduct
potential based on employee records. Jury
Summoning with Neural Networks The Montgomery
Court House in Norristown, PA saves 70 million
annually using The Intelligent Summoner from MEA.
Forecasting Highway Maintenance with Neural
Networks Professor Awad Hanna at the University
of Wisconsin in Madison has trained a neural
network to predict which type of concrete is
better than another for a particular highway
problem.
38
Medical Applications Breast Cancer Cell Analysis
David Weinberg, MD Image analysis ignores
benign cells and classifies malignant cells.
Hospital Expenses Reduced Anderson Memorial
Hospital Improves the quality of care, reduces
death rate, and saved 500,000 in the first 15
months of use. Diagnosing Heart Attacks J.
Furlong, MD Recognizes Acute Myocardial
Infarction from enzyme data Emergency Room Lab
Test Ordering S. Berkov, MD Saves time and money
ordering tests using symptoms and demographics.
Classifying Patients for Psychiatric Care G.
Davis, MD Predicts Length of Stay for
Psychiatric Patients, saving money
39
Sports Applications Thoroughbred Horse
Racing Don Emmons 22 races, 17 winning horses.
Thoroughbred Horse Racing Rich Janeva 39 of
winners picked at odds better than 4.5 to 1. Dog
Racing Derek Anderson 94 accuracy picking
first place.
40
Science Solar Flare Prediction Dr. Henrik
Lundstet Predicts the next major solar flare
helps prevent problems for power plants.
Mosquito Identification Aubrey Moore 100
accuracy distinguishing between male and female,
two species. Spectroscopy StellarNet Inc
Analyze spectral data to classify materials.
Weather Forecasting Fort Worth National Weather
Service Predict rainfall to 85 accuracy. Air
Quality Testing Researchers at the Defense
Research Establishment Suffield, Chemical
Biological Defense Section, in Alberta, Canada
have trained a neural network to recognize,
classify and characterize aerosols of unknown
origin with a high degree of accuracy.
41
Manufacturing Plastics Testing Monsanto
Predicts plastics quality, saving research time,
processing time, and manufacturing expense.
Computer Chip Manufacturing Quality Intel
Analyzes chip failures to help improve yields.
Nondestructive Concrete Testing Donald G. Pratt
Detects the presence and position of flaws in
reinforced concrete. Beer Testing
Anheuser-Busch Identifies the organic content
of competitors' beer vapors with 96 accuracy.
Steam Quality Testing AECL Research in
Manitoba, Canada has developed the INSIGHT steam
quality monitor, an instrument used to measure
steam quality and mass flowrate.
42
Support Vector Machines a different approach to
finding the decision surface, particularly good
at generalisation
43
Suppose we can divide the classes with a simple
hyperplane
44
There will be infinitely many such lines
45
One of them is optimal
46
Beause it maximises the average distance of the
hyperplane from the support vectors instances
that are closest to instances of different class
47
A Support Vector Machine (SVM) finds this
hyperplane
48
But, usually there is no simple hyperplane that
separates the classes!
49
One dimension (x), two classes
50
Two dimensions (x, xsin(x)),
51
Now we can separate the classes
52
Thats what SVMs do
If we add enough extra dimensions/fields using
arbitrary functions of the existing fields,
then it becomes very likely we can separate the
data with a straight line hyperplane. SVMs
- apply such a transformation - then find
the optimal separating hyperplane. The
optimality of the sep hyp means
good generalisation properties
53
Next the classic, field-defining DM algorithm

Write a Comment

User Comments (0)