Machine Learning and Neural Networks Professor Tony Martinez Computer Science Department Brigham Young University http://axon.cs.byu.edu/~martinez - PowerPoint PPT Presentation

About This Presentation

Title:

Machine Learning and Neural Networks Professor Tony Martinez Computer Science Department Brigham Young University http://axon.cs.byu.edu/~martinez

Description:

Decision Trees, Nearest Neighbor/IBL, Genetic Algorithms, Rule ... Stacking, Gating/Mixture of Experts, Bagging, Boosting, Wagging, Mimicking, Combinations ... – PowerPoint PPT presentation

Number of Views:221

Avg rating:3.0/5.0

Slides: 60

Provided by: axonC

Learn more at: https://axon.cs.byu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning and Neural Networks Professor Tony Martinez Computer Science Department Brigham Young University http://axon.cs.byu.edu/~martinez

1
Machine Learning and Neural NetworksProfessor
Tony MartinezComputer Science DepartmentBrigham
Young Universityhttp//axon.cs.byu.edu/martinez
2
Tutorial Overview

Introduction and Motivation
Neural Network Model Descriptions
Perceptron
Backpropagation
Issues
Overfitting
Applications
Other Models
Decision Trees, Nearest Neighbor/IBL, Genetic
Algorithms, Rule Induction, Ensembles

3
More Information

You can download this presentation from
ftp//axon.cs.byu.edu/pub/papers/NNML.ppt
An excellent introductory text to Machine
Learning
Machine Learning, Tom M. Mitchell, McGraw Hill,
1997

4
What is Inductive Learning

Gather a set of input-output examples from some
application Training Set
i.e. Speech Recognition, financial forecasting
Train the learning model (Neural network, etc.)
on the training set until it solves it well
The Goal is to generalize on novel data not yet
seen
Gather a further set of input-output examples
from the same application Test Set
Use the learning system on actual data

5
Motivation

Costs and Errors in Programming
Our inability to program "subjective" problems
General, easy-to use mechanism for a large set of
applications
Improvement in application accuracy - Empirical

6
Example Application - Heart Attack Diagnosis

The patient has a set of symptoms - Age, type of
pain, heart rate, blood pressure, temperature,
etc.
Given these symptoms in an Emergency Room
setting, a doctor must diagnose whether a heart
attack has occurred.
How do you train a machine learning model solve
this problem using the inductive learning model?
Consistent approach
Knowledge of ML approach not critical
Need to select a reasonable set of input features

7
Examples and Discussion

Loan Underwriting
Which Input Features (Data)
Divide into Training Set and Test Set
Choose a learning model
Train model on Training set
Predict accuracy with the Test Set
How to generalize better?
Different Input Features
Different Learning Model
Issues
Intuition vs. Prejudice
Social Response

8
UC Irvine Machine Learning Data BaseIris Data Set
4.8,3.0,1.4,0.3, Iris-setosa 5.1,3.8,1.6,0.2, Iris
-setosa 4.6,3.2,1.4,0.2, Iris-setosa 5.3,3.7,1.5,0
.2, Iris-setosa 5.0,3.3,1.4,0.2, Iris-setosa 7.0,3
.2,4.7,1.4, Iris-versicolor 6.4,3.2,4.5,1.5, Iris-
versicolor 6.9,3.1,4.9,1.5, Iris-versicolor 5.5,2.
3,4.0,1.3, Iris-versicolor 6.5,2.8,4.6,1.5, Iris-v
ersicolor 6.0,2.2,5.0,1.5, Iris-viginica 6.9,3.2,5
.7,2.3, Iris-viginica 5.6,2.8,4.9,2.0, Iris-vigini
ca 7.7,2.8,6.7,2.0, Iris-viginica 6.3,2.7,4.9,1.8,
Iris-viginica
9
Voting Records Data Base
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y democrat,
n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y republican,n,y,n,y
,y,y,n,n,n,n,n,n,y,y,?,y republican,n,y,n,y,y,y,n,
n,n,n,n,y,y,y,n,y democrat,y,y,y,n,n,n,y,y,y,n,n,n
,n,n,?,? republican,n,y,n,y,y,n,n,n,n,n,?,?,y,y,n,
n republican,n,y,n,y,y,y,n,n,n,n,y,?,y,y,?,? democ
rat,n,y,y,n,n,n,y,y,y,n,n,n,y,n,?,? democrat,y,y,y
,n,n,y,y,y,?,y,y,?,n,n,y,? republican,n,y,n,y,y,y,
n,n,n,n,n,y,?,?,n,? republican,n,y,n,y,y,y,n,n,n,y
,n,y,y,?,n,? democrat,y,n,y,n,n,y,n,y,?,y,y,y,?,n,
n,y democrat,y,?,y,n,n,n,y,y,y,n,n,n,y,n,y,y repub
lican,n,y,n,y,y,y,n,n,n,n,n,?,y,y,n,n
10
Machine Learning Sketch History

Neural Networks - Connectionist - Biological
Plausibility
Late 50s, early 60s, Rosenblatt, Perceptron
Minsky Papert 1969 - The Lull, symbolic
expansion
Late 80s - Backpropagation, Hopfield, etc. - The
explosion
Machine Learning - Artificial Intelligence -
Symbolic - Psychological Plausibility
Samuel (1959) - Checkers evaluation strategies
1970s and on - ID3, Instance Based Learning,
Rule induction,
Currently Symbolic and connectionist lumped
under ML
Genetic Algorithms - 1970s
Originally lumped in connectionist
Now an exploding area Evolutionary Algorithms

11
Inductive Learning - Supervised

Assume a set T of examples of the form (x,y)
where x is a vector of features/attributes and y
is a scalar or vector output
By examining the examples postulate a hypothesis
H(x) gt y for arbitrary x
Spectrum of Supervised Algorithms
Unsupervised Learning
Reinforcement Learning

12
Other Machine Learning Areas

Case Based Reasoning
Analogical Reasoning
Speed-up Learning
Inductive Learning is the most studied and
successful to date
Data Mining
COLT Computational Learning Theory

13
(No Transcript)
14
Perceptron Node Threshold Logic Unit
x1
w1
x2
Z
w2
xn
wn
15
Learning Algorithm
x1
.4
Z
.1
x2
-.2
16
First Training Instance
.4
Z
1
.1
-.2
Net .8.4 .3-.2 .26
17
Second Training Instance
.4
Z
1
.1
-.2
Net .4.4 .1-.2 .14
Dwi
(T - Z)
C
Xi
18
Delta Rule Learning

Dwij C(Tj Zj) xi
Create a network with n input and m output nodes
Each iteration through the training set is an
epoch
Continue training until error is less than some
epsilon
Perceptron Convergence Theorem Guaranteed to
find a solution in finite time if a solution
exists
As can be seen from the node activation function
the decision surface is an n-dimensional hyper
plane

19
Linear Separability
20
Linear Separability and Generalization
When is data noise vs. a legitimate exception
21
Limited Functionality of Hyperplane
22
Gradient Descent Learning
Error Landscape
TSS Total Sum Squared Error
0
Weight Values
23
Deriving a Gradient Descent Learning Algorithm

Goal to decrease overall error (or other
objective function) each time a weight is changed
Total Sum Squared error S (Ti Zi)2
Seek a weight changing algorithm such that
is negative
If a formula can be found then we have a gradient
descent learning algorithm
Perceptron/Delta rule is a gradient descent
learning algorithm
Linearly-separable problems have no local minima

24
Multi-layer Perceptron

Can compute arbitrary mappings
Assumes a non-linear activation function
Training Algorithms less obvious
Backpropagation learning algorithm not exploited
until 1980s
First of many powerful multi-layer learning
algorithms

25
Responsibility Problem
Output 1 Wanted 0
26
Multi-Layer Generalization
27
Backpropagation

Multi-layer supervised learner
Gradient Descent weight updates
Sigmoid activation function (smoothed threshold
logic)
Backpropagation requires a differentiable
activation function

28
Multi-layer Perceptron Topology
Input Layer Hidden Layer(s) Output Layer
29
Backpropagation Learning Algorithm

Until Convergence (low error or other criteria)
do
Present a training pattern
Calculate the error of the output nodes (based on
T - Z)
Calculate the error of the hidden nodes (based on
the error of the output nodes which is propagated
back to the hidden nodes)
Continue propagating error back until the input
layer is reached
Update all weights based on the standard delta
rule with the appropriate error function d
Dwij Cdj Zi

30
Activation Function and its Derivative

Node activation function f(net) is typically the
sigmoid
Derivate of activation function is critical part
of algorithm

1
.5
0
0
-5
5
Net
.25
0
-5
5
0
Net
31
Backpropagation Learning Equations
32
Backpropagation Summary

Excellent Empirical results
Scaling The pleasant surprise
Local Minima very rare is problem and network
complexity increase
Most common neural network approach
User defined parameters lead to more difficulty
of use
Number of hidden nodes, layers, learning rate,
etc.
Many variants
Adaptive Parameters, Ontogenic (growing and
pruning) learning algorithms
Higher order gradient descent (Newton, Conjugate
Gradient, etc.)
Recurrent networks

33
Inductive Bias

The approach used to decide how to generalize
novel cases
Occams Razor The simplest hypothesis which
fits the data is usually the best Still many
remaining options
A B C -gt Z
A B C -gt Z
A B C -gt Z
A B C -gt Z
A B C -gt Z
Now you receive the new input A B C What
is your output?

34
Overfitting

Noise vs. Exceptions revisited

35
The Overfit Problem
TSS
Validation/Test Set
Training Set
Epochs

Newer powerful models can have very complex
decision surfaces which can converge well on most
training sets by learning noisy and irrelevant
aspects of the training set in order to minimize
error (memorization in the limit)
This makes them susceptible to overfit if not
carefully considered

36
Avoiding Overfit

Inductive Bias Simplest accurate model
More Training Data (vs. overtraining - One epoch
limit)
Validation Set (requires separate test set)
Backpropagation Tends to build from simple
model (0 weights) to just large enough weights
(Validation Set)
Stopping criteria with any constructive model
(Accuracy increase vs Statistical significance)
Noise vs. Exceptions
Specific Techniques
Weight Decay, Pruning, Jitter, Regularization
Ensembles

37
Ensembles

Many different Ensemble approaches
Stacking, Gating/Mixture of Experts, Bagging,
Boosting, Wagging, Mimicking, Combinations
Multiple diverse models trained on same problem
and then their outputs are combined
The specific overfit of each learning model is
averaged out
If models are diverse (uncorrelated errors) then
even if the individual models are weak
generalizers, the ensemble can be very accurate

Combining Technique
M1
Mn
M3
M2
38
Application Issues

Choose relevant features
Normalize features
Can learn to ignore irrelevant features, but will
have to fight the curse of dimensionality
More data (training examples) the better
Slower training acceptable for complex and
production applications if accuracy improvement,
(The week phenomenon)
Execution normally fast regardless of training
time

39
Decision Trees - ID3/C4.5

Top down induction of decision trees
Highly used and successful
Attribute Features - discrete nominal (mutually
exclusive) Real valued features are discretized
Search for smallest tree is too complex (always
NP hard)
C4.5 use common symbolic ML philosophy of a
greedy iterative approach

40
Decision Tree Learning

Mapping by Hyper-Rectangles

A1
A2
41
ID3 Learning Approach

C is the current set of examples
A test on attribute A partitions C into Ci,
C2,...,Cw where w is the number of values of A

C
Red
Green
AttributeColor
Purple
C1
C2
C3
42
Decision Tree Learning Algorithm

Start with the Training Set as C and test how
each attribute partitions C
Choose the best A for root
The goodness measure is based on how well
attribute A divides C into different output
classes A perfect attribute would divide C into
partitions that contain only one output class
each A poor attribute (irrelevant) would leave
each partition with the same ratio of classes as
in C
20 questions analogy good questions quickly
minimize the possibilities
Continue recursively until sets unambiguously
classified or a stopping criteria is reached

43
ID3 Example and Discussion
Temperature Humidity
P N P N Hot 2 2 High 3 4 Mild 4 2 Normal 6 1 C
ool 3 1 Gain .029 Gain .151

14 Examples. Uses Information Gain. Attributes
which best discriminate between classes are
chosen
If the same ratios are found in partitioned set,
then gain is 0

44
ID3 - Conclusions

Good Empirical Results
Comparable application robustness and accuracy
with neural networks - faster learning (though
NNs are more natural with continuous features -
both input and output)
Most used and well known of current symbolic
systems - used widely to aid in creating rules
for expert systems

45
Nearest Neighbor Learners

Broad Spectrum
Basic K-NN, Instance Based Learning, Case Based
Reasoning, Analogical Reasoning
Simply store all or some representative subset of
the examples in the training set
Generalize on the fly rather than use
pre-acquired hypothesis - faster learning, slower
execution, information retained, memory intensive

46
Nearest Neighbor Algorithms
47
Nearest Neighbor Variations

How many examples to store
How do stored example vote (distance weighted,
etc.)
Can we choose a smaller set of near-optimal
examples (prototypes/exemplars)
Storage reduction
Faster execution
Noise robustness
Distance Metrics non-Euclidean
Irrelevant Features Feature weighting

48
Evolutionary Computation/AlgorithmsGenetic
Algorithms

Simulate natural evolution of structures via
selection and reproduction, based on performance
(fitness)
Type of Heuristic Search - Discovery, not
inductive in isolation
Genetic Operators - Recombination (Crossover) and
Mutation are most common
1 1 0 2 3 1 0 2 2 1 (Fitness 10)
2 2 0 1 1 3 1 1 0 0 (Fitness 12)
2 2 0 1 3 1 0 2 2 1 (Fitness calculated or
f(parents))

49
Evolutionary Algorithms

Start with initialized population P(t) - random,
domain- knowledge, etc.
Population usually made up of possible parameter
settings for a complex problem
Typically have fixed population size (like beam
search)
Selection
Parent_Selection P(t) - Promising Parents used to
create new children
Survive P(t) - Pruning of unpromising candidates
Evaluate P(t) - Calculate fitness of population
members. Ranges from simple metrics to complex
simulations.

50
Evolutionary Algorithm

Procedure EA
t 0
Initialize Population P(t)
Evaluate P(t)
Until Done /Sufficiently good individuals
discovered/
t t1
Parent_Selection P(t)
Recombine P(t)
Mutate P(t)
Evaluate P(t)
Survive P(t)

51
EA Example

Goal Discover a new automotive engine to
maximize performance, reliability, and mileage
while minimizing emissions
Features CID (Cubic inch displacement), fuel
system, of valves, of cylinders, presence of
turbo-charging
Assume - Test unit which tests possible engines
and returns integer measure of goodness
Start with population of random engines

52
(No Transcript)
53
(No Transcript)
54
Genetic Operators

Crossover variations - multi-point, uniform
probability, averaging, etc.
Mutation - Random changes in features, adaptive,
different for each feature, etc.
Others - many schemes mimicking natural
genetics dominance, selective mating, inversion,
reordering, speciation, knowledge-based, etc.
Reproduction - terminology - selection based on
fitness - keep best around - supported in the
algorithms
Critical to maintain balance of diversity and
quality in the population

55
Evolutionary Algorithms

There exist mathematical proofs that evolutionary
techniques are efficient search strategies
There are a number of different Evolutionary
strategies
Genetic Algorithms
Evolutionary Programming
Evolution Strategies
Genetic Programming
Strategies differ in representations, selection,
operators, evaluation, etc.
Most independently discovered, initially function
optimization (EP, ES)
Strategies continue to evolve

56
Genetic Algorithm Comments

Much current work and extensions
Numerous application attempts. Can plug into
many algorithms requiring search. Has built-in
heuristic. Could augment with domain heuristics
Lazy Mans Solution to any tough parameter
search

57
Rule Induction

Creates a set of symbolic rules to solve a
classification problem
Sequential Covering Algorithms
Until no good and significant rules can be
created
Create all first order rules Ax -gt Classy
Score each rule based on goodness (accuracy) and
significance using the current training set
Iteratively (greedily) expand the best rules to
n1 attributes, score the new rules, and prune
weak rules to keep the total candidate list at a
fixed size (beam search)
Pick the one best rule and remove all instances
from the training set that the rule covers

58
Rule Induction Variants

Ordered Rule lists (decision lists) - naturally
supports multiple output classes
AGreen and BTall -gt Class 1
ARed and CFast -gt Class 2
Else Class 1
Placing new rules at beginning or end of list
Unordered rule lists for each output class (must
handle multiple matches)
Rule induction can handle noise by no longer
creating new rules when gain is negligible or not
statistically significant

59
Conclusion

Many new algorithms and approaches being proposed
Application areas rapidly increasing
Amount of available data and information growing
User desire for more adaptive and user-specific
computer interaction
This need for specific and adaptable user
interaction will make machine learning a more
important tool in user interface research and
applications

Write a Comment

User Comments (0)