Title: Machine Learning and Neural Networks Professor Tony Martinez Computer Science Department Brigham Young University http://axon.cs.byu.edu/~martinez
1Machine Learning and Neural NetworksProfessor
Tony MartinezComputer Science DepartmentBrigham
Young Universityhttp//axon.cs.byu.edu/martinez
2Tutorial Overview
- Introduction and Motivation
- Neural Network Model Descriptions
- Perceptron
- Backpropagation
- Issues
- Overfitting
- Applications
- Other Models
- Decision Trees, Nearest Neighbor/IBL, Genetic
Algorithms, Rule Induction, Ensembles
3More Information
- You can download this presentation from
- ftp//axon.cs.byu.edu/pub/papers/NNML.ppt
- An excellent introductory text to Machine
Learning - Machine Learning, Tom M. Mitchell, McGraw Hill,
1997
4What is Inductive Learning
- Gather a set of input-output examples from some
application Training Set - i.e. Speech Recognition, financial forecasting
- Train the learning model (Neural network, etc.)
on the training set until it solves it well - The Goal is to generalize on novel data not yet
seen - Gather a further set of input-output examples
from the same application Test Set - Use the learning system on actual data
5Motivation
- Costs and Errors in Programming
- Our inability to program "subjective" problems
- General, easy-to use mechanism for a large set of
applications - Improvement in application accuracy - Empirical
6Example Application - Heart Attack Diagnosis
- The patient has a set of symptoms - Age, type of
pain, heart rate, blood pressure, temperature,
etc. - Given these symptoms in an Emergency Room
setting, a doctor must diagnose whether a heart
attack has occurred. - How do you train a machine learning model solve
this problem using the inductive learning model? - Consistent approach
- Knowledge of ML approach not critical
- Need to select a reasonable set of input features
7Examples and Discussion
- Loan Underwriting
- Which Input Features (Data)
- Divide into Training Set and Test Set
- Choose a learning model
- Train model on Training set
- Predict accuracy with the Test Set
- How to generalize better?
- Different Input Features
- Different Learning Model
- Issues
- Intuition vs. Prejudice
- Social Response
8UC Irvine Machine Learning Data BaseIris Data Set
4.8,3.0,1.4,0.3, Iris-setosa 5.1,3.8,1.6,0.2, Iris
-setosa 4.6,3.2,1.4,0.2, Iris-setosa 5.3,3.7,1.5,0
.2, Iris-setosa 5.0,3.3,1.4,0.2, Iris-setosa 7.0,3
.2,4.7,1.4, Iris-versicolor 6.4,3.2,4.5,1.5, Iris-
versicolor 6.9,3.1,4.9,1.5, Iris-versicolor 5.5,2.
3,4.0,1.3, Iris-versicolor 6.5,2.8,4.6,1.5, Iris-v
ersicolor 6.0,2.2,5.0,1.5, Iris-viginica 6.9,3.2,5
.7,2.3, Iris-viginica 5.6,2.8,4.9,2.0, Iris-vigini
ca 7.7,2.8,6.7,2.0, Iris-viginica 6.3,2.7,4.9,1.8,
Iris-viginica
9Voting Records Data Base
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y democrat,
n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y republican,n,y,n,y
,y,y,n,n,n,n,n,n,y,y,?,y republican,n,y,n,y,y,y,n,
n,n,n,n,y,y,y,n,y democrat,y,y,y,n,n,n,y,y,y,n,n,n
,n,n,?,? republican,n,y,n,y,y,n,n,n,n,n,?,?,y,y,n,
n republican,n,y,n,y,y,y,n,n,n,n,y,?,y,y,?,? democ
rat,n,y,y,n,n,n,y,y,y,n,n,n,y,n,?,? democrat,y,y,y
,n,n,y,y,y,?,y,y,?,n,n,y,? republican,n,y,n,y,y,y,
n,n,n,n,n,y,?,?,n,? republican,n,y,n,y,y,y,n,n,n,y
,n,y,y,?,n,? democrat,y,n,y,n,n,y,n,y,?,y,y,y,?,n,
n,y democrat,y,?,y,n,n,n,y,y,y,n,n,n,y,n,y,y repub
lican,n,y,n,y,y,y,n,n,n,n,n,?,y,y,n,n
10Machine Learning Sketch History
- Neural Networks - Connectionist - Biological
Plausibility - Late 50s, early 60s, Rosenblatt, Perceptron
- Minsky Papert 1969 - The Lull, symbolic
expansion - Late 80s - Backpropagation, Hopfield, etc. - The
explosion - Machine Learning - Artificial Intelligence -
Symbolic - Psychological Plausibility - Samuel (1959) - Checkers evaluation strategies
- 1970s and on - ID3, Instance Based Learning,
Rule induction, - Currently Symbolic and connectionist lumped
under ML - Genetic Algorithms - 1970s
- Originally lumped in connectionist
- Now an exploding area Evolutionary Algorithms
11Inductive Learning - Supervised
- Assume a set T of examples of the form (x,y)
where x is a vector of features/attributes and y
is a scalar or vector output - By examining the examples postulate a hypothesis
H(x) gt y for arbitrary x - Spectrum of Supervised Algorithms
- Unsupervised Learning
- Reinforcement Learning
12Other Machine Learning Areas
- Case Based Reasoning
- Analogical Reasoning
- Speed-up Learning
- Inductive Learning is the most studied and
successful to date - Data Mining
- COLT Computational Learning Theory
13(No Transcript)
14Perceptron Node Threshold Logic Unit
x1
w1
x2
Z
w2
xn
wn
15Learning Algorithm
x1
.4
Z
.1
x2
-.2
16First Training Instance
.4
Z
1
.1
-.2
Net .8.4 .3-.2 .26
17Second Training Instance
.4
Z
1
.1
-.2
Net .4.4 .1-.2 .14
Dwi
(T - Z)
C
Xi
18Delta Rule Learning
- Dwij C(Tj Zj) xi
- Create a network with n input and m output nodes
- Each iteration through the training set is an
epoch - Continue training until error is less than some
epsilon - Perceptron Convergence Theorem Guaranteed to
find a solution in finite time if a solution
exists - As can be seen from the node activation function
the decision surface is an n-dimensional hyper
plane
19Linear Separability
20Linear Separability and Generalization
When is data noise vs. a legitimate exception
21Limited Functionality of Hyperplane
22Gradient Descent Learning
Error Landscape
TSS Total Sum Squared Error
0
Weight Values
23Deriving a Gradient Descent Learning Algorithm
- Goal to decrease overall error (or other
objective function) each time a weight is changed - Total Sum Squared error S (Ti Zi)2
- Seek a weight changing algorithm such that
is negative - If a formula can be found then we have a gradient
descent learning algorithm - Perceptron/Delta rule is a gradient descent
learning algorithm - Linearly-separable problems have no local minima
24Multi-layer Perceptron
- Can compute arbitrary mappings
- Assumes a non-linear activation function
- Training Algorithms less obvious
- Backpropagation learning algorithm not exploited
until 1980s - First of many powerful multi-layer learning
algorithms
25Responsibility Problem
Output 1 Wanted 0
26Multi-Layer Generalization
27Backpropagation
- Multi-layer supervised learner
- Gradient Descent weight updates
- Sigmoid activation function (smoothed threshold
logic) - Backpropagation requires a differentiable
activation function
28Multi-layer Perceptron Topology
Input Layer Hidden Layer(s) Output Layer
29Backpropagation Learning Algorithm
- Until Convergence (low error or other criteria)
do - Present a training pattern
- Calculate the error of the output nodes (based on
T - Z) - Calculate the error of the hidden nodes (based on
the error of the output nodes which is propagated
back to the hidden nodes) - Continue propagating error back until the input
layer is reached - Update all weights based on the standard delta
rule with the appropriate error function d - Dwij Cdj Zi
30Activation Function and its Derivative
- Node activation function f(net) is typically the
sigmoid - Derivate of activation function is critical part
of algorithm
1
.5
0
0
-5
5
Net
.25
0
-5
5
0
Net
31Backpropagation Learning Equations
32Backpropagation Summary
- Excellent Empirical results
- Scaling The pleasant surprise
- Local Minima very rare is problem and network
complexity increase - Most common neural network approach
- User defined parameters lead to more difficulty
of use - Number of hidden nodes, layers, learning rate,
etc. - Many variants
- Adaptive Parameters, Ontogenic (growing and
pruning) learning algorithms - Higher order gradient descent (Newton, Conjugate
Gradient, etc.) - Recurrent networks
33Inductive Bias
- The approach used to decide how to generalize
novel cases - Occams Razor The simplest hypothesis which
fits the data is usually the best Still many
remaining options - A B C -gt Z
- A B C -gt Z
- A B C -gt Z
- A B C -gt Z
- A B C -gt Z
- Now you receive the new input A B C What
is your output?
34Overfitting
- Noise vs. Exceptions revisited
35The Overfit Problem
TSS
Validation/Test Set
Training Set
Epochs
- Newer powerful models can have very complex
decision surfaces which can converge well on most
training sets by learning noisy and irrelevant
aspects of the training set in order to minimize
error (memorization in the limit) - This makes them susceptible to overfit if not
carefully considered
36Avoiding Overfit
- Inductive Bias Simplest accurate model
- More Training Data (vs. overtraining - One epoch
limit) - Validation Set (requires separate test set)
- Backpropagation Tends to build from simple
model (0 weights) to just large enough weights
(Validation Set) - Stopping criteria with any constructive model
(Accuracy increase vs Statistical significance)
Noise vs. Exceptions - Specific Techniques
- Weight Decay, Pruning, Jitter, Regularization
- Ensembles
37Ensembles
- Many different Ensemble approaches
- Stacking, Gating/Mixture of Experts, Bagging,
Boosting, Wagging, Mimicking, Combinations - Multiple diverse models trained on same problem
and then their outputs are combined - The specific overfit of each learning model is
averaged out - If models are diverse (uncorrelated errors) then
even if the individual models are weak
generalizers, the ensemble can be very accurate
Combining Technique
M1
Mn
M3
M2
38Application Issues
- Choose relevant features
- Normalize features
- Can learn to ignore irrelevant features, but will
have to fight the curse of dimensionality - More data (training examples) the better
- Slower training acceptable for complex and
production applications if accuracy improvement,
(The week phenomenon) - Execution normally fast regardless of training
time
39Decision Trees - ID3/C4.5
- Top down induction of decision trees
- Highly used and successful
- Attribute Features - discrete nominal (mutually
exclusive) Real valued features are discretized - Search for smallest tree is too complex (always
NP hard) - C4.5 use common symbolic ML philosophy of a
greedy iterative approach
40Decision Tree Learning
- Mapping by Hyper-Rectangles
A1
A2
41ID3 Learning Approach
- C is the current set of examples
- A test on attribute A partitions C into Ci,
C2,...,Cw where w is the number of values of A
C
Red
Green
AttributeColor
Purple
C1
C2
C3
42Decision Tree Learning Algorithm
- Start with the Training Set as C and test how
each attribute partitions C - Choose the best A for root
- The goodness measure is based on how well
attribute A divides C into different output
classes A perfect attribute would divide C into
partitions that contain only one output class
each A poor attribute (irrelevant) would leave
each partition with the same ratio of classes as
in C - 20 questions analogy good questions quickly
minimize the possibilities - Continue recursively until sets unambiguously
classified or a stopping criteria is reached
43ID3 Example and Discussion
Temperature Humidity
P N P N Hot 2 2 High 3 4 Mild 4 2 Normal 6 1 C
ool 3 1 Gain .029 Gain .151
- 14 Examples. Uses Information Gain. Attributes
which best discriminate between classes are
chosen - If the same ratios are found in partitioned set,
then gain is 0
44ID3 - Conclusions
- Good Empirical Results
- Comparable application robustness and accuracy
with neural networks - faster learning (though
NNs are more natural with continuous features -
both input and output) - Most used and well known of current symbolic
systems - used widely to aid in creating rules
for expert systems
45Nearest Neighbor Learners
- Broad Spectrum
- Basic K-NN, Instance Based Learning, Case Based
Reasoning, Analogical Reasoning - Simply store all or some representative subset of
the examples in the training set - Generalize on the fly rather than use
pre-acquired hypothesis - faster learning, slower
execution, information retained, memory intensive
46Nearest Neighbor Algorithms
47Nearest Neighbor Variations
- How many examples to store
- How do stored example vote (distance weighted,
etc.) - Can we choose a smaller set of near-optimal
examples (prototypes/exemplars) - Storage reduction
- Faster execution
- Noise robustness
- Distance Metrics non-Euclidean
- Irrelevant Features Feature weighting
48Evolutionary Computation/AlgorithmsGenetic
Algorithms
- Simulate natural evolution of structures via
selection and reproduction, based on performance
(fitness) - Type of Heuristic Search - Discovery, not
inductive in isolation - Genetic Operators - Recombination (Crossover) and
Mutation are most common - 1 1 0 2 3 1 0 2 2 1 (Fitness 10)
- 2 2 0 1 1 3 1 1 0 0 (Fitness 12)
- 2 2 0 1 3 1 0 2 2 1 (Fitness calculated or
f(parents))
49Evolutionary Algorithms
- Start with initialized population P(t) - random,
domain- knowledge, etc. - Population usually made up of possible parameter
settings for a complex problem - Typically have fixed population size (like beam
search) - Selection
- Parent_Selection P(t) - Promising Parents used to
create new children - Survive P(t) - Pruning of unpromising candidates
- Evaluate P(t) - Calculate fitness of population
members. Ranges from simple metrics to complex
simulations.
50Evolutionary Algorithm
- Procedure EA
- t 0
- Initialize Population P(t)
- Evaluate P(t)
- Until Done /Sufficiently good individuals
discovered/ - t t1
- Parent_Selection P(t)
- Recombine P(t)
- Mutate P(t)
- Evaluate P(t)
- Survive P(t)
51EA Example
- Goal Discover a new automotive engine to
maximize performance, reliability, and mileage
while minimizing emissions - Features CID (Cubic inch displacement), fuel
system, of valves, of cylinders, presence of
turbo-charging - Assume - Test unit which tests possible engines
and returns integer measure of goodness - Start with population of random engines
52(No Transcript)
53(No Transcript)
54Genetic Operators
- Crossover variations - multi-point, uniform
probability, averaging, etc. - Mutation - Random changes in features, adaptive,
different for each feature, etc. - Others - many schemes mimicking natural
genetics dominance, selective mating, inversion,
reordering, speciation, knowledge-based, etc. - Reproduction - terminology - selection based on
fitness - keep best around - supported in the
algorithms - Critical to maintain balance of diversity and
quality in the population
55Evolutionary Algorithms
- There exist mathematical proofs that evolutionary
techniques are efficient search strategies - There are a number of different Evolutionary
strategies - Genetic Algorithms
- Evolutionary Programming
- Evolution Strategies
- Genetic Programming
- Strategies differ in representations, selection,
operators, evaluation, etc. - Most independently discovered, initially function
optimization (EP, ES) - Strategies continue to evolve
56Genetic Algorithm Comments
- Much current work and extensions
- Numerous application attempts. Can plug into
many algorithms requiring search. Has built-in
heuristic. Could augment with domain heuristics - Lazy Mans Solution to any tough parameter
search
57Rule Induction
- Creates a set of symbolic rules to solve a
classification problem - Sequential Covering Algorithms
- Until no good and significant rules can be
created - Create all first order rules Ax -gt Classy
- Score each rule based on goodness (accuracy) and
significance using the current training set - Iteratively (greedily) expand the best rules to
n1 attributes, score the new rules, and prune
weak rules to keep the total candidate list at a
fixed size (beam search) - Pick the one best rule and remove all instances
from the training set that the rule covers
58Rule Induction Variants
- Ordered Rule lists (decision lists) - naturally
supports multiple output classes - AGreen and BTall -gt Class 1
- ARed and CFast -gt Class 2
- Else Class 1
- Placing new rules at beginning or end of list
- Unordered rule lists for each output class (must
handle multiple matches) - Rule induction can handle noise by no longer
creating new rules when gain is negligible or not
statistically significant
59Conclusion
- Many new algorithms and approaches being proposed
- Application areas rapidly increasing
- Amount of available data and information growing
- User desire for more adaptive and user-specific
computer interaction - This need for specific and adaptable user
interaction will make machine learning a more
important tool in user interface research and
applications