Machine Learning and Neural Networks Professor Tony Martinez Computer Science Department Brigham Young University http://axon.cs.byu.edu/~martinez - PowerPoint PPT Presentation

About This Presentation
Title:

Machine Learning and Neural Networks Professor Tony Martinez Computer Science Department Brigham Young University http://axon.cs.byu.edu/~martinez

Description:

Decision Trees, Nearest Neighbor/IBL, Genetic Algorithms, Rule ... Stacking, Gating/Mixture of Experts, Bagging, Boosting, Wagging, Mimicking, Combinations ... – PowerPoint PPT presentation

Number of Views:221
Avg rating:3.0/5.0
Slides: 60
Provided by: axonC
Learn more at: https://axon.cs.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning and Neural Networks Professor Tony Martinez Computer Science Department Brigham Young University http://axon.cs.byu.edu/~martinez


1
Machine Learning and Neural NetworksProfessor
Tony MartinezComputer Science DepartmentBrigham
Young Universityhttp//axon.cs.byu.edu/martinez
2
Tutorial Overview
  • Introduction and Motivation
  • Neural Network Model Descriptions
  • Perceptron
  • Backpropagation
  • Issues
  • Overfitting
  • Applications
  • Other Models
  • Decision Trees, Nearest Neighbor/IBL, Genetic
    Algorithms, Rule Induction, Ensembles

3
More Information
  • You can download this presentation from
  • ftp//axon.cs.byu.edu/pub/papers/NNML.ppt
  • An excellent introductory text to Machine
    Learning
  • Machine Learning, Tom M. Mitchell, McGraw Hill,
    1997

4
What is Inductive Learning
  • Gather a set of input-output examples from some
    application Training Set
  • i.e. Speech Recognition, financial forecasting
  • Train the learning model (Neural network, etc.)
    on the training set until it solves it well
  • The Goal is to generalize on novel data not yet
    seen
  • Gather a further set of input-output examples
    from the same application Test Set
  • Use the learning system on actual data

5
Motivation
  • Costs and Errors in Programming
  • Our inability to program "subjective" problems
  • General, easy-to use mechanism for a large set of
    applications
  • Improvement in application accuracy - Empirical

6
Example Application - Heart Attack Diagnosis
  • The patient has a set of symptoms - Age, type of
    pain, heart rate, blood pressure, temperature,
    etc.
  • Given these symptoms in an Emergency Room
    setting, a doctor must diagnose whether a heart
    attack has occurred.
  • How do you train a machine learning model solve
    this problem using the inductive learning model?
  • Consistent approach
  • Knowledge of ML approach not critical
  • Need to select a reasonable set of input features

7
Examples and Discussion
  • Loan Underwriting
  • Which Input Features (Data)
  • Divide into Training Set and Test Set
  • Choose a learning model
  • Train model on Training set
  • Predict accuracy with the Test Set
  • How to generalize better?
  • Different Input Features
  • Different Learning Model
  • Issues
  • Intuition vs. Prejudice
  • Social Response

8
UC Irvine Machine Learning Data BaseIris Data Set
4.8,3.0,1.4,0.3, Iris-setosa 5.1,3.8,1.6,0.2, Iris
-setosa 4.6,3.2,1.4,0.2, Iris-setosa 5.3,3.7,1.5,0
.2, Iris-setosa 5.0,3.3,1.4,0.2, Iris-setosa 7.0,3
.2,4.7,1.4, Iris-versicolor 6.4,3.2,4.5,1.5, Iris-
versicolor 6.9,3.1,4.9,1.5, Iris-versicolor 5.5,2.
3,4.0,1.3, Iris-versicolor 6.5,2.8,4.6,1.5, Iris-v
ersicolor 6.0,2.2,5.0,1.5, Iris-viginica 6.9,3.2,5
.7,2.3, Iris-viginica 5.6,2.8,4.9,2.0, Iris-vigini
ca 7.7,2.8,6.7,2.0, Iris-viginica 6.3,2.7,4.9,1.8,
Iris-viginica
9
Voting Records Data Base
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y democrat,
n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y republican,n,y,n,y
,y,y,n,n,n,n,n,n,y,y,?,y republican,n,y,n,y,y,y,n,
n,n,n,n,y,y,y,n,y democrat,y,y,y,n,n,n,y,y,y,n,n,n
,n,n,?,? republican,n,y,n,y,y,n,n,n,n,n,?,?,y,y,n,
n republican,n,y,n,y,y,y,n,n,n,n,y,?,y,y,?,? democ
rat,n,y,y,n,n,n,y,y,y,n,n,n,y,n,?,? democrat,y,y,y
,n,n,y,y,y,?,y,y,?,n,n,y,? republican,n,y,n,y,y,y,
n,n,n,n,n,y,?,?,n,? republican,n,y,n,y,y,y,n,n,n,y
,n,y,y,?,n,? democrat,y,n,y,n,n,y,n,y,?,y,y,y,?,n,
n,y democrat,y,?,y,n,n,n,y,y,y,n,n,n,y,n,y,y repub
lican,n,y,n,y,y,y,n,n,n,n,n,?,y,y,n,n
10
Machine Learning Sketch History
  • Neural Networks - Connectionist - Biological
    Plausibility
  • Late 50s, early 60s, Rosenblatt, Perceptron
  • Minsky Papert 1969 - The Lull, symbolic
    expansion
  • Late 80s - Backpropagation, Hopfield, etc. - The
    explosion
  • Machine Learning - Artificial Intelligence -
    Symbolic - Psychological Plausibility
  • Samuel (1959) - Checkers evaluation strategies
  • 1970s and on - ID3, Instance Based Learning,
    Rule induction,
  • Currently Symbolic and connectionist lumped
    under ML
  • Genetic Algorithms - 1970s
  • Originally lumped in connectionist
  • Now an exploding area Evolutionary Algorithms

11
Inductive Learning - Supervised
  • Assume a set T of examples of the form (x,y)
    where x is a vector of features/attributes and y
    is a scalar or vector output
  • By examining the examples postulate a hypothesis
    H(x) gt y for arbitrary x
  • Spectrum of Supervised Algorithms
  • Unsupervised Learning
  • Reinforcement Learning

12
Other Machine Learning Areas
  • Case Based Reasoning
  • Analogical Reasoning
  • Speed-up Learning
  • Inductive Learning is the most studied and
    successful to date
  • Data Mining
  • COLT Computational Learning Theory

13
(No Transcript)
14
Perceptron Node Threshold Logic Unit
x1
w1
x2
Z
w2
xn
wn
15
Learning Algorithm
x1
.4
Z
.1
x2
-.2
16
First Training Instance
.4
Z
1
.1
-.2
Net .8.4 .3-.2 .26
17
Second Training Instance
.4
Z
1
.1
-.2
Net .4.4 .1-.2 .14
Dwi
(T - Z)
C
Xi
18
Delta Rule Learning
  • Dwij C(Tj Zj) xi
  • Create a network with n input and m output nodes
  • Each iteration through the training set is an
    epoch
  • Continue training until error is less than some
    epsilon
  • Perceptron Convergence Theorem Guaranteed to
    find a solution in finite time if a solution
    exists
  • As can be seen from the node activation function
    the decision surface is an n-dimensional hyper
    plane

19
Linear Separability
20
Linear Separability and Generalization
When is data noise vs. a legitimate exception
21
Limited Functionality of Hyperplane
22
Gradient Descent Learning
Error Landscape
TSS Total Sum Squared Error
0
Weight Values
23
Deriving a Gradient Descent Learning Algorithm
  • Goal to decrease overall error (or other
    objective function) each time a weight is changed
  • Total Sum Squared error S (Ti Zi)2
  • Seek a weight changing algorithm such that
    is negative
  • If a formula can be found then we have a gradient
    descent learning algorithm
  • Perceptron/Delta rule is a gradient descent
    learning algorithm
  • Linearly-separable problems have no local minima

24
Multi-layer Perceptron
  • Can compute arbitrary mappings
  • Assumes a non-linear activation function
  • Training Algorithms less obvious
  • Backpropagation learning algorithm not exploited
    until 1980s
  • First of many powerful multi-layer learning
    algorithms

25
Responsibility Problem
Output 1 Wanted 0
26
Multi-Layer Generalization
27
Backpropagation
  • Multi-layer supervised learner
  • Gradient Descent weight updates
  • Sigmoid activation function (smoothed threshold
    logic)
  • Backpropagation requires a differentiable
    activation function

28
Multi-layer Perceptron Topology
Input Layer Hidden Layer(s) Output Layer
29
Backpropagation Learning Algorithm
  • Until Convergence (low error or other criteria)
    do
  • Present a training pattern
  • Calculate the error of the output nodes (based on
    T - Z)
  • Calculate the error of the hidden nodes (based on
    the error of the output nodes which is propagated
    back to the hidden nodes)
  • Continue propagating error back until the input
    layer is reached
  • Update all weights based on the standard delta
    rule with the appropriate error function d
  • Dwij Cdj Zi

30
Activation Function and its Derivative
  • Node activation function f(net) is typically the
    sigmoid
  • Derivate of activation function is critical part
    of algorithm

1
.5
0
0
-5
5
Net
.25
0
-5
5
0
Net
31
Backpropagation Learning Equations
32
Backpropagation Summary
  • Excellent Empirical results
  • Scaling The pleasant surprise
  • Local Minima very rare is problem and network
    complexity increase
  • Most common neural network approach
  • User defined parameters lead to more difficulty
    of use
  • Number of hidden nodes, layers, learning rate,
    etc.
  • Many variants
  • Adaptive Parameters, Ontogenic (growing and
    pruning) learning algorithms
  • Higher order gradient descent (Newton, Conjugate
    Gradient, etc.)
  • Recurrent networks

33
Inductive Bias
  • The approach used to decide how to generalize
    novel cases
  • Occams Razor The simplest hypothesis which
    fits the data is usually the best Still many
    remaining options
  • A B C -gt Z
  • A B C -gt Z
  • A B C -gt Z
  • A B C -gt Z
  • A B C -gt Z
  • Now you receive the new input A B C What
    is your output?

34
Overfitting
  • Noise vs. Exceptions revisited

35
The Overfit Problem
TSS
Validation/Test Set
Training Set
Epochs
  • Newer powerful models can have very complex
    decision surfaces which can converge well on most
    training sets by learning noisy and irrelevant
    aspects of the training set in order to minimize
    error (memorization in the limit)
  • This makes them susceptible to overfit if not
    carefully considered

36
Avoiding Overfit
  • Inductive Bias Simplest accurate model
  • More Training Data (vs. overtraining - One epoch
    limit)
  • Validation Set (requires separate test set)
  • Backpropagation Tends to build from simple
    model (0 weights) to just large enough weights
    (Validation Set)
  • Stopping criteria with any constructive model
    (Accuracy increase vs Statistical significance)
    Noise vs. Exceptions
  • Specific Techniques
  • Weight Decay, Pruning, Jitter, Regularization
  • Ensembles

37
Ensembles
  • Many different Ensemble approaches
  • Stacking, Gating/Mixture of Experts, Bagging,
    Boosting, Wagging, Mimicking, Combinations
  • Multiple diverse models trained on same problem
    and then their outputs are combined
  • The specific overfit of each learning model is
    averaged out
  • If models are diverse (uncorrelated errors) then
    even if the individual models are weak
    generalizers, the ensemble can be very accurate

Combining Technique
M1
Mn
M3
M2
38
Application Issues
  • Choose relevant features
  • Normalize features
  • Can learn to ignore irrelevant features, but will
    have to fight the curse of dimensionality
  • More data (training examples) the better
  • Slower training acceptable for complex and
    production applications if accuracy improvement,
    (The week phenomenon)
  • Execution normally fast regardless of training
    time

39
Decision Trees - ID3/C4.5
  • Top down induction of decision trees
  • Highly used and successful
  • Attribute Features - discrete nominal (mutually
    exclusive) Real valued features are discretized
  • Search for smallest tree is too complex (always
    NP hard)
  • C4.5 use common symbolic ML philosophy of a
    greedy iterative approach

40
Decision Tree Learning
  • Mapping by Hyper-Rectangles

A1
A2
41
ID3 Learning Approach
  • C is the current set of examples
  • A test on attribute A partitions C into Ci,
    C2,...,Cw where w is the number of values of A

C
Red
Green
AttributeColor
Purple
C1
C2
C3
42
Decision Tree Learning Algorithm
  • Start with the Training Set as C and test how
    each attribute partitions C
  • Choose the best A for root
  • The goodness measure is based on how well
    attribute A divides C into different output
    classes A perfect attribute would divide C into
    partitions that contain only one output class
    each A poor attribute (irrelevant) would leave
    each partition with the same ratio of classes as
    in C
  • 20 questions analogy good questions quickly
    minimize the possibilities
  • Continue recursively until sets unambiguously
    classified or a stopping criteria is reached

43
ID3 Example and Discussion
Temperature Humidity
P N P N Hot 2 2 High 3 4 Mild 4 2 Normal 6 1 C
ool 3 1 Gain .029 Gain .151
  • 14 Examples. Uses Information Gain. Attributes
    which best discriminate between classes are
    chosen
  • If the same ratios are found in partitioned set,
    then gain is 0

44
ID3 - Conclusions
  • Good Empirical Results
  • Comparable application robustness and accuracy
    with neural networks - faster learning (though
    NNs are more natural with continuous features -
    both input and output)
  • Most used and well known of current symbolic
    systems - used widely to aid in creating rules
    for expert systems

45
Nearest Neighbor Learners
  • Broad Spectrum
  • Basic K-NN, Instance Based Learning, Case Based
    Reasoning, Analogical Reasoning
  • Simply store all or some representative subset of
    the examples in the training set
  • Generalize on the fly rather than use
    pre-acquired hypothesis - faster learning, slower
    execution, information retained, memory intensive

46
Nearest Neighbor Algorithms
47
Nearest Neighbor Variations
  • How many examples to store
  • How do stored example vote (distance weighted,
    etc.)
  • Can we choose a smaller set of near-optimal
    examples (prototypes/exemplars)
  • Storage reduction
  • Faster execution
  • Noise robustness
  • Distance Metrics non-Euclidean
  • Irrelevant Features Feature weighting

48
Evolutionary Computation/AlgorithmsGenetic
Algorithms
  • Simulate natural evolution of structures via
    selection and reproduction, based on performance
    (fitness)
  • Type of Heuristic Search - Discovery, not
    inductive in isolation
  • Genetic Operators - Recombination (Crossover) and
    Mutation are most common
  • 1 1 0 2 3 1 0 2 2 1 (Fitness 10)
  • 2 2 0 1 1 3 1 1 0 0 (Fitness 12)
  • 2 2 0 1 3 1 0 2 2 1 (Fitness calculated or
    f(parents))

49
Evolutionary Algorithms
  • Start with initialized population P(t) - random,
    domain- knowledge, etc.
  • Population usually made up of possible parameter
    settings for a complex problem
  • Typically have fixed population size (like beam
    search)
  • Selection
  • Parent_Selection P(t) - Promising Parents used to
    create new children
  • Survive P(t) - Pruning of unpromising candidates
  • Evaluate P(t) - Calculate fitness of population
    members. Ranges from simple metrics to complex
    simulations.

50
Evolutionary Algorithm
  • Procedure EA
  • t 0
  • Initialize Population P(t)
  • Evaluate P(t)
  • Until Done /Sufficiently good individuals
    discovered/
  • t t1
  • Parent_Selection P(t)
  • Recombine P(t)
  • Mutate P(t)
  • Evaluate P(t)
  • Survive P(t)

51
EA Example
  • Goal Discover a new automotive engine to
    maximize performance, reliability, and mileage
    while minimizing emissions
  • Features CID (Cubic inch displacement), fuel
    system, of valves, of cylinders, presence of
    turbo-charging
  • Assume - Test unit which tests possible engines
    and returns integer measure of goodness
  • Start with population of random engines

52
(No Transcript)
53
(No Transcript)
54
Genetic Operators
  • Crossover variations - multi-point, uniform
    probability, averaging, etc.
  • Mutation - Random changes in features, adaptive,
    different for each feature, etc.
  • Others - many schemes mimicking natural
    genetics dominance, selective mating, inversion,
    reordering, speciation, knowledge-based, etc.
  • Reproduction - terminology - selection based on
    fitness - keep best around - supported in the
    algorithms
  • Critical to maintain balance of diversity and
    quality in the population

55
Evolutionary Algorithms
  • There exist mathematical proofs that evolutionary
    techniques are efficient search strategies
  • There are a number of different Evolutionary
    strategies
  • Genetic Algorithms
  • Evolutionary Programming
  • Evolution Strategies
  • Genetic Programming
  • Strategies differ in representations, selection,
    operators, evaluation, etc.
  • Most independently discovered, initially function
    optimization (EP, ES)
  • Strategies continue to evolve

56
Genetic Algorithm Comments
  • Much current work and extensions
  • Numerous application attempts. Can plug into
    many algorithms requiring search. Has built-in
    heuristic. Could augment with domain heuristics
  • Lazy Mans Solution to any tough parameter
    search

57
Rule Induction
  • Creates a set of symbolic rules to solve a
    classification problem
  • Sequential Covering Algorithms
  • Until no good and significant rules can be
    created
  • Create all first order rules Ax -gt Classy
  • Score each rule based on goodness (accuracy) and
    significance using the current training set
  • Iteratively (greedily) expand the best rules to
    n1 attributes, score the new rules, and prune
    weak rules to keep the total candidate list at a
    fixed size (beam search)
  • Pick the one best rule and remove all instances
    from the training set that the rule covers

58
Rule Induction Variants
  • Ordered Rule lists (decision lists) - naturally
    supports multiple output classes
  • AGreen and BTall -gt Class 1
  • ARed and CFast -gt Class 2
  • Else Class 1
  • Placing new rules at beginning or end of list
  • Unordered rule lists for each output class (must
    handle multiple matches)
  • Rule induction can handle noise by no longer
    creating new rules when gain is negligible or not
    statistically significant

59
Conclusion
  • Many new algorithms and approaches being proposed
  • Application areas rapidly increasing
  • Amount of available data and information growing
  • User desire for more adaptive and user-specific
    computer interaction
  • This need for specific and adaptable user
    interaction will make machine learning a more
    important tool in user interface research and
    applications
Write a Comment
User Comments (0)
About PowerShow.com