Outline - PowerPoint PPT Presentation

1 / 155
About This Presentation
Title:

Outline

Description:

When an example i is from class 2, y(i) = -1. In this case, we use the sign of x to decide which class ... We train c linear models for each class against the rest ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 156
Provided by: xiuwe
Category:
Tags: class | outline

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Midterm Review
  • Liner Models
  • Perceptron algorithm
  • Support vector machines
  • Multiple Layer Perceptrons
  • Instance-based Learning
  • Decision Trees
  • Bayesian Decision Theory
  • Classification rules

2
Announcement
  • The midterm exam will be on March 30, 2006
  • It will be open book and open notes
  • A calculator may be needed and remember to bring
    one with you

3
Pattern Recognition
  • The problem is defined as follows
  • We have a set of examples, represented by a set
    of attributes
  • Each one example has a label also

4
Hand-Written Digit Recognition
  • For homework 1, we recognized ten hand written
    digits

5
Pattern Recognition Problem Statement
  • Now we want to predict a decision or a value
    based on input patterns attributes
  • For linear models, we have one weight each
    attribute, the output will be a linear
    combination of attributes

6
Linear Model as a Neural Network
Figure 4.10 (b)
7
Binary Classification
  • There are two different labels, class 1 and class
    2
  • When an example i is from class 1, y(i) 1
  • When an example i is from class 2, y(i) -1
  • In this case, we use the sign of x to decide
    which class
  • If x gt 0, then we will predict the first class
  • Otherwise, we will predict the second class.

8
McCulloch and Pitts Model
  • The linear model is equivalent to a simple neuron
    model, known as the McCulloch and Pitts Model
  • It is a simple model as a binary threshold unit
  • The model neuron first computes a weighted sum of
    its inputs and a bias
  • It outputs one if the weighted sum is larger than
    zero and zero otherwise

9
McCulloch and Pitts Model cont.
10
Linear Models cont.
  • This implements an AND gate of two inputs

11
Linear Models cont.
  • This implements an OR gate of two inputs

12
Linear Models cont.
13
Linear Models - cont.
  • How to design a linear model for XOR?

14
Linear Models cont.
  • How to design an adder of two two-bit binary
    numbers

15
Perceptron Learning Rule
Figure 4.10 (a)
16
Example
  • We have one instance in class 1(a11 and a21)
    and one in class 2 (a1-1 and a21)

17
Perceptron Learning Rule
Perceptron Convergence Theorem If training
samples are linearly separable, then the sequence
of weight vectors by the above algorithm will
terminate at a solution vector in a finite number
of times.
18
Linear Models cont.
  • How to handle multiple class cases
  • For many applications, there are often more then
    two classes
  • There are two ways to devise multi-category
    classifiers using binary linear models
  • One against the rest (c linear models)
  • One against another (c (c-1)/2 linear models)

19
Multi-category Case One against the Rest
  • If there are c classes,
  • We train c linear models for each class against
    the rest
  • Then we classify an instance as one class if it
    is the only class whose x gt 0
  • Otherwise, it is ambiguous

20
Multi-category Case One against the Rest
21
Multi-category Case One against the Other
  • If there are c classes, we train one linear model
    for class c1 and class c2 (c1 lt c2)
  • There will be c(c-1)/2 total linear models
  • To classify an unknown instance, we apply each of
    the c(c-1)/2 linear models
  • For c1 against c2, if x gt 0, then class c1
    receives one vote
  • Otherwise, class c2 receives one vote
  • The unknown instance is classifies as the class
    that receives the most votes

22
Multi-category Case One against the Other
23
Multiple Response Linear Models for Multiple
Class Classification
  • Instead of combining binary class classifiers for
    classification, we can also learn one linear
    model for each class
  • That is, if we have C different classes, we will
    have C different linear models, given by weights

24
Multiple Response Linear Models
25
Multiple Response Linear Models
  • Classification
  • Given an instance, we compute the response from
    each class

26
Multiple Response Linear Models
  • Classification continued
  • It is classified as the class whose x is the
    maximum
  • In Matlab,

27
Multiple Response Linear Models
  • Learning the weights
  • Generalized perceptron learning rule based on
    Kesslers construction

28
Decision Boundaries and Decision Regions
  • The final result of any classifier is to
    partition the attribute space into regions of
    classes
  • Decision regions are regions of the same class in
    the attribute space
  • Decision boundaries are the boundaries between
    decision regions of different classes

29
Decision Boundary of a Linear Model
  • If there are only two classes, the decision
    boundary is given by

30
The Margin of a Linear Model
  • The margin of a linear model with respect to a
    training set is the minimum distance of the
    instances to the decision boundary

31
Maximum Margin
32
Maximum Margin Hyperplane
  • Support vectors are the instances that are
    closest to the maximum margin hyperplane
  • Note that the decision boundary does not depend
    on instances that are not support vectors
  • They can be deleted without changing the maximum
    margin hyperplane
  • In other words, the support vectors uniquely
    define the linear model that has the largest
    margin

33
Maximum Margin Hyperplane
  • Now we change the notation a little bit
  • We have two classes and we label the training
    instance as either 1 (the first class) or -1
  • Then the maximum margin hyperplane is
  • Here are support vectors and is the
    attribute vector of a test instance

34
How to Determine the Parameters
  • It requires solving a standard class of
    optimization problem, known as the constrained
    quadratic optimization
  • There are off-the-shelf software packages for
    solving these problems
  • Matlab provides quadprog for quadratic
    programming
  • Sequential minimal optimization

35
Support Vector Machines cont.
  • Kernel methods can be used to solve problems that
    are not linearly separable
  • Mapping to a high dimensional is done implicitly
    using a kernel function, which can be used to
    compute the inner product in the high dimensional
    space directly (without mapping)

36
Support Vector Machines cont.
  • Support vector machines with a kernel function
  • Linear support vector machine
  • Kernel support vector machine

37
Digital Computers
  • All digital computers consist of digital gates to
    control the dataflow and perform computation
  • Any digital computer can be simulated by a
    network of neurons

38
Neural Network Architecture
  • Neural network architecture
  • Consists of a set of neurons
  • The inputs are connected to each of the neurons
    in the input layer
  • The outputs are connected to each of the neurons
    in the output layer
  • Neurons are inter-connected following some
    patterns

39
A More General Neuron
40
Transfer Functions
41
Multiple Layers of Neurons
Note that there are no standard definitions of
number of layers in a neural network
42
Notations
  • We use superscript to specify neurons in a
    particular layer
  • We use subscript to specify which neurons in one
    layer
  • For weights, we need double subscript indices
  • The first index specifies the neuron within the
    layer
  • The second index specifies the neuron from the
    preceding layer
  • means weights between neuron I in the
    mth layer and neuron j in the (m-1)th layer

43
Example
44
Elementary Decision Boundaries
First Boundary
Second Boundary
First Subnetwork
45
Elementary Decision Boundaries
Third Boundary
Fourth Boundary
Second Subnetwork
46
Total Network
47
XOR Example
48
Multilayer Neural Networks
  • Given the expressive power of multilayer neural
    networks, the key question now is how to learn
    the weights (parameters) from a set of training
    examples

49
Multilayer Neural Networks cont.
  • Training error
  • Here we consider the training error of an input
    pattern to be sum over the output units of the
    squared difference between the desired output and
    the actual output of each output unit
  • Note that w is a vector consisting of all the
    parameters in the multilayer neural network

50
Back Propagation Algorithm
  • Initialize the weights to small random values
  • Choose a training pattern aq,tq and apply aq to
    the input layer, i.e, let a0 aq and t tq
  • Propagate the input forwards through the network
  • Compute the deltas (errors) for the output layer

51
Back Propagation Algorithm cont.
  • Compute the deltas (errors) for the preceding
    layers by propagating the errors backwards
  • Update all the connections by
  • Go back to step 2 and repeat for the next pattern

52
Backpropagation Algorithm cont.
53
Example Function Approximation
54
Network
55
Initial Conditions
56
Forward Propagation
57
Transfer Function Derivatives
58
Backpropagation
59
Weight Update
60
Practical Issues
  • While in theory a multiple layer neural network
    with nonlinear transfer functions trained using
    backpropagation is sufficient to approximate any
    function or solve any recognition problem, there
    are practical issues
  • What is the optimal architecture for a particular
    problem/application?
  • What is the performance on unknown test data?
  • Will the network converge to a good solution?
  • How long does it take to train a network?

61
Choice of Architecture
1-3-1 Network
i 1
i 2
i 4
i 8
62
Choice of Network Architecture
1-2-1
1-3-1
1-5-1
1-4-1
63
The Issue of Generalization
  • We are interested in a neural network trained
    only on a training set will work well for novel
    and unseen test data
  • For example, for face recognition, a neural
    network can only recognize those in the training
    set is not very useful
  • Generalization is one of the most fundamental
    problems in neural networks and many other
    recognition techniques

64
Improving Generalization
  • Heuristics
  • A neural network should have fewer parameters
    than the number of data points in the training
    set
  • More domain specific knowledge
  • Cross validation
  • Divide the labeled examples into training and
    validation sets
  • Stop training when the error on the validation
    set increases

65
Cross Validation
66
Convergence Issues
  • A neural network may converge to a bad solution
  • Train several neural networks from different
    initial conditions
  • Learning rate may be too large

67
Convergence Issues
  • The convergence can be slow
  • Practical techniques
  • Variations of basic backpropagation algorithms
  • Heuristics
  • Momentum
  • Variable learning rate
  • Conjugate gradient
  • Second-order methods
  • Newtons method
  • Levenberg-Marquardt algorithm

68
Momentum
  • Momentum
  • A heuristic term to speed up the training
  • The momentum parameter a must be between 0 and 1
    a value of 0.9 is often chosen

69
Momentum Backpropagation
70
Derivation of Backpropagation Algorithm
  • Backpropagation algorithm is not a magic
    algorithm
  • It is essentially a gradient descent algorithm
    for optimization on the training error of neural
    network parameters (i.e., the weights and biases)
  • Understanding this allows to derive new
    algorithms for different neural networks

71
Gradient Descent
Choose the next step so that the function
decreases
For small changes in x we can approximate F(x)
where
If we want the function to decrease
We can maximize the decrease by choosing
72
Example
73
Gradient Descent
74
Backpropagation for a Neuron
75
Neural Network Based Face Detection
76
Instance-Based Learning
77
The Nearest Neighbor Rule
  • Suppose we have Da(1), ......, a(n) labeled
    training instances
  • Let a in D be the closest point to instance a,
    which needs to be classified
  • The nearest neighbor rule is to assign a the
    label associated with a

78
The Nearest Neighbor Rule cont.
  • Example x (0.10, 0.25)t

79
The Nearest Neighbor Rule cont.
The nearest neighbor rule leads to a partition of
the attribute space into Voronoi cells
80
The k-Nearest Neighbor Rule
  • An extension of the nearest neighbor rule
  • The k-nearest neighbor rule classifies a by
    assigning it the label most frequently
    represented among the k nearest samples
  • In other words, given a, we find the k nearest
    labeled samples. The label appeared most is
    assigned to a.

81
The k-Nearest Neighbor Rule cont.
  • Example
  • k 3 (odd value) and a (0.10, 0.25)t
  • Closest vectors to a with their labels are
  • (0.10, 0.28, ?2) (0.12, 0.20, ?2) (0.15,
    0.35,?1)
  • One voting scheme assigns the label ?2 to x
    since ?2 is the most frequently represented

82
The k-Nearest Neighbor Rule cont.
83
Computational Complexity of the
k-Nearest-Neighbor Rule
  • Straightforward implementation O(d n)
  • More efficient implementations
  • Tree-based data structures
  • Editing

84
K-d Tree
  • A k-d tree is a generalization of a binary search
    tree in high dimensions
  • Each internal node in a k-d tree is associated
    with a hyper-rectangle and a hyper-plane
    orthogonal to one of the coordinate axis
  • The hyper-plane splits the hyper-rectangle into
    two parts, which are associated with the child
    nodes
  • The partitioning process goes on until the number
    of data points in the hyper-rectangle falls below
    some given threshold

85
K-d Tree cont.
86
K-d Tree cont.
  • For a given query point, the algorithm works by
    first descending the tree to find the data points
    lying in the cell that contains the query point
  • Then it examines surrounding cells if they
    overlap the ball centered at the query point and
    the closest data point so far

87
Condensing
  • Aim is to reduce the number of training samples
  • Retain only the samples that are needed to define
    the decision boundary

88
Dataset Reduction Editing
  • Training data may contain noise, overlapping
    classes
  • Editing seeks to remove noisy points and produce
    smooth decision boundaries often by retaining
    points far from the decision boundaries
  • Results in homogenous clusters of points

89
Wilson Editing
  • Remove points that do not agree with the majority
    of their k nearest neighbours

90
Decision Trees
  • Decision trees are based on the idea that we can
    classify a pattern through a sequence of
    questions
  • This is known as a 20-questions approach, very
    similar to the guess who game strategy

91
Decision Tree Example
92
Constructing Decision Trees
  • As for constructing other tree structures, we use
    recursive procedure
  • First we select an attribute for root node and
    create branch for each possible attribute value
  • Then we split instances into subsets, one for
    each branch extending from the node
  • Finally we repeat recursively for each branch,
    using only instances that reach the branch
  • Stop if all instances have the same class

93
The Weather Data
94
Which Attribute to Select?
95
Criterion for Attribute Selection
  • Which is the best attribute?
  • Want to get the smallest tree
  • Heuristic choose the attribute that produces the
    purest nodes
  • Popular impurity criterion information gain
  • Information gain increases with the average
    purity of the subsets
  • Strategy choose attribute that gives greatest
    information gain

96
Computing Information
  • Measure information in bits
  • Given a probability distribution, the info
    required to predict an event is the
    distributions entropy
  • Entropy gives the information required in
    bits(can involve fractions of bits!)
  • Formula for computing the entropy

97
Example Attribute Outlook
  • Outlook Sunny
  • Outlook Overcast
  • Outlook Rainy
  • Expected information for attribute

98
Computing Information Gain
  • Information gain difference between information
    before splitting and information after splitting
  • Information gain for attributes from weather data

gain(Outlook ) info(9,5) info(2,3,4,0,3
,2) 0.940 0.693 0.247 bits
gain(Outlook ) 0.247 bits gain(Temperature )
0.029 bits gain(Humidity ) 0.152
bits gain(Windy ) 0.048 bits
99
Continuing to Split
gain(Temperature ) 0.571 bits gain(Humidity )
0.971 bits gain(Windy ) 0.020 bits
100
Final Decision Tree
  • Note not all leaves need to be pure sometimes
    identical instances have different classes
  • ? Splitting stops when data cant be split any
    further

101
Highly-branching Attributes
  • Attributes with a large number of values can be
    problematic
  • Subsets are more likely to be pure if there is a
    large number of values
  • Information gain is biased towards choosing
    attributes with a large number of values
  • This may result in overfitting (selection of an
    attribute that is non-optimal for prediction)

102
Weather data with ID code
103
Tree Stump for ID code Attribute
  • Entropy of split
  • Information gain is maximal for ID code (namely
    0.940 bits)

104
Gain Ratio
  • Gain ratio a modification of the information
    gain that reduces its bias
  • Gain ratio takes number and size of branches into
    account when choosing an attribute
  • It corrects the information gain by taking the
    intrinsic information of a split into account
  • Intrinsic information entropy of distribution of
    instances into branches (i.e. how much info do we
    need to tell which branch an instance belongs to)

105
Computing the gain ratio
  • Example intrinsic information for ID code
  • Value of attribute decreases as intrinsic
    information gets larger
  • Definition of gain ratio
  • Example

106
Gain ratios for Weather Data
107
Numeric Attributes
  • Standard method binary splits
  • E.g. temp lt 45
  • Unlike nominal attributes, every attribute has
    many possible split points
  • Solution is straightforward extension
  • Evaluate info gain (or other measure) for every
    possible split point of attribute
  • Choose best split point
  • Info gain for best split point is info gain for
    attribute
  • Computationally more demanding

108
Example
  • Split on temperature attribute
  • E.g. temperature ? 71.5 yes/4, no/2 temperature
    ? 71.5 yes/5, no/3
  • Info(4,2,5,3) 6/14 info(4,2) 8/14
    info(5,3) 0.939 bits
  • Place split points halfway between values
  • Can evaluate all split points in one pass!

109
Pruning
  • Prevent overfitting to noise in the data
  • Prune the decision tree
  • Two strategies
  • Postpruningtake a fully-grown decision tree and
    discard unreliable parts
  • Prepruningstop growing a branch when information
    becomes unreliable
  • Postpruning preferred in practiceprepruning can
    stop early

110
Statistical Pattern Recognition
  • We have a set of training examples for each class
    we are interested in
  • We want to design/learn a classifier that can
    classify an input into one of the classes
  • Note the examples are influenced by factors that
    are random in nature
  • While the classifiers we considered give us a
    definite answer, note that the problem is
    intrinsically probabilistic
  • In other words, we should estimate the
    probability of being a particular class rather
    than a definite answer

111
The Illustration Example
112
Review Probability Theory cont.
  • Conditional probability
  • The law of total probability
  • Bayes rule

113
Bayesian Decision Theory
  • We have two pattern classes w1 and w2
  • Prior probability
  • Our prior knowledge of how likely to get class w1
    or w2
  • Class conditional probability
  • Class conditional probability density p(x w1)
    and p(x w2)

114
Bayesian Decision Theory cont.
  • Bayes formula
  • In English, it can be expressed as

115
Bayesian Decision Theory cont.
  • Minimum error classification
  • Decide w1 if P(w1x) gt P(w2x) otherwise decide
    w2
  • P(errorx)minP(w1x), P(w2x)

116
Example
  • Number of fins of sea bass and salmon

117
Example cont.
  • Estimated class conditional

118
Example cont.
  • Prior probabilities
  • Priors come from prior knowledge
  • For example, if a fish expert says there are
    twice as many salmon than sea bass
  • What are priors?
  • How to estimate priors from data?

119
Bayesian Decision Theory cont.
  • Classification results are often associated with
    actions
  • Decisions with different risks
  • For example, a typical user can tolerate a spam
    being classified incorrectly as a ham more than a
    ham being classified incorrectly as a spam

120
Bayesian Decision Theory cont.
  • Conditional risk
  • Bayesian decision theory
  • Take the action that has the minimum conditional
    risk

121
General Case
  • In general, we can have more than two classes and
    a number of actions
  • The Bayesian decision theory is the same
  • For each one, we compute its posterior
    probability
  • For minimum error rate classification, we decide
    to be the class with the largest posterior
    probability
  • If we have actions with different risks, we take
    the action that has the least conditional risk

122
Bayesian Decision Theory cont.
  • Bayes decision rule for minimum error rate
  • Decide wi if P(wix) ? P(wjx) for j ? i
  • In case of ties, you can break the tie any way
    you like
  • The probability of error is given by

123
Bayesian Decision Theory General Case
  • Suppose that there are c categories
  • w1, w2, ..., wc
  • and there are d actions associated with
    recognition labels
  • a1, a2, ..., ad
  • Loss function
  • The lost function states exactly how costly each
    action is and is used to convert a probability
    determination into a decision
  • Written as

124
Bayesian Decision Theory cont.
  • Conditional risk
  • Bayes decision rule
  • For a given x, select the action ai for which the
    conditional risk is minimum

125
Naïve Bayesian Classifier
  • When there are more than one feature, the class
    conditional probability distribution can be
    complicated and expensive to model
  • One way to overcome this problem is to assume
    that all the features are statistically
    independent
  • The resulting Bayesian classifier is called naïve
    Bayesian classifier

126
Training Set
  • We need to have two sets of email messages
  • Spam, consisting of known spam messages
  • Note that the definition of spam is subjective
  • Ham, consisting of known non-spam messages
  • You can generate the sets based on your email
    messages
  • You can also obtain publicly available spam
    corpora

127
Bayesian Spam Filtering
  • Training stage
  • Tokenize each message in the training sets
  • This step can affect the performance
    significantly
  • Create two tables, one for spam class and one for
    non-spam class, counting the number of times each
    token occurred in the corresponding corpus
  • Estimate the probability that a message belongs
    to spam if it contains a particular word using
    the following formula, resulting another table

128
Estimating Probability (Pseudo Code)
  • nbad the number of spam messages in its corpus
  • ngood the number of non-spam messages in its
    corpus
  • For each word in the combined list (consisting of
    words of both corpora)
  • g the number of occurrence in the non-spam
    corpus
  • b the number of occurrence in the spam corpus
  • If 2g b lt 5, then
  • Delete the word from the list
  • Otherwise
  • Its probability is given by

129
Bayesian Spam Filtering
  • Testing stage
  • When a new message arrives, it is first tokenized
    the same way the training messages are
  • Find the fifteen tokens that are most certain,
    when the certainty of a token is measured by its
    entropy (the smaller the entropy, the more the
    certainty)
  • This can be done according to p-0.5

130
Testing
  • Combined probability
  • (let ((prod (apply ' probs)))
  • (/ prod ( prod (apply ' (mapcar '(lambda
    (x)
  • (- 1 x))
  • probs)))))
  • This translates to

131
Testing
  • Thresholding
  • If the combined probability is larger than a
    threshold, it is classified as spam otherwise,
    it is classified as ham
  • The threshold is an important parameter
  • By varying the threshold, we will get a curve
    called ROC (receiver-operating curve)

132
Statistical Pattern Recognition
133
Classification Using Rules
  • Perform classification using If-Then rules
  • Classification Rule r lta,cgt
  • Antecedent, Consequent
  • May generate from from other techniques (DT, NN)
    or generate directly.
  • Algorithms 1R, PRISM

134
Generating Rules from DTs
135
Generating Rules Example
136
Inferring Rudimentary Rules
  • 1R algorithm
  • It is an algorithm to find simple rules
  • It often comes up with quite good rules for
    characterizing the structures in the data

137
1R Algorithm
138
1R Algorithm Example
139
PRIZM Algorithm
140
The contact lenses data
141
Example contact lens data cont.
  • Rule we seek
  • Possible tests

142
Modified rule and resulting data
  • Rule with best test added
  • Instances covered by modified rule

143
Further refinement
  • Current state
  • Possible tests

144
Modified rule and resulting data
  • Rule with best test added
  • Instances covered by modified rule

145
Further refinement
  • Current state
  • Possible tests
  • Tie between the first and the fourth test
  • We choose the one with greater coverage

146
The result
  • Final rule
  • Second rule for recommending hard
    lenses(built from instances not covered by
    first rule)
  • These two rules cover all hard lenses
  • Process is repeated with other two classes

147
Meta learning schemes
  • Basic ideabuild different experts and let
    them vote
  • Advantage
  • often improves predictive performance
  • Disadvantage
  • produces output that is very hard to analyze
  • Schemes
  • Bagging
  • Boosting
  • Stacking
  • error-correcting output codes

apply to both classificationand numeric
prediction
148
Bias-variance decomposition
  • To analyze how much any specific training set
    affects performance
  • Assume infinitely many classifiers,built from
    different training sets of size n
  • For any learning scheme,
  • Bias expected error of the combined classifier
    on new data
  • Variance expected error due to the particular
    training set used
  • Total expected error bias variance

149
Bagging
  • Combining predictions by voting/averaging
  • Simplest way!
  • Each model receives equal weight
  • Idealized version
  • Sample several training sets of size n(instead
    of just having one training set of size n)
  • Build a classifier for each training set
  • Combine the classifiers predictions
  • Learning scheme is unstable ? almost always
    improves performance
  • Small change in training data can make big change
    in model
  • (e.g. decision trees)

150
Bagging classifiers
Model generation
  • Let n be the number of instances in the training
    data
  • For each of t iterations
  • Sample n instances from training set
  • (with replacement)
  • Apply learning algorithm to the sample
  • Store resulting model

Classification
For each of the t models Predict class of
instance using model Return class that is
predicted most often
151
Boosting
  • Also uses voting/averaging
  • Weights models according to performance
  • Iterative new models are influenced by
    performance of previously built ones
  • Encourage new model become an expert for
    instances misclassified by earlier models
  • Intuitive justification models should be experts
    that complement each other
  • Several variants

152
AdaBoost
153
AdaBoost cont.
  • The final/component classifier is given by
  • The error on the training set of the final
    classifier is bounded to be

154
AdaBoost cont.
155
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com