Title: Statistical Learning
1Statistical Learning
- Artificial Intelligence A Modern Approach
- (Chapter 20)
- Juntae Kim
- Department of Computer Engineering
- Dongguk University
2The Brain
- The brain consists of billions of neurons
- Structure of a neuron
- Dendrites input fibers
- Axon single long output fiber
- Synapse connecting junction - 10100000
connections / cell
3The Neuron
- Propagation of signals - electrochemical
- 1. Chemical transmitter enter through the
dendrite - 2. Cell bodys electrical potential increases
- 3. When the potential reaches a threshold, an
electrical pulse is sent down the axon - Changes in structure
- Building new connections
- Plasticity
- Long-term changes in the strength of connections
- Changes in response to the pattern of stimulation
- Learning
4Brain vs. Computer
- Speed
- Brain - 10-3 sec/cycle, Computer - 10-8 sec/cycle
- Parallelism
- Brain - massively parallel computation
- Computer - sequential computation
- Fault-tolerant
- Brain - cells die, and no effect to the overall
function - Computer - error in one bit cause total failure
- Learning
- Brain - the connection changes
- Computer - no change
5A Unit in Neural Networks
- A computational model of the neuron
- Link - connects units
- Weight - a number associated with each link
- Activation function - g input ? output
6Activation Function
- Activation functions step, sigmoid
7Moving Threshold
- Make all units threshold T to 0
- Adding extra input a0 -1, W0i T
8Function of a Unit
- Neural units as logic gates
- Inputs 0 or 1
- Activation function step
- O 1 if
- I1w1 I2w2 1.5 gt 0
9Perceptrons
- A single-layer feed-forward network with step
function
10Function of a Perceptron
- Perceptron represents linearly separable
functions - O 1 if W I gt 0
- Input space is divided by the plane W I 0
- Example
- T 1.5 (W0 1.5), W11.0, W21.0
- Output 1 if 1.5 I1 I2 gt 0
- I2 gt -I1 1.5
- Exgt I (1, 1)
- ? 1.01 1.01 (-1.5)1 0.5
- ? O 1
- Function a line (plane)
11Perceptron Learning
- Learning
- Adjusting weights to reduce error for (I, T)
- Error T(target output) - O(perceptron output)
- If error gt 0 increase Wj if Ij gt 0
decrease Wj if Ij lt 0 - else decrease Wj if Ij gt 0
increase Wj if Ij lt 0 - Wj Wj a (T - O) Ij
- Learning adjusting weights in (w1I1 w2I2
w0 gt 0) - Moving the line (plane)
12Perceptron Learning
- Example (? 0.5)
- Current w11, w21, w0 1
- I1 I2 1 gt 0
- Training example (1.0, 0.5)? 0
- O step (11 0.51 1) 1
- w1 1 0.5(0-1) 1.0 0.5
- w2 1 0.5(0-1) 0.5 0.75
- w0 1 0.5(0-1) -1.0 1.5
- After learning w10.5, w20.75, w01.5
- 0.5 I1 0.75 I2 1.5 gt 0
13Perceptron Learning
- Learning curves
- Majority function (linearly separable)
Perceptron is good - WillWait problem (not linearly separable)
Perceptron is bad
14Expressiveness of a Perceptron
- A perceptron can represent linear separator only
- Linearly separable - (a), (b)
- Not linearly separable - (c)
15Multilayer Networks
- Multiple layer, sigmoid output function
16Minimizing Error
- Error E is a function of W
- If E f(w) Minimizing E ?
- If E f(w1, w2, wn) Minimizing E ?
-
17Minimizing Error
- Example
- E W12 W22
- ? (?E/?W1, ?E/?W2) (2W1, 2W2)
- From (1, 1), move toward (-2, -2)
- From (1, 0), move toward (-2, 0)
18Learning in Multilayer Networks
- Minimizing E in output layer
19Learning in Multilayer Networks
- Error propagation to hidden layer
20Backpropagation Learning
- Learning (training)
- 1. Decide of units and connections
- 2. Initialize weights
- 3. Train the weights using examples
- For an example (I, T),
- Compute outputs
- Compute error in output layer - ?i
- Adjust hidden to output weights Wji
- Compute error in hidden layer - ?j
- Adjust input to hidden weights Wkj
21Backpropagation Learning
22Backpropagation Learning
- Algorithm (for sigmoid activation function)
For each example ( I, T ) compute o(I)
For each output unit k ? i (ti-oi) oi
(1-oi) For each hidden unit h ? j (?
wji ? i) aj (1-aj) wji wji ? aj ? i
wkj wkj ? ik ? j
23Example
- Initial weights 0.0, learning rate 1.0,
example (0.4, 0.8, 0) -
- h1 sig(0.00.4 0.00.8 0.01.0) 0.5
- h2 sig(0.00.4 0.00.8 0.01.0) 0.5
- o1 sig(0.00.5 0.00.5 0.01.0) 0.5
-
- d21 (0 0.5)0.5(1 0.5) - 0.125
- d11 0.0(-0.125)0.5(1 0.5) 0
- d12 0.0(-0.125)0.5(1 0.5) 0
-
- w201 0.0 1.01.0(- 0.125) -0.125
- w211 0.0 1.00.5(- 0.125) -0.0625
- w221 0.0 1.00.5(- 0.125) -0.0625
- w101 0.0 1.01.00 0.0
- w111 0.0 1.00.40 0.0
-
0.5
w111
I1(0.4)
h1
w211
w112
0.5
o1
h2
w221
I2(0.8)
0.5
24Example Restaurant domain
- Learning curve
- 100 training example, MLP with 4 hidden units
25Applications
- Handwritten character recognition
- Zip code reader (Le Cun, 1989)
- Input 16x16 array of pixels
- Output 0 - 9 digits
- Network 256 x 768 x 192 x 30 x 10
- Hidden units are grouped for feature detection
- Training set 7300 examples
- 99 accuracy after rejecting 12 on 2000 examples
26Applications
- Autonomous vehicle
- ALVINN (Pomerleau, 1993)
- Input 30x32 pixels from video
- Output 30 steering direction
- Network 960 x 5 x 30
- Training set examples from 5 min human driving
- Drive at 70 mph on public highway
27Bayesian Learning
- Concept
- Data evidence. d d1, d2, , dn
- Hypothesis theory of the domain. H h1, h2,
(exgt a Bayesian net) - Learning given d, find probabilities of H
- Full Bayesian learning
- Learning P(hi d) ? P(d hi) P(hi)
- P(hi) hypothesis prior probability
- P(d hi) likelihood of the data under
each hypothesis - Prediction P(X d) ? P(X hi) P(hi d)
28Bayesian Learning - Example
- There are 5 kinds of candy bags
- h1 100 cherry
- h2 75 cherry 25 lime
- h3 50 cherry 50 lime
- h4 25 cherry 75 lime
- h5 100 lime
- P(hi) (0.1, 0.2, 0.4, 0.2, 0.1)
- Data (evidence) 3 limes
- P(h1 d) ? P(d h1) P(h1) ? 0.03 0.1
0 ? 0 - P(h2 d) ? P(d h2) P(h2) ? 0.253 0.2
0.003125 ? 0.01316 - P(h3 d) ? P(d h3) P(h3) ? 0.53 0.4
0.05 ? 0.2105 - P(h4 d) ? P(d h4) P(h4) ? 0.753 0.2
0.084375 ? 0.3553 - P(h5 d) ? P(d h5) P(h5) ? 1.03 0.1
0.1? 0.4211
29Bayesian Learning - Example
- P(next is lime d(lime, lime, lime)) ? P(next
is lime hi) P(hi d) - P(lime h1) P (h1 d) P(lime h2) P (h2
d) - 0 0 0.25 0.01316 0.5 0.2105 0.75
0.3553 1 0.4211 - 0.7961
30MAP and ML
- Maximum a Posteriori (MAP) hypothesis
- Make prediction based on single most probable h
- Learning hMAP argmax hi P(d hi) P(hi)
- Prediction P(X d) P(X hMAP)
- hMAP argmax hi P(d hi) P(hi) h5
- P(next is lime d(lime, lime, lime)) P(lime
hMAP) P(lime h5) 1.0 - Maximum Likelihood (ML) hypothesis
- Assume uniform prior P(H)
- Learning hML argmax hi P(d hi)
- Prediction P(X d) P(X hML)
- hML argmax hi P(d hi) h5
- P(next is lime d(lime, lime, lime)) P(lime
hML) P(lime h5) 1.0
31ML learning in Bayesian Net
- We know
- Cherry and lime candy
- Red and green wrapper
- The color of wrapper depends on the flavor of
candy - We want to learn
- The conditional probabilities (?, ?1, ?2)for the
above Bayesian Network
32ML learning in Bayesian Net
- Learning ?
- Data c cherry, l lime
- Likelihood P(d h?) ?c (1-?)l
- Log likelihood L(d h?) c log ? l log (1-?)
- Find maximum
- Procedure
- Find expression of likelihood as a function of
parameters - Find derivative of log likelihood
- Find parameter by solving derivative 0
33ML learning in Bayesian Net
- Learning ?, ?1, ?2
- Wrappers are selected propabilistically
- P(r and c) P(c) P(rc) ? ?1
- Data c cherry, l lime, rc cherry in red, gc
cherry in green, rl lime in red, gl lime in
green - Likelihood P(d h? ?1 ?2) ?c (1-?)l ?1rc
(1-?1)gc ?2rl (1-?2)gl - Log likelihood L(d h?) c log ? l log
(1-?) - rc log ?1 gc log (1-?1)
- rl log ?2 gl log (1-?2)
- Find maximum
34Naïve Bayes Model
- Learn P(CT) ?, P(xi CT) ?i1, P(xi CF)
?i2 - by Maximum Likelihood method
- Then
C
x1
x2
xn
P(C x1 xn)
a P(C) ?i P(xi C)
35Instance-Based Learning
- IBL or Memory-based learning
- Construct hypothesis directly from the training
instances - xs ? most similar (stored) instances to x
- f(x) ? average of f(xs)
x (6, 0.4)
f(x)
36K-Nearest Neighbor Method
- Given examples ltxi, f(xi)gt
- Find k intances xi with min D(x, xi)
- D(x, xi) distance between x and xi (exgt
Euclidean distance x xi ) - Compute f(x) as
37K-Nearest Neighbor Method
- dgender(A, B) A B (female0, male1)
- dage(A, B) A B / max difference
- dsalary(A, B) A B / max difference
- d dgender dage dsalary
- k 3 ? 3-NN 4, 3, 5
- f(x) (w41 w31 w50) / (w4w3w5) 0.71
? yes, 71
38K-Nearest Neighbor Method
- Curse of Dimensionality
- For high dimensional data, neighbors can be far
away - Number of data N, dimension d, all values are
in 0, 1 ? total volume 1 - All neighborhood distance b (0 ? b ? 1) ? total
volume bd - To contain k data, the neighbors should occupy
k/N fraction - ? bd (k/N)1 or b (k/N)1/d
- Example
- N1,000,000, d2, k10 ? b 0.03
- N1,000,000, d100, k10 ? b 0.89
k/ N 1/100
d1
d2
d3
39Performance of BC, k-NN
?
Naïve Bayes
k-NN