Statistical Learning - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Statistical Learning

Description:

Cherry and lime candy. Red and green wrapper. The color of wrapper depends on the flavor of candy ... Data: c cherry, l lime. Likelihood: P(d | h ) = c (1- )l ... – PowerPoint PPT presentation

Number of Views:128

Avg rating:3.0/5.0

Slides: 40

Provided by: aiDg

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Learning

1
Statistical Learning

Artificial Intelligence A Modern Approach
(Chapter 20)
Juntae Kim
Department of Computer Engineering
Dongguk University

2
The Brain

The brain consists of billions of neurons
Structure of a neuron
Dendrites input fibers
Axon single long output fiber
Synapse connecting junction - 10100000
connections / cell

3
The Neuron

Propagation of signals - electrochemical
1. Chemical transmitter enter through the
dendrite
2. Cell bodys electrical potential increases
3. When the potential reaches a threshold, an
electrical pulse is sent down the axon
Changes in structure
Building new connections
Plasticity
Long-term changes in the strength of connections
Changes in response to the pattern of stimulation
Learning

4
Brain vs. Computer

Speed
Brain - 10-3 sec/cycle, Computer - 10-8 sec/cycle
Parallelism
Brain - massively parallel computation
Computer - sequential computation
Fault-tolerant
Brain - cells die, and no effect to the overall
function
Computer - error in one bit cause total failure
Learning
Brain - the connection changes
Computer - no change

5
A Unit in Neural Networks

A computational model of the neuron
Link - connects units
Weight - a number associated with each link
Activation function - g input ? output

6
Activation Function

Activation functions step, sigmoid

7
Moving Threshold

Make all units threshold T to 0
Adding extra input a0 -1, W0i T

8
Function of a Unit

Neural units as logic gates
Inputs 0 or 1
Activation function step
O 1 if
I1w1 I2w2 1.5 gt 0

9
Perceptrons

A single-layer feed-forward network with step
function

10
Function of a Perceptron

Perceptron represents linearly separable
functions
O 1 if W I gt 0
Input space is divided by the plane W I 0
Example
T 1.5 (W0 1.5), W11.0, W21.0
Output 1 if 1.5 I1 I2 gt 0
I2 gt -I1 1.5
Exgt I (1, 1)
? 1.01 1.01 (-1.5)1 0.5
? O 1
Function a line (plane)

11
Perceptron Learning

Learning
Adjusting weights to reduce error for (I, T)
Error T(target output) - O(perceptron output)
If error gt 0 increase Wj if Ij gt 0
decrease Wj if Ij lt 0
else decrease Wj if Ij gt 0
increase Wj if Ij lt 0
Wj Wj a (T - O) Ij
Learning adjusting weights in (w1I1 w2I2
w0 gt 0)
Moving the line (plane)

12
Perceptron Learning

Example (? 0.5)
Current w11, w21, w0 1
I1 I2 1 gt 0
Training example (1.0, 0.5)? 0
O step (11 0.51 1) 1
w1 1 0.5(0-1) 1.0 0.5
w2 1 0.5(0-1) 0.5 0.75
w0 1 0.5(0-1) -1.0 1.5
After learning w10.5, w20.75, w01.5
0.5 I1 0.75 I2 1.5 gt 0

13
Perceptron Learning

Learning curves
Majority function (linearly separable)
Perceptron is good
WillWait problem (not linearly separable)
Perceptron is bad

14
Expressiveness of a Perceptron

A perceptron can represent linear separator only
Linearly separable - (a), (b)
Not linearly separable - (c)

15
Multilayer Networks

Multiple layer, sigmoid output function

16
Minimizing Error

Error E is a function of W
If E f(w) Minimizing E ?
If E f(w1, w2, wn) Minimizing E ?

17
Minimizing Error

Example
E W12 W22
? (?E/?W1, ?E/?W2) (2W1, 2W2)
From (1, 1), move toward (-2, -2)
From (1, 0), move toward (-2, 0)

18
Learning in Multilayer Networks

Minimizing E in output layer

19
Learning in Multilayer Networks

Error propagation to hidden layer

20
Backpropagation Learning

Learning (training)
1. Decide of units and connections
2. Initialize weights
3. Train the weights using examples
For an example (I, T),
Compute outputs
Compute error in output layer - ?i
Adjust hidden to output weights Wji
Compute error in hidden layer - ?j
Adjust input to hidden weights Wkj

21
Backpropagation Learning

Algorithm

22
Backpropagation Learning

Algorithm (for sigmoid activation function)

For each example ( I, T ) compute o(I)
For each output unit k ? i (ti-oi) oi
(1-oi) For each hidden unit h ? j (?
wji ? i) aj (1-aj) wji wji ? aj ? i
wkj wkj ? ik ? j
23
Example

Initial weights 0.0, learning rate 1.0,
example (0.4, 0.8, 0)
h1 sig(0.00.4 0.00.8 0.01.0) 0.5
h2 sig(0.00.4 0.00.8 0.01.0) 0.5
o1 sig(0.00.5 0.00.5 0.01.0) 0.5
d21 (0 0.5)0.5(1 0.5) - 0.125
d11 0.0(-0.125)0.5(1 0.5) 0
d12 0.0(-0.125)0.5(1 0.5) 0
w201 0.0 1.01.0(- 0.125) -0.125
w211 0.0 1.00.5(- 0.125) -0.0625
w221 0.0 1.00.5(- 0.125) -0.0625
w101 0.0 1.01.00 0.0
w111 0.0 1.00.40 0.0

0.5
w111
I1(0.4)
h1
w211
w112
0.5
o1
h2
w221
I2(0.8)
0.5
24
Example Restaurant domain

Learning curve
100 training example, MLP with 4 hidden units

25
Applications

Handwritten character recognition
Zip code reader (Le Cun, 1989)
Input 16x16 array of pixels
Output 0 - 9 digits
Network 256 x 768 x 192 x 30 x 10
Hidden units are grouped for feature detection
Training set 7300 examples
99 accuracy after rejecting 12 on 2000 examples

26
Applications

Autonomous vehicle
ALVINN (Pomerleau, 1993)
Input 30x32 pixels from video
Output 30 steering direction
Network 960 x 5 x 30
Training set examples from 5 min human driving
Drive at 70 mph on public highway

27
Bayesian Learning

Concept
Data evidence. d d1, d2, , dn
Hypothesis theory of the domain. H h1, h2,
(exgt a Bayesian net)
Learning given d, find probabilities of H
Full Bayesian learning
Learning P(hi d) ? P(d hi) P(hi)
P(hi) hypothesis prior probability
P(d hi) likelihood of the data under
each hypothesis
Prediction P(X d) ? P(X hi) P(hi d)

28
Bayesian Learning - Example

There are 5 kinds of candy bags
h1 100 cherry
h2 75 cherry 25 lime
h3 50 cherry 50 lime
h4 25 cherry 75 lime
h5 100 lime
P(hi) (0.1, 0.2, 0.4, 0.2, 0.1)
Data (evidence) 3 limes
P(h1 d) ? P(d h1) P(h1) ? 0.03 0.1
0 ? 0
P(h2 d) ? P(d h2) P(h2) ? 0.253 0.2
0.003125 ? 0.01316
P(h3 d) ? P(d h3) P(h3) ? 0.53 0.4
0.05 ? 0.2105
P(h4 d) ? P(d h4) P(h4) ? 0.753 0.2
0.084375 ? 0.3553
P(h5 d) ? P(d h5) P(h5) ? 1.03 0.1
0.1? 0.4211

29
Bayesian Learning - Example

P(next is lime d(lime, lime, lime)) ? P(next
is lime hi) P(hi d)
P(lime h1) P (h1 d) P(lime h2) P (h2
d)
0 0 0.25 0.01316 0.5 0.2105 0.75
0.3553 1 0.4211
0.7961

30
MAP and ML

Maximum a Posteriori (MAP) hypothesis
Make prediction based on single most probable h
Learning hMAP argmax hi P(d hi) P(hi)
Prediction P(X d) P(X hMAP)
hMAP argmax hi P(d hi) P(hi) h5
P(next is lime d(lime, lime, lime)) P(lime
hMAP) P(lime h5) 1.0
Maximum Likelihood (ML) hypothesis
Assume uniform prior P(H)
Learning hML argmax hi P(d hi)
Prediction P(X d) P(X hML)
hML argmax hi P(d hi) h5
P(next is lime d(lime, lime, lime)) P(lime
hML) P(lime h5) 1.0

31
ML learning in Bayesian Net

We know
Cherry and lime candy
Red and green wrapper
The color of wrapper depends on the flavor of
candy
We want to learn
The conditional probabilities (?, ?1, ?2)for the
above Bayesian Network

32
ML learning in Bayesian Net

Learning ?
Data c cherry, l lime
Likelihood P(d h?) ?c (1-?)l
Log likelihood L(d h?) c log ? l log (1-?)
Find maximum
Procedure
Find expression of likelihood as a function of
parameters
Find derivative of log likelihood
Find parameter by solving derivative 0

33
ML learning in Bayesian Net

Learning ?, ?1, ?2
Wrappers are selected propabilistically
P(r and c) P(c) P(rc) ? ?1
Data c cherry, l lime, rc cherry in red, gc
cherry in green, rl lime in red, gl lime in
green
Likelihood P(d h? ?1 ?2) ?c (1-?)l ?1rc
(1-?1)gc ?2rl (1-?2)gl
Log likelihood L(d h?) c log ? l log
(1-?)
rc log ?1 gc log (1-?1)
rl log ?2 gl log (1-?2)
Find maximum

34
Naïve Bayes Model