Statistical Learning - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Statistical Learning

Description:

Cherry and lime candy. Red and green wrapper. The color of wrapper depends on the flavor of candy ... Data: c cherry, l lime. Likelihood: P(d | h ) = c (1- )l ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 40
Provided by: aiDg
Category:

less

Transcript and Presenter's Notes

Title: Statistical Learning


1
Statistical Learning
  • Artificial Intelligence A Modern Approach
  • (Chapter 20)
  • Juntae Kim
  • Department of Computer Engineering
  • Dongguk University

2
The Brain
  • The brain consists of billions of neurons
  • Structure of a neuron
  • Dendrites input fibers
  • Axon single long output fiber
  • Synapse connecting junction - 10100000
    connections / cell

3
The Neuron
  • Propagation of signals - electrochemical
  • 1. Chemical transmitter enter through the
    dendrite
  • 2. Cell bodys electrical potential increases
  • 3. When the potential reaches a threshold, an
    electrical pulse is sent down the axon
  • Changes in structure
  • Building new connections
  • Plasticity
  • Long-term changes in the strength of connections
  • Changes in response to the pattern of stimulation
  • Learning

4
Brain vs. Computer
  • Speed
  • Brain - 10-3 sec/cycle, Computer - 10-8 sec/cycle
  • Parallelism
  • Brain - massively parallel computation
  • Computer - sequential computation
  • Fault-tolerant
  • Brain - cells die, and no effect to the overall
    function
  • Computer - error in one bit cause total failure
  • Learning
  • Brain - the connection changes
  • Computer - no change

5
A Unit in Neural Networks
  • A computational model of the neuron
  • Link - connects units
  • Weight - a number associated with each link
  • Activation function - g input ? output

6
Activation Function
  • Activation functions step, sigmoid

7
Moving Threshold
  • Make all units threshold T to 0
  • Adding extra input a0 -1, W0i T

8
Function of a Unit
  • Neural units as logic gates
  • Inputs 0 or 1
  • Activation function step
  • O 1 if
  • I1w1 I2w2 1.5 gt 0

9
Perceptrons
  • A single-layer feed-forward network with step
    function

10
Function of a Perceptron
  • Perceptron represents linearly separable
    functions
  • O 1 if W I gt 0
  • Input space is divided by the plane W I 0
  • Example
  • T 1.5 (W0 1.5), W11.0, W21.0
  • Output 1 if 1.5 I1 I2 gt 0
  • I2 gt -I1 1.5
  • Exgt I (1, 1)
  • ? 1.01 1.01 (-1.5)1 0.5
  • ? O 1
  • Function a line (plane)

11
Perceptron Learning
  • Learning
  • Adjusting weights to reduce error for (I, T)
  • Error T(target output) - O(perceptron output)
  • If error gt 0 increase Wj if Ij gt 0
    decrease Wj if Ij lt 0
  • else decrease Wj if Ij gt 0
    increase Wj if Ij lt 0
  • Wj Wj a (T - O) Ij
  • Learning adjusting weights in (w1I1 w2I2
    w0 gt 0)
  • Moving the line (plane)

12
Perceptron Learning
  • Example (? 0.5)
  • Current w11, w21, w0 1
  • I1 I2 1 gt 0
  • Training example (1.0, 0.5)? 0
  • O step (11 0.51 1) 1
  • w1 1 0.5(0-1) 1.0 0.5
  • w2 1 0.5(0-1) 0.5 0.75
  • w0 1 0.5(0-1) -1.0 1.5
  • After learning w10.5, w20.75, w01.5
  • 0.5 I1 0.75 I2 1.5 gt 0

13
Perceptron Learning
  • Learning curves
  • Majority function (linearly separable)
    Perceptron is good
  • WillWait problem (not linearly separable)
    Perceptron is bad

14
Expressiveness of a Perceptron
  • A perceptron can represent linear separator only
  • Linearly separable - (a), (b)
  • Not linearly separable - (c)

15
Multilayer Networks
  • Multiple layer, sigmoid output function

16
Minimizing Error
  • Error E is a function of W
  • If E f(w) Minimizing E ?
  • If E f(w1, w2, wn) Minimizing E ?

17
Minimizing Error
  • Example
  • E W12 W22
  • ? (?E/?W1, ?E/?W2) (2W1, 2W2)
  • From (1, 1), move toward (-2, -2)
  • From (1, 0), move toward (-2, 0)

18
Learning in Multilayer Networks
  • Minimizing E in output layer

19
Learning in Multilayer Networks
  • Error propagation to hidden layer

20
Backpropagation Learning
  • Learning (training)
  • 1. Decide of units and connections
  • 2. Initialize weights
  • 3. Train the weights using examples
  • For an example (I, T),
  • Compute outputs
  • Compute error in output layer - ?i
  • Adjust hidden to output weights Wji
  • Compute error in hidden layer - ?j
  • Adjust input to hidden weights Wkj

21
Backpropagation Learning
  • Algorithm

22
Backpropagation Learning
  • Algorithm (for sigmoid activation function)

For each example ( I, T ) compute o(I)
For each output unit k ? i (ti-oi) oi
(1-oi) For each hidden unit h ? j (?
wji ? i) aj (1-aj) wji wji ? aj ? i
wkj wkj ? ik ? j
23
Example
  • Initial weights 0.0, learning rate 1.0,
    example (0.4, 0.8, 0)
  • h1 sig(0.00.4 0.00.8 0.01.0) 0.5
  • h2 sig(0.00.4 0.00.8 0.01.0) 0.5
  • o1 sig(0.00.5 0.00.5 0.01.0) 0.5
  • d21 (0 0.5)0.5(1 0.5) - 0.125
  • d11 0.0(-0.125)0.5(1 0.5) 0
  • d12 0.0(-0.125)0.5(1 0.5) 0
  • w201 0.0 1.01.0(- 0.125) -0.125
  • w211 0.0 1.00.5(- 0.125) -0.0625
  • w221 0.0 1.00.5(- 0.125) -0.0625
  • w101 0.0 1.01.00 0.0
  • w111 0.0 1.00.40 0.0

0.5
w111
I1(0.4)
h1
w211
w112
0.5
o1
h2
w221
I2(0.8)
0.5
24
Example Restaurant domain
  • Learning curve
  • 100 training example, MLP with 4 hidden units

25
Applications
  • Handwritten character recognition
  • Zip code reader (Le Cun, 1989)
  • Input 16x16 array of pixels
  • Output 0 - 9 digits
  • Network 256 x 768 x 192 x 30 x 10
  • Hidden units are grouped for feature detection
  • Training set 7300 examples
  • 99 accuracy after rejecting 12 on 2000 examples

26
Applications
  • Autonomous vehicle
  • ALVINN (Pomerleau, 1993)
  • Input 30x32 pixels from video
  • Output 30 steering direction
  • Network 960 x 5 x 30
  • Training set examples from 5 min human driving
  • Drive at 70 mph on public highway

27
Bayesian Learning
  • Concept
  • Data evidence. d d1, d2, , dn
  • Hypothesis theory of the domain. H h1, h2,
    (exgt a Bayesian net)
  • Learning given d, find probabilities of H
  • Full Bayesian learning
  • Learning P(hi d) ? P(d hi) P(hi)
  • P(hi) hypothesis prior probability
  • P(d hi) likelihood of the data under
    each hypothesis
  • Prediction P(X d) ? P(X hi) P(hi d)

28
Bayesian Learning - Example
  • There are 5 kinds of candy bags
  • h1 100 cherry
  • h2 75 cherry 25 lime
  • h3 50 cherry 50 lime
  • h4 25 cherry 75 lime
  • h5 100 lime
  • P(hi) (0.1, 0.2, 0.4, 0.2, 0.1)
  • Data (evidence) 3 limes
  • P(h1 d) ? P(d h1) P(h1) ? 0.03 0.1
    0 ? 0
  • P(h2 d) ? P(d h2) P(h2) ? 0.253 0.2
    0.003125 ? 0.01316
  • P(h3 d) ? P(d h3) P(h3) ? 0.53 0.4
    0.05 ? 0.2105
  • P(h4 d) ? P(d h4) P(h4) ? 0.753 0.2
    0.084375 ? 0.3553
  • P(h5 d) ? P(d h5) P(h5) ? 1.03 0.1
    0.1? 0.4211

29
Bayesian Learning - Example
  • P(next is lime d(lime, lime, lime)) ? P(next
    is lime hi) P(hi d)
  • P(lime h1) P (h1 d) P(lime h2) P (h2
    d)
  • 0 0 0.25 0.01316 0.5 0.2105 0.75
    0.3553 1 0.4211
  • 0.7961

30
MAP and ML
  • Maximum a Posteriori (MAP) hypothesis
  • Make prediction based on single most probable h
  • Learning hMAP argmax hi P(d hi) P(hi)
  • Prediction P(X d) P(X hMAP)
  • hMAP argmax hi P(d hi) P(hi) h5
  • P(next is lime d(lime, lime, lime)) P(lime
    hMAP) P(lime h5) 1.0
  • Maximum Likelihood (ML) hypothesis
  • Assume uniform prior P(H)
  • Learning hML argmax hi P(d hi)
  • Prediction P(X d) P(X hML)
  • hML argmax hi P(d hi) h5
  • P(next is lime d(lime, lime, lime)) P(lime
    hML) P(lime h5) 1.0

31
ML learning in Bayesian Net
  • We know
  • Cherry and lime candy
  • Red and green wrapper
  • The color of wrapper depends on the flavor of
    candy
  • We want to learn
  • The conditional probabilities (?, ?1, ?2)for the
    above Bayesian Network

32
ML learning in Bayesian Net
  • Learning ?
  • Data c cherry, l lime
  • Likelihood P(d h?) ?c (1-?)l
  • Log likelihood L(d h?) c log ? l log (1-?)
  • Find maximum
  • Procedure
  • Find expression of likelihood as a function of
    parameters
  • Find derivative of log likelihood
  • Find parameter by solving derivative 0

33
ML learning in Bayesian Net
  • Learning ?, ?1, ?2
  • Wrappers are selected propabilistically
  • P(r and c) P(c) P(rc) ? ?1
  • Data c cherry, l lime, rc cherry in red, gc
    cherry in green, rl lime in red, gl lime in
    green
  • Likelihood P(d h? ?1 ?2) ?c (1-?)l ?1rc
    (1-?1)gc ?2rl (1-?2)gl
  • Log likelihood L(d h?) c log ? l log
    (1-?)
  • rc log ?1 gc log (1-?1)
  • rl log ?2 gl log (1-?2)
  • Find maximum

34
Naïve Bayes Model
  • Learn P(CT) ?, P(xi CT) ?i1, P(xi CF)
    ?i2
  • by Maximum Likelihood method
  • Then

C

x1
x2
xn
P(C x1 xn)
a P(C) ?i P(xi C)
35
Instance-Based Learning
  • IBL or Memory-based learning
  • Construct hypothesis directly from the training
    instances
  • xs ? most similar (stored) instances to x
  • f(x) ? average of f(xs)

x (6, 0.4)
f(x)
36
K-Nearest Neighbor Method
  • Given examples ltxi, f(xi)gt
  • Find k intances xi with min D(x, xi)
  • D(x, xi) distance between x and xi (exgt
    Euclidean distance x xi )
  • Compute f(x) as

37
K-Nearest Neighbor Method
  • dgender(A, B) A B (female0, male1)
  • dage(A, B) A B / max difference
  • dsalary(A, B) A B / max difference
  • d dgender dage dsalary
  • k 3 ? 3-NN 4, 3, 5
  • f(x) (w41 w31 w50) / (w4w3w5) 0.71
    ? yes, 71

38
K-Nearest Neighbor Method
  • Curse of Dimensionality
  • For high dimensional data, neighbors can be far
    away
  • Number of data N, dimension d, all values are
    in 0, 1 ? total volume 1
  • All neighborhood distance b (0 ? b ? 1) ? total
    volume bd
  • To contain k data, the neighbors should occupy
    k/N fraction
  • ? bd (k/N)1 or b (k/N)1/d
  • Example
  • N1,000,000, d2, k10 ? b 0.03
  • N1,000,000, d100, k10 ? b 0.89

k/ N 1/100
d1
d2
d3
39
Performance of BC, k-NN
  • Learning curve

?
Naïve Bayes
k-NN
Write a Comment
User Comments (0)
About PowerShow.com