For Wednesday - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

For Wednesday

Description:

Intensionally: Prolog definition of the predicate. Sequential Covering Algorithm ... Graceful degradation due to distributed represent-ations that spread knowledge ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 36
Provided by: maryelai
Category:

less

Transcript and Presenter's Notes

Title: For Wednesday


1
For Wednesday
  • Read ch. 20, sections 1, 2, 5, and 7
  • No homework

2
Program 4
  • Any questions?

3
Learning mini-project
  • Worth 2 homeworks
  • Due next Monday
  • Foil6 is available in /home/mecalif/public/itk340/
    foil
  • A manual and sample data files are there as well.
  • Create a data file that will allow FOIL to learn
    rules for a sister/2 relation from background
    relations of parent/2, male/1, and female/1. You
    can look in the prolog folder of my 327 folder
    for sample data if you like.
  • Electronically submit your data filewhich should
    be named sister.d, and turn in a hard copy of the
    rules FOIL learns.

4
Inductive Logic Programming
  • Representation is Horn clauses
  • Builds rules using background predicates
  • Rules are potentially much more expressive than
    attribute-value representations

5
Example Results
  • Rules for family relations from data of primitive
    or related predicates.
  • uncle(A,B) brother(A,C), parent(C,B).
  • uncle(A,B) husband(A,C), sister(C,D),
    parent(D,B).
  • Recursive list programs.
  • member(X,X Y).
  • member(X, Y Z) member(X, Z).

6
ILP
  • Goal is to induce a Hornclause definition for
    some target predicate P given definitions of
    background predicates Qi.
  • Goal is to find a syntactically simple definition
    D for P such that given background predicate
    definitions B
  • For every positive example pi D È B p
  • For every negative example ni D È B / n
  • Background definitions are either provided
  • Extensionally List of ground tuples satisfying
    the predicate.
  • Intensionally Prolog definition of the predicate.

7
Sequential Covering Algorithm
  • Let P be the set of positive examples
  • Until P is empty do
  • Learn a rule R that covers a large number of
    positives without covering any negatives.
  • Add R to the list of learned rules.
  • Remove positives covered by R from P

8
  • This is just an instance of the greedy algorithm
    for minimum set covering and does not guarantee
    that a minimum number of rules is learned but
    tends to learn a reasonably small rule set.
  • Minimum set covering is an NPhard problem and
    the greedy algorithm is a common approximation
    algorithm.
  • There are several ways to learn a single rule
    used in various methods.

9
Strategies for Learning a Single Rule
  • TopDown (General to Specific)
  • Start with the most general (empty) rule.
  • Repeatedly add feature constraints that eliminate
    negatives while retaining positives.
  • Stop when only positives are covered.
  • BottomUp (Specific to General)
  • Start with a most specific rule (complete
    description of a single instance).
  • Repeatedly eliminate feature constraints in order
    to cover more positive examples.
  • Stop when further generalization results in
    covering negatives.

10
FOIL
  • Basic topdown sequential covering algorithm
    adapted for Prolog clauses.
  • Background provided extensionally.
  • Initialize clause for target predicate P to
  • P(X1 ,...Xr ) .
  • Possible specializations of a clause include
    adding all possible literals
  • Qi (V1 ,...Vr )
  • not(Qi (V1 ,...Vr ))
  • Xi Xj
  • not(Xi X )
  • where X's are variables in the existing clause,
    at least one of V1 ,...Vr is an existing
    variable, others can be new.
  • Allow recursive literals if not cause infinite
    regress.

11
Foil Input Data
  • Consider example of finding a path in a directed
    acyclic graph.
  • Intended Clause
  • path(X,Y) edge(X,Y).
  • path(X,Y) edge(X,Z), path (Z,Y).
  • Examples
  • edge lt1,2gt, lt1,3gt, lt3,6gt, lt4,2gt, lt4,6gt, lt6,5gt
  • path lt1,2gt, lt1,3gt, lt1,6gt, lt1,5gt, lt3,6gt, lt3,
    5gt, lt4,2gt, lt4,6gt, lt4,5gt, lt6, 5gt
  • Negative examples of the target predicate can be
    provided directly or indirectly produced using a
    closed world assumption. Every pair ltx,ygt not in
    positive tuples for path.

12
Example Induction
  • lt1,2gt, lt1,3gt, lt1,6gt, lt1,5gt, lt3,6gt, lt3, 5gt,
    lt4,2gt, lt4,6gt, lt4,5gt, lt6, 5gt
  • - lt1,4gt, lt2,1gt, lt2,3gt, lt2,4gt, lt2,5gt lt2,6gt,
    lt3,1gt, lt3,2gt, lt3,4gt, lt4,1gt lt4,3gt, lt5,1gt, lt5,2gt,
    lt5,3gt, lt5,4gt lt5,6gt, lt6,1gt, lt6,2gt, lt6,3gt, lt6,4gt
  • Start with empty rule path(X,Y) .
  • Among others, consider adding literal edge(X,Y)
    (also consider edge(Y,X), edge(X,Z), edge(Z,X),
    path(Y,X), path(X,Z), path(Z,X), XY, and
    negations)
  • 6 positive tuples and NO negative tuples covered.
  • Create base case and remove covered examples
  • path(X,Y) edge(X,Y).

13
  • lt1,6gt, lt1,5gt, lt3, 5gt, lt4,5gt
  • - lt1,4gt, lt2,1gt, lt2,3gt, lt2,4gt, lt2,5gt lt2,6gt,
    lt3,1gt, lt3,2gt, lt3,4gt, lt4,1gt, lt4,3gt, lt5,1gt, lt5,2gt,
    lt5,3gt, lt5,4gt lt5,6gt, lt6,1gt, lt6,2gt, lt6,3gt, lt6,4gt
  • Start with new empty rule path(X,Y) .
  • Consider literal edge(X,Z) (among others...)
  • 4 remaining positives satisfy it but so do 10 of
    20 negatives
  • Current rule path(x,y) edge(X,Z).
  • Consider literal path(Z,Y) (as well as edge(X,Y),
    edge(Y,Z), edge(X,Z), path(Z,X), etc....)
  • No negatives covered, complete clause.
  • path(X,Y) edge(X,Z), path(Z,Y).
  • New clause actually covers all remaining positive
    tuples of path, so definition is complete.

14
Picking the Best Literal
  • Based on information gain (similar to ID3).
  • p(log2 (p /(pn)) - log2 (P
    /(PN)))
  • P is number of positives before adding literal L
  • N is number of negatives before adding literal L
  • p is number of positives after adding literal L
  • n is number of negatives after adding literal L
  • Given n predicates of arity m there are O(n2m)
    possible literals to chose from, so branching
    factor can be quite large.

15
Other Approaches
  • Golem
  • CHILL
  • Foidl
  • Bufoidl

16
Domains
  • Any kind of concept learning where background
    knowledge is useful.
  • Natural Language Processing
  • Planning
  • Chemistry and biology
  • DNA
  • Protein structure

17
Why Neural Networks?
18
Why Neural Networks?
  • Analogy to biological systems, the best examples
    we have of robust learning systems.
  • Models of biological systems allowing us to
    understand how they learn and adapt.
  • Massive parallelism that allows for computational
    efficiency.
  • Graceful degradation due to distributed
    represent-ations that spread knowledge
    representation over large numbers of
    computational units.
  • Intelligent behavior is an emergent property from
    large numbers of simple units rather than
    resulting from explicit symbolically encoded
    rules.

19
Neural Speed Constraints
  • Neuron switching time is on the order of
    milliseconds compared to nanoseconds for current
    transistors.
  • A factor of a million difference in speed.
  • However, biological systems can perform
    significant cognitive tasks (vision, language
    understanding) in seconds or tenths of seconds.

20
What That Means
  • Therefore, there is only time for about a hundred
    serial steps needed to perform such tasks.
  • Even with limited abilties, current AI systems
    require orders of magnitude more serial steps.
  • Human brain has approximately 1011 neurons each
    connected on average to 104 others, therefore
    must exploit massive parallelism.

21
Real Neurons
  • Cells forming the basis of neural tissue
  • Cell body
  • Dendrites
  • Axon
  • Syntaptic terminals
  • The electrical potential across the cell membrane
    exhibits spikes called action potentials.
  • Originating in the cell body, this spike travels
    down the axon and causes chemical
    neuro-transmitters to be released at syntaptic
    terminals.
  • This chemical difuses across the synapse into
    dendrites of neighboring cells.

22
Real Neurons (cont)
  • Synapses can be excitory or inhibitory.
  • Size of synaptic terminal influences strength of
    connection.
  • Cells add up the incoming chemical messages
    from all neighboring cells and if the net
    positive influence exceeds a threshold, they
    fire and emit an action potential.

23
Model Neuron(Linear Threshold Unit)
  • Neuron modelled by a unit (j) connected by
    weights, wji, to other units (i)
  • Net input to a unit is defined as
  • netj S wji oi
  • Output of a unit is a threshold function on the
    net input
  • 1 if netj gt Tj
  • 0 otherwise

24
Neural Computation
  • McCollough and Pitts (1943) show how linear
    threshold units can be used to compute logical
    functions.
  • Can build basic logic gates
  • AND Let all wji be (Tj /n)e where n number of
    inputs
  • OR Let all wji be Tje
  • NOT Let one input be a constant 1 with weight
    Tje and the input to be inverted have weight Tj

25
Neural Computation (cont)
  • Can build arbitrary logic circuits, finitestate
    machines, and computers given these basis gates.
  • Given negated inputs, two layers of linear
    threshold units can specify any boolean function
    using a twolayer ANDOR network.

26
Learning
  • Hebb (1949) suggested if two units are both
    active (firing) then the weight between them
    should increase
  • wji wji ?ojoi
  • h is a constant called the learning rate
  • Supported by physiological evidence

27
Alternate Learning Rule
  • Rosenblatt (1959) suggested that if a target
    output value is provided for a single neuron with
    fixed inputs, can incrementally change weights to
    learn to produce these outputs using the
    perceptron learning rule.
  • Assumes binary valued input/outputs
  • Assumes a single linear threshold unit.
  • Assumes input features are detected by fixed
    networks.

28
Perceptron Learning Rule
  • If the target output for output unitj is tj
  • wji wji h(tj - oj)oi
  • Equivalent to the intuitive rules
  • If output is correct, don't change the weights
  • If output is low (oj 0, tj 1), increment
    weights for all inputs which are 1.
  • If output is high (oj 1, tj 0), decrement
    weights for all inputs which are 1.
  • Must also adjust threshold
  • Tj Tj h(tj - oj)
  • or equivalently assume there is a weight wj0
    -Tj for an extra input unit 0 that has constant
    output o0 1 and that the threshold is always 0.

29
Perceptron Learning Algorithm
  • Repeatedly iterate through examples adjusting
    weights according to the perceptron learning rule
    until all outputs are correct
  • Initialize the weights to all zero (or randomly)
  • Until outputs for all training examples are
    correct
  • For each training example, e, do
  • Compute the current output oj
  • Compare it to the target tj and update the
    weights
  • according to the perceptron learning rule.

30
Algorithm Notes
  • Each execution of the outer loop is called an
    epoch.
  • If the output is considered as concept membership
    and inputs as binary input features, then easily
    applied to concept learning problems.
  • For multiple category problems, learn a separate
    perceptron for each category and assign to the
    class whose perceptron most exceeds its
    threshold.
  • When will this algorithm terminate (converge) ??

31
Representational Limitations
  • Perceptrons can only represent linear threshold
    functions and can therefore only learn data which
    is linearly separable (positive and negative
    examples are separable by a hyperplane in
    ndimensional space)
  • Cannot represent exclusiveor (xor)

32
Perceptron Learnability
  • System obviously cannot learn what it cannot
    represent.
  • Minsky and Papert(1969) demonstrated that many
    functions like parity (ninput generalization of
    xor) could not be represented.
  • In visual pattern recognition, assumed that input
    features are local and extract feature within a
    fixed radius. In which case no input features
    support learning
  • Symmetry
  • Connectivity
  • These limitations discouraged subsequent research
    on neural networks.

33
Perceptron Convergence and Cycling Theorems
  • Perceptron Convergence Theorem If there are a
    set of weights that are consistent with the
    training data (i.e. the data is linearly
    separable), the perceptron learning algorithm
    will converge (Minsky Papert, 1969).
  • Perceptron Cycling Theorem If the training data
    is not linearly separable, the Perceptron
    learning algorithm will eventually repeat the
    same set of weights and threshold at the end of
    some epoch and therefore enter an infinite loop.

34
Perceptron Learning as Hill Climbing
  • The search space for Perceptron learning is the
    space of possible values for the weights (and
    threshold).
  • The evaluation metric is the error these weights
    produce when used to classify the training
    examples.
  • The perceptron learning algorithm performs a form
    of hillclimbing (gradient descent), at each
    point altering the weights slightly in a
    direction to help minimize this error.
  • Perceptron convergence theorem guarantees that
    for the linearly separable case there is only one
    local minimum and the space is well behaved.

35
Perceptron Performance
  • Can represent and learn conjunctive concepts and
    MofN concepts (true if any M of a set of N
    selected binary features are true).
  • Although simple and restrictive, this highbias
    algorithm performs quite well on many realistic
    problems.
  • However, the representational restriction is
    limiting in many applications.
Write a Comment
User Comments (0)
About PowerShow.com