For Wednesday - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

For Wednesday

Description:

Intensionally: Prolog definition of the predicate. Sequential Covering Algorithm ... Graceful degradation due to distributed represent-ations that spread knowledge ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 36

Provided by: maryelai

Category:

Tags: definition | degradation | graceful | of | wednesday

more less

Transcript and Presenter's Notes

Title: For Wednesday

1
For Wednesday

Read ch. 20, sections 1, 2, 5, and 7
No homework

2
Program 4

Any questions?

3
Learning mini-project

Worth 2 homeworks
Due next Monday
Foil6 is available in /home/mecalif/public/itk340/
foil
A manual and sample data files are there as well.
Create a data file that will allow FOIL to learn
rules for a sister/2 relation from background
relations of parent/2, male/1, and female/1. You
can look in the prolog folder of my 327 folder
for sample data if you like.
Electronically submit your data filewhich should
be named sister.d, and turn in a hard copy of the
rules FOIL learns.

4
Inductive Logic Programming

Representation is Horn clauses
Builds rules using background predicates
Rules are potentially much more expressive than
attribute-value representations

5
Example Results

Rules for family relations from data of primitive
or related predicates.
uncle(A,B) brother(A,C), parent(C,B).
uncle(A,B) husband(A,C), sister(C,D),
parent(D,B).
Recursive list programs.
member(X,X Y).
member(X, Y Z) member(X, Z).

6
ILP

Goal is to induce a Hornclause definition for
some target predicate P given definitions of
background predicates Qi.
Goal is to find a syntactically simple definition
D for P such that given background predicate
definitions B
For every positive example pi D È B p
For every negative example ni D È B / n
Background definitions are either provided
Extensionally List of ground tuples satisfying
the predicate.
Intensionally Prolog definition of the predicate.

7
Sequential Covering Algorithm

Let P be the set of positive examples
Until P is empty do
Learn a rule R that covers a large number of
positives without covering any negatives.
Add R to the list of learned rules.
Remove positives covered by R from P

This is just an instance of the greedy algorithm
for minimum set covering and does not guarantee
that a minimum number of rules is learned but
tends to learn a reasonably small rule set.
Minimum set covering is an NPhard problem and
the greedy algorithm is a common approximation
algorithm.
There are several ways to learn a single rule
used in various methods.

9
Strategies for Learning a Single Rule

TopDown (General to Specific)
Start with the most general (empty) rule.
Repeatedly add feature constraints that eliminate
negatives while retaining positives.
Stop when only positives are covered.
BottomUp (Specific to General)
Start with a most specific rule (complete
description of a single instance).
Repeatedly eliminate feature constraints in order
to cover more positive examples.
Stop when further generalization results in
covering negatives.

10
FOIL

Basic topdown sequential covering algorithm
adapted for Prolog clauses.
Background provided extensionally.
Initialize clause for target predicate P to
P(X1 ,...Xr ) .
Possible specializations of a clause include
adding all possible literals
Qi (V1 ,...Vr )
not(Qi (V1 ,...Vr ))
Xi Xj
not(Xi X )
where X's are variables in the existing clause,
at least one of V1 ,...Vr is an existing
variable, others can be new.
Allow recursive literals if not cause infinite
regress.

11
Foil Input Data

Consider example of finding a path in a directed
acyclic graph.
Intended Clause
path(X,Y) edge(X,Y).
path(X,Y) edge(X,Z), path (Z,Y).
Examples
edge lt1,2gt, lt1,3gt, lt3,6gt, lt4,2gt, lt4,6gt, lt6,5gt
path lt1,2gt, lt1,3gt, lt1,6gt, lt1,5gt, lt3,6gt, lt3,
5gt, lt4,2gt, lt4,6gt, lt4,5gt, lt6, 5gt
Negative examples of the target predicate can be
provided directly or indirectly produced using a
closed world assumption. Every pair ltx,ygt not in
positive tuples for path.

12
Example Induction

lt1,2gt, lt1,3gt, lt1,6gt, lt1,5gt, lt3,6gt, lt3, 5gt,
lt4,2gt, lt4,6gt, lt4,5gt, lt6, 5gt
- lt1,4gt, lt2,1gt, lt2,3gt, lt2,4gt, lt2,5gt lt2,6gt,
lt3,1gt, lt3,2gt, lt3,4gt, lt4,1gt lt4,3gt, lt5,1gt, lt5,2gt,
lt5,3gt, lt5,4gt lt5,6gt, lt6,1gt, lt6,2gt, lt6,3gt, lt6,4gt
Start with empty rule path(X,Y) .
Among others, consider adding literal edge(X,Y)
(also consider edge(Y,X), edge(X,Z), edge(Z,X),
path(Y,X), path(X,Z), path(Z,X), XY, and
negations)
6 positive tuples and NO negative tuples covered.
Create base case and remove covered examples
path(X,Y) edge(X,Y).

lt1,6gt, lt1,5gt, lt3, 5gt, lt4,5gt
- lt1,4gt, lt2,1gt, lt2,3gt, lt2,4gt, lt2,5gt lt2,6gt,
lt3,1gt, lt3,2gt, lt3,4gt, lt4,1gt, lt4,3gt, lt5,1gt, lt5,2gt,
lt5,3gt, lt5,4gt lt5,6gt, lt6,1gt, lt6,2gt, lt6,3gt, lt6,4gt
Start with new empty rule path(X,Y) .
Consider literal edge(X,Z) (among others...)
4 remaining positives satisfy it but so do 10 of
20 negatives
Current rule path(x,y) edge(X,Z).
Consider literal path(Z,Y) (as well as edge(X,Y),
edge(Y,Z), edge(X,Z), path(Z,X), etc....)
No negatives covered, complete clause.
path(X,Y) edge(X,Z), path(Z,Y).
New clause actually covers all remaining positive
tuples of path, so definition is complete.

14
Picking the Best Literal

Based on information gain (similar to ID3).
p(log2 (p /(pn)) - log2 (P
/(PN)))
P is number of positives before adding literal L
N is number of negatives before adding literal L
p is number of positives after adding literal L
n is number of negatives after adding literal L
Given n predicates of arity m there are O(n2m)
possible literals to chose from, so branching
factor can be quite large.

15
Other Approaches

Golem
CHILL
Foidl
Bufoidl

16
Domains

Any kind of concept learning where background
knowledge is useful.
Natural Language Processing
Planning
Chemistry and biology
DNA
Protein structure

17
Why Neural Networks?
18
Why Neural Networks?

Analogy to biological systems, the best examples
we have of robust learning systems.
Models of biological systems allowing us to
understand how they learn and adapt.
Massive parallelism that allows for computational
efficiency.
Graceful degradation due to distributed
represent-ations that spread knowledge
representation over large numbers of
computational units.
Intelligent behavior is an emergent property from
large numbers of simple units rather than
resulting from explicit symbolically encoded
rules.

19
Neural Speed Constraints

Neuron switching time is on the order of
milliseconds compared to nanoseconds for current
transistors.
A factor of a million difference in speed.
However, biological systems can perform
significant cognitive tasks (vision, language
understanding) in seconds or tenths of seconds.

20
What That Means

Therefore, there is only time for about a hundred
serial steps needed to perform such tasks.
Even with limited abilties, current AI systems
require orders of magnitude more serial steps.
Human brain has approximately 1011 neurons each
connected on average to 104 others, therefore
must exploit massive parallelism.

21
Real Neurons

Cells forming the basis of neural tissue
Cell body
Dendrites
Axon
Syntaptic terminals
The electrical potential across the cell membrane
exhibits spikes called action potentials.
Originating in the cell body, this spike travels
down the axon and causes chemical
neuro-transmitters to be released at syntaptic
terminals.
This chemical difuses across the synapse into
dendrites of neighboring cells.

22
Real Neurons (cont)

Synapses can be excitory or inhibitory.
Size of synaptic terminal influences strength of
connection.
Cells add up the incoming chemical messages
from all neighboring cells and if the net
positive influence exceeds a threshold, they
fire and emit an action potential.

23
Model Neuron(Linear Threshold Unit)

Neuron modelled by a unit (j) connected by
weights, wji, to other units (i)
Net input to a unit is defined as
netj S wji oi
Output of a unit is a threshold function on the
net input
1 if netj gt Tj
0 otherwise

24
Neural Computation

McCollough and Pitts (1943) show how linear
threshold units can be used to compute logical
functions.
Can build basic logic gates
AND Let all wji be (Tj /n)e where n number of
inputs
OR Let all wji be Tje
NOT Let one input be a constant 1 with weight
Tje and the input to be inverted have weight Tj

25
Neural Computation (cont)

Can build arbitrary logic circuits, finitestate
machines, and computers given these basis gates.
Given negated inputs, two layers of linear
threshold units can specify any boolean function
using a twolayer ANDOR network.

26
Learning

Hebb (1949) suggested if two units are both
active (firing) then the weight between them
should increase
wji wji ?ojoi
h is a constant called the learning rate
Supported by physiological evidence

27
Alternate Learning Rule

Rosenblatt (1959) suggested that if a target
output value is provided for a single neuron with
fixed inputs, can incrementally change weights to
learn to produce these outputs using the
perceptron learning rule.
Assumes binary valued input/outputs
Assumes a single linear threshold unit.
Assumes input features are detected by fixed
networks.

28
Perceptron Learning Rule

If the target output for output unitj is tj
wji wji h(tj - oj)oi
Equivalent to the intuitive rules
If output is correct, don't change the weights
If output is low (oj 0, tj 1), increment
weights for all inputs which are 1.
If output is high (oj 1, tj 0), decrement
weights for all inputs which are 1.
Must also adjust threshold
Tj Tj h(tj - oj)
or equivalently assume there is a weight wj0
-Tj for an extra input unit 0 that has constant
output o0 1 and that the threshold is always 0.

29
Perceptron Learning Algorithm

Repeatedly iterate through examples adjusting
weights according to the perceptron learning rule
until all outputs are correct
Initialize the weights to all zero (or randomly)
Until outputs for all training examples are
correct
For each training example, e, do
Compute the current output oj
Compare it to the target tj and update the
weights
according to the perceptron learning rule.

30
Algorithm Notes

Each execution of the outer loop is called an
epoch.
If the output is considered as concept membership
and inputs as binary input features, then easily
applied to concept learning problems.
For multiple category problems, learn a separate
perceptron for each category and assign to the
class whose perceptron most exceeds its
threshold.
When will this algorithm terminate (converge) ??

31
Representational Limitations

Perceptrons can only represent linear threshold
functions and can therefore only learn data which
is linearly separable (positive and negative
examples are separable by a hyperplane in
ndimensional space)
Cannot represent exclusiveor (xor)

32
Perceptron Learnability

System obviously cannot learn what it cannot
represent.
Minsky and Papert(1969) demonstrated that many
functions like parity (ninput generalization of
xor) could not be represented.
In visual pattern recognition, assumed that input
features are local and extract feature within a
fixed radius. In which case no input features
support learning
Symmetry
Connectivity
These limitations discouraged subsequent research
on neural networks.

33
Perceptron Convergence and Cycling Theorems

Perceptron Convergence Theorem If there are a
set of weights that are consistent with the
training data (i.e. the data is linearly
separable), the perceptron learning algorithm
will converge (Minsky Papert, 1969).
Perceptron Cycling Theorem If the training data
is not linearly separable, the Perceptron
learning algorithm will eventually repeat the
same set of weights and threshold at the end of
some epoch and therefore enter an infinite loop.

34
Perceptron Learning as Hill Climbing

The search space for Perceptron learning is the
space of possible values for the weights (and
threshold).
The evaluation metric is the error these weights
produce when used to classify the training
examples.
The perceptron learning algorithm performs a form
of hillclimbing (gradient descent), at each
point altering the weights slightly in a
direction to help minimize this error.
Perceptron convergence theorem guarantees that
for the linearly separable case there is only one
local minimum and the space is well behaved.

35
Perceptron Performance

Can represent and learn conjunctive concepts and
MofN concepts (true if any M of a set of N
selected binary features are true).
Although simple and restrictive, this highbias
algorithm performs quite well on many realistic
problems.
However, the representational restriction is
limiting in many applications.

Write a Comment

User Comments (0)