Title: Machine Learning: Symbolbased
1Machine Learning Symbol-based
9
9.0 Introduction 9.1 A Framework
for Symbol-based Learning 9.2 Version Space
Search 9.3 The ID3 Decision Tree Induction
Algorithm 9.4 Inductive Bias and Learnability
9.5 Knowledge and Learning 9.6 Unsupervised
Learning 9.7 Reinforcement Learning 9.8 Epilogue
and References 9.9 Exercises
Additional sources used in preparing the
slides Jeffrey Ullmans data mining lecture
notes (clustering) Ernest Davis lecture notes
(clustering) Dean, Allen, and Aloimonos AI
textbook (reinforcement learning)
2Unsupervised learning
3Conceptual Clustering
- The clustering problem
- Given
- a collection of unclassified objects, and
- a means for measuring the similarity of objects
(distance metric), - find
- classes (clusters) of objects such that some
standard of quality is met (e.g., maximize the
similarity of objects in the same class.) - Essentially, it is an approach to discover a
useful summary of the data.
4Conceptual Clustering (contd)
- Essentially, it is an approach to discover a
useful summary of the data. - Ideally, we would like to represent clusters and
their semantic explanations. In other words, we
would like to define clusters extensionally
(i.e., by general rules) rather than
intensionally (i.e., by enumeration). - For instance, compare
- X X teaches AI at MTU CS, and
- John Lowther, Nilufer Onder
5Example a cholera outbreak in London
- Many years ago, during a cholera outbreak in
London, a physician plotted the location of cases
on a map. Properly visualized, the data indicated
that cases clustered around certain
intersections, where there were polluted wells,
not only exposing the cause of cholera, but
indicating what to do about the problem.
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
6Higher dimensional examples
- Observation that customers who buy diapers are
more likely to buy beer than average allowed
supermarkets to place beer and diapers nearby,
knowing many customers would walk between them.
Placing potato chips between increased the sales
of all three items.
7Higher dimensional examples (contd)
- Skycat clustered 2 x 109 sky objects into stars,
galaxies, quasars, etc. Each object was a point
in a space of 7 dimensions, with each dimension
representing radiation in one band of the
spectrum. The Sloan Sky Survey is a more
ambitious attempt to catalog and cluster the
entire visible universe. Clustering sky objects
by their radiation levels in different bands
allowed astronomers to distinguish between
galaxies, nearby stars, and many other kinds of
celestial objects.
8Skycat software
9Higher dimensional examples (contd)
- Documents may be thought of as points in a
high-dimensional space, where each dimension
corresponds to one possible word. The position of
a document in a dimension is the number of times
the word occurs in a document (or just 1 if it
occurs, 0 if not). Clusters of documents in this
space often correspond to groups of documents on
the same topic.Query salsa submitted to
MetaCrawler returns 246 documents in 15 clusters,
of which the top are Puerto Rico Latin Music
(8 docs) Follow Up Post York Salsa Dancers (20
docs) music entertainment latin artists (40
docs) hot food chiles sauces condiments
companies (79 docs) pepper onion tomatoes
(41 docs)
10Measuring distance
- To discuss, whether a set of points is close
enough to be considered a cluster, we need a
distance measure D(x,y) that tells how far points
x and y are. - The usual axioms for a distance measure D are
1. D(x,x) 0. A point is distance
0 from itself. 2. D(x,y) D(y,x).
Distance is symmetric. 3. D(x,y) ? D(x,z)
D(z,y). The triangle inequality.
11K-dimensional Euclidean space
- The distance between any two points, saya a1,
a2, , ak and b b1, b2, , bkis given in
one of the usual manners - 1. Common distance (L2 norm)
?i 1 (ai - bi)2 2. Manhattan distance
(L1 norm) ?i 1 ai -
bi3. Max of dimensions (L? norm)
maxi 1 ai - bi
b
k
a
b
k
a
b
k
a
12Non-Euclidean spaces
- Here are some examples where a distance measure
without a Euclidean space makes sense. - Web pages Roughly 108-dimensional space where
each dimension corresponds to one word. Rather
use vectors to deal with only the words actually
present in documents a and b. - Character strings, such as DNA sequences Rather
use a metric based on the LCS---Lowest Common
Subsequence. - Objects represented as sets of symbolic, rather
than numeric, features Rather base similarity on
the proportion of features that they have in
common.
13Non-Euclidean spaces (contd)
- object1 small, red, rubber, ball
- object2 small, blue, rubber, ball
- object3 large, black, wooden, ball
- similarity(object1, object2) 3 / 4
- similarity(object1, object3)
similarity(object2, object3) 1/4 - Note that it is possible to assign different
weights to features.
14Approaches to Clustering
- Broadly specified, there are two classes of
clustering algorithms - 1. Centroid approaches We guess the centroids or
central point in each cluster, and assign points
to the cluster of their nearest centroid. - 2. Hierarchical approaches We begin assuming
that each point is a cluster by itself. We
repeatedly merge nearby clusters, using some
measure of how close two clusters are (e.g.,
distance between their centroids), or how good a
cluster the resulting group would be (e.g., the
average distance of points in the cluster from
the resulting centroid.)
15The k-means algorithm
- Pick k cluster centroids.
- Assign points to clusters by picking the closest
centroid to the point in question. As points are
assigned to clusters, the centroid of the cluster
may migrate. - Example Suppose that k 2 and we assign points
1, 2, 3, 4, 5, in that order. Outline circles
represent points, filled circles represent
centroids.
5
1
2
3
4
16The k-means algorithm example (contd)
5
5
1
1
2
2
3
3
4
4
5
5
1
1
2
2
3
3
4
4
17Issues
- How to initialize the k centroids? Pick points
sufficiently far away from any other centroid,
until there is k. - As computation progresses, one can decide to
split one cluster and merge two, to keep the
total at k. A test for whether to do so might be
to ask whether doing so reduces the average
distance from points to their centroids. - Having located the centroids of k clusters, we
can reassign all points, since some points that
were assigned early may actually wind up closer
to another centroid, as the centroids move about.
18Issues (contd)
- How to determine k? One can try different
values for k until the smallest k such that
increasing k does not much decrease the average
points of points to their centroids.
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
19Issues (contd)
X
X
When k 1, all the points are in one cluster,
and the average distance to the centroid will be
high.
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
When k 2, one of the clusters will be by itself
and the other two will be forced into one
cluster. The average distance of points to the
centroid will shrink considerably.
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
20Issues (contd)
X
X
When k 3, each of the apparent clusters should
be a cluster by itself, and the average distance
from the points to their centroids shrinks again.
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
When k 4, then one of the true clusters will be
artificially partitioned into two nearby
clusters. The average distance to centroid will
drop a bit, but not much.
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
21Issues (contd)
Average radius
1
2
3
4
k
- This failure to drop further suggests that k 3
is right. This conclusion can be made even if the
data is in so many dimensions that we cannot
visualize the clusters.
22The CLUSTER/2 algorithm
- 1. Select k seeds from the set of observed
objects. This may be done randomly or according
to some selection function. - 2. For each seed, using that seed as a positive
instance and all other seeds as negative
instances, produce a maximally general definition
that covers all of the positive and none of the
negative instances (multiple classifications of
non-seed objects are possible.)
23The CLUSTER/2 algorithm (contd)
- 3. Classify all objects in the sample according
to these descriptions. Replace each maximally
specific description that covers all objects in
the category (to decrease the likelihood that
classes overlap on unseen objects.) - 4. Adjust remaining overlapping definitions.
- 5. Using a distance metric, select an element
closest to the center of each class. - 6. Repeat steps 1-5 using the new central
elements as seeds. Stop when clusters are
satisfactory. -
24The CLUSTER/2 algorithm (contd)
- 7. If clusters are unsatisfactory and no
improvement occurs over several iterations,
select the new seeds closest to the edge of the
cluster. -
25The steps of a CLUSTER/2 run
26A COBWEB clustering for four one-celled organisms
(Gennari et al.,1989)
Note we will skip the COBWEB algorithm
27Related communities
- data mining (in databases, over the web)
- statistics
- clustering algorithms
- visualization
- databases
28Reinforcement Learning
- A form of learning where the agent can explore
and learn through interaction with the
environment. - The agent learns a policy which is a mapping
from states to actions. The policy tells what the
best move is in a particular state. - It is a general methodology planning, decision
making, search can all be viewed in the context
of reinforcement learning.
29Tic-tac-toe a different approach
- Recall the minimax approach The agent knows
its current state. Generates a two layer search
tree taking into account all the possible moves
for itself and the opponent. Backs up values from
the leaf nodes and takes the best move assuming
that the opponent will also do so. - An alternative is to directly start playing with
an opponent (does not have to be perfect,but
could as well be). Assume no prior knowledge or
lookahead. Assign values to states 1 is
win 0 is loss or draw 0.5 is anything else
30Notice that 0.5 is arbitrary, it cannot
differentiate between good moves and bad moves.
So, the learner has no guidance initially. It
engages in playing. When the game ends, if it is
a win, the value 1 will be propagated backwards.
If it is a draw or a loss, the value 0 is
propagated backwards. Eventually, earlier states
will be labeled to reflect their true value.
After several plays, the learner will learn the
best move given a state (a policy.)
31Issues in generalizing this approach
- How will the state values be initialized or
propagated backwards? - What if there is no end to the game (infinite
horizon)? - This is an optimization problem which suggests
that it is hard. How can an optimal policy be
learned?
32A simple robot domain
The robot is in one of the states 0, 1, 2, 3.
Each one represents an office, the offices are
connected in a ring. Three actions are
available moves to the next state
- moves to the previous state _at_
remains at the same state
_at_
_at_
0
1
-
-
-
-
3
2
_at_
_at_
33The robot domain (contd)
- The robot can observe the label of the state it
is in and perform any action corresponding to an
arc leading out of its current state. - We assume that there is a clock governing the
passage of time, and that at each tick of the
clock the robot has to perform an action. - The environment is deterministic, there is a
unique state resulting from any initial state and
action. (Yes, the diagram in the previous page is
a state-transition diagram.) - Each state has a reward, 10 for state 3, 0 for
the others.
34Compare three policies
- a. Every state is mapped to _at_
- The value of this policy is 0, because the
robot will never get to office 3. - b. Every state is mapped to
policy 0 - The value of this policy is ?, because the
robot will end up in office 3 infinitely often. - c. Every state is except 3 is mapped to , 3 is
mapped to _at_
policy 1 - The valus of this policy is also ?, because
the robot will end up (stay) in office 3
infinitely often.
35Compare three policies
So, it is easy to rule case a out, but how can we
show that policy 1 is better than policy 0?
- POLICY 1
- The average reward per tick for state 0 is 10.
- The discounted cumulative reward for state 0 is
2.5.
POLICY 0 The average reward per tick for state 0
is 10/4. The discounted cumulative reward for
state 0 is 1.33.
36Discounted cumulative reward
- Assume that the robot associates a higher value
with more immediate rewards and therefore
discounts future rewards. - The discount rate (?) is a number between 0 and 1
used to discount future rewards. - The discounted cumulative reward for a particular
state with respect to a given policy is the sum
for n from 0 to infinity of ?n times the reward
associated with the state reached after the n-th
tick of the clock.
37Discounted cumulative reward (contd)
- Take ? 0.5
- For state 0 with respect to policy 00.50 x 0
0.51 x 0 0.52 x 0 0.53 x 10 0.54 x 0 0.55
x 0 0.56 x 0 0.57 x 10 1.25 0.078
1.33 in the limit - For state 0 with respect to policy 00.50 x 0
0.51 x 0 0.52 x 0 0.53 x 10 0.54 x 10
0.55 x 10 0.56 x 10 0.57 x 10 2.5 in
the limit
38Discounted cumulative reward (contd)
- Let j be a state,R(j) be the reward for ending
up in state j,? be a fixed policy,?(j) be the
action dictated by ? in state j,f(j,a) be the
next state given the robot starts in state j and
performs action a,V?i(j) be the estimated value
of state j with respect to the policy ? after the
i-th iteration of the algorithm - Using a dynamic programming algorithm, one can
obtain a good estimate of V?, the value function
for policy ? as i ? ?.
39A dynamic programming algorithm to compute values
for states
- 1. For each j, set V?0(j) to 0.
- 2. Set i to 0.
- 3. For each j, set V?I1 (j) to R(j) ? V?i(
f(j,?) ) ). - 4. Set i to i 1.
- 5. If i is equal to the maximum number of
iterations, then return V?I
otherwise, return to step 3.
40Temporal credit assignment problem
- The problem of assigning credit or blame to the
actions in a sequence of actions where feedback
is available only at the end of the sequence. - When you lose a game of chess or checkers, the
blame for your loss cannot necessarily be
attributesd to the last move you made, or even
the next-to-the-last move. - Dynamic programming solves the temporal credit
assignment problem by propogating rewards
backwards to earlier states and hence to actions
earlier in the sequence of actions determined by
a policy.
41Computing an optimal policy
- Given a method for estimating the value of states
with respect to a fixed policy, it is possible to
find an optimal policy. We would like to maximize
the discounted cumulative reward. - Policy iteration Howard, 1960 is an algorithm
that uses the algorithm for computing the value
of a state as a subroutine.
42Policy iteration algorithm
- 1. Let ?0 be an arbitrary policy.
- 2. Set i to 0.
- 3. Compute V?0 (j) for each j.
- 4. Compute a new policy ?i1 so that ?i1 (j) is
the action a maximizing R(j) ? V?i( f(j,?) ) . - 5. If ?i1 ?i , then return ?I otherwise, set
i to i 1, and go to step 3.
43Policy iteration algorithm (contd)
- A policy ? is said to be the optimal policy if
there is no other policy ? and state j such that
V? (j) V? (j) and for all k ? j V? (j) V?
(j) . - The policy iteration algorithm is guaranteed to
terminate in a finite number of steps with an
optimal policy.
44Comments on reinforcement learning
- A general model where an agent can learn to
function in dynamic environments - The agent can learn while interacting with the
environment - No prior knowledge except the (probabilistic)
transitions is assumed - Can be generalized to stochastic domains (an
action might have several different probabilistic
consequences, i.e., the state-transition function
is not deterministic) - Can also be generalized to domains where the
reward function is not known
45Famous example TD-Gammon (Tosauro, 1995)
- Learns to play Backgammon
- Immediate reward 100 if win -100 if lose 0
for all other states - Trained by playing 1.5 million games against
itself (several weeks) - Now approximately equal to best human player
(won World Cup of Backgammon in 1992 among top 3
since 1995) - Predecessor NeuroGammon Tesauro and Sejnowski,
1989 learned from examples of labelled moves
(very tedious for human expert)
46Other examples
- Robot learning to dock on battery charger
- Pole balancing
- Elevator dispatching Crites and Barto, 1995
better than industry standard - Inventory management Van Roy et. Al 10-15
improvement over industry standards - Job-shop scheduling for NASA space missions
Zhang and Dietterich, 1997 - Dynamic channel assignment in cellular phones
Singh and Bertsekas, 1994 - Robotic soccer
47Common characteristics
- delayed reward
- opportunity for active exploration
- possibility that state only partially observable
- possible need to learn multiple tasks with same
sensors/effectors - there may not be an adequate teacher