Title: Machine Learning
1Machine Learning
- Foundations of Artificial Intelligence
2Learning
- What is Learning?
- Learning in AI is also called machine learning or
pattern recognition. - The basic objective is to allow an intelligent
agent to discover autonomously knowledge from
experience. - Lets examine the definition more closely
- an intelligent agent The ability to learn
requires a prior level of intelligence and
knowledge. Learning has to start from an
existing level of capability. - to discover autonomously Learning is
fundamentally about an agent recognizing new
facts for its own use and acquiring new abilities
that reinforce its own existing abilities.
Literal programming, i.e. rote learning from
instruction, is not useful. - knowledge Whatever is learned has to be
represented in some way that the agent can use.
If you can't represent it, you can't learn it
is a corollary of the slogan Knowledge is
power. - from experience Experience is typically a set
of so-called training examples examples may be
categorized or not. They may be random or
selected by a teacher. They may include
explanations or not.
3Learning Agent
Critic
Percepts
Problem solver
Learning element
KB
Actions
4Learning element
- Design of a learning element is affected by
- Which components of the performance element are
to be learned - What feedback is available to learn these
components - What representation is used for the components
- Type of feedback
- Supervised learning correct answers for each
training example - Unsupervised learning correct answers not given
- Reinforcement learning occasional
rewards/feedback
5Inductive Learning
- Inductive Learning
- inductive learning involves learning generalized
rules from specific examples (can think of this
as the inverse of deduction) - main task given a set of examples, each
classified as positive or negative produce a
concept description that matches exactly the
positive examples - Some Notes
- The examples are coded in some representation
language, e.g. they are coded by a finite set of
real-valued features. - The concept description is in a certain language
that is presumably a superset of the language of
possible example encodings. - A correct concept description is one that
classifies correctly ALL possible examples, not
just those given in the training set. - Fundamental Difficulties with Induction
- cant generalize with perfect certainty
- examples and concepts are NOT available
directly they are only available through
representations which may be more or less
adequate to capture them - some examples may be classified as both positive
and negative - the features supplied may not be sufficient to
discriminate between positive and negative
examples
6Inductive Learning Frameworks
- Function-learning formulation
- Logic-inference formulation
7Inductive learning
- Simplest form learn a function from examples
- f is the target function
- An example is a pair (x, f(x))
- Problem find a hypothesis h
- such that h f
- given a training set of examples
- This is a highly simplified model of real
learning - Ignores prior knowledge
- Assumes examples are given
8Inductive learning
- Construct/adjust h to agree with f on training
set - h is consistent if it agrees with f on all
examples - E.g., curve fitting
9Inductive learning
- Construct/adjust h to agree with f on training
set - h is consistent if it agrees with f on all
examples - E.g., curve fitting
10Inductive learning
- Construct/adjust h to agree with f on training
set - h is consistent if it agrees with f on all
examples - E.g., curve fitting
11Inductive learning
- Construct/adjust h to agree with f on training
set - h is consistent if it agrees with f on all
examples - E.g., curve fitting
12Inductive learning
- Construct/adjust h to agree with f on training
set - h is consistent if it agrees with f on all
examples - E.g., curve fitting
13Inductive learning
- Construct/adjust h to agree with f on training
set - h is consistent if it agrees with f on all
examples - E.g., curve fitting
- Ockhams razor prefer the simplest hypothesis
consistent with data
14Logic-Inference Formulation
- Background knowledge KB
- Training set D (observed knowledge) that is not
logically implied by KB - Inductive inference Find h (inductive
hypothesis) such that KB and h imply D
h D is a trivial, but uninteresting solution
(data caching)
Usually, not a sound inference
15Rewarded Card Example
- Deck of cards, with each card designated by
r,s, its rank and suit, and some cards
rewarded - Background knowledge KB
- ((r1) v v (r10)) ? NUM(r)((rJ) v (rQ) v
(rK)) ? FACE(r)((sS) v (sC)) ?
BLACK(s)((sD) v (sH)) ? RED(s) - Training set DREWARD(4,C) ? REWARD(7,C) ?
REWARD(2,S) ?
?REWARD(5,H) ? ?REWARD(J,S) - Possible inductive hypothesish ? (NUM(r) ?
BLACK(s) ? REWARD(r,s))
Note There are several possible inductive
hypotheses
16Learning a Predicate
- Set E of objects (e.g., cards)
- Goal predicate CONCEPT(x), where x is an object
in E, - takes the value True or False (e.g., REWARD)
- Observable predicates A(x), B(X),
- e.g., NUM, RED
- Training set
- values of CONCEPT for some combinations of values
of the observable predicates
17A Possible Training Set
Ex. A B C D E CONCEPT
1 True True False True False False
2 True False False False False True
3 False False True True True False
4 True True True False True True
5 False True True False False False
6 True True False True True False
7 False False True False True False
8 True False True False True True
9 False False False True True False
10 True True True True False True
Note that the training set does not say whether
an observable predicate A, , E is pertinent or
not
18Learning a Predicate
- Set E of objects (e.g., cards)
- Goal predicate CONCEPT(x), where x is an object
in E, - takes the value True or False (e.g., REWARD)
- Observable predicates A(x), B(X),
- e.g., NUM, RED
- Training set
- values of CONCEPT for some combinations of values
of the observable predicates - Find a representation of CONCEPT in the form
- CONCEPT(x) ? S(A,B, )
- where S(A,B,) is a sentence built with the
observable predicates, e.g. - CONCEPT(x) ? A(x) ? (?B(x) v C(x))
19Example set
- An example consists of the values of CONCEPT and
the observable predicates for some object x - A example is positive if CONCEPT is True, else it
is negative - The set X of all examples is the example set
- The training set is a subset of X
20Hypothesis Space
- An hypothesis is any sentence h of the form
- CONCEPT(x) ? S(A,B, )
- where S(A,B,) is a sentence built with the
observable predicates - The set of all hypotheses is called the
hypothesis space H - An hypothesis h agrees with an example if it
gives the correct value of CONCEPT
21Inductive Learning Scheme
22Size of Hypothesis Space
- n observable predicates
- 2n entries in truth table
- In the absence of any restriction (bias), there
are 22n hypotheses to choose from - n 6 ? 2x1019 hypotheses!
23Multiple Inductive Hypotheses
Rewarded Card Example (Continued)
h1 ? NUM(x) ? BLACK(x) ? REWARD(x) h2 ?
BLACK(r,s) ? ?(rJ) ? REWARD(r,s) h3 ?
(r,s4,C) ? (r,s7,C) ? r,s2,S) ?
REWARD(r,s) h4 ? ?(r,s5,H) ?
?(r,sJ,S) ? REWARD(r,s) agree with all the
examples in the training set
24Inductive Bias
- Need for a system of preferences called a bias
to compare possible hypotheses - Keep-It-Simple (KIS) Bias
- If an hypothesis is too complex it may not be
worth learning it - There are much fewer simple hypotheses than
complex ones, hence the hypothesis space is
smaller - Examples
- Use much fewer observable predicates than
suggested by the training set - Constrain the learnt predicate, e.g., to use only
high-level observable predicates such as NUM,
FACE, BLACK, and RED and/or to have simple syntax
(e.g., conjunction of literals)
If the bias allows only sentences S that are
conjunctions of k ltlt n predicates picked from the
n observable predicates, then the size of H is
O(nk)
25Version Spaces
- Idea assume you are looking for a CONJUNCTIVE
CONCEPT - e.g., spade A, club 7, club 9 yes
- club 8, heart 5 no
- concept odd and black
- now notice that the set of conjunctive concepts
is partially ordered by specificity
any card
- at any point, keep most specific and least
specific - conjuncts consistent with data
- most specific
- anything more specific misses some positive
instances - always exists -- conjoin all OK conjunctions
- least specific
- anything less specific admits some negative
instances - may not be unique -- imagine all you know is
club - 4 not ok, odd black ok, spade ok, black not ok
- Idea is to gradually merge least and most
specific as data comes in.
black
odd black
spade
odd spade
3 of spade
26Version Spaces Example
- Step 0 most specific concept (msc) is the empty
set least specific concept (lsc) is the set of
all cards. - Step 1 A-spade is found to be in target set
- msc A-spade
- lsc set of all cards
- Step 2 7-club is found to be in target set
- msc odd black cards
- lsc set of all cards
- Step 3 8-heart is not in target set
- msc odd black cards
- lsc all odd cards OR all black cards
- . . .
The training examples (obtained) incrementally
27Predicate as a Decision Tree
The predicate CONCEPT(x) ? A(x) ? (?B(x) v C(x))
can be represented by the following decision
tree
- ExampleA mushroom is poisonous iffit is yellow
and small, or yellow, - big and spotted
- x is a mushroom
- CONCEPT POISONOUS
- A YELLOW
- B BIG
- C SPOTTED
28Decision Trees
- What is a Decision Tree
- it takes as input the description of a situation
as a set of attributes (features) and outputs a
yes/no decision (so it represents a Boolean
function) - each leaf is labeled "positive or "negative",
each node is labeled with an attribute (or
feature), and each edge is labeled with a value
for the feature of its parent node - Attribute-value language for examples
- in many inductive tasks, especially learning
decision trees, we need a representation language
for examples - each example is a finite feature vector
- a concept is a decision tree where nodes are
features
29Decision Trees
- Example is it a good day to play golf?
- a set of attributes and their possible values
- outlook sunny, overcast, rain
- temperature cool, mild, hot
- humidity high, normal
- windy true, false
A particular instance in the training set might
be ltovercast, hot, normal, falsegt play
In this case, the target class is a binary
attribute, so each instance represents a
positive or a negative example.
30Using Decision Trees for Classification
- Examples can be classified as follows
- 1. look at the example's value for the feature
specified - 2. move along the edge labeled with this value
- 3. if you reach a leaf, return the label of the
leaf - 4. otherwise, repeat from step 1
- Example (a decision tree to decide whether to go
play golf)
outlook
sunny
overcast
rain
humidity
windy
high
normal
true
false
31Classification 3 Step Process
- 1. Model construction (Learning)
- Each record (instance) is assumed to belong to a
predefined class, as determined by one of the
attributes, called the class label - The set of records used for construction of the
model is called training set - The model is usually represented in the form of
classification rules, (IF-THEN statements) or
decision trees - 2. Model Evaluation (Accuracy)
- Estimate accuracy rate of the model based on a
test set - The known label of test sample is compared to
classified result from model - Accuracy rate percentage of test set samples
correctly classified by the model - Test set is independent of training set otherwise
over-fitting will occur - 3. Model Use (Classification)
- The model is used to classify unseen instances
(assigning class labels) - Predict the value of an actual attribute
32Memory-Based Reasoning
- Basic Idea classify new instances based on their
similarity to instances we have seen before - also called instance-based learning
- Simplest form of MBR Rote Learning
- learning by memorization
- save all previously encountered instance given a
new instance, find one from the memorized set
that most closely resembles the new one assign
new instance to the same class as the nearest
neighbor - more general methods try to find k nearest
neighbors rather than just one - but, how do we define resembles?
- MBR is lazy
- defers all of the real work until new instance is
obtained no attempts are made to learn a
generalized model from the training set - less data preprocessing and model evaluation, but
more work has to be done at classification time
33MBR Collaborative Filtering
- Collaborative Filtering or Social Learning
- idea is to give recommendations to a user based
on the ratings of objects by other users - usually assumes that features in the data are
similar objects (e.g., Web pages, music, movies,
etc.) - usually requires explicit ratings of objects by
users based on a rating scale - there have been some attempts to obtain ratings
implicitly based on user behavior (mixed results
problem is that implicit ratings are often
binary) - Nearest Neighbors Strategy
- Find similar users and predicted (weighted)
average of user ratings - We can use any distance or similarity measure to
compute similarity among users (user ratings on
items viewed as a vector) - In case of ratings, often the Pearson r algorithm
is used to compute correlations
34MBR Collaborative Filtering
- Collaborative Filtering Example
- A movie rating system
- Ratings scale 1 detest 7 love it
- Historical DB of users includes ratings of movies
by Sally, Bob, Chris, and Lynn - Karen is a new user who has rated 3 movies, but
has not yet seen Independence Day should we
recommend it to her?
Will Karen like Independence Day?
35Clustering
Clustering is a process of partitioning a set of
data (or objects) in a set of meaningful
sub-classes, called clusters
Helps users understand the natural grouping or
structure in a data set
- Cluster
- a collection of data objects that are similar
to one another and thus can be treated
collectively as one group - but as a collection, they are sufficiently
different from other groups - Clustering
- unsupervised classification
- no predefined classes
36Distance or Similarity Measures
- Measuring Distance
- In order to group similar items, we need a way to
measure the distance between objects (e.g.,
records) - Note distance inverse of similarity
- Often based on the representation of objects as
feature vectors
Term Frequencies for Documents
An Employee DB
37Distance or Similarity Measures
- Common Distance Measures
- Manhattan distance
- Euclidean distance
- Cosine similarity
38What Is Good Clustering?
- A good clustering will produce high quality
clusters in which - the intra-class (that is, intra-cluster)
similarity is high - the inter-class similarity is low
- The quality of a clustering result also depends
on both the similarity measure used by the method
and its implementation - The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns - The quality of a clustering result also depends
on the definition and representation of cluster
chosen
39Applications of Clustering
- Clustering has wide applications in Pattern
Recognition - Spatial Data Analysis
- create thematic maps in GIS by clustering feature
spaces - detect spatial clusters and explain them in
spatial data mining - Image Processing
- Market Research
- Information Retrieval
- Document or term categorization
- Information visualization and IR interfaces
- Web Mining
- Cluster Web usage data to discover groups of
similar access patterns - Web Personalization
40Learning by Discovery
- One example AM by Doug Lenat at Stanford
- a mathematical system
- inputs set theory (union, intersection, etc)
how to do mathematics (based on a book by
Polya), e.g., if f is an interesting function of
two arguments, then f(x,x) is an interesting
function on one, etc. - speculated about what was interesting an made
conjectures, etc. - What AM discovered
- integers (as equivalence relation on cardinality
of sets) - addition (using disjoint union of sets)
- multiplication
- primes 1 was interesting, the function returning
the cardinality of set of divisors was
interesting, etc. - Glodbachs conjecture all even numbers are the
sum of two prime numbers (note that AM did not
prove it, just discovered that it was
interesting) - Why was AM so successful?
- Connection between LISP and mathematics
(mutations of small bits of LISP code are likely
to be interesting) - Doesnt extend to other domains
- Lessons from EURISKO (fleet game)
41Explanation-Based Learning
- Explanation- based learning (EBL) systems try to
explain why each training instance belongs to the
target concept. - The resulting proof is then generalized and
saved. - If a new instance can be explained in the same
manner as a previous instance, then it is also
assumed to be a member of the target concept. - Like macro- operators, EBL systems never learn to
solve a problem that they couldnt solve before
(in principle). - However, they can become much more efficient at
problem-solving by reorganizing the search space. - One of the strengths of EBL is that the resulting
explanations are typically easy to understand. - One of the weaknesses of EBL is that they rely on
a domain theory to generate the explanations.
42Case-Based Learning
- Case-based reasoning (CBR) systems keep track of
previously seen instances and apply them directly
to new ones. - In general, a CBR system simply stores each
case that it experiences in a case base which
represents its memory of previous episodes. - To reason about a new instance, the system
consults its case base and finds the most similar
case that its seen before. The old case is then
adapted and applied to the new situation. - CBR is similar to reasoning by analogy. Many
people believe that much of human learning is
case- based in nature.
43Connectionist Algorithms
- Connectionist models (also called neural
networks) are inspired by the interconnectivity
of the brain. - Connectionist networks typically consist of many
nodes that are highly interconnected. When a node
is activated, it sends signals to other nodes so
that they are activated in turn. - Using layers of nodes allows connectionist models
to learn fairly complex functions. - Neural networks are loosely modeled after the
biological processes involved in cognition - 1. Information processing involves many simple
elements called neurons. - 2. Signals are transmitted between neurons using
connecting links. - 3. Each link has a weight that controls the
strength of its signal. - 4. Each neuron applies an activation function to
the input that it receives from other neurons.
This function determines its output.