CS 512 Machine Learning

About This Presentation

Title:

CS 512 Machine Learning

Description:

Personalized news or mail filter. Personalized tutoring ... Medical text mining (e.g. migraines to calcium channel blockers to magnesium) 9 ... – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 58

Provided by: raym125

Category:

more less

Transcript and Presenter's Notes

Title: CS 512 Machine Learning

1
CS 512 Machine Learning

Berrin Yanikoglu
Slides are expanded from the
Machine Learning-Mitchell book slides
Some of the extra slides thanks to T. Jaakkola,
MIT and others

2
CS512-Machine Learning

Description This is a introductory level
course on machine learning. Topics covered will
include theoretical aspects of learning and main
approaches to pattern recognition (Bayesian
approaches, Decision trees, NNs,). The emphasis
will be on what is possible rather than
techniques (which is what is covered in
EE566-Pattern Recognition course). Some of the
content will be tailored according to student
composition.
This course is especially intended for students
working in the area of pattern recognition or
related fields, to deepen their understanding of
machine learning/pattern recognition topics.
Students who have already taken EE566-Pattern
Recognition may find a significant material being
repeated in this course (see the syllabus) while
there will be a significant overlap, this course
will cover some topics that were not covered
sufficiently in EE566, particularly theoretical
aspects, neural network approaches, support
vector machines and different learning paradigms.
Prereq None. Undergraduate level Probability
and Linear Algebra helpful.
Matlab or other Toolboxes will be used
for homework assignments.
Book Machine Learning by T. Mitchell (ML).
Supplementary book Introduction to Machine
Learning (Ethem), by Ethem Alpaydin.
We will follow and cover the ML book until the
end of Chapter 9.
We will also read important research articles
on the topic and some student presentations/discus
sions are expected.
Course Schedule Wed 1040-1230 FENS L058
(note change of time)
Thu 1140-1230 in FENS
L065
Instructor Berrin Yanikoglu (berrin_at_sabanciuniv.e
du, FENS 2056
Office Hours Walk-in.

3
Syllabus

Week1 28 September-2 OctoberML1 Introduction
to ML
ML2 - Concept Learning
Week 2 5-9 October
ML2 - Concept Learning
Week 3 12-16 OctoberML3 - Decision Trees
Week4 19-23 OctoberML4 - ANN (MLP)
Week5 26-30 October (29th is a holiday)ML4 -
ANN (MLP)
Week6 2-6 November
ML6 - Bayesian Learning (Bayes formula, naive
Bayes)
Week7 9-13 November
ML6 Bayesian Learning (Bayes formula, naive
Bayes) continued
Ethem 4-5 Intro to multivariate methods
Week8 16-20 NovemberMidterm 1 1.5 hrs - no
class
Ethem Chp 5 Multivariate methods continued

Week9 23-27 November (27 is holiday)
Ethem4 Parameter Estimation (ML, MAP and Bayes
estimates)
Slides Intrinsic error, Bias-Variance
Week10 30 November-4 December
ML8,Ethem8 Non-parametric density estimation
(Parzen windows, KNN)
RBF-Networks
Week 11 7-11 DecemberEthem 7 - Linear
Discriminant Analysis (intro) and Support Vector
Machines
Week 12 14-18 December
Midterm 2 1.5 hrs - no class
ML5 - Evaluating hypothesis
Week 13 21-25 December ML7 - Computational
learning Theory (PAC learning, VC dimension)
Week 14 28 December- January 1st (Jan 1st is
holiday)
Ethem14 - Assessing Comparing Class.
Algorithms
Ethem15 - Classifier Combination

4
What is learning?

Learning denotes changes in a system that ...
enable a system to do the same task more
efficiently the next time. Herbert Simon
Learning is any process by which a system
improves performance from experience. Herbert
Simon
Learning is constructing or modifying
representations of what is being experienced.
Ryszard Michalski
Learning is making useful changes in our minds.
Marvin Minsky

5
Machine Learning - Example

The mind-reading game
written by Y. Freund and R. Schapire
Repeat 200 times
Computer guesses whether youll type 0/1
You type 0 or 1
The computer is right much more than half the
time How?

6
Machine Learning - Example

One of my favorite AI/Machine Learning sites
http//www.20q.net/

7
Why learn?

Build software agents that can adapt to their
users or to other software agents or to changing
environments
Fill in skeletal or incomplete specifications
about a domain
Large, complex AI systems cannot be completely
derived by hand and require dynamic updating to
incorporate new information.
Learning new characteristics expands the domain
or expertise and lessens the brittleness of the
system
Discover new things or structure that were
previously unknown to humans
Examples data mining, scientific discovery
Understand and improve efficiency of human
learning

8
Why Study Machine Learning?Building Better
Engineering Systems

Develop systems that can automatically adapt and
customize themselves to individual users.
Personalized news or mail filter
Personalized tutoring
Develop systems that are too difficult/expensive
to construct manually because they require
specific detailed skills or knowledge tuned to a
specific task (knowledge engineering bottleneck).
Discover new knowledge from large databases (data
mining).
Market basket analysis (e.g. diapers and beer)
Medical text mining (e.g. migraines to calcium
channel blockers to magnesium)

9
Why Study Machine Learning?Cognitive Science

Computational studies of learning may help us
understand learning in humans and other
biological organisms.
Hebbian neural learning
Neurons that fire together, wire together.

10
Related Disciplines

Artificial Intelligence
Pattern Recognition
Data Mining
Probability and Statistics
Information theory
Psychology (developmental, cognitive)
Neurobiology
Linguistics
Philosophy

11
History of Machine Learning

1950s
Samuels checker player
Selfridges Pandemonium
1960s
Neural networks Perceptron
Pattern recognition
Learning in the limit theory
Minsky and Papert prove limitations of Perceptron
1970s
Symbolic concept induction
Winstons arch learner
Expert systems and the knowledge acquisition
bottleneck
Quinlans ID3
Michalskis AQ and soybean diagnosis
Scientific discovery with BACON
Mathematical discovery with AM

12
History of Machine Learning (cont.)

1980s
Advanced decision tree and rule learning
Explanation-based Learning (EBL)
Learning, planning and problem solving
Utility theory
Analogy
Cognitive architectures
Resurgence of neural networks (connectionism,
backpropagation)
Valiants PAC Learning Theory
1990s
Data mining
Reinforcement learning (RL)
Inductive Logic Programming (ILP)
Ensembles Bagging, Boosting, and Stacking
Bayes Net learning

13
History of Machine Learning (cont.)

2000s
Kernel methods
Support vector machines
Graphical models
Statistical relational learning
Transfer learning
Applications
Adaptive software agents and web applications
Learning in robotics and vision
E-mail management (spam detection)

14
Major paradigms of machine learning

Rote learning Learning by memorization.
Employed by first machine learning systems, in
1950s
Samuels Checkers program
Supervised learning Use specific examples to
reach general conclusions or extract general
rules
Classification (Concept learning)
Regression
Unsupervised learning (Clustering) Unsupervised
identification of natural groups in data
Reinforcement learning Feedback (positive or
negative reward) given at the end of a sequence
of steps
Analogy Determine correspondence between two
different representations
Discovery Unsupervised, specific goal not given
Batch vs. online learning
All training examples are provided at once or one
at a time (with error estimate and training after
each example).

15
Rote Learning is Limited

Memorize I/O pairs and perform exact matching
with new inputs
If a computer has not seen the precise case
before, it cannot apply its experience
We want computers to generalize from prior
experience
Generalization is the most important factor in
learning

16
The inductive learning problem

Extrapolate from a given set of examples to make
accurate predictions about future examples
Supervised versus unsupervised learning
Learn an unknown function f(X) Y, where X is an
input example and Y is the desired output.
Supervised learning implies we are given a
training set of (X, Y) pairs by a teacher
Unsupervised learning means we are only given the
Xs.
Semi-supervised learning mostly unlabelled data
Reinforcement learning delayed feedback

17
Types of supervised learning
x2color

Classification
We are given the label of the training objects
(x1,x2,yT/O)
We are interested in classifying future objects
(x1,x2) with the correct label.
I.e. Find y for given (x1,x2).

x1size

Concept Learning
We are given positive and negative samples for
the concept we want to learn (e.g.Tangerine)
(x1,x2,y/-)
We are interested in classifying future objects
as member of the class (or positive example for
the concept) or not.
I.e. Answer /- for given (x1,x2).

18
Types of Supervised Learning

Regression
Target function is continuous rather than class
membership

19
Classification

Assign object/event to one of a given finite set
of categories.
Medical diagnosis
Credit card applications or transactions
Fraud detection in e-commerce
Spam filtering in email
Recommended books, movies, music
Financial investments
Spoken words
Handwritten letters

20
Concept learning

Given a training set of positive and negative
examples of a concept
Construct a description that will accurately
classify whether future examples are positive or
negative
Examples points in a multi-dimensional feature
space
Concepts function that labels every point in
feature space
(as , -, and possibly ?)

21
Example
Positive Examples
Negative Examples
How does this symbol classify?

Concept
Solid Red Circle in a (regular?) Polygon

What about?
Figures on left side of page
Figures drawn before 5pm 2/2/89 ltetcgt

22
Inductive learning framework Feature Space

Raw input data from sensors are typically
preprocessed to obtain a feature vector, X, that
adequately describes all of the relevant features
for classifying examples
Each x is a list of (attribute, value) pairs
X ColorOrange ShapeRound Weight200g
Each attribute can be discrete or continuous
Each example can be interpreted as a point in an
n-dimensional feature space, where n is the
number of attributes

23
Feature Space
Size
?
Big
Color
Gray
2500
Weight
A concept is then a (possibly disjoint) volume
in this space.
24
Inductive learning as search

Instance space I
Each instance i ? I is a feature vector
i (v1, v2, , vk) ? I V1 x V2 x x Vk
Class C gives the instances class (to be
predicted)
Model space M defines the possible hypotheses
M I ? C, M m1, mn (possibly infinite)
Training data can be used to direct the search
for a good (consistent, complete, simple)
hypothesis in the model space

25
Learning Key Steps

data and assumptions
what data is available for the learning task?
what can we assume about the problem?
representation
how should we represent the examples to be
classified
method and estimation
what are the possible hypotheses?
how do we adjust our predictions based on the
feedback?
evaluation
how well are we doing?

26
(No Transcript)
27
Evaluation of Learning Systems

Experimental
Conduct controlled cross-validation experiments
to compare various methods on a variety of
benchmark datasets.
Gather data on their performance, e.g. test
accuracy, training-time, testing-time.
Analyze differences for statistical significance.
Theoretical
Analyze algorithms mathematically and prove
theorems about their
Computational complexity
Ability to fit training data
Sample complexity (number of training examples
needed to learn an accurate function)

28
Measuring Performance

Performance of the learner can be measured in one
of the following ways, as suitable for the
application
Classification Accuracy
Solution quality (length, efficiency)
Speed of performance

29
(No Transcript)
30
Inductive Learning

Learning is a characteristic of adaptive systems
which are capable of improving their performance
on a problem as a function of previous
experience, for example, in solving similar
problems (Simon, 1983) .
The inductive learning process is a heuristic
search through a space of symbolic descriptions,
generated by the application of various inference
rules to the initial observational statements
(Michalski, 1983).

31
Defining the Learning Task

Improve on task, T,
with respect to performance metric, P,
based on experience, E.

32
Defining the Learning Task

T Playing checkers
P Percentage of games won against an arbitrary
opponent
E Playing practice games against itself
T Recognizing hand-written words
P Percentage of words correctly classified
E Database of human-labeled images of
handwritten words
T Driving on four-lane highways using vision
sensors
P Average distance traveled before a
human-judged error
E A sequence of images and steering commands
recorded while
observing a human driver.
T Categorize email messages as spam or
legitimate.
P Percentage of email messages correctly
classified.
E Database of emails, some with human-given
labels

33
Sample Learning Problem

Learn to play checkers from self-play, using an
approach analogous to that used in the first
machine learning system developed by Arthur
Samuels at IBM in 1959.

34
Designing a Learning System

Choose the training experience
Choose exactly what is too be learned,
i.e. the target function.
Choose the representation for the target function
Selecting the features and the hypothesis class
Choose a learning algorithm to infer the target
function from the experience.

35
Training Experience

Direct experience Sample input and output pairs
are given for a useful target function.
E.g. Checker boards labeled with the correct
move, extracted from record of expert play
Indirect experience Given feedback is not direct
I/O pairs for a useful target function.
Potentially arbitrary sequences of game moves and
their final game results.
Credit/Blame Assignment Problem How to assign
credit/blame to individual moves given only
indirect feedback?

36
Source of Training Data

Random examples provided outside of the learners
control
Negative examples available or only positive?
Good training examples selected by a benevolent
teacher.
Near miss examples
Learner can query an oracle about class of an
unlabeled example in the environment
Learner can construct an arbitrary example and
query an oracle for its label.
Learner can design and run experiments directly
in the environment without any human guidance.
Last three cases are forms of Active learning

37
Training vs. Test Distribution

Generally assume that the training and test
examples are independently drawn from the same
overall distribution of data.
IID Independently and identically distributed
If examples are not independent, it requires
collective classification.
If test distribution is different, it requires
transfer learning.
Almost all ML/PR systems assume iid training and
identical train/test distributions

38
Choosing a Target Function

What function is to be learned and how will it be
used by the performance system?
For the 2-class classification, we may want to
learn a separating boundary between classes
For checkers, assume we are given a function for
generating the legal moves for a given board
position. We want to decide the best move.
Could learn a function that returns the best move
for a given board
ChooseMove(board, legal-moves) ?
best-move
Or could learn an evaluation function, V(board) ?
R that gives each board position a score for how
favorable it is.
V can then be used to pick a move by applying
each legal move, scoring the resulting board
position, and choosing the move that results in
the highest scoring board position.

39
100 -100 100 100
100 ..
40
Deciding on the Target Function

For checkers, we need to further qualify the
function we want to learn. For instance, it would
be useful to have a function V that assigns the
following values to board b
If b is a final winning board, then V(b) 100
If b is a final losing board, then V(b) 100
If b is a final draw board, then V(b) 0
If b is a non-terminal board, then V(b) in
-100-100 according to the goodness of the
board.
This is what we want to learn.

41
Representing the Target Function

Target function can be represented in many ways
lookup table
symbolic rules
numerical function (of varying complexity)
neural network
There is a trade-off between the expressiveness
of a representation and the ease of learning.
The more expressive a representation, the better
it will be at approximating an arbitrary
function however, the more examples will be
needed to learn an accurate function.

42
Various Target Function Representations

Numerical functions
Linear regression
Neural networks
Support vector machines
Symbolic functions
Decision trees
Rules in propositional logic
Rules in first-order predicate logic
Instance-based functions
Nearest-neighbor
Case-based
Probabilistic Graphical Models
Naïve Bayes
Bayesian networks
Hidden-Markov Models (HMMs)
Probabilistic Context Free Grammars (PCFGs)

43
Target Function Representation (1/2)

Lets first assume that the following
features/attributes are useful to decide the
value of a checker board
bp(b) number of black pieces on board b
rp(b) number of red pieces on board b
bk(b) number of black kings on board b
rk(b) number of red kings on board b
bt(b) number of black pieces threatened (i.e.
which can be immediately taken by red on its next
turn)
rt(b) number of red pieces threatened
Why these features? These are meaningful features
for a board, but one could add more features or
combinations of features as well.

44
Target Function Representation (2/2)

Now assume that there is a function V that gives
the value of a particular board state given these
features - we want to find that function.
Since we do not know V, we can only estimate it
i.e. Find V (or V-hat) which is an approximation
to V
To do this, we need to first decide on a
representation
E.g. A linear combination of weighted attributes
(other numerical functions are also possible)
Then, we need to learn this function (i.e. the
weights wi)

45
Learning Algorithm

Use training values for the target function to
induce a hypothesized definition that fits these
examples and hopefully generalizes to unseen
examples.
We can minimize some measure of error (loss
function) such as mean squared error

Vtrain and V are used to mean the same thing, the
target function V or V-hat are the approximation
learned by the system
46
Various Search Algorithms

Gradient descent
Perceptron
Backpropagation
Divide and Conquer
Decision tree induction
Rule learning
Evolutionary Computation
Genetic Algorithms (GAs)
Dynamic Programming
HMM Learning

47
Learning V (V and V-hat are the same thing!)

Start with a rough approximation V of V
Assign some initial (possibly random) weights wi
There are many learning algorithms to learn V
I.e learning the weights wi, since the attributes
is input
We will learn the function V using our training
experience.
The training experience is obtained by self-play
Modify the weights so that V(b) is closer to
V(b) for the given training samples
One possible learning algorithm to adjust our
weights is the Least Mean Squares (LMS) Algorithm

48
Learning

Least Mean Squares (LMS) Algorithm A gradient
descent algorithm that incrementally updates the
weights of a linear function in an attempt to
minimize the mean squared error
Until weights converge
For each training example b do
1) Compute the absolute error
2) For each board feature, fi, update its weight,
wi
for some small constant (learning rate) c
What is the intuition behind this update rule?

49
Gradient Descent
50
(No Transcript)
51
(No Transcript)
52
LMS as Gradient Descent
53
Obtaining Training Values

Some direct supervision may be available for the
target function.
lt ltbp?,rp0,bk1,rk0,bt0,rt0gt, 100gt
(win for black)
With indirect feedback for intermediate board
states, training values are also only estimates

54
Estimating Training Values

How to learn to play a game
In game playing, in general, standard approach is
to apply the minimax algorithm where the player
picks those moves that maximizes his returns
(highest board value) assuming a perfect
opponent.
Since the game tree is exponential in size,
normally the search is cut at some point and the
current best option is selected.
For very small games, we could have the computer
plays against itself and assign a value to each
board state that is considered in the game tree,
as follows
Final boards value is known 100, -100, 0 by
definition
An intermediate board b is assigned a value V(b)
equal to V(b), where b is the highest scoring
final board position that can be achieved
starting from b and playing optimally until the
end of the game (assuming the opponent plays
optimally as well).

55
Estimating Training Values

Computing V(b) in this way is intractable since
it involves searching the complete exponential
game tree.
Therefore, this definition is said to be
non-operational.
An operational definition can be computed in
reasonable (polynomial) time.
Need to learn an operational approximation to the
ideal evaluation function.

56
Estimating Training Values

Estimate training values for intermediate
(non-terminal) board positions by the estimated
value of their successor in an actual game trace
(one path in the game tree)
where successor(b) is the next board position
where it is the programs move in actual play.
Values towards the end of the game are initially
more accurate and continued training slowly
backs up accurate values to earlier board
positions.
Temporal difference learning deals with the
credit assignment problem with exponentially
decaying credit/penalty for boards farther away
in time.

57
Lessons Learned about Learning

Learning can be viewed as using direct or
indirect experience to approximate a chosen
target function.
Function approximation can be viewed as a search
through a space of hypotheses (representations of
functions) for one that best fits a set of
training data.
Different learning methods assume different
hypothesis spaces (representation languages)
and/or employ different search techniques.

58
Issues in Machine Learning

Training Experience
What can be the training experience (labelled
samples, self-play,)
Target Function
What should we aim to learn?
What should the representation of the target
function be (features, hypothesis class,)
Learning
What learning algorithms exist for learning
general target functions from specific training
examples?
Which algorithms can approximate functions well
(and when)?
How does noisy data influence accuracy?
.
Training Data
How much training data is sufficient?
How does number of training examples influence
accuracy?
What is the best strategy for choosing a useful
next training experience? How it affects
complexity?
Prior Knowledge/Domain Knowledge