Title: CS 512 Machine Learning
1CS 512 Machine Learning
- Berrin Yanikoglu
- Slides are expanded from the
- Machine Learning-Mitchell book slides
- Some of the extra slides thanks to T. Jaakkola,
MIT and others
2CS512-Machine Learning
- Description This is a introductory level
course on machine learning. Topics covered will
include theoretical aspects of learning and main
approaches to pattern recognition (Bayesian
approaches, Decision trees, NNs,). The emphasis
will be on what is possible rather than
techniques (which is what is covered in
EE566-Pattern Recognition course). Some of the
content will be tailored according to student
composition. - This course is especially intended for students
working in the area of pattern recognition or
related fields, to deepen their understanding of
machine learning/pattern recognition topics.
Students who have already taken EE566-Pattern
Recognition may find a significant material being
repeated in this course (see the syllabus) while
there will be a significant overlap, this course
will cover some topics that were not covered
sufficiently in EE566, particularly theoretical
aspects, neural network approaches, support
vector machines and different learning paradigms. - Prereq None. Undergraduate level Probability
and Linear Algebra helpful. - Matlab or other Toolboxes will be used
for homework assignments. - Book Machine Learning by T. Mitchell (ML).
- Supplementary book Introduction to Machine
Learning (Ethem), by Ethem Alpaydin. - We will follow and cover the ML book until the
end of Chapter 9. - We will also read important research articles
on the topic and some student presentations/discus
sions are expected. - Course Schedule Wed 1040-1230 FENS L058
(note change of time) - Thu 1140-1230 in FENS
L065 - Instructor Berrin Yanikoglu (berrin_at_sabanciuniv.e
du, FENS 2056 - Office Hours Walk-in.
3Syllabus
- Week1 28 September-2 OctoberML1 Introduction
to ML - ML2 - Concept Learning
- Week 2 5-9 October
- ML2 - Concept Learning
- Week 3 12-16 OctoberML3 - Decision Trees
- Week4 19-23 OctoberML4 - ANN (MLP)
- Week5 26-30 October (29th is a holiday)ML4 -
ANN (MLP) - Week6 2-6 November
- ML6 - Bayesian Learning (Bayes formula, naive
Bayes) - Week7 9-13 November
- ML6 Bayesian Learning (Bayes formula, naive
Bayes) continued - Ethem 4-5 Intro to multivariate methods
- Week8 16-20 NovemberMidterm 1 1.5 hrs - no
class - Ethem Chp 5 Multivariate methods continued
- Week9 23-27 November (27 is holiday)
- Ethem4 Parameter Estimation (ML, MAP and Bayes
estimates) - Slides Intrinsic error, Bias-Variance
- Week10 30 November-4 December
- ML8,Ethem8 Non-parametric density estimation
(Parzen windows, KNN) - RBF-Networks
- Week 11 7-11 DecemberEthem 7 - Linear
Discriminant Analysis (intro) and Support Vector
Machines - Week 12 14-18 December
- Midterm 2 1.5 hrs - no class
- ML5 - Evaluating hypothesis
- Week 13 21-25 December ML7 - Computational
learning Theory (PAC learning, VC dimension) - Week 14 28 December- January 1st (Jan 1st is
holiday) - Ethem14 - Assessing Comparing Class.
Algorithms - Ethem15 - Classifier Combination
4What is learning?
- Learning denotes changes in a system that ...
enable a system to do the same task more
efficiently the next time. Herbert Simon - Learning is any process by which a system
improves performance from experience. Herbert
Simon - Learning is constructing or modifying
representations of what is being experienced.
Ryszard Michalski - Learning is making useful changes in our minds.
Marvin Minsky
5Machine Learning - Example
- The mind-reading game
- written by Y. Freund and R. Schapire
- Repeat 200 times
- Computer guesses whether youll type 0/1
- You type 0 or 1
- The computer is right much more than half the
time How?
6Machine Learning - Example
- One of my favorite AI/Machine Learning sites
- http//www.20q.net/
7Why learn?
- Build software agents that can adapt to their
users or to other software agents or to changing
environments - Fill in skeletal or incomplete specifications
about a domain - Large, complex AI systems cannot be completely
derived by hand and require dynamic updating to
incorporate new information. - Learning new characteristics expands the domain
or expertise and lessens the brittleness of the
system - Discover new things or structure that were
previously unknown to humans - Examples data mining, scientific discovery
- Understand and improve efficiency of human
learning
8Why Study Machine Learning?Building Better
Engineering Systems
- Develop systems that can automatically adapt and
customize themselves to individual users. - Personalized news or mail filter
- Personalized tutoring
- Develop systems that are too difficult/expensive
to construct manually because they require
specific detailed skills or knowledge tuned to a
specific task (knowledge engineering bottleneck). - Discover new knowledge from large databases (data
mining). - Market basket analysis (e.g. diapers and beer)
- Medical text mining (e.g. migraines to calcium
channel blockers to magnesium)
9Why Study Machine Learning?Cognitive Science
- Computational studies of learning may help us
understand learning in humans and other
biological organisms. - Hebbian neural learning
- Neurons that fire together, wire together.
10Related Disciplines
- Artificial Intelligence
- Pattern Recognition
- Data Mining
- Probability and Statistics
- Information theory
- Psychology (developmental, cognitive)
- Neurobiology
- Linguistics
- Philosophy
11History of Machine Learning
- 1950s
- Samuels checker player
- Selfridges Pandemonium
- 1960s
- Neural networks Perceptron
- Pattern recognition
- Learning in the limit theory
- Minsky and Papert prove limitations of Perceptron
- 1970s
- Symbolic concept induction
- Winstons arch learner
- Expert systems and the knowledge acquisition
bottleneck - Quinlans ID3
- Michalskis AQ and soybean diagnosis
- Scientific discovery with BACON
- Mathematical discovery with AM
12History of Machine Learning (cont.)
- 1980s
- Advanced decision tree and rule learning
- Explanation-based Learning (EBL)
- Learning, planning and problem solving
- Utility theory
- Analogy
- Cognitive architectures
- Resurgence of neural networks (connectionism,
backpropagation) - Valiants PAC Learning Theory
- 1990s
- Data mining
- Reinforcement learning (RL)
- Inductive Logic Programming (ILP)
- Ensembles Bagging, Boosting, and Stacking
- Bayes Net learning
13History of Machine Learning (cont.)
- 2000s
- Kernel methods
- Support vector machines
- Graphical models
- Statistical relational learning
- Transfer learning
- Applications
- Adaptive software agents and web applications
- Learning in robotics and vision
- E-mail management (spam detection)
14Major paradigms of machine learning
- Rote learning Learning by memorization.
- Employed by first machine learning systems, in
1950s - Samuels Checkers program
- Supervised learning Use specific examples to
reach general conclusions or extract general
rules - Classification (Concept learning)
- Regression
- Unsupervised learning (Clustering) Unsupervised
identification of natural groups in data - Reinforcement learning Feedback (positive or
negative reward) given at the end of a sequence
of steps - Analogy Determine correspondence between two
different representations - Discovery Unsupervised, specific goal not given
- Batch vs. online learning
- All training examples are provided at once or one
at a time (with error estimate and training after
each example).
15Rote Learning is Limited
- Memorize I/O pairs and perform exact matching
with new inputs - If a computer has not seen the precise case
before, it cannot apply its experience - We want computers to generalize from prior
experience - Generalization is the most important factor in
learning
16The inductive learning problem
- Extrapolate from a given set of examples to make
accurate predictions about future examples - Supervised versus unsupervised learning
- Learn an unknown function f(X) Y, where X is an
input example and Y is the desired output. - Supervised learning implies we are given a
training set of (X, Y) pairs by a teacher - Unsupervised learning means we are only given the
Xs. - Semi-supervised learning mostly unlabelled data
- Reinforcement learning delayed feedback
17Types of supervised learning
x2color
- Classification
- We are given the label of the training objects
(x1,x2,yT/O) - We are interested in classifying future objects
(x1,x2) with the correct label. - I.e. Find y for given (x1,x2).
-
x1size
- Concept Learning
- We are given positive and negative samples for
the concept we want to learn (e.g.Tangerine)
(x1,x2,y/-) -
- We are interested in classifying future objects
as member of the class (or positive example for
the concept) or not. - I.e. Answer /- for given (x1,x2).
-
18Types of Supervised Learning
- Regression
- Target function is continuous rather than class
membership
19Classification
- Assign object/event to one of a given finite set
of categories. - Medical diagnosis
- Credit card applications or transactions
- Fraud detection in e-commerce
- Spam filtering in email
- Recommended books, movies, music
- Financial investments
- Spoken words
- Handwritten letters
20Concept learning
- Given a training set of positive and negative
examples of a concept - Construct a description that will accurately
classify whether future examples are positive or
negative - Examples points in a multi-dimensional feature
space - Concepts function that labels every point in
feature space - (as , -, and possibly ?)
21Example
Positive Examples
Negative Examples
How does this symbol classify?
- Concept
- Solid Red Circle in a (regular?) Polygon
- What about?
- Figures on left side of page
- Figures drawn before 5pm 2/2/89 ltetcgt
22Inductive learning framework Feature Space
- Raw input data from sensors are typically
preprocessed to obtain a feature vector, X, that
adequately describes all of the relevant features
for classifying examples - Each x is a list of (attribute, value) pairs
- X ColorOrange ShapeRound Weight200g
- Each attribute can be discrete or continuous
- Each example can be interpreted as a point in an
n-dimensional feature space, where n is the
number of attributes
23Feature Space
Size
?
Big
Color
Gray
2500
Weight
A concept is then a (possibly disjoint) volume
in this space.
24Inductive learning as search
- Instance space I
- Each instance i ? I is a feature vector
- i (v1, v2, , vk) ? I V1 x V2 x x Vk
- Class C gives the instances class (to be
predicted) - Model space M defines the possible hypotheses
- M I ? C, M m1, mn (possibly infinite)
- Training data can be used to direct the search
for a good (consistent, complete, simple)
hypothesis in the model space
25Learning Key Steps
- data and assumptions
- what data is available for the learning task?
- what can we assume about the problem?
- representation
- how should we represent the examples to be
classified - method and estimation
- what are the possible hypotheses?
- how do we adjust our predictions based on the
feedback? - evaluation
- how well are we doing?
26(No Transcript)
27Evaluation of Learning Systems
- Experimental
- Conduct controlled cross-validation experiments
to compare various methods on a variety of
benchmark datasets. - Gather data on their performance, e.g. test
accuracy, training-time, testing-time. - Analyze differences for statistical significance.
- Theoretical
- Analyze algorithms mathematically and prove
theorems about their - Computational complexity
- Ability to fit training data
- Sample complexity (number of training examples
needed to learn an accurate function)
28Measuring Performance
- Performance of the learner can be measured in one
of the following ways, as suitable for the
application - Classification Accuracy
- Solution quality (length, efficiency)
- Speed of performance
29(No Transcript)
30Inductive Learning
- Learning is a characteristic of adaptive systems
which are capable of improving their performance
on a problem as a function of previous
experience, for example, in solving similar
problems (Simon, 1983) . - The inductive learning process is a heuristic
search through a space of symbolic descriptions,
generated by the application of various inference
rules to the initial observational statements
(Michalski, 1983).
31Defining the Learning Task
- Improve on task, T,
- with respect to performance metric, P,
- based on experience, E.
32Defining the Learning Task
- T Playing checkers
- P Percentage of games won against an arbitrary
opponent - E Playing practice games against itself
- T Recognizing hand-written words
- P Percentage of words correctly classified
- E Database of human-labeled images of
handwritten words - T Driving on four-lane highways using vision
sensors - P Average distance traveled before a
human-judged error - E A sequence of images and steering commands
recorded while - observing a human driver.
- T Categorize email messages as spam or
legitimate. - P Percentage of email messages correctly
classified. - E Database of emails, some with human-given
labels
33Sample Learning Problem
- Learn to play checkers from self-play, using an
approach analogous to that used in the first
machine learning system developed by Arthur
Samuels at IBM in 1959.
34Designing a Learning System
- Choose the training experience
- Choose exactly what is too be learned,
- i.e. the target function.
- Choose the representation for the target function
- Selecting the features and the hypothesis class
- Choose a learning algorithm to infer the target
function from the experience.
35Training Experience
- Direct experience Sample input and output pairs
are given for a useful target function. - E.g. Checker boards labeled with the correct
move, extracted from record of expert play - Indirect experience Given feedback is not direct
I/O pairs for a useful target function. - Potentially arbitrary sequences of game moves and
their final game results. - Credit/Blame Assignment Problem How to assign
credit/blame to individual moves given only
indirect feedback?
36Source of Training Data
- Random examples provided outside of the learners
control - Negative examples available or only positive?
- Good training examples selected by a benevolent
teacher. - Near miss examples
- Learner can query an oracle about class of an
unlabeled example in the environment - Learner can construct an arbitrary example and
query an oracle for its label. - Learner can design and run experiments directly
in the environment without any human guidance. - Last three cases are forms of Active learning
37Training vs. Test Distribution
- Generally assume that the training and test
examples are independently drawn from the same
overall distribution of data. - IID Independently and identically distributed
- If examples are not independent, it requires
collective classification. - If test distribution is different, it requires
transfer learning. - Almost all ML/PR systems assume iid training and
identical train/test distributions
38Choosing a Target Function
- What function is to be learned and how will it be
used by the performance system? - For the 2-class classification, we may want to
learn a separating boundary between classes - For checkers, assume we are given a function for
generating the legal moves for a given board
position. We want to decide the best move. - Could learn a function that returns the best move
for a given board - ChooseMove(board, legal-moves) ?
best-move - Or could learn an evaluation function, V(board) ?
R that gives each board position a score for how
favorable it is. - V can then be used to pick a move by applying
each legal move, scoring the resulting board
position, and choosing the move that results in
the highest scoring board position.
39100 -100 100 100
100 ..
40Deciding on the Target Function
- For checkers, we need to further qualify the
function we want to learn. For instance, it would
be useful to have a function V that assigns the
following values to board b - If b is a final winning board, then V(b) 100
- If b is a final losing board, then V(b) 100
- If b is a final draw board, then V(b) 0
- If b is a non-terminal board, then V(b) in
-100-100 according to the goodness of the
board. - This is what we want to learn.
41Representing the Target Function
- Target function can be represented in many ways
- lookup table
- symbolic rules
- numerical function (of varying complexity)
- neural network
-
- There is a trade-off between the expressiveness
of a representation and the ease of learning. - The more expressive a representation, the better
it will be at approximating an arbitrary
function however, the more examples will be
needed to learn an accurate function.
42Various Target Function Representations
- Numerical functions
- Linear regression
- Neural networks
- Support vector machines
- Symbolic functions
- Decision trees
- Rules in propositional logic
- Rules in first-order predicate logic
- Instance-based functions
- Nearest-neighbor
- Case-based
- Probabilistic Graphical Models
- Naïve Bayes
- Bayesian networks
- Hidden-Markov Models (HMMs)
- Probabilistic Context Free Grammars (PCFGs)
43Target Function Representation (1/2)
- Lets first assume that the following
features/attributes are useful to decide the
value of a checker board - bp(b) number of black pieces on board b
- rp(b) number of red pieces on board b
- bk(b) number of black kings on board b
- rk(b) number of red kings on board b
- bt(b) number of black pieces threatened (i.e.
which can be immediately taken by red on its next
turn) - rt(b) number of red pieces threatened
- Why these features? These are meaningful features
for a board, but one could add more features or
combinations of features as well.
44Target Function Representation (2/2)
- Now assume that there is a function V that gives
the value of a particular board state given these
features - we want to find that function. - Since we do not know V, we can only estimate it
- i.e. Find V (or V-hat) which is an approximation
to V - To do this, we need to first decide on a
representation - E.g. A linear combination of weighted attributes
(other numerical functions are also possible) - Then, we need to learn this function (i.e. the
weights wi)
45Learning Algorithm
- Use training values for the target function to
induce a hypothesized definition that fits these
examples and hopefully generalizes to unseen
examples. - We can minimize some measure of error (loss
function) such as mean squared error
Vtrain and V are used to mean the same thing, the
target function V or V-hat are the approximation
learned by the system
46Various Search Algorithms
- Gradient descent
- Perceptron
- Backpropagation
- Divide and Conquer
- Decision tree induction
- Rule learning
- Evolutionary Computation
- Genetic Algorithms (GAs)
- Dynamic Programming
- HMM Learning
47Learning V (V and V-hat are the same thing!)
- Start with a rough approximation V of V
- Assign some initial (possibly random) weights wi
- There are many learning algorithms to learn V
- I.e learning the weights wi, since the attributes
is input - We will learn the function V using our training
experience. - The training experience is obtained by self-play
- Modify the weights so that V(b) is closer to
V(b) for the given training samples - One possible learning algorithm to adjust our
weights is the Least Mean Squares (LMS) Algorithm
48Learning
- Least Mean Squares (LMS) Algorithm A gradient
descent algorithm that incrementally updates the
weights of a linear function in an attempt to
minimize the mean squared error - Until weights converge
- For each training example b do
- 1) Compute the absolute error
-
-
-
- 2) For each board feature, fi, update its weight,
wi -
- for some small constant (learning rate) c
- What is the intuition behind this update rule?
49Gradient Descent
50(No Transcript)
51(No Transcript)
52LMS as Gradient Descent
53Obtaining Training Values
- Some direct supervision may be available for the
target function. - lt ltbp?,rp0,bk1,rk0,bt0,rt0gt, 100gt
(win for black) - With indirect feedback for intermediate board
states, training values are also only estimates
54Estimating Training Values
- How to learn to play a game
- In game playing, in general, standard approach is
to apply the minimax algorithm where the player
picks those moves that maximizes his returns
(highest board value) assuming a perfect
opponent. - Since the game tree is exponential in size,
normally the search is cut at some point and the
current best option is selected. - For very small games, we could have the computer
plays against itself and assign a value to each
board state that is considered in the game tree,
as follows - Final boards value is known 100, -100, 0 by
definition - An intermediate board b is assigned a value V(b)
equal to V(b), where b is the highest scoring
final board position that can be achieved
starting from b and playing optimally until the
end of the game (assuming the opponent plays
optimally as well).
55Estimating Training Values
- Computing V(b) in this way is intractable since
it involves searching the complete exponential
game tree. - Therefore, this definition is said to be
non-operational. - An operational definition can be computed in
reasonable (polynomial) time. - Need to learn an operational approximation to the
ideal evaluation function.
56Estimating Training Values
- Estimate training values for intermediate
(non-terminal) board positions by the estimated
value of their successor in an actual game trace
(one path in the game tree) -
-
- where successor(b) is the next board position
where it is the programs move in actual play. - Values towards the end of the game are initially
more accurate and continued training slowly
backs up accurate values to earlier board
positions. - Temporal difference learning deals with the
credit assignment problem with exponentially
decaying credit/penalty for boards farther away
in time.
57Lessons Learned about Learning
- Learning can be viewed as using direct or
indirect experience to approximate a chosen
target function. - Function approximation can be viewed as a search
through a space of hypotheses (representations of
functions) for one that best fits a set of
training data. - Different learning methods assume different
hypothesis spaces (representation languages)
and/or employ different search techniques.
58Issues in Machine Learning
- Training Experience
- What can be the training experience (labelled
samples, self-play,) - Target Function
- What should we aim to learn?
- What should the representation of the target
function be (features, hypothesis class,) - Learning
- What learning algorithms exist for learning
general target functions from specific training
examples? - Which algorithms can approximate functions well
(and when)? - How does noisy data influence accuracy?
- .
- Training Data
- How much training data is sufficient?
- How does number of training examples influence
accuracy? - What is the best strategy for choosing a useful
next training experience? How it affects
complexity? - Prior Knowledge/Domain Knowledge