CS 512 Machine Learning - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

CS 512 Machine Learning

Description:

Personalized news or mail filter. Personalized tutoring ... Medical text mining (e.g. migraines to calcium channel blockers to magnesium) 9 ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 58
Provided by: raym125
Category:

less

Transcript and Presenter's Notes

Title: CS 512 Machine Learning


1
CS 512 Machine Learning
  • Berrin Yanikoglu
  • Slides are expanded from the
  • Machine Learning-Mitchell book slides
  • Some of the extra slides thanks to T. Jaakkola,
    MIT and others

2
CS512-Machine Learning
  • Description This is a introductory level
    course on machine learning. Topics covered will
    include theoretical aspects of learning and main
    approaches to pattern recognition (Bayesian
    approaches, Decision trees, NNs,). The emphasis
    will be on what is possible rather than
    techniques (which is what is covered in
    EE566-Pattern Recognition course). Some of the
    content will be tailored according to student
    composition.
  • This course is especially intended for students
    working in the area of pattern recognition or
    related fields, to deepen their understanding of
    machine learning/pattern recognition topics.
    Students who have already taken EE566-Pattern
    Recognition may find a significant material being
    repeated in this course (see the syllabus) while
    there will be a significant overlap, this course
    will cover some topics that were not covered
    sufficiently in EE566, particularly theoretical
    aspects, neural network approaches, support
    vector machines and different learning paradigms.
  • Prereq None. Undergraduate level Probability
    and Linear Algebra helpful.
  • Matlab or other Toolboxes will be used
    for homework assignments.
  • Book Machine Learning by T. Mitchell (ML).
  • Supplementary book Introduction to Machine
    Learning (Ethem), by Ethem Alpaydin.
  • We will follow and cover the ML book until the
    end of Chapter 9.
  • We will also read important research articles
    on the topic and some student presentations/discus
    sions are expected.
  • Course Schedule Wed 1040-1230 FENS L058
    (note change of time)
  • Thu 1140-1230 in FENS
    L065
  • Instructor Berrin Yanikoglu (berrin_at_sabanciuniv.e
    du, FENS 2056
  • Office Hours Walk-in.

3
Syllabus
  • Week1 28 September-2 OctoberML1 Introduction
    to ML
  • ML2 - Concept Learning
  • Week 2 5-9 October
  • ML2 - Concept Learning
  • Week 3 12-16 OctoberML3 - Decision Trees
  • Week4 19-23 OctoberML4 - ANN (MLP)
  • Week5 26-30 October (29th is a holiday)ML4 -
    ANN (MLP)
  • Week6 2-6 November
  • ML6 - Bayesian Learning (Bayes formula, naive
    Bayes)
  • Week7 9-13 November
  • ML6 Bayesian Learning (Bayes formula, naive
    Bayes) continued
  • Ethem 4-5 Intro to multivariate methods
  • Week8 16-20 NovemberMidterm 1 1.5 hrs - no
    class
  • Ethem Chp 5 Multivariate methods continued
  • Week9 23-27 November (27 is holiday)
  • Ethem4 Parameter Estimation (ML, MAP and Bayes
    estimates)
  • Slides Intrinsic error, Bias-Variance
  • Week10 30 November-4 December
  • ML8,Ethem8 Non-parametric density estimation
    (Parzen windows, KNN)
  • RBF-Networks
  • Week 11 7-11 DecemberEthem 7 - Linear
    Discriminant Analysis (intro) and Support Vector
    Machines
  • Week 12 14-18 December
  • Midterm 2 1.5 hrs - no class
  • ML5 - Evaluating hypothesis
  • Week 13 21-25 December ML7 - Computational
    learning Theory (PAC learning, VC dimension)
  • Week 14 28 December- January 1st (Jan 1st is
    holiday)
  • Ethem14 - Assessing Comparing Class.
    Algorithms
  • Ethem15 - Classifier Combination

4
What is learning?
  • Learning denotes changes in a system that ...
    enable a system to do the same task more
    efficiently the next time. Herbert Simon
  • Learning is any process by which a system
    improves performance from experience. Herbert
    Simon
  • Learning is constructing or modifying
    representations of what is being experienced.
    Ryszard Michalski
  • Learning is making useful changes in our minds.
    Marvin Minsky

5
Machine Learning - Example
  • The mind-reading game
  • written by Y. Freund and R. Schapire
  • Repeat 200 times
  • Computer guesses whether youll type 0/1
  • You type 0 or 1
  • The computer is right much more than half the
    time How?

6
Machine Learning - Example
  • One of my favorite AI/Machine Learning sites
  • http//www.20q.net/

7
Why learn?
  • Build software agents that can adapt to their
    users or to other software agents or to changing
    environments
  • Fill in skeletal or incomplete specifications
    about a domain
  • Large, complex AI systems cannot be completely
    derived by hand and require dynamic updating to
    incorporate new information.
  • Learning new characteristics expands the domain
    or expertise and lessens the brittleness of the
    system
  • Discover new things or structure that were
    previously unknown to humans
  • Examples data mining, scientific discovery
  • Understand and improve efficiency of human
    learning

8
Why Study Machine Learning?Building Better
Engineering Systems
  • Develop systems that can automatically adapt and
    customize themselves to individual users.
  • Personalized news or mail filter
  • Personalized tutoring
  • Develop systems that are too difficult/expensive
    to construct manually because they require
    specific detailed skills or knowledge tuned to a
    specific task (knowledge engineering bottleneck).
  • Discover new knowledge from large databases (data
    mining).
  • Market basket analysis (e.g. diapers and beer)
  • Medical text mining (e.g. migraines to calcium
    channel blockers to magnesium)

9
Why Study Machine Learning?Cognitive Science
  • Computational studies of learning may help us
    understand learning in humans and other
    biological organisms.
  • Hebbian neural learning
  • Neurons that fire together, wire together.

10
Related Disciplines
  • Artificial Intelligence
  • Pattern Recognition
  • Data Mining
  • Probability and Statistics
  • Information theory
  • Psychology (developmental, cognitive)
  • Neurobiology
  • Linguistics
  • Philosophy

11
History of Machine Learning
  • 1950s
  • Samuels checker player
  • Selfridges Pandemonium
  • 1960s
  • Neural networks Perceptron
  • Pattern recognition
  • Learning in the limit theory
  • Minsky and Papert prove limitations of Perceptron
  • 1970s
  • Symbolic concept induction
  • Winstons arch learner
  • Expert systems and the knowledge acquisition
    bottleneck
  • Quinlans ID3
  • Michalskis AQ and soybean diagnosis
  • Scientific discovery with BACON
  • Mathematical discovery with AM

12
History of Machine Learning (cont.)
  • 1980s
  • Advanced decision tree and rule learning
  • Explanation-based Learning (EBL)
  • Learning, planning and problem solving
  • Utility theory
  • Analogy
  • Cognitive architectures
  • Resurgence of neural networks (connectionism,
    backpropagation)
  • Valiants PAC Learning Theory
  • 1990s
  • Data mining
  • Reinforcement learning (RL)
  • Inductive Logic Programming (ILP)
  • Ensembles Bagging, Boosting, and Stacking
  • Bayes Net learning

13
History of Machine Learning (cont.)
  • 2000s
  • Kernel methods
  • Support vector machines
  • Graphical models
  • Statistical relational learning
  • Transfer learning
  • Applications
  • Adaptive software agents and web applications
  • Learning in robotics and vision
  • E-mail management (spam detection)

14
Major paradigms of machine learning
  • Rote learning Learning by memorization.
  • Employed by first machine learning systems, in
    1950s
  • Samuels Checkers program
  • Supervised learning Use specific examples to
    reach general conclusions or extract general
    rules
  • Classification (Concept learning)
  • Regression
  • Unsupervised learning (Clustering) Unsupervised
    identification of natural groups in data
  • Reinforcement learning Feedback (positive or
    negative reward) given at the end of a sequence
    of steps
  • Analogy Determine correspondence between two
    different representations
  • Discovery Unsupervised, specific goal not given
  • Batch vs. online learning
  • All training examples are provided at once or one
    at a time (with error estimate and training after
    each example).

15
Rote Learning is Limited
  • Memorize I/O pairs and perform exact matching
    with new inputs
  • If a computer has not seen the precise case
    before, it cannot apply its experience
  • We want computers to generalize from prior
    experience
  • Generalization is the most important factor in
    learning

16
The inductive learning problem
  • Extrapolate from a given set of examples to make
    accurate predictions about future examples
  • Supervised versus unsupervised learning
  • Learn an unknown function f(X) Y, where X is an
    input example and Y is the desired output.
  • Supervised learning implies we are given a
    training set of (X, Y) pairs by a teacher
  • Unsupervised learning means we are only given the
    Xs.
  • Semi-supervised learning mostly unlabelled data
  • Reinforcement learning delayed feedback

17
Types of supervised learning
x2color
  • Classification
  • We are given the label of the training objects
    (x1,x2,yT/O)
  • We are interested in classifying future objects
    (x1,x2) with the correct label.
  • I.e. Find y for given (x1,x2).

x1size
  • Concept Learning
  • We are given positive and negative samples for
    the concept we want to learn (e.g.Tangerine)
    (x1,x2,y/-)
  • We are interested in classifying future objects
    as member of the class (or positive example for
    the concept) or not.
  • I.e. Answer /- for given (x1,x2).

18
Types of Supervised Learning
  • Regression
  • Target function is continuous rather than class
    membership

19
Classification
  • Assign object/event to one of a given finite set
    of categories.
  • Medical diagnosis
  • Credit card applications or transactions
  • Fraud detection in e-commerce
  • Spam filtering in email
  • Recommended books, movies, music
  • Financial investments
  • Spoken words
  • Handwritten letters

20
Concept learning
  • Given a training set of positive and negative
    examples of a concept
  • Construct a description that will accurately
    classify whether future examples are positive or
    negative
  • Examples points in a multi-dimensional feature
    space
  • Concepts function that labels every point in
    feature space
  • (as , -, and possibly ?)

21
Example
Positive Examples
Negative Examples
How does this symbol classify?
  • Concept
  • Solid Red Circle in a (regular?) Polygon
  • What about?
  • Figures on left side of page
  • Figures drawn before 5pm 2/2/89 ltetcgt

22
Inductive learning framework Feature Space
  • Raw input data from sensors are typically
    preprocessed to obtain a feature vector, X, that
    adequately describes all of the relevant features
    for classifying examples
  • Each x is a list of (attribute, value) pairs
  • X ColorOrange ShapeRound Weight200g
  • Each attribute can be discrete or continuous
  • Each example can be interpreted as a point in an
    n-dimensional feature space, where n is the
    number of attributes

23
Feature Space
Size
?
Big
Color
Gray
2500
Weight
A concept is then a (possibly disjoint) volume
in this space.
24
Inductive learning as search
  • Instance space I
  • Each instance i ? I is a feature vector
  • i (v1, v2, , vk) ? I V1 x V2 x x Vk
  • Class C gives the instances class (to be
    predicted)
  • Model space M defines the possible hypotheses
  • M I ? C, M m1, mn (possibly infinite)
  • Training data can be used to direct the search
    for a good (consistent, complete, simple)
    hypothesis in the model space

25
Learning Key Steps
  • data and assumptions
  • what data is available for the learning task?
  • what can we assume about the problem?
  • representation
  • how should we represent the examples to be
    classified
  • method and estimation
  • what are the possible hypotheses?
  • how do we adjust our predictions based on the
    feedback?
  • evaluation
  • how well are we doing?

26
(No Transcript)
27
Evaluation of Learning Systems
  • Experimental
  • Conduct controlled cross-validation experiments
    to compare various methods on a variety of
    benchmark datasets.
  • Gather data on their performance, e.g. test
    accuracy, training-time, testing-time.
  • Analyze differences for statistical significance.
  • Theoretical
  • Analyze algorithms mathematically and prove
    theorems about their
  • Computational complexity
  • Ability to fit training data
  • Sample complexity (number of training examples
    needed to learn an accurate function)

28
Measuring Performance
  • Performance of the learner can be measured in one
    of the following ways, as suitable for the
    application
  • Classification Accuracy
  • Solution quality (length, efficiency)
  • Speed of performance

29
(No Transcript)
30
Inductive Learning
  • Learning is a characteristic of adaptive systems
    which are capable of improving their performance
    on a problem as a function of previous
    experience, for example, in solving similar
    problems (Simon, 1983) .
  • The inductive learning process is a heuristic
    search through a space of symbolic descriptions,
    generated by the application of various inference
    rules to the initial observational statements
    (Michalski, 1983).

31
Defining the Learning Task
  • Improve on task, T,
  • with respect to performance metric, P,
  • based on experience, E.


32
Defining the Learning Task
  • T Playing checkers
  • P Percentage of games won against an arbitrary
    opponent
  • E Playing practice games against itself
  • T Recognizing hand-written words
  • P Percentage of words correctly classified
  • E Database of human-labeled images of
    handwritten words
  • T Driving on four-lane highways using vision
    sensors
  • P Average distance traveled before a
    human-judged error
  • E A sequence of images and steering commands
    recorded while
  • observing a human driver.
  • T Categorize email messages as spam or
    legitimate.
  • P Percentage of email messages correctly
    classified.
  • E Database of emails, some with human-given
    labels

33
Sample Learning Problem
  • Learn to play checkers from self-play, using an
    approach analogous to that used in the first
    machine learning system developed by Arthur
    Samuels at IBM in 1959.

34
Designing a Learning System
  • Choose the training experience
  • Choose exactly what is too be learned,
  • i.e. the target function.
  • Choose the representation for the target function
  • Selecting the features and the hypothesis class
  • Choose a learning algorithm to infer the target
    function from the experience.

35
Training Experience
  • Direct experience Sample input and output pairs
    are given for a useful target function.
  • E.g. Checker boards labeled with the correct
    move, extracted from record of expert play
  • Indirect experience Given feedback is not direct
    I/O pairs for a useful target function.
  • Potentially arbitrary sequences of game moves and
    their final game results.
  • Credit/Blame Assignment Problem How to assign
    credit/blame to individual moves given only
    indirect feedback?

36
Source of Training Data
  • Random examples provided outside of the learners
    control
  • Negative examples available or only positive?
  • Good training examples selected by a benevolent
    teacher.
  • Near miss examples
  • Learner can query an oracle about class of an
    unlabeled example in the environment
  • Learner can construct an arbitrary example and
    query an oracle for its label.
  • Learner can design and run experiments directly
    in the environment without any human guidance.
  • Last three cases are forms of Active learning

37
Training vs. Test Distribution
  • Generally assume that the training and test
    examples are independently drawn from the same
    overall distribution of data.
  • IID Independently and identically distributed
  • If examples are not independent, it requires
    collective classification.
  • If test distribution is different, it requires
    transfer learning.
  • Almost all ML/PR systems assume iid training and
    identical train/test distributions

38
Choosing a Target Function
  • What function is to be learned and how will it be
    used by the performance system?
  • For the 2-class classification, we may want to
    learn a separating boundary between classes
  • For checkers, assume we are given a function for
    generating the legal moves for a given board
    position. We want to decide the best move.
  • Could learn a function that returns the best move
    for a given board
  • ChooseMove(board, legal-moves) ?
    best-move
  • Or could learn an evaluation function, V(board) ?
    R that gives each board position a score for how
    favorable it is.
  • V can then be used to pick a move by applying
    each legal move, scoring the resulting board
    position, and choosing the move that results in
    the highest scoring board position.

39
100 -100 100 100
100 ..
40
Deciding on the Target Function
  • For checkers, we need to further qualify the
    function we want to learn. For instance, it would
    be useful to have a function V that assigns the
    following values to board b
  • If b is a final winning board, then V(b) 100
  • If b is a final losing board, then V(b) 100
  • If b is a final draw board, then V(b) 0
  • If b is a non-terminal board, then V(b) in
    -100-100 according to the goodness of the
    board.
  • This is what we want to learn.

41
Representing the Target Function
  • Target function can be represented in many ways
  • lookup table
  • symbolic rules
  • numerical function (of varying complexity)
  • neural network
  • There is a trade-off between the expressiveness
    of a representation and the ease of learning.
  • The more expressive a representation, the better
    it will be at approximating an arbitrary
    function however, the more examples will be
    needed to learn an accurate function.

42
Various Target Function Representations
  • Numerical functions
  • Linear regression
  • Neural networks
  • Support vector machines
  • Symbolic functions
  • Decision trees
  • Rules in propositional logic
  • Rules in first-order predicate logic
  • Instance-based functions
  • Nearest-neighbor
  • Case-based
  • Probabilistic Graphical Models
  • Naïve Bayes
  • Bayesian networks
  • Hidden-Markov Models (HMMs)
  • Probabilistic Context Free Grammars (PCFGs)

43
Target Function Representation (1/2)
  • Lets first assume that the following
    features/attributes are useful to decide the
    value of a checker board
  • bp(b) number of black pieces on board b
  • rp(b) number of red pieces on board b
  • bk(b) number of black kings on board b
  • rk(b) number of red kings on board b
  • bt(b) number of black pieces threatened (i.e.
    which can be immediately taken by red on its next
    turn)
  • rt(b) number of red pieces threatened
  • Why these features? These are meaningful features
    for a board, but one could add more features or
    combinations of features as well.

44
Target Function Representation (2/2)
  • Now assume that there is a function V that gives
    the value of a particular board state given these
    features - we want to find that function.
  • Since we do not know V, we can only estimate it
  • i.e. Find V (or V-hat) which is an approximation
    to V
  • To do this, we need to first decide on a
    representation
  • E.g. A linear combination of weighted attributes
    (other numerical functions are also possible)
  • Then, we need to learn this function (i.e. the
    weights wi)

45
Learning Algorithm
  • Use training values for the target function to
    induce a hypothesized definition that fits these
    examples and hopefully generalizes to unseen
    examples.
  • We can minimize some measure of error (loss
    function) such as mean squared error

Vtrain and V are used to mean the same thing, the
target function V or V-hat are the approximation
learned by the system
46
Various Search Algorithms
  • Gradient descent
  • Perceptron
  • Backpropagation
  • Divide and Conquer
  • Decision tree induction
  • Rule learning
  • Evolutionary Computation
  • Genetic Algorithms (GAs)
  • Dynamic Programming
  • HMM Learning

47
Learning V (V and V-hat are the same thing!)
  • Start with a rough approximation V of V
  • Assign some initial (possibly random) weights wi
  • There are many learning algorithms to learn V
  • I.e learning the weights wi, since the attributes
    is input
  • We will learn the function V using our training
    experience.
  • The training experience is obtained by self-play
  • Modify the weights so that V(b) is closer to
    V(b) for the given training samples
  • One possible learning algorithm to adjust our
    weights is the Least Mean Squares (LMS) Algorithm

48
Learning
  • Least Mean Squares (LMS) Algorithm A gradient
    descent algorithm that incrementally updates the
    weights of a linear function in an attempt to
    minimize the mean squared error
  • Until weights converge
  • For each training example b do
  • 1) Compute the absolute error
  • 2) For each board feature, fi, update its weight,
    wi
  • for some small constant (learning rate) c
  • What is the intuition behind this update rule?

49
Gradient Descent
50
(No Transcript)
51
(No Transcript)
52
LMS as Gradient Descent
53
Obtaining Training Values
  • Some direct supervision may be available for the
    target function.
  • lt ltbp?,rp0,bk1,rk0,bt0,rt0gt, 100gt
    (win for black)
  • With indirect feedback for intermediate board
    states, training values are also only estimates

54
Estimating Training Values
  • How to learn to play a game
  • In game playing, in general, standard approach is
    to apply the minimax algorithm where the player
    picks those moves that maximizes his returns
    (highest board value) assuming a perfect
    opponent.
  • Since the game tree is exponential in size,
    normally the search is cut at some point and the
    current best option is selected.
  • For very small games, we could have the computer
    plays against itself and assign a value to each
    board state that is considered in the game tree,
    as follows
  • Final boards value is known 100, -100, 0 by
    definition
  • An intermediate board b is assigned a value V(b)
    equal to V(b), where b is the highest scoring
    final board position that can be achieved
    starting from b and playing optimally until the
    end of the game (assuming the opponent plays
    optimally as well).

55
Estimating Training Values
  • Computing V(b) in this way is intractable since
    it involves searching the complete exponential
    game tree.
  • Therefore, this definition is said to be
    non-operational.
  • An operational definition can be computed in
    reasonable (polynomial) time.
  • Need to learn an operational approximation to the
    ideal evaluation function.

56
Estimating Training Values
  • Estimate training values for intermediate
    (non-terminal) board positions by the estimated
    value of their successor in an actual game trace
    (one path in the game tree)
  • where successor(b) is the next board position
    where it is the programs move in actual play.
  • Values towards the end of the game are initially
    more accurate and continued training slowly
    backs up accurate values to earlier board
    positions.
  • Temporal difference learning deals with the
    credit assignment problem with exponentially
    decaying credit/penalty for boards farther away
    in time.

57
Lessons Learned about Learning
  • Learning can be viewed as using direct or
    indirect experience to approximate a chosen
    target function.
  • Function approximation can be viewed as a search
    through a space of hypotheses (representations of
    functions) for one that best fits a set of
    training data.
  • Different learning methods assume different
    hypothesis spaces (representation languages)
    and/or employ different search techniques.

58
Issues in Machine Learning
  • Training Experience
  • What can be the training experience (labelled
    samples, self-play,)
  • Target Function
  • What should we aim to learn?
  • What should the representation of the target
    function be (features, hypothesis class,)
  • Learning
  • What learning algorithms exist for learning
    general target functions from specific training
    examples?
  • Which algorithms can approximate functions well
    (and when)?
  • How does noisy data influence accuracy?
  • .
  • Training Data
  • How much training data is sufficient?
  • How does number of training examples influence
    accuracy?
  • What is the best strategy for choosing a useful
    next training experience? How it affects
    complexity?
  • Prior Knowledge/Domain Knowledge
Write a Comment
User Comments (0)
About PowerShow.com