132, v3'0 - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

132, v3'0

Description:

Learning to play Backgammon. The world's best computer backgammon players are based on machine ... Tessauro 1992 Backgammon application, 'world class' ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 33
Provided by: intranetE
Category:
Tags: backgammon | card | mug | shots | tricks

less

Transcript and Presenter's Notes

Title: 132, v3'0


1
Lecture 12Introduction to Intelligent Systems
  • Dr Martin Brown
  • Room E1k
  • Email martin.brown_at_manchester.ac.uk
  • Telephone 0161 306 4672
  • http//www.eee.manchester.ac.uk/intranet/pg/course
    material/

2
Lecture 12 Outline
  • Introduction to machine learning and intelligent
    systems
  • Intelligence and learning
  • Definition of machine learning
  • Application areas
  • Face detection (classification)
  • Energy prediction (prediction/system
    identification)
  • The machine learning process
  • Steps necessary to solve a machine learning
    application
  • Aspects of machine learning
  • Performance, generalisation and parameter
    estimation
  • Simple regression, classification and clustering
    examples

3
Lecture 12 Resources
  • This set of slides are largely self-contained,
    however it is based on the following texts, which
    provide useful background/supplementary
    information
  • Chapters 1-3, Machine Learning (MIT Open
    Courseware), T Jakkola, http//ocw.mit.edu/OcwWeb/
    Electrical-Engineering-and-Computer-Science/6-867M
    achine-LearningFall2002/CourseHome/index.htm
  • An introduction to Support Vector Machines and
    other kernel-based learning methods, N
    Cristianini, J Shawe-Taylor, Cambridge University
    Press, 2000
  • Machine Learning, T Mitchell, McGraw Hill, 1997
  • Machine Learning, Neural and Statistical
    Classification, D Michie, DJ Spiegelhalter and CC
    Taylor, 1994 (out of print, but available from
    http//www.amsta.leeds.ac.uk/charles/statlog/)

4
Examples of Machine Learning
  • The field of machine learning is concerned with
    the question of how to construct computer
    programs that automatically improve with
    experience
  • Systems for credit card transactions approval.
    The system is designed by showing it examples of
    good and bad (fraudulent) transactions, and
    letting it learn the difference
  • Learning to play Backgammon. The worlds best
    computer backgammon players are based on machine
    learning algorithms
  • Systems for spam filtering. Decisions about
    whether or not a mail is regarded as junk can be
    built and refined during actual usage
  • Note that recursive model estimation can also be
    regarded as a type of machine learning
  • Broad definition, covering statistical regression
    and system identification to genetic algorithms

5
Definition of Machine Learning
  • A computer program M is said to learn from
    experience E with respect to some tasks T and
    performance measure P, if Ms performance at
    tasks in T, as measured by P, improves with
    experience E.
  • The task T must be adequately described well in
    terms of measurable signals (inputs and outputs)
    and success criteria
  • The computer program M could be a physical model,
    where the parameters require tuning, a
    statistical distribution or a set of rules
  • The experience E may occur during design
    (off-line) or during actual operation (on-line)
  • The actual, on-line performance P may be
    difficult to estimate, if learning takes place
    off-line
  • The experience set E needs to be rich enough so
    that M is sufficiently exercised, should be
    balanced to reflect the actual usage and should
    contain enough examples to estimate M
    sufficiently accurately.

6
Predict/Learn Cycle
Task T specify measurable inputs and outputs

Learning Dq f(y, y)
Experience E is a data set D X,y of
measurements and desired targets
Model M which is parameterized by the unknown
vector q
Performance P compares Ms predictions y against y


Prediction y m(x,q)
  • This is the most commonly applied view of machine
    learning
  • Supervised learning
  • Its also worthwhile remembering the
    statisticians view that
  • All models are wrong, just some are more useful
    than others

7
Simple, Single Regression Example
  • Consider trying to build a simple, single
    variable linear predictor of the relationship
    between job (lot) size and work hours for Toluca
  • Data set is of the form
  • X ones(25,1) work(,1)
  • y work(,2)
  • Linear prediction model is
  • Quadratic Performance function is
  • Parameters can be learnt/estimated
  • This is the basis for todays laboratory

8
History of Machine Learning
  • Rosenblatt 1956 Perceptron
  • Minsky and Papert 1969 Critique of basic
    Perceptron theory
  • Holland 1975 Genetic algorithms
  • Barto Sutton 1985 Proposed basic
    reinforcement learning algorithms
  • Rumelhart 1986 Development of back-propagation
    a gradient descent procedure for multi-layer
    perceptrons
  • Mackay 1992 Bayesian MLPs (energy prediction
    winner)
  • Tessauro 1992 Backgammon application, world
    class
  • Heckerman 1994 Bayesian network (medical, win95
    applications)
  • Vapnik 1995 Support vector machines
  • Neal 1996 Gaussian processes
  • Jordan 1997 Variational learning and Bayesian
    nets

9
Rational for Machine Learning
  • Poor understanding of the optimal model that
    maximises the performance P. The optimal model
    may be the true physical model or a statistical
    model with known parameters.
  • In practice, there is always some uncertainty
    about effects not included in the model, or the
    structure of the model itself hence the need to
    learn/adapt during design.
  • Similarly, the model may change during operation
    or training information about the optimal
    behaviour may be weakly available hence the
    need to learn during operation.

10
Relationship to System Identification
  • This is obviously closely related to system
    identification, so its worthwhile considering
    what the differences are
  • System identification is largely concerned with
    linear, dynamical prediction, based on
    real-valued, time-delayed variables
  • The models are described by their structures and
    their unknown parameter vectors, which are tuned
    to best fit the data
  • Intelligent systems is largely concerned with
    non-linear classification and prediction
    problems, where the variables may be real-valued
    or categorical.
  • Again, the models are described by their
    structure and their unknown parameter vectors,
    which are tuned to best fit the data
  • In practice, there is a reasonable amount of
    overlap between the two areas.

11
Application 1 Face Detection
  • Task Given an arbitrary image which could be a
    digitized video signal or a scanned photograph,
    classify image regions according to whether they
    contain human faces.
  • Data Representative sample of images containing
    examples of faces and non-faces. Input features
    are derived from the raw pixel values, binary
    class labels are labelled by experts
  • Rational for machine learning most pixel based
    detection problems are hard, due to significant
    pattern variations that are hard to parameterise
    analytically
  • Reference Support Vector Machines Training and
    Applications, Osuna, Freund and Girosi, MIT
    AI-1602 Technical Report, 1997.

12
Faces Feature Extraction
Aim is given a 1919 pixel window (283)
features, determine whether the normalized image
window contains a face (1) or not (-1).
  • Rescale Image several times
  • (Scale invariance)
  • Cut 1919 windows pattern
  • (Location detection)
  • Preprocessing gt
  • light correction,
  • histogram equalization
  • (brightness invariance)
  • Classify using SVM

13
Faces Classification Problem
  • Determine whether the 1919 window contains a
    face
  • Support vector machines only use the examples
    closest to the decision boundary, the rest are
    ignored
  • Training set contained 2,000 examples

x2
x1
14
Faces Results and Comments
  • Use a support vector machine classifier using
    polynomial kernels of degree 2.
  • Number of parameters in the polynomial 40,000!
  • Test on two data sets
  • A 313 images, 313 faces (one per image mug
    shots), 4.5M windows
  • B 23 images, 155 faces, 5.5M windows

15
Faces Example Classifications
16
Application 2 Building Energy Prediction
  • Task Create a (time series) prediction model(s)
    which can predict the energy load from a series
    of environmental factors
  • Data Four months of actual energy usage was
    recorded, and used to estimate a predictive model
  • Rational Physical model of energy usage is too
    complex to estimate, often the best guide is
    prior experience, or matching to similar
    buildings already in existence
  • Reference Bayesian non-linear modelling for the
    prediction competition, Mackay, Technical Report,
    Cambridge University, 1994

17
Energy Task Details and Data
  • The data set consisted of hourly measurements
    from 1/9/89 to 31/12/89 of four environmental
    (input) variables
  • Temperature
  • Humidity
  • Solar flux
  • Wind
  • as well as three dependent targets
  • A1 Electricity
  • A2 Cooling water
  • A3 Heating water
  • A total of 2926 training data, and the aim was to
    predict hourly usage of the three targets over
    the next 54 days (1282 test data points)

18
Energy Data Preparation
Electricity
Hours
  • Data preparation involved determining the time
    delay associated with the environmental values,
    as well as deciding how best to represent them
    (raw values, rotated projections, exponential
    averaging)
  • Outlier rejection some data actually included a
    burst pipe, the standard model should not be
    trained on these

19
Energy Modelling Approach
  • The inputs included the four environmental
    factors at time
  • Current and 2.5, 24, 72 hour delays
  • Categorical inputs denoting
  • day, week, holiday, year

The modelling algorithm was a Bayesian
multi-layer perceptron Many models were trained
and averaged to get robust performance with some
indication of expected error
20
Energy Prediction and Uncertainty
Cooling water
Hours
  • As the designer is not certain about model
    structure, type of features, there will be some
    discrepancy between different predictions from
    different models
  • These can be averaged to get a mean prediction
    and the standard deviation can be used to give
    the prediction error bars

21
Energy Results and Conclusions
  • The winning entry in this competition was
    created using the following data modelling
    philosophy use huge flexible models, including
    all possibilities that you can imagine might be
    appropriate control the flexibility of these
    models using sophisticated priors and use Bayes
    as a helmsman to guide the search through model
    space.
  • A physical model would have been very difficult
    to produce, whereas an empirical regression model
  • Independent test data
  • A1 s(y-y) 65
  • A2 s(y-y) 0.64
  • A3 s(y-y) 0.53




22
Data Modelling Methodology
1 - Define task specify goal ROI to guide
accuracy
2 Data model Define relevant data sources and
attributes
3 Transform Consolidate, clean and transform
data
4 Exploratory data analysis Analyse ranges,
distributions, simple relationships
(correlations)
5 Model building Association classification
clustering and prediction
6 Validation Assess model accuracy, find
important relationships, try alternative models
23
2. Data Models
  • Decide on which informative, measurable variables
    will be used in this project and select their
    source(s)
  • Selecting which attributes/variables to use
  • Deriving a table that contains those
    attributes/variables
  • Define structure of x -gt y mapping

24
4. Exploratory Data Analysis
  • Understand the distributions of each variable,
    and simple relationships (correlations) between
    variables during off-line data exploration
  • Visualise the distributions associated with each
    variable in the system
  • Continuous variable distribution
  • Categorical variable histogram
  • Very important for things like classification,
    when one of the class labels is rare (e.g.
    fraudulent transaction)

25
5. Machine Learning Techniques
  • Summarise a data set, using a model, in order to
    extract useful relationships
  • Associations
  • Identify items that occur together within a
    transaction
  • Supermarket goods, patent authors, viewed web
    pages
  • Classification
  • Predict whether an image window contains a face
  • Awarding credit, user complete a web transaction?
  • Clustering
  • Identify a small number of groups that contain
    similar records
  • Customer segmentation, web page grouping
  • Regression
  • Predict a real-value
  • Time series analysis, lifetime value prediction

26
6. Model Analysis Validation
  • When building a model to describe relationships
    within the data, you need to perform model
    analysis and validation in order to estimate its
    performance when applied to unseen data
  • Models are not 100 accurate!
  • This is due to limited
  • Features (cannot measure everything)
  • Data examples (cannot sample every case)
  • The ROI of the data mining project will depend on
    the accuracy of the classification/prediction
    models, therefore, this needs to be carefully
    estimated
  • Error analysis 95 accurate
  • Confusion matrices
  • Lift charts
  • When building a model to describe relationships
    within the data, you need to perform model
    analysis and validation in order to estimate its
    performance when applied to unseen data
  • Models are not 100 accurate!
  • This is due to limited
  • Features (cannot measure everything)
  • Data examples (cannot sample every case)
  • The ROI of the data mining project will depend on
    the accuracy of the classification/prediction
    models, therefore, this needs to be carefully
    estimated
  • Error analysis 95 accurate
  • Confusion matrices
  • Lift charts

27
Aspects of Machine Learning Theory
  • Aspects of machine learning theory
  • Optimal estimator properties
  • Parameter estimation and prediction uncertainty
  • Generalisation bias/variance (overfitting)
  • Model hypothesis space
  • In machine learning, were always aiming for the
    best second best solution
  • The model needs to be guided by, but not overfit,
    the exemplar training data
  • The model needs to represent its prediction
    uncertainty
  • Due to the effects of having a finite training set

28
Modelling as Data Compression

D X,y
y xTq
25 degrees of freedom (data points)
2 degrees of freedom (parameters)
  • Typically, the model represents the underlying
    signal, in the data that has been corrupted by
    noise
  • Assume data is generated by
  • yi f(xi) ni
  • where E(n) 0, E(n2) s2. Aim produce an
    optimal model
  • yi E(yxi) m(xi) where m() f()


29
Parameter Estimation
  • A/the key aspect of machine learning is parameter
    estimation
  • Once the model structure and data set is fixed,
    learning can be as simple as finding the minimum
    point of the performance function
  • Machine learning approaches often use iterative
    schemes

At a turning point
Dq
f
q
30
Bias/Variance Dilemma (OverModelling )
  • There is no such thing as a free lunch
  • In machine learning, whenever we select a certain
    class of models we want one that closely
    approximates f (low bias) while being insensitive
    to particular choices of D (low variance)
  • Typically, this is only achieved with lots of
    data, but sensible tricks can be used to achieve
    a free pudding

(Squared) Bias of optimal predictor in class
Parameter estimation variance due to choice of D
31
Lecture 12 Summary
  • Many system design tasks require a machine
    learning approach, ranging from parameter
    estimation to model selection
  • In machine learning applications, key aspects
    include data preparation, transformation and
    model validation
  • This course is focussing on the actual
    modelling/machine learning, key aspects of which
    include
  • Modelling as data compression
  • Optimal parameter estimation (optimization)
  • Prediction/parameter uncertainty (statistics)
  • Model specification and selection

32
Lecture 12 Laboratory
  • Load a data set work.dat into Matlab. This has
    two columns the Lot Size and the Work Hours
    and the aim is to build a linear, least squares
    regression model for the Toluca company to
    estimate the time taken to produce lots of a
    particular size (taken from Applied Linear
    Statistical Models, Neter et al).
  • With regard to the definition machine learning
  • Clearly state the task, environment, model and
    performance
  • Is this off-line or on-line learning
  • For a linear model, create two Matlab functions
    that allow you to
  • Train/estimate the parameters from the training
    data
  • Predict/estimate new data
  • Test the routines on the given data set and
    comment on
  • What are the optimal parameters (bias and gain)
    for the linear model
  • Plot the linear input output relationship as well
    as the original training data
  • What is the standard deviation of the prediction
    error and how does this relate to the range
    (standard deviation) of the target data. How
    does the errors standard deviation relate to the
    performance function on slide 7?
  • How does this model compare to simply predicting
    the average Work Hours, irrespective of
    information about the Lot Size i.e. a model
    that just contains a bias term
Write a Comment
User Comments (0)
About PowerShow.com