Title: 132, v3'0
1Lecture 12Introduction to Intelligent Systems
- Dr Martin Brown
- Room E1k
- Email martin.brown_at_manchester.ac.uk
- Telephone 0161 306 4672
- http//www.eee.manchester.ac.uk/intranet/pg/course
material/
2Lecture 12 Outline
- Introduction to machine learning and intelligent
systems - Intelligence and learning
- Definition of machine learning
- Application areas
- Face detection (classification)
- Energy prediction (prediction/system
identification) - The machine learning process
- Steps necessary to solve a machine learning
application - Aspects of machine learning
- Performance, generalisation and parameter
estimation - Simple regression, classification and clustering
examples
3Lecture 12 Resources
- This set of slides are largely self-contained,
however it is based on the following texts, which
provide useful background/supplementary
information - Chapters 1-3, Machine Learning (MIT Open
Courseware), T Jakkola, http//ocw.mit.edu/OcwWeb/
Electrical-Engineering-and-Computer-Science/6-867M
achine-LearningFall2002/CourseHome/index.htm - An introduction to Support Vector Machines and
other kernel-based learning methods, N
Cristianini, J Shawe-Taylor, Cambridge University
Press, 2000 - Machine Learning, T Mitchell, McGraw Hill, 1997
- Machine Learning, Neural and Statistical
Classification, D Michie, DJ Spiegelhalter and CC
Taylor, 1994 (out of print, but available from
http//www.amsta.leeds.ac.uk/charles/statlog/)
4Examples of Machine Learning
- The field of machine learning is concerned with
the question of how to construct computer
programs that automatically improve with
experience - Systems for credit card transactions approval.
The system is designed by showing it examples of
good and bad (fraudulent) transactions, and
letting it learn the difference - Learning to play Backgammon. The worlds best
computer backgammon players are based on machine
learning algorithms - Systems for spam filtering. Decisions about
whether or not a mail is regarded as junk can be
built and refined during actual usage - Note that recursive model estimation can also be
regarded as a type of machine learning - Broad definition, covering statistical regression
and system identification to genetic algorithms
5Definition of Machine Learning
- A computer program M is said to learn from
experience E with respect to some tasks T and
performance measure P, if Ms performance at
tasks in T, as measured by P, improves with
experience E. - The task T must be adequately described well in
terms of measurable signals (inputs and outputs)
and success criteria - The computer program M could be a physical model,
where the parameters require tuning, a
statistical distribution or a set of rules - The experience E may occur during design
(off-line) or during actual operation (on-line) - The actual, on-line performance P may be
difficult to estimate, if learning takes place
off-line - The experience set E needs to be rich enough so
that M is sufficiently exercised, should be
balanced to reflect the actual usage and should
contain enough examples to estimate M
sufficiently accurately.
6Predict/Learn Cycle
Task T specify measurable inputs and outputs
Learning Dq f(y, y)
Experience E is a data set D X,y of
measurements and desired targets
Model M which is parameterized by the unknown
vector q
Performance P compares Ms predictions y against y
Prediction y m(x,q)
- This is the most commonly applied view of machine
learning - Supervised learning
- Its also worthwhile remembering the
statisticians view that - All models are wrong, just some are more useful
than others
7Simple, Single Regression Example
- Consider trying to build a simple, single
variable linear predictor of the relationship
between job (lot) size and work hours for Toluca - Data set is of the form
- X ones(25,1) work(,1)
- y work(,2)
- Linear prediction model is
- Quadratic Performance function is
- Parameters can be learnt/estimated
- This is the basis for todays laboratory
8History of Machine Learning
- Rosenblatt 1956 Perceptron
- Minsky and Papert 1969 Critique of basic
Perceptron theory - Holland 1975 Genetic algorithms
- Barto Sutton 1985 Proposed basic
reinforcement learning algorithms - Rumelhart 1986 Development of back-propagation
a gradient descent procedure for multi-layer
perceptrons - Mackay 1992 Bayesian MLPs (energy prediction
winner) - Tessauro 1992 Backgammon application, world
class - Heckerman 1994 Bayesian network (medical, win95
applications) - Vapnik 1995 Support vector machines
- Neal 1996 Gaussian processes
- Jordan 1997 Variational learning and Bayesian
nets
9Rational for Machine Learning
- Poor understanding of the optimal model that
maximises the performance P. The optimal model
may be the true physical model or a statistical
model with known parameters. - In practice, there is always some uncertainty
about effects not included in the model, or the
structure of the model itself hence the need to
learn/adapt during design. - Similarly, the model may change during operation
or training information about the optimal
behaviour may be weakly available hence the
need to learn during operation.
10Relationship to System Identification
- This is obviously closely related to system
identification, so its worthwhile considering
what the differences are - System identification is largely concerned with
linear, dynamical prediction, based on
real-valued, time-delayed variables - The models are described by their structures and
their unknown parameter vectors, which are tuned
to best fit the data - Intelligent systems is largely concerned with
non-linear classification and prediction
problems, where the variables may be real-valued
or categorical. - Again, the models are described by their
structure and their unknown parameter vectors,
which are tuned to best fit the data - In practice, there is a reasonable amount of
overlap between the two areas.
11Application 1 Face Detection
- Task Given an arbitrary image which could be a
digitized video signal or a scanned photograph,
classify image regions according to whether they
contain human faces. - Data Representative sample of images containing
examples of faces and non-faces. Input features
are derived from the raw pixel values, binary
class labels are labelled by experts - Rational for machine learning most pixel based
detection problems are hard, due to significant
pattern variations that are hard to parameterise
analytically - Reference Support Vector Machines Training and
Applications, Osuna, Freund and Girosi, MIT
AI-1602 Technical Report, 1997.
12Faces Feature Extraction
Aim is given a 1919 pixel window (283)
features, determine whether the normalized image
window contains a face (1) or not (-1).
- Rescale Image several times
- (Scale invariance)
- Cut 1919 windows pattern
- (Location detection)
- Preprocessing gt
- light correction,
- histogram equalization
- (brightness invariance)
- Classify using SVM
13Faces Classification Problem
- Determine whether the 1919 window contains a
face - Support vector machines only use the examples
closest to the decision boundary, the rest are
ignored - Training set contained 2,000 examples
x2
x1
14Faces Results and Comments
- Use a support vector machine classifier using
polynomial kernels of degree 2. - Number of parameters in the polynomial 40,000!
- Test on two data sets
- A 313 images, 313 faces (one per image mug
shots), 4.5M windows - B 23 images, 155 faces, 5.5M windows
15Faces Example Classifications
16Application 2 Building Energy Prediction
- Task Create a (time series) prediction model(s)
which can predict the energy load from a series
of environmental factors - Data Four months of actual energy usage was
recorded, and used to estimate a predictive model - Rational Physical model of energy usage is too
complex to estimate, often the best guide is
prior experience, or matching to similar
buildings already in existence - Reference Bayesian non-linear modelling for the
prediction competition, Mackay, Technical Report,
Cambridge University, 1994
17Energy Task Details and Data
- The data set consisted of hourly measurements
from 1/9/89 to 31/12/89 of four environmental
(input) variables - Temperature
- Humidity
- Solar flux
- Wind
- as well as three dependent targets
- A1 Electricity
- A2 Cooling water
- A3 Heating water
- A total of 2926 training data, and the aim was to
predict hourly usage of the three targets over
the next 54 days (1282 test data points)
18Energy Data Preparation
Electricity
Hours
- Data preparation involved determining the time
delay associated with the environmental values,
as well as deciding how best to represent them
(raw values, rotated projections, exponential
averaging) - Outlier rejection some data actually included a
burst pipe, the standard model should not be
trained on these
19Energy Modelling Approach
- The inputs included the four environmental
factors at time - Current and 2.5, 24, 72 hour delays
- Categorical inputs denoting
- day, week, holiday, year
The modelling algorithm was a Bayesian
multi-layer perceptron Many models were trained
and averaged to get robust performance with some
indication of expected error
20Energy Prediction and Uncertainty
Cooling water
Hours
- As the designer is not certain about model
structure, type of features, there will be some
discrepancy between different predictions from
different models - These can be averaged to get a mean prediction
and the standard deviation can be used to give
the prediction error bars
21Energy Results and Conclusions
- The winning entry in this competition was
created using the following data modelling
philosophy use huge flexible models, including
all possibilities that you can imagine might be
appropriate control the flexibility of these
models using sophisticated priors and use Bayes
as a helmsman to guide the search through model
space. - A physical model would have been very difficult
to produce, whereas an empirical regression model
- Independent test data
- A1 s(y-y) 65
- A2 s(y-y) 0.64
- A3 s(y-y) 0.53
22Data Modelling Methodology
1 - Define task specify goal ROI to guide
accuracy
2 Data model Define relevant data sources and
attributes
3 Transform Consolidate, clean and transform
data
4 Exploratory data analysis Analyse ranges,
distributions, simple relationships
(correlations)
5 Model building Association classification
clustering and prediction
6 Validation Assess model accuracy, find
important relationships, try alternative models
232. Data Models
- Decide on which informative, measurable variables
will be used in this project and select their
source(s) - Selecting which attributes/variables to use
- Deriving a table that contains those
attributes/variables - Define structure of x -gt y mapping
244. Exploratory Data Analysis
- Understand the distributions of each variable,
and simple relationships (correlations) between
variables during off-line data exploration - Visualise the distributions associated with each
variable in the system - Continuous variable distribution
-
- Categorical variable histogram
- Very important for things like classification,
when one of the class labels is rare (e.g.
fraudulent transaction)
255. Machine Learning Techniques
- Summarise a data set, using a model, in order to
extract useful relationships - Associations
- Identify items that occur together within a
transaction - Supermarket goods, patent authors, viewed web
pages - Classification
- Predict whether an image window contains a face
- Awarding credit, user complete a web transaction?
- Clustering
- Identify a small number of groups that contain
similar records - Customer segmentation, web page grouping
- Regression
- Predict a real-value
- Time series analysis, lifetime value prediction
266. Model Analysis Validation
- When building a model to describe relationships
within the data, you need to perform model
analysis and validation in order to estimate its
performance when applied to unseen data - Models are not 100 accurate!
- This is due to limited
- Features (cannot measure everything)
- Data examples (cannot sample every case)
- The ROI of the data mining project will depend on
the accuracy of the classification/prediction
models, therefore, this needs to be carefully
estimated - Error analysis 95 accurate
- Confusion matrices
- Lift charts
- When building a model to describe relationships
within the data, you need to perform model
analysis and validation in order to estimate its
performance when applied to unseen data - Models are not 100 accurate!
- This is due to limited
- Features (cannot measure everything)
- Data examples (cannot sample every case)
- The ROI of the data mining project will depend on
the accuracy of the classification/prediction
models, therefore, this needs to be carefully
estimated - Error analysis 95 accurate
- Confusion matrices
- Lift charts
27Aspects of Machine Learning Theory
- Aspects of machine learning theory
- Optimal estimator properties
- Parameter estimation and prediction uncertainty
- Generalisation bias/variance (overfitting)
- Model hypothesis space
- In machine learning, were always aiming for the
best second best solution - The model needs to be guided by, but not overfit,
the exemplar training data - The model needs to represent its prediction
uncertainty - Due to the effects of having a finite training set
28Modelling as Data Compression
D X,y
y xTq
25 degrees of freedom (data points)
2 degrees of freedom (parameters)
- Typically, the model represents the underlying
signal, in the data that has been corrupted by
noise - Assume data is generated by
- yi f(xi) ni
- where E(n) 0, E(n2) s2. Aim produce an
optimal model - yi E(yxi) m(xi) where m() f()
29Parameter Estimation
- A/the key aspect of machine learning is parameter
estimation - Once the model structure and data set is fixed,
learning can be as simple as finding the minimum
point of the performance function - Machine learning approaches often use iterative
schemes
At a turning point
Dq
f
q
30Bias/Variance Dilemma (OverModelling )
- There is no such thing as a free lunch
- In machine learning, whenever we select a certain
class of models we want one that closely
approximates f (low bias) while being insensitive
to particular choices of D (low variance) - Typically, this is only achieved with lots of
data, but sensible tricks can be used to achieve
a free pudding
(Squared) Bias of optimal predictor in class
Parameter estimation variance due to choice of D
31Lecture 12 Summary
- Many system design tasks require a machine
learning approach, ranging from parameter
estimation to model selection - In machine learning applications, key aspects
include data preparation, transformation and
model validation - This course is focussing on the actual
modelling/machine learning, key aspects of which
include - Modelling as data compression
- Optimal parameter estimation (optimization)
- Prediction/parameter uncertainty (statistics)
- Model specification and selection
32Lecture 12 Laboratory
- Load a data set work.dat into Matlab. This has
two columns the Lot Size and the Work Hours
and the aim is to build a linear, least squares
regression model for the Toluca company to
estimate the time taken to produce lots of a
particular size (taken from Applied Linear
Statistical Models, Neter et al). - With regard to the definition machine learning
- Clearly state the task, environment, model and
performance - Is this off-line or on-line learning
- For a linear model, create two Matlab functions
that allow you to - Train/estimate the parameters from the training
data - Predict/estimate new data
- Test the routines on the given data set and
comment on - What are the optimal parameters (bias and gain)
for the linear model - Plot the linear input output relationship as well
as the original training data - What is the standard deviation of the prediction
error and how does this relate to the range
(standard deviation) of the target data. How
does the errors standard deviation relate to the
performance function on slide 7? - How does this model compare to simply predicting
the average Work Hours, irrespective of
information about the Lot Size i.e. a model
that just contains a bias term