Program for North American Mobility in Higher Education - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Program for North American Mobility in Higher Education

Description:

unrelated facts. Information. facts plus relations. Knowledge ... was next to useless. However, if somebody had. a thousand compasses. and took an average, a ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 41
Provided by: Lafo6
Category:

less

Transcript and Presenter's Notes

Title: Program for North American Mobility in Higher Education


1
NC STATE UNIVERSITY
Program for North American Mobility in Higher
Education Introducing Process Integration for
Environmental Control in Engineering
Curricula MODULE 17 Introduction to
Multivariate Analysis
Created at Ecole Polytechnique de Montreal
North Carolina State University, 2003.
2
Purpose of Module 17
  • What is the purpose of this module?
  • This module provides a basic introduction to
    multivariate analysis (MVA) as it is applied to
    chemical engineering. After completing this
    module, the student should have sufficient
    understanding to apply this statistical method to
    real data.
  • The target audience for this module are
  • Upper-year engineering students, and
  • Practising engineers, particularly those in an
    industrial setting.

3
Prerequisites for Module 17
What are the prerequisites for this
module? Before starting this module, the student
must have first completed Module 8, Introduction
to Process Integration. This module includes
basic concepts not repeated here, notably those
related to data quality. Applying MVA to real
data, without having an understanding of data
quality, is a recipe for disaster. The software
will generate results, but they could be totally
meaningless and misleading. It is further
assumed that students already have an
introductory-level background in statistics, such
as would normally be part of any undergraduate
engineering curriculum.
4
Structure of Module 17
  • What is the structure of this module?
  • Module 17 is divided into 3 tiers, each with a
    specific goal
  • Tier 1 Basic introduction
  • Tier 2 Worked example
  • Tier 3 Open-ended problem
  • These tiers are intended to be completed in
    order. Students are quizzed at various points,
    to measure their degree of understanding, before
    proceeding.
  • Each tier contains a statement of intent at the
    beginning, and a quiz at the end.

5
TIER 1 Basic Introduction
6
Tier 1 Statement of Intent
  • Tier 1 Statement of intent
  • The goal of Tier 1 is to familiarise the student
    with the basic concepts of multivariate analysis
    (MVA). At the end of Tier 1, the student should
    be able to answer the following questions
  • What is the difference between univariate and
    multivariate statistics?
  • Why is MVA used in a process integration context?
  • How does MVA fit into the bigger picture?
  • What are the specific types of MVA analysis?
  • Tier 1 also includes some selected readings, to
    help the student acquire a deeper understanding
    of this subject. It is impossible to
    spoon-feed someone about a technique as complex
    as MVA. The student must begin to delve into
    this topic independently right from the start.

7
Tier 1 Contents
Tier 1 is broken down into two sections 1.1
What is MVA used for? 1.2 How does MVA work? At
the end of Tier 1 there is a short
multiple-answer quiz.
8
1.1 What is MVA used for?
9
Process Integration ChallengeMake sense of
masses of data
  • Drowning in data!
  • Many organisations today are faced with the same
    challenge TOO MUCH DATA. These include
  • Business - customer transactions
  • Communications - website use
  • Government - intelligence
  • Science - astronomical data
  • Pharmaceuticals - molecular configurations
  • Industry - process data
  • It is the last item that is of interest to us as
    chemical engineers

10
Too Much Process Data
A typical industrial plant has hundreds of
control loops, and thousands of measured
variables, many of which are updated every few
seconds. This situation generates tens of
millions of new data points each day, and
billions of data points each year. Obviously,
this is far too much for a human brain to absorb.
Because of the way we visualise things, we are
basically limited to looking at only one or two
variables at a time
11
Data-Rich but Knowledge-Poor
  • As a result, we have become data-rich but
    knowledge-poor.
  • The biggest problem is that interesting, useful
    patterns and relationships which are not
    intuitively obvious lie hidden inside enormous,
    unwieldy databases. Also, many variables are
    correlated.
  • This has led to the creation of data-mining
    techniques, aimed at extracting this useful
    knowledge. Some examples are
  • Neural Networks
  • Multiple Regression
  • Decision Trees
  • Genetic Algorithms
  • Clustering
  • MVA

Subject of this module
Mining data
12
Data ? Information ? Knowledge
The aim of data-mining can be illustrated
graphically as follows
Scientific principles
Connectedness
KNOWLEDGE
  • Data
  • unrelated facts
  • Information
  • facts plus relations
  • Knowledge
  • information plus patterns

Observed associations
patterns
INFORMATION
relations
Raw Numbers
DATA
Understanding
13
Process Modelling from First
Principles
Theoretical Model
Chemical engineers create two types of models to
simulate an industrial process. The first of
these is a theoretical model, which uses First
Principles to mimic the inner workings of the
process. These models are based on a process
flowsheet, and each unit operation is modelled
separately reactors, tanks, mixers, heat
exchangers, and so forth. Heat and mass balances
are calculated, along with other thermodynamic
factors. Chemical reactions are accounted for
explicitly, as are the physical properties of the
various gas, liquid and solid streams.
14
Data-Driven Process Modelling
Empirical Model
The second type of model created by chemical
engineers is the empirical or black-box model.
This approach uses the plant process data
directly, to establish mathematic
correlations. Unlike the theoretical models,
empirical models do NOT take the process
fundamentals into account. They only use pure
mathematical and statistical techniques. MVA is
one such method, because it reveals patterns and
correlations independently of any pre-conceived
notions. Obviously this approach is very
sensitive to Garbage-in, garbage-out which is
why validation of the model is so important.
15
What is MVA?
Multivariate analysis (MVA) is defined as the
simultaneous analysis of more than five
variables. Some people use the term
megavariate analysis to denote cases where
there are more than a hundred variables. MVA
uses ALL available data to capture the most
information possible. The basic principle is to
boil down hundreds of variables down to a mere
handful.
MVA
?
16
Multivariate Analysis is Based on Ockhams Razor
Pluralitas non est ponenda sine
necessitate. Rough translation Dont make
things more complicated than they need to be.
William of Ockham was an English monk who laid
one of the cornerstones of the Scientific Method
with his famous razor (so named because it
serves to cut out the unnecessary parts of a
scientific theory). Essentially, Ockham
realised back in the 14th century that deep down,
Nature is simple
William of Ockham (1285-1347)
17
Example Apples and Oranges
  • A good example of these ideas is Apple versus
    Orange.
  • Clever scientists could easily come up with
    hundreds of different things to measure on apples
    and oranges, to tell them apart
  • Colour, shape, firmness, reflectivity,
  • Skin smoothness, thickness, morphology,
  • Juice water content, pH, composition,
  • Seeds colour, weight, size distribution,
  • etc.
  • However, there will never be more than one
    difference is it an apple or an orange? In MVA
    parlance, we would say that there is only one
    latent attribute.

-1
1
18
Graphical Representation of MVA
The main element of MVA is the reduction in
dimensionality. Taken to its extreme, this can
mean going from hundreds of dimensions
(variables) down to just two, allowing us to
create a 2-dimensional graph. Using these
graphs, which our eyes and brains can easily
handle, we are able to peer into the database
and identify trends and correlations. This is
illustrated on the next page
Peering into the data
19
Graphical representation of MVA
Statistical Model
(internal to software)
.
.
.
.
.
.
.
.
.
.
.
.
Raw Data impossible to interpret
Y
trends
trends
X
X
trends
X
X
hundreds of columns
thousands of rows
2-D Visual Outputs
20
Illustrative Data Set Food Consumption in
European Countries
To illustrate these concepts, we take an
easy-to-understand example involving food. Data
on food preferences in 16 different European
countries are considered, involving the
consumption patterns for 18 different food
groups. Look at the table on the following
page. Can you tell anything from the raw
numbers? Of course not. No one could.
21
Data Table Food Consumption in European Countries
Note that MVA can handle up to 10-20 missing data
Courtesy of Umetrics corp.
22
Score Plot
The MVA software generates two main types of
plots to represent the data Score plots and
Loadings plots. The first of these, the Score
plot, shows all the original data points
(observations) in a new set of coordinates or
components. Each score is the value of that data
point on one of the new component
dimensions A score plot shows how the
observations are arranged in the new component
space. The score plot for the food data is shown
on the next page. Note how similar countries
cluster together
The Score Plot is the projection of the original
data points onto a plane defined by two new
components.
.
.
.
.
.
.
.
.
.
.
23
Score Plot for Food Example
95 Confidence interval (analogous to t-test)
Score Plot observations
24
Loadings Plot
The second type of data plot generated by the MVA
software is the Loadings plots. This is the
equivalent to the score plot, only from the point
of view of the original variables. Each
component has a set of loadings or weights, which
express the projection of each original variable
onto each new component. Loadings show how
strongly each variable is associated with each
new component. The loadings plot for the food
example is shown on the next page. The further
from the origin, the more significant the
correlation. Note that the quadrants are the
same on each type of plot. Sweden and Denmark
are in the top-right corner so are frozen fish
and vegetables. Using both plots, variables and
observations can be correlated with one another.
25
Use of loadings (illustration)
Projection of old variabiles onto new
Loadings Plot variables
26
To MVA, Data Overload is Good!
One great advantage of MVA is that the more data
are available, the less noise matters (assuming
that the noise is normally distributed). This is
one of the reasons MVA is used to mine huge
amounts of data. This is analogous to NMR
measurements in a laboratory. The more trials
there are, the clearer the spectrum becomes
1.
2.
After 1500 trials?
3.
Not random at all (ve and ve noise cancels out)
Looks random
27
Too Much Data is Good!
Another analogy is the toy compass that used to
be given as a prize in a box of Cracker
Jack. One of these compasses alone was next to
useless. However, if somebody had a thousand
compasses and took an average, a useful result
might be obtained.
Dictionary time Look up the definitions of
induction and deduction
28
Multivariate Analysis Benefits
What is the point of doing MVA? The first
potential benefit is to explore the
inter-relationships between different process
variables. It is well known that simply creating
a model can provide insight in the process itself
(Learn by modelling). Once a representative
model has been created, the engineer can perform
what if? exercises without affecting the real
process. This is a low-cost way to investigate
options. Some important parameters, like final
product quality, cannot be measured in real time.
They can, however, be inferred from other
variables that are measured on-line. When
incorporated in the process control system, this
inferential controller or soft sensor can
greatly improve process performance.
29
MVA is Different to Neural Networks
  • Both are data-driven black box models
  • Both learn using real data
  • However, what is inside the black box is totally
    different (NN is non-linear)
  • Neural Networks seek to reproduce the
    neuron-to-neuron linkages in the brain
  • Much as genetic algorithms seek to reproduce
    Darwinian evolution

30
Reading List
There is no paint-by-numbers way to learn MVA.
Students are strongly encouraged to read the
following papers, in order to begin to develop an
independent understanding of what MVA is used for
and how it works. After doing this on-line
course, reading the references and playing around
with real data, the student should at some point
experience a Eureka! moment when suddenly MVA
makes sense. Unfortunately, there is no shortcut
to achieving this insight Broderick, G., J.
Paris, J.L. Valade and J. Wood. Applying Latent
Vector Analysis to Pulp Characterization, Paperi
ja Puu, 77 (6-7) 410-419. Saltin, J. F.,
and B. C. Strand. Analysis and Control of
Newsprint Quality and Paper Machine Operation
Using Integrated Factor Networks, Pulp and Paper
Canada 96(7) 48-51
31
Reading List (contd)
Kooi, S. Adaptive Inferential Control of Wood
Chip Refiner, Tappi Journal 77(11)185-194.
Kresta, J. V., T. E. Marlin and J. F. MacGregor
(1994). Development of Inferential Process
Models Using PLS, Computers and Chemical
Engineering 18 (7)597-611. Marklund, A.
Prediction of Strength Parameters for Softwood
Kraft Pulps. Nordic Pulp Paper Research
Journal, 13 (3) 211-219. Tessier, P., G.
Broderick, P. Plouffe (2001). Competitive
Analysis of North American Newsprint Producers
Using Composite Statistical Indicators of Product
and Process Performance. TAPPI Journal, 84 (3).
32
1.2 How does MVA work?
33
Basic Statistics
It is assumed that the student is familiar with
the following basic statistical concepts
  • Mean / median / mode
  • Standard deviation / variance
  • Normality / symmetry
  • Degree of association
  • Correlation coefficients
  • Degree of explanation
  • R2, F-test
  • Significance of differences
  • t-test, Chi-square

If not, or if its been a while, it is advisable
to consult an introductory statistics text and do
a cursory review.
34
Statistical Tests
  • Classical statistics is severely hampered by
    certain assumptions about data
  • All values are accurate
  • All variables are uncorellated
  • There are no missing data
  • For real process data, such assumptions are
    totally unrealistic.

Statistical tests help characterise an existing
dataset. They do NOT enable you to make
predictions about future data. For this we must
turn to regression techniques
35
Regression
Regression can be summarised as follows
  • Take a set of data points, each described by a
    vector of values (y, x1, x2, xn)
  • Find an algebraic equation y b1x1 b2x2
    bnxn ethat best expresses the
    relationship between y and the xis.
  • This equation can be used to predict a new
    y-value given new xis.

36
Independent vs. Dependent Variables
  • The xis in the preceding equation are called
    independent variables. They are used to predict
    y.
  • Y is called the dependent variable, because the
    way the equation is written, its value depends on
    the xis.

X
X
X
Y
Y
X
Y
X
Y
X
X
37
Simple vs. MultipleRegression
  • Simple regression has only one x y bx e
  • Multiple regression has more than one x y
    b1x1 b2x2 bnxn e

X
X
X
X
X
X
X
X
38
Linear vs. Nonlinear Regression
  • Linear regression involves no powers of xi
    (square, cube etc.) and no cross-product terms of
    form xixj
  • If such terms are present, we are dealing with
    nonlinear regression.

XiXj
X3
X2
39
The Error Term e
  • The error term expresses the uncertainty in an
    empirical predictive equation derived from
    imperfect observations.
  • Factors contributing to the error term include
  • measurement error
  • measurement noise
  • unaccounted-for natural variations
  • disturbances to the process being measured

ERROR
40
The Least Squares Principle
  • Regression tries to produce a best fit equation
    --- but what is best ?
  • Criterion minimize the sum of squared deviations
    of data points from the regression line.

Least Squares
Write a Comment
User Comments (0)
About PowerShow.com