Title: Fitting Equations to Data
1Fitting Equations to Data
2- Suppose that we have a
- single dependent variable Y (continuous
numerical) - and
- one or several independent variables, X1, X2, X3,
... (also continuous numerical, although there
are techniques that allow you to handle
categorical independent variables). - The objective will be to fit an equation to
the data collected on these measurements that
explains the dependence of Y on X1, X2, X3, ...
3Example
- Data collected on n 110 countries
- Some of the variables
- Y infant mortality
- X1 popn size
- X2 popn density
- X3 urban
- X4 GDP
- Etc
- Our intersest is in determining how Y is related
to X1, X2, X3, X4 ,etc
4What is the value of these equations?
5Equations give very precise and concise
descriptions (models) of data and how dependent
variables are related to independent variables.
6Examples
- Linear models Y Blood Pressure, X age
- Y a X b e
7- Exponential growth or decay models
- Y Average of 5 best times for the 100m during
an Olympic year, X the Olympic year.
8 9 10- Note the presence of the random error term
(random noise). - This is a important term in any statistical
model. - Without this term the model is deterministic and
doesnt require the statistical analysis
11What is the value of these equations?
- Equations give very precise and concise
descriptions (models) of data and how dependent
variables are related to independent variables. - The parameters of the equations usually have very
useful interpretations relative to the phenomena
that is being studied. - The equations can be used to calculate and
estimate very useful quantities related to
phenomena. Relative extrema, future or
out-of-range values of the phenomena - Equations can provide the framework for
comparison.
12The Multiple Linear Regression Model
13- Again we assume that we have a single dependent
variable Y and p (say) independent variables X1,
X2, X3, ... , Xp. - Â
- The equation (model) that generally describes the
relationship between Y and the Independent
variables is of the form - Â
- Y f(X1, X2,... ,Xp q1, q2, ... , qq) e
- where q1, q2, ... , qq are unknown parameters of
the function f and e is a random disturbance
(usually assumed to have a normal distribution
with mean 0 and standard deviation s).
14- In Multiple Linear Regression we assume the
following model - Â
- Y b0 b1 X1 b2 X2 ... bp Xp e
- Â
- This model is called the Multiple Linear
Regression Model. - Again are unknown parameters of the model and
where b0, b1, b2, ... , bp are unknown
parameters and e is a random disturbance assumed
to have a normal distribution with mean 0 and
standard deviation s.
15The importance of the Linear model
- 1.    It is the simplest form of a model in
which each dependent variable has some effect on
the independent variable Y. When fitting models
to data one tries to find the simplest form of a
model that still adequately describes the
relationship between the dependent variable and
the independent variables. The linear model is
sometimes the first model to be fitted and only
abandoned if it turns out to be inadequate.
16- In many instance a linear model is the most
appropriate model to describe the dependence
relationship between the dependent variable and
the independent variables. This will be true if
the dependent variable increases at a constant
rate as any or the independent variables is
increased while holding the other independent
variables constant.
17- 3.    Many non-Linear models can be put into the
form of a Linear model by appropriately
transformation the dependent variables and/or any
or all of the independent variables. This
important fact ensures the wide utility of the
Linear model. (i.e. the fact the many non-linear
models are linearizable.)
18An Example
- The following data comes from an experiment that
was interested in investigating the source from
which corn plants in various soils obtain their
phosphorous. The concentration of inorganic
phosphorous (X1) and the concentration of organic
phosphorous (X2) was measured in the soil of n
18 test plots. In addition the phosphorous
content (Y) of corn grown in the soil was also
measured. The data is displayed below
19 20Equation Y 56.2510241 1.78977412 X1
0.08664925 X2
21(No Transcript)
22Least Squares for Multiple Regression
23- Assume we have taken n observations on Y
- y1, y2, , yn
- For n sets of values of X1, X2, , Xp
- (x11, x12, , x1p)
- (x21, x22, , x2p)
-
- (xn1, xn2, , xnp)
- For any choice of the parameters b0, b1, b2, ,
bp - the residual sum of squares is defined to be
24- The Least Squares estimators of b0, b1, b2, ,
bp - are chosen to minimize the residual sum of
squares
To achieve this we solve the following system of
equations
25or
26Also
or
27The system of equations for
(n 1) linear equations in (n 1)
unknowns These equations are called the Normal
equations. The solutions are
called the least squares estimates
28The Example
- The following data comes from an experiment that
was interested in investigating the source from
which corn plants in various soils obtain their
phosphorous. The concentration of inorganic
phosphorous (X1) and the concentration of organic
phosphorous (X2) was measured in the soil of n
18 test plots. In addition the phosphorous
content (Y) of corn grown in the soil was also
measured. The data is displayed below
29 30The Normal equations.
where
31The Normal equations.
have solution
32Equation Y 56.2510241 1.78977412 X1
0.08664925 X2
33(No Transcript)
34Summary of the Statistics used in Multiple
Regression
35- The Least Squares Estimates
- the values that minimize
36- The Analysis of Variance Table Entries
- a) Adjusted Total Sum of Squares (SSTotal)
-
- b) Residual Sum of Squares (SSError)
-
- c) Regression Sum of Squares (SSReg)
-
- Note
-
- i.e. SSTotal SSReg SSError
- Â
37The Analysis of Variance Table
- Source Sum of Squares d.f. Mean Square F
-
- Regression SSReg p SSReg/p MSReg MSReg/s2
- Error SSError n-p-1 SSError/(n-p-1) MSError
s2 -
- Total SSTotal n-1
38Uses
- 1. To estimate s2 (the error variance).
- - Use s2 MSError to estimate s2.
- To test the Hypothesis
- H0 b1 b2 ... bp 0.
- Use the test statistic
-
- Reject H0 if F gt Fa(p,n-p-1).
39- 3. To compute other statistics that are useful in
describing the relationship between Y (the
dependent variable) and X1, X2, ... ,Xp (the
independent variables). - a) R2 the coefficient of determination
- SSReg/SSTotal
-
- the proportion of variance in Y explained by
- X1, X2, ... ,Xp
- 1 - R2 the proportion of variance in Y
- that is left unexplained by X1, X2, ... , Xp
- SSError/SSTotal.
40- b) Ra2 "R2 adjusted" for degrees of freedom.
- 1 -the proportion of variance in Y that is
left - unexplained by X1, X2,... , Xp adjusted
for d.f.
41- c) R ÖR2 the Multiple correlation
coefficient of Y with X1, X2, ... ,Xp -
- the maximum correlation between Y and a
linear combination of X1, X2, ... ,Xp - Comment The statistics F, R2, Ra2 and R are
equivalent statistics.
42Using Statistical Packages
- To perform Multiple Regression
43Using SPSS
Note The use of another statistical package such
as Minitab is similar to using SPSS
44After starting the SSPS program the following
dialogue box appears
45If you select Opening an existing file and press
OK the following dialogue box appears
46The following dialogue box appears
47If the variable names are in the file ask it to
read the names. If you do not specify the Range
the program will identify the Range
Once you click OK, two windows will appear
48One that will contain the output
49The other containing the data
50To perform any statistical Analysis select the
Analyze menu
51Then select Regression and Linear.
52The following Regression dialogue box appears
53Select the Dependent variable Y.
54Select the Independent variables X1, X2, etc.
55If you select the Method - Enter.
56- All variables will be put into the equation.
There are also several other methods that can be
used
- Forward selection
- Backward Elimination
- Stepwise Regression
57(No Transcript)
58- This method starts with no variables in the
equation - Carries out statistical tests on variables not in
the equation to see which have a significant
effect on the dependent variable. - Adds the most significant.
- Continues until all variables not in the equation
have no significant effect on the dependent
variable.
59- This method starts with all variables in the
equation - Carries out statistical tests on variables in the
equation to see which have no significant effect
on the dependent variable. - Deletes the least significant.
- Continues until all variables in the equation
have a significant effect on the dependent
variable.
60- Stepwise Regression (uses both forward and
backward techniques)
- This method starts with no variables in the
equation - Carries out statistical tests on variables not in
the equation to see which have a significant
effect on the dependent variable. - It then adds the most significant.
- After a variable is added it checks to see if any
variables added earlier can now be deleted. - Continues until all variables not in the equation
have no significant effect on the dependent
variable.
61- All of these methods are procedures for
attempting to find the best equation
The best equation is the equation that is the
simplest (not containing variables that are not
important) yet adequate (containing variables
that are important)
62Once the dependent variable, the independent
variables and the Method have been selected if
you press OK, the Analysis will be performed.
63The output will contain the following table
R2 and R2 adjusted measures the proportion of
variance in Y that is explained by X1, X2, X3,
etc (67.6 and 67.3)
R is the Multiple correlation coefficient (the
maximum correlation between Y and a linear
combination of X1, X2, X3, etc)
64The next table is the Analysis of Variance Table
The F test is testing if the regression
coefficients of the predictor variables are all
zero. Namely none of the independent variables
X1, X2, X3, etc have any effect on Y
65The final table in the output
Gives the estimates of the regression
coefficients, there standard error and the t test
for testing if they are zeroNote Engine size
has no significant effect on Mileage
66The estimated equation from the table below
Is
67Note the equation is
Mileage decreases with
- With increases in Engine Size (not significant, p
0.432)With increases in Horsepower
(significant, p 0.000)With increases in Weight
(significant, p 0.000)