Multiple Regression in GIS - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Multiple Regression in GIS

Description:

Regression analysis is used to examine the relationship between the study ... You will find it under the tools - data analysis tab. ... Results of Analysis ... – PowerPoint PPT presentation

Number of Views:179

Avg rating:3.0/5.0

Slides: 36

Provided by: AJL2

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Regression in GIS

1
Multiple Regression in GIS

Review of Regression LabAmares neighbor problem

2
Multiple Regression(Chou p. 276)

Explaining the distribution of a spatial
phenomenon requires the analysis of relationships
between the phenomenon and potential explanatory
variables
The two most useful methods in GIS spatial
analysis are multiple regression analysis and
logistic regression analysis
Regression analysis is used to examine the
relationship between the study phenomenon and
multiple explanatory variables
Y b0 b1X1 b2X2 bnXn e

3
Multiple Regression

Y b0 b1X1 b2X2 bnXn e
Where Y denotes the dependent variable and Xs are
explanatory variables b0 represents the
intercept, while bn denotes estimated parameter
of the variable Xn and e is the randomly
distributed error term
Single most widely used statistical technique in
the social sciences
Multiple regression can be implemented in matrix
form in the following MathCAD example
see MathCAD example

4
Multiple Regression

We are going to now kick things up a notch. In
many cases, there is more than one independent
variable to explain a dependent variable.
For example, income is probably not just related
to education level. It may also have something
to do with the number of years a person works, or
the industry a person is in.
For example, in our sample plot, there were three
points that were along the same X axis. How can
this be possible, if we give one treatment, we
yield? Its probably because there is some error
as we discussed before, but, its very likely that
there are some other explanatory variables that
we havent accounted for.

5
Multiple Regression

Multiple regression is simply an extension of the
simple linear model. The difference is that
there are more independent variables
The assumptions for multiple regression are
similar to simple linear regression
The average value of the dependent variable Y is
a linear combination of the independent
variables.
The only random component is the error term e ,
and the independent variables are assumed to be
fixed, and independent of e.
Errors between observations are uncorrelated, and
normally distributed with a mean of zero and a
constant variance s2.

6
Multiple Regression

While we were able to solve for a single
explanatory variable using a spreadsheet, as we
add more explanatory variables, it will be harder
to solve. This is especially true when the
number of independent variables is greater than
3.
However, through the use of matrices, the task is
far less difficult. We can structure the
regression equations into a matrix as follows
where X is the observation matrix of independent
variables, and b is the vector of unknown
parameters.

7
Multiple Regression

So, if we had 4 observations, and three
explanatory variables, our matrices would look
like the following
The reason we have a 1 in the first column is
because we have to include the intercept
parameter. Therefore, we really have 4 unknowns
to solve for (the coefficient for each
explanatory variable, and the slope)
Using basic least square with matrix algebra is
fairly simple, once you have a computer program
to do the work, we simply solve for the following

Matrix of dependent variables
8
Multiple Regression

We will not be performing the math, but it will
be useful to create the matrices, just to see how
it all gets formed.
In our example, suppose a gas utility company is
trying to estimate revenue. They may have
determined that heating cost is a function of the
temperature, the amount of insulation in an
attic, and the age of a furnace. They decided to
look at 20 customer sites, and quantify the data
as shown

9
Multiple Regression

We now need to store this information in a matrix
as follows (we are only going to do the first 4
rows, just to make it simple
You can see how for the first four rows, we have
defined the monthly cost for the heating, the
intercept, temperature, insulation, and age of
furnace.
Now, using least square principles with matrix
algebra, we can come up with our unknown
coefficients.

10
Multiple Regression

Microsoft Excel has an excellent regression tool
for relatively small problems. You will find it
under the tools -gt data analysis tab.
Once you select the tool, an interactive dialog
will come up stepping you through the regression
wizard.
Here is where you will enter the range for the Y
value (a single column), and the X values
(multiple columns) as shown below.

11
Multiple Regression

You should type the numbers into Excel, and
attempt to perform the regression yourself.
Check your answers against ours.
What this tells us is that our R squre value is
quite high (.80) representing a good fit, and we
have a standard error of 51 (in dollars) for our
20 observations.

12
Multiple Regression

The next chart tells us our coefficient values
(intercept, temperature, insulation, age of
furnace). It also tells us our P-value, or a
measure of significance. All the values except
age of furnace are very low, meaning that they
are all significant at the 95 level.
So, what we now have is a formula
Cost to Heat Home 427 -4.58(temperature)
-14.83(attic insulation) 6.101(age of the
furnace).
Therefore, if a person with no attic insulation
decided to add 12 inches, what would they save
when the average temperature if 12 degrees?

13
What it means

The intercept is 427.194. This the cost of
heating when all the independent variables are
equal to zero.
The regression coefficients for the mean
temperature and the amount of attic insulation
are both negative. This is logical as the
outside temperature increases, the cost of
heating the house will go down.
For each degree the mean temperature increases,
we expect the heating cost to decrease 4.583 per
month.
P-value for all the coefficients are significant
for ?0.05 except for the coefficient of the
variable age of furnace(ß3). Hence, we can
conclude that they are significantly different
from zero.
However, if we examine the p-value for the
variable age of furnace, we see that it is not
significant at ?0.05. Hence, we cannot conclude
that it is significantly different from zero.
In that case, we can drop this variable from the
model. Lets see what happens if we drop the
age of furnace variable from the model

14
Multiple Regression

Rather than rerunning things, well go with the
first conclusions
Cost to Heat Home 427 -4.58(temperature)
-14.83(attic insulation) 6.101(age of the
furnace).
Cost to Heat Home 427 -4.58(12) -14.83(0)
6.101(6).
408
Cost to Heat Home 427 -4.58(12) -14.83(12)
6.101(6).
230
A utility company could then use this information
to determine how much revenue they would generate
if they provided service to a neighborhood.

15
Using Geography in Multiple Regression

GIS is a great tool for obtaining the explanatory
variables. For example, consider the following
problem to solve.
Assume that an environmental remediation company
wants to know how much phosphorous is being
dumped in a lake.
If they had all the data together, they could
develop a regression model to predict the amount,
and then prescribe different land use options for
reducing the load.
Lets assume that the company determined that the
following information is a pretty good predictor
of phosphorous loading
The landuse developed land, and dairy farms
will have more phosphorous than forest
Distance to a stream areas near a stream will
be more likely to load into the lake
The soil type certain soil types and their
erosion factors may play a role in the amount of
loading.
Slope areas uphill from the water will be more
likely to load into the lake.

16
The data

Here we have a picture of a number of phosphorous
sampling sites, along streams right near the
lake. And the landcover data.

17
The data

The soil types and watershed boundaries for each
sample site (that is, all the areas that pour
into the sample site)

18
The data
Buffering the stream by 200 meters
19
The data

Using basic GIS tools, we can determine the area
of each watershed, and, we can overlay the land
use, and figure out each of the following
Area of each land use type per watershed
Area of each land use type within each 200 meter
buffer of the stream
Area of each soil type per watershed
The average slope of each watershed

20
Spatial analysis results cont.Land use () for
each sub-basin
21
Spatial Analysis results cont..Land use ()
within 200m buffer
22
Soil Type Coverage per sub-basin
23
Soil Type Coverage per sub-basin (200m buffer)
24
Slope Average (parameter for regression)

Slope average for each sub-basin was calculated
from the DEM
This may give an indication of runoff, the
steeper the slope the higher the runoff.

25
Putting it together

Once all the tables are categorized for each
sample site, the matrices can be formed with the
dependent variable being the loading, and the
independent variables being the land use,
soil type, average slope, etc..
Then, the company would just run a regression
analysis like we did earlier, and obtain the
coefficients.
After the coefficients are obtained, the users
can modify the values (assuming the coefficients
yield good results) such as converting some
agricultural land to forest, and then see how
that may or may not impact the phosphorous
loading in the lake.

26
Example

Examining the relationship between bird
distribution and selected environmental variables
An example from Chou

27
Data
CLICK FOR EXCEL SHEET
28
Comments on Data

The nominal variables (species and vegetation)
not included in the model
Can observation frequency of birds be explained
by forest density, proximity to rivers, and
slope?

29
Results of Analysis

Which variables in the model are significant, and
which ones are not necessary
An efficient spatial model incorporates a small
number of critical variables while generating a
sufficiently accurate prediction

30
Results of Analysis

According to statistical tables, the critical
value of t is 2.101 (a .025 for d.f. 18)
Therefore, only Forest Cover is significant in
the distribution of birds in this study.

31
Theoretical Formation of Multiple Regression
using Matrices
Watch this one. With spatial data we often
violate this assumption. More about this in a
few weeks.
32
(No Transcript)
33
(No Transcript)
34
Here we are going to rerun the bird regression
example using matrices
35
(No Transcript)

Write a Comment

User Comments (0)