Title: Multiple Regression in GIS
1Multiple Regression in GIS
- Review of Regression LabAmares neighbor problem
2Multiple Regression(Chou p. 276)
- Explaining the distribution of a spatial
phenomenon requires the analysis of relationships
between the phenomenon and potential explanatory
variables - The two most useful methods in GIS spatial
analysis are multiple regression analysis and
logistic regression analysis - Regression analysis is used to examine the
relationship between the study phenomenon and
multiple explanatory variables - Y b0 b1X1 b2X2 bnXn e
3Multiple Regression
- Y b0 b1X1 b2X2 bnXn e
- Where Y denotes the dependent variable and Xs are
explanatory variables b0 represents the
intercept, while bn denotes estimated parameter
of the variable Xn and e is the randomly
distributed error term - Single most widely used statistical technique in
the social sciences - Multiple regression can be implemented in matrix
form in the following MathCAD example - see MathCAD example
4Multiple Regression
- We are going to now kick things up a notch. In
many cases, there is more than one independent
variable to explain a dependent variable. - For example, income is probably not just related
to education level. It may also have something
to do with the number of years a person works, or
the industry a person is in. - For example, in our sample plot, there were three
points that were along the same X axis. How can
this be possible, if we give one treatment, we
yield? Its probably because there is some error
as we discussed before, but, its very likely that
there are some other explanatory variables that
we havent accounted for.
5Multiple Regression
- Multiple regression is simply an extension of the
simple linear model. The difference is that
there are more independent variables - The assumptions for multiple regression are
similar to simple linear regression - The average value of the dependent variable Y is
a linear combination of the independent
variables. - The only random component is the error term e ,
and the independent variables are assumed to be
fixed, and independent of e. - Errors between observations are uncorrelated, and
normally distributed with a mean of zero and a
constant variance s2.
6Multiple Regression
- While we were able to solve for a single
explanatory variable using a spreadsheet, as we
add more explanatory variables, it will be harder
to solve. This is especially true when the
number of independent variables is greater than
3. - However, through the use of matrices, the task is
far less difficult. We can structure the
regression equations into a matrix as follows - where X is the observation matrix of independent
variables, and b is the vector of unknown
parameters.
7Multiple Regression
- So, if we had 4 observations, and three
explanatory variables, our matrices would look
like the following - The reason we have a 1 in the first column is
because we have to include the intercept
parameter. Therefore, we really have 4 unknowns
to solve for (the coefficient for each
explanatory variable, and the slope) - Using basic least square with matrix algebra is
fairly simple, once you have a computer program
to do the work, we simply solve for the following
Matrix of dependent variables
8Multiple Regression
- We will not be performing the math, but it will
be useful to create the matrices, just to see how
it all gets formed. - In our example, suppose a gas utility company is
trying to estimate revenue. They may have
determined that heating cost is a function of the
temperature, the amount of insulation in an
attic, and the age of a furnace. They decided to
look at 20 customer sites, and quantify the data
as shown
9Multiple Regression
- We now need to store this information in a matrix
as follows (we are only going to do the first 4
rows, just to make it simple - You can see how for the first four rows, we have
defined the monthly cost for the heating, the
intercept, temperature, insulation, and age of
furnace. - Now, using least square principles with matrix
algebra, we can come up with our unknown
coefficients.
10Multiple Regression
- Microsoft Excel has an excellent regression tool
for relatively small problems. You will find it
under the tools -gt data analysis tab. - Once you select the tool, an interactive dialog
will come up stepping you through the regression
wizard. - Here is where you will enter the range for the Y
value (a single column), and the X values
(multiple columns) as shown below.
11Multiple Regression
- You should type the numbers into Excel, and
attempt to perform the regression yourself.
Check your answers against ours. - What this tells us is that our R squre value is
quite high (.80) representing a good fit, and we
have a standard error of 51 (in dollars) for our
20 observations.
12Multiple Regression
- The next chart tells us our coefficient values
(intercept, temperature, insulation, age of
furnace). It also tells us our P-value, or a
measure of significance. All the values except
age of furnace are very low, meaning that they
are all significant at the 95 level. - So, what we now have is a formula
- Cost to Heat Home 427 -4.58(temperature)
-14.83(attic insulation) 6.101(age of the
furnace). - Therefore, if a person with no attic insulation
decided to add 12 inches, what would they save
when the average temperature if 12 degrees?
13What it means
- The intercept is 427.194. This the cost of
heating when all the independent variables are
equal to zero. - The regression coefficients for the mean
temperature and the amount of attic insulation
are both negative. This is logical as the
outside temperature increases, the cost of
heating the house will go down. - For each degree the mean temperature increases,
we expect the heating cost to decrease 4.583 per
month. - P-value for all the coefficients are significant
for ?0.05 except for the coefficient of the
variable age of furnace(ß3). Hence, we can
conclude that they are significantly different
from zero. - However, if we examine the p-value for the
variable age of furnace, we see that it is not
significant at ?0.05. Hence, we cannot conclude
that it is significantly different from zero. - In that case, we can drop this variable from the
model. Lets see what happens if we drop the
age of furnace variable from the model
14Multiple Regression
- Rather than rerunning things, well go with the
first conclusions - Cost to Heat Home 427 -4.58(temperature)
-14.83(attic insulation) 6.101(age of the
furnace). - Cost to Heat Home 427 -4.58(12) -14.83(0)
6.101(6). - 408
- Cost to Heat Home 427 -4.58(12) -14.83(12)
6.101(6). - 230
- A utility company could then use this information
to determine how much revenue they would generate
if they provided service to a neighborhood.
15Using Geography in Multiple Regression
- GIS is a great tool for obtaining the explanatory
variables. For example, consider the following
problem to solve. - Assume that an environmental remediation company
wants to know how much phosphorous is being
dumped in a lake. - If they had all the data together, they could
develop a regression model to predict the amount,
and then prescribe different land use options for
reducing the load. - Lets assume that the company determined that the
following information is a pretty good predictor
of phosphorous loading - The landuse developed land, and dairy farms
will have more phosphorous than forest - Distance to a stream areas near a stream will
be more likely to load into the lake - The soil type certain soil types and their
erosion factors may play a role in the amount of
loading. - Slope areas uphill from the water will be more
likely to load into the lake.
16The data
- Here we have a picture of a number of phosphorous
sampling sites, along streams right near the
lake. And the landcover data.
17The data
- The soil types and watershed boundaries for each
sample site (that is, all the areas that pour
into the sample site)
18The data
Buffering the stream by 200 meters
19The data
- Using basic GIS tools, we can determine the area
of each watershed, and, we can overlay the land
use, and figure out each of the following - Area of each land use type per watershed
- Area of each land use type within each 200 meter
buffer of the stream - Area of each soil type per watershed
- The average slope of each watershed
20Spatial analysis results cont.Land use () for
each sub-basin
21Spatial Analysis results cont..Land use ()
within 200m buffer
22 Soil Type Coverage per sub-basin
23 Soil Type Coverage per sub-basin (200m buffer)
24Slope Average (parameter for regression)
- Slope average for each sub-basin was calculated
from the DEM - This may give an indication of runoff, the
steeper the slope the higher the runoff.
25Putting it together
- Once all the tables are categorized for each
sample site, the matrices can be formed with the
dependent variable being the loading, and the
independent variables being the land use,
soil type, average slope, etc.. - Then, the company would just run a regression
analysis like we did earlier, and obtain the
coefficients. - After the coefficients are obtained, the users
can modify the values (assuming the coefficients
yield good results) such as converting some
agricultural land to forest, and then see how
that may or may not impact the phosphorous
loading in the lake.
26Example
- Examining the relationship between bird
distribution and selected environmental variables - An example from Chou
27Data
CLICK FOR EXCEL SHEET
28Comments on Data
- The nominal variables (species and vegetation)
not included in the model - Can observation frequency of birds be explained
by forest density, proximity to rivers, and
slope?
29Results of Analysis
- Which variables in the model are significant, and
which ones are not necessary - An efficient spatial model incorporates a small
number of critical variables while generating a
sufficiently accurate prediction
30Results of Analysis
- According to statistical tables, the critical
value of t is 2.101 (a .025 for d.f. 18) - Therefore, only Forest Cover is significant in
the distribution of birds in this study.
31Theoretical Formation of Multiple Regression
using Matrices
Watch this one. With spatial data we often
violate this assumption. More about this in a
few weeks.
32(No Transcript)
33(No Transcript)
34Here we are going to rerun the bird regression
example using matrices
35(No Transcript)