Multiple Regression in GIS - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Multiple Regression in GIS

Description:

Regression analysis is used to examine the relationship between the study ... You will find it under the tools - data analysis tab. ... Results of Analysis ... – PowerPoint PPT presentation

Number of Views:179
Avg rating:3.0/5.0
Slides: 36
Provided by: AJL2
Category:

less

Transcript and Presenter's Notes

Title: Multiple Regression in GIS


1
Multiple Regression in GIS
  • Review of Regression LabAmares neighbor problem

2
Multiple Regression(Chou p. 276)
  • Explaining the distribution of a spatial
    phenomenon requires the analysis of relationships
    between the phenomenon and potential explanatory
    variables
  • The two most useful methods in GIS spatial
    analysis are multiple regression analysis and
    logistic regression analysis
  • Regression analysis is used to examine the
    relationship between the study phenomenon and
    multiple explanatory variables
  • Y b0 b1X1 b2X2 bnXn e

3
Multiple Regression
  • Y b0 b1X1 b2X2 bnXn e
  • Where Y denotes the dependent variable and Xs are
    explanatory variables b0 represents the
    intercept, while bn denotes estimated parameter
    of the variable Xn and e is the randomly
    distributed error term
  • Single most widely used statistical technique in
    the social sciences
  • Multiple regression can be implemented in matrix
    form in the following MathCAD example
  • see MathCAD example

4
Multiple Regression
  • We are going to now kick things up a notch. In
    many cases, there is more than one independent
    variable to explain a dependent variable.
  • For example, income is probably not just related
    to education level. It may also have something
    to do with the number of years a person works, or
    the industry a person is in.
  • For example, in our sample plot, there were three
    points that were along the same X axis. How can
    this be possible, if we give one treatment, we
    yield? Its probably because there is some error
    as we discussed before, but, its very likely that
    there are some other explanatory variables that
    we havent accounted for.

5
Multiple Regression
  • Multiple regression is simply an extension of the
    simple linear model. The difference is that
    there are more independent variables
  • The assumptions for multiple regression are
    similar to simple linear regression
  • The average value of the dependent variable Y is
    a linear combination of the independent
    variables.
  • The only random component is the error term e ,
    and the independent variables are assumed to be
    fixed, and independent of e.
  • Errors between observations are uncorrelated, and
    normally distributed with a mean of zero and a
    constant variance s2.

6
Multiple Regression
  • While we were able to solve for a single
    explanatory variable using a spreadsheet, as we
    add more explanatory variables, it will be harder
    to solve. This is especially true when the
    number of independent variables is greater than
    3.
  • However, through the use of matrices, the task is
    far less difficult. We can structure the
    regression equations into a matrix as follows
  • where X is the observation matrix of independent
    variables, and b is the vector of unknown
    parameters.

7
Multiple Regression
  • So, if we had 4 observations, and three
    explanatory variables, our matrices would look
    like the following
  • The reason we have a 1 in the first column is
    because we have to include the intercept
    parameter. Therefore, we really have 4 unknowns
    to solve for (the coefficient for each
    explanatory variable, and the slope)
  • Using basic least square with matrix algebra is
    fairly simple, once you have a computer program
    to do the work, we simply solve for the following

Matrix of dependent variables
8
Multiple Regression
  • We will not be performing the math, but it will
    be useful to create the matrices, just to see how
    it all gets formed.
  • In our example, suppose a gas utility company is
    trying to estimate revenue. They may have
    determined that heating cost is a function of the
    temperature, the amount of insulation in an
    attic, and the age of a furnace. They decided to
    look at 20 customer sites, and quantify the data
    as shown

9
Multiple Regression
  • We now need to store this information in a matrix
    as follows (we are only going to do the first 4
    rows, just to make it simple
  • You can see how for the first four rows, we have
    defined the monthly cost for the heating, the
    intercept, temperature, insulation, and age of
    furnace.
  • Now, using least square principles with matrix
    algebra, we can come up with our unknown
    coefficients.

10
Multiple Regression
  • Microsoft Excel has an excellent regression tool
    for relatively small problems. You will find it
    under the tools -gt data analysis tab.
  • Once you select the tool, an interactive dialog
    will come up stepping you through the regression
    wizard.
  • Here is where you will enter the range for the Y
    value (a single column), and the X values
    (multiple columns) as shown below.

11
Multiple Regression
  • You should type the numbers into Excel, and
    attempt to perform the regression yourself.
    Check your answers against ours.
  • What this tells us is that our R squre value is
    quite high (.80) representing a good fit, and we
    have a standard error of 51 (in dollars) for our
    20 observations.

12
Multiple Regression
  • The next chart tells us our coefficient values
    (intercept, temperature, insulation, age of
    furnace). It also tells us our P-value, or a
    measure of significance. All the values except
    age of furnace are very low, meaning that they
    are all significant at the 95 level.
  • So, what we now have is a formula
  • Cost to Heat Home 427 -4.58(temperature)
    -14.83(attic insulation) 6.101(age of the
    furnace).
  • Therefore, if a person with no attic insulation
    decided to add 12 inches, what would they save
    when the average temperature if 12 degrees?

13
What it means
  • The intercept is 427.194. This the cost of
    heating when all the independent variables are
    equal to zero.
  • The regression coefficients for the mean
    temperature and the amount of attic insulation
    are both negative. This is logical as the
    outside temperature increases, the cost of
    heating the house will go down.
  • For each degree the mean temperature increases,
    we expect the heating cost to decrease 4.583 per
    month.
  • P-value for all the coefficients are significant
    for ?0.05 except for the coefficient of the
    variable age of furnace(ß3). Hence, we can
    conclude that they are significantly different
    from zero.
  • However, if we examine the p-value for the
    variable age of furnace, we see that it is not
    significant at ?0.05. Hence, we cannot conclude
    that it is significantly different from zero.
  • In that case, we can drop this variable from the
    model. Lets see what happens if we drop the
    age of furnace variable from the model

14
Multiple Regression
  • Rather than rerunning things, well go with the
    first conclusions
  • Cost to Heat Home 427 -4.58(temperature)
    -14.83(attic insulation) 6.101(age of the
    furnace).
  • Cost to Heat Home 427 -4.58(12) -14.83(0)
    6.101(6).
  • 408
  • Cost to Heat Home 427 -4.58(12) -14.83(12)
    6.101(6).
  • 230
  • A utility company could then use this information
    to determine how much revenue they would generate
    if they provided service to a neighborhood.

15
Using Geography in Multiple Regression
  • GIS is a great tool for obtaining the explanatory
    variables. For example, consider the following
    problem to solve.
  • Assume that an environmental remediation company
    wants to know how much phosphorous is being
    dumped in a lake.
  • If they had all the data together, they could
    develop a regression model to predict the amount,
    and then prescribe different land use options for
    reducing the load.
  • Lets assume that the company determined that the
    following information is a pretty good predictor
    of phosphorous loading
  • The landuse developed land, and dairy farms
    will have more phosphorous than forest
  • Distance to a stream areas near a stream will
    be more likely to load into the lake
  • The soil type certain soil types and their
    erosion factors may play a role in the amount of
    loading.
  • Slope areas uphill from the water will be more
    likely to load into the lake.

16
The data
  • Here we have a picture of a number of phosphorous
    sampling sites, along streams right near the
    lake. And the landcover data.

17
The data
  • The soil types and watershed boundaries for each
    sample site (that is, all the areas that pour
    into the sample site)

18
The data
Buffering the stream by 200 meters
19
The data
  • Using basic GIS tools, we can determine the area
    of each watershed, and, we can overlay the land
    use, and figure out each of the following
  • Area of each land use type per watershed
  • Area of each land use type within each 200 meter
    buffer of the stream
  • Area of each soil type per watershed
  • The average slope of each watershed

20
Spatial analysis results cont.Land use () for
each sub-basin
21
Spatial Analysis results cont..Land use ()
within 200m buffer
22
Soil Type Coverage per sub-basin
23
Soil Type Coverage per sub-basin (200m buffer)
24
Slope Average (parameter for regression)
  • Slope average for each sub-basin was calculated
    from the DEM
  • This may give an indication of runoff, the
    steeper the slope the higher the runoff.

25
Putting it together
  • Once all the tables are categorized for each
    sample site, the matrices can be formed with the
    dependent variable being the loading, and the
    independent variables being the land use,
    soil type, average slope, etc..
  • Then, the company would just run a regression
    analysis like we did earlier, and obtain the
    coefficients.
  • After the coefficients are obtained, the users
    can modify the values (assuming the coefficients
    yield good results) such as converting some
    agricultural land to forest, and then see how
    that may or may not impact the phosphorous
    loading in the lake.

26
Example
  • Examining the relationship between bird
    distribution and selected environmental variables
  • An example from Chou

27
Data
CLICK FOR EXCEL SHEET
28
Comments on Data
  • The nominal variables (species and vegetation)
    not included in the model
  • Can observation frequency of birds be explained
    by forest density, proximity to rivers, and
    slope?

29
Results of Analysis
  • Which variables in the model are significant, and
    which ones are not necessary
  • An efficient spatial model incorporates a small
    number of critical variables while generating a
    sufficiently accurate prediction

30
Results of Analysis
  • According to statistical tables, the critical
    value of t is 2.101 (a .025 for d.f. 18)
  • Therefore, only Forest Cover is significant in
    the distribution of birds in this study.

31
Theoretical Formation of Multiple Regression
using Matrices
Watch this one. With spatial data we often
violate this assumption. More about this in a
few weeks.
32
(No Transcript)
33
(No Transcript)
34
Here we are going to rerun the bird regression
example using matrices
35
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com