A Brief Introduction to Spatial Regression - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

A Brief Introduction to Spatial Regression

Description:

In regression, we are testing the null hypothesis that the coefficient (i.e., slope) ... Geographically-Weighted Regression (can be run in ArcGIS 9.3) ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 30
Provided by: JeffC152
Category:

less

Transcript and Presenter's Notes

Title: A Brief Introduction to Spatial Regression


1
A Brief Introduction to Spatial Regression
  • Eugene Brusilovskiy

2
Outline
  • Review of Correlation
  • OLS Regression
  • Regression with a non-normal dependent variable
  • Spatial Regression

3
Correlation
  • Defined as a measure of how much two variables X
    and Y change together
  • Dimensionless measure
  • A correlation between two variables is a single
    number that can range from -1 to 1, with positive
    values close to one indicating a strong direct
    relationship and negative values close to -1
    indicating a strong inverse relationship
  • E.g., a positive correlation between income and
    years of schooling indicates that more years of
    schooling would correspond to greater income
    (i.e., an increase in the years of schooling is
    associated with an increase in income)
  • A correlation of 0 indicates a lack of
    relationship between the variables
  • Generally denoted by the Greek letter ?
  • Pearson Correlation When the variables are
    normally distributed
  • Spearman Correlation When the variables arent
    normally distributed

4
Some Remarks
  • In practice, we rarely see perfect positive or
    negative correlations (i.e., correlations of
    exactly 1 or -1)
  • Correlations are those higher than 0.6 (or lower
    than -0.6) are considered to be strong
  • There might be confounding factors that explain a
    strong positive or negative correlation between
    variables
  • E.g., volume of ice cream consumption might be
    correlated with crime rates. Why?
  • Both tend to be high when the temperatures are
    warmer!
  • The correlation between two seemingly unrelated
    variables does not always equal exactly zero
    (although it will often be close to it)

5
Correlation does not imply causation!
Source http//imgs.xkcd.com/comics/correlation.pn
g
6
Regression
  • A statistical method used to examine the
    relationship between a variable of interest
    (dependent variable) and one or more explanatory
    variables (predictors)
  • Strength of the relationship
  • Direction of the relationship (positive,
    negative, zero)
  • Goodness of model fit
  • Allows you to calculate the amount by which your
    dependent variable changes when a predictor
    variable changes by one unit (holding all other
    predictors constant)
  • Often referred to as Ordinary Least Squares (OLS)
    regression
  • Regression with one predictor is called simple
    regression
  • Regression with two or more predictors is called
    multiple regression
  • Available in all statistical packages
  • Just like correlation, if an explanatory variable
    is a significant predictor of the dependent
    variable, it doesnt imply that the explanatory
    variable is a cause of the dependent variable

7
Example
  • Assume we have data on median income and median
    house value in 381 Philadelphia census tracts
    (i.e., our unit of measurement is a tract)
  • Each of the 381 tracts has information on income
    (call it Y) and on house value (call it X). So,
    we can create a scatter-plot of Y against X.
  • Through this scatter plot, we can calculate the
    equation of the line that best fits the pattern
    (recall Ymxb, where m is the slope and b is
    the y-intercept)
  • This is done by finding a line such that the sum
    of the squared (vertical) distances between the
    points and the line is minimized
  • Hence the term ordinary least squares
  • Now, we can examine the relationship between
    these two variables

8
(No Transcript)
9
We can easily extend this to cases with 2
predictors
  • When we have ngt1 predictors, rather than getting
    a line in 2 dimensions, we get a line in n1
    dimensions (the 1 accounts for the dependent
    variable)
  • Each independent variable will have its own slope
    coefficient which will indicate the relationship
    of that particular predictor with the dependent
    variable, controlling for all other independent
    variables in the regression.
  • The equation of the best fit line becomes

where
  • The coefficient ß of each predictor may be
    interpreted as the amount by which the dependent
    variable changes as the independent variable
    increases by one unit (holding all other
    variables constant)

10
An Example with 2 Predictors Income as a
function of House Value and Crime
11
Some Basic Regression Diagnostics
  • The so-called p-value associated with the
    variable
  • For any statistical method, including regression,
    we are testing some hypothesis. In regression, we
    are testing the null hypothesis that the
    coefficient (i.e., slope) ß is equal to zero
    (i.e., that the explanatory variable is not a
    significant predictor of the dependent variable).
  • Formally, the p-value is the probability of
    observing the value of ß as extreme (i.e., as
    different from 0 as its estimated value is) when
    in reality it equals to zero (i.e., when the Null
    Hypothesis holds). If this probability is small
    enough (generally, plt0.05), we reject the null
    hypothesis of ß0 for an alternative hypothesis
    of ßltgt0.
  • Again, when the null hypothesis (of ß0) cannot
    be rejected, the dependent variable is not
    related to the independent variable.
  • The rejection of a null hypothesis (i.e., when p
    lt0.05) indicates that the independent variable is
    a statistically significant predictor of the
    dependent variable
  • One p-value per independent variable

12
Some Basic Regression Diagnostics (Contd)
  • The sign of the coefficient of the independent
    variable (i.e., the slope of the regression line)
  • One coefficient per independent variable
  • Indicates whether the relationship between the
    dependent and independent variables is positive
    or negative
  • We should look at the sign when the coefficient
    is statistically significant (i.e., significantly
    different from zero)

13
Some Basic Regression Diagnostics (Contd)
  • R-squared (AKA Coefficient of Determination) the
    percent of variance in the dependent variable
    that is explained by the predictors
  • In the single predictor case, R-squared is simply
    the square of the correlation between the
    predictor and dependent variable
  • The more independent variables included, the
    higher the R-squared
  • Adjusted R-squared percent of variance in the
    dependent variable explained, adjusted by the
    number of predictors
  • One R-squared for the regression model

14
Some (but not all) regression assumptions
  • The dependent variable should be normally
    distributed (i.e., the histogram of the variable
    should look like a bell curve)
  • Ideally, this will also be true of independent
    variables, but this is not essential. Independent
    variables can also be binary (i.e., have two
    values, such as 1 (yes) and 0 (no))
  • The predictors should not be strongly correlated
    with each other (i.e., no multicollinearity)
  • Very importantly, the observations should be
    independent of each other. (The same holds for
    regression residuals). If this assumption is
    violated, our coefficient estimates could be
    wrong!
  • General rule of thumb 10 observations per
    independent
  • variable

15
An Example of a Normal Distribution
16
Data Transformations
  • Sometimes, it is possible to transform a
    variables distribution by subjecting it to some
    simple algebraic operation.
  • The logarithmic transformation is the most widely
    used to achieve normality when the variable is
    positively skewed (as in the image on the left
    below)
  • Analysis is then performed on the transformed
    variable.

17
Additional Regression Methods
  • Logistic regression/Probit regression
  • When your dependent variable is binary (i.e., has
    two possible outcomes).
  • E.g., Employment Indicator (Are you employed?
    Yes/No)
  • Multinomial logistic regression
  • When your dependent variable is categorical and
    has more than two categories
  • E.g., Race Black, Asian, White, Other
  • Ordinal logistic regression
  • When your dependent variable is ordinal and has
    more than two categories
  • E.g., Education (1Less than High School, 2High
    School, 3More than High School)
  • Poisson regression
  • When your dependent variable is a count
  • E.g., Number of traffic violations (0, 1, 2, 3,
    4, 5, etc)

18
Spatial Autocorrelation
  • Recall
  • There is spatial autocorrelation in a variable if
    observations that are closer to each other in
    space have related values (Toblers Law)
  • One of the regression assumptions is independence
    of observations. If this doesnt hold, we obtain
    inaccurate estimates of the ß coefficients, and
    the error term e contains spatial dependencies
    (i.e., meaningful information), whereas we want
    the error to not be distinguishable from random
    noise.

19
Imagine a problem with a spatial component
This example is obviously a dramatization, but
nonetheless, in many spatial problems points
which are close together have similar values
20
But how do we know if spatial dependencies exist?
  • Morans I (1950) a rather old and perhaps the
    most widely used method of testing for spatial
    autocorrelation, or spatial dependencies
  • We can determine a p-value for Morans I (i.e.,
    an indicator of whether spatial autocorrelation
    is statistically significant).
  • For more on Morans I, see http//en.wikipedia.org
    /wiki/Moran27s_I
  • Just as the non-spatial correlation coefficient,
    ranges from -1 to 1
  • Can be calculated in ArcGIS
  • Other indices of spatial autocorrelation commonly
    used include
  • Gearys c (1954)
  • Getis and Ords G-statistic (1992)
  • For non-negative values only

21
So, when a problem has a spatial component, we
should
  • Run the non-spatial regression
  • Test the regression residuals for spatial
    autocorrelation, using Morans I or some other
    index
  • If no significant spatial autocorrelation exists,
    STOP. Otherwise, if the spatial dependencies are
    significant, use a special model which takes
    spatial dependencies into account.

22
Spatial Regression Models
  • A spatial lag (SL) model
  • Assumes that dependencies exist directly among
    the levels of the dependent variable
  • That is, the income at one location is affected
    by the income at the nearby locations
  • A lag term, which is a specification of income
    at nearby locations, is included in the
    regression, and its coefficient and p-value are
    interpreted as for the independent variables.
  • As in OLS regression, we can include independent
    variables in the model.
  • Whereas we will see spatial autocorrelation in
    OLS residuals, the SL model should account for
    spatial dependencies and the SL residuals would
    not be autocorrelated,
  • Hence the SL residuals should not be
    distinguishable from random noise (i.e., have no
    consistent patterns or dependencies in them)

23
OLS Residuals vs. SL Residuals
Non-random patterns and clustering Random
Noise
24
But how is spatial proximity defined?
  • For each point (or areal unit), we need to
    identify its spatial relationship with all the
    other points (or areal units). This can be done
    by looking at the (inverse of the) distance
    between each pair of points, or in a number of
    other ways
  • A binary indicator stating whether two points (or
    census tract centroids) are within a certain
    distance of each other (1yes, 0no)
  • A binary indicator stating whether point A is one
    of the ___ (1, 5, 10, 15, etc) nearest neighbors
    of B (1yes, 0no)
  • For areal datasets, the proportion of the
    boundary that zone 1 shares with zone 2, or
    simply a binary indicator of whether zone 1 and 2
    share a border (1yes, 0no)
  • Etc, etc, etc
  • When we have n observations, we form an n x n
    table (called a weight matrix or a link matrix)
    which summarizes all the pairwise spatial
    relationships in the dataset
  • These weight matrices are used in the estimation
    of spatial regression (and the calculation of
    Morans I).
  • Unless we have compelling reasons not to do so,
    its generally a good idea to see whether our
    results hold with different types of weight
    matrices

25
Assume we have a map with 10 Census tracts
The hypothetical weight matrix below indicates
whether any given Census tract shares a boundary
with another tract. 1 means yes and 0 means no.
For instance, tracts 3 and 6 do share a boundary,
as indicated by the blue 1s.
Point 1 2 3 4 5 6 7 8 9 10
1 0 1 0 0 1 0 0 0 0 1
2 1 0 0 1 0 1 0 0 0 0
3 0 0 0 1 1 1 1 0 0 0
4 0 1 1 0 1 0 1 0 1 0
5 1 0 1 1 0 1 1 1 0 0
6 0 1 1 0 1 0 1 0 0 0
7 0 0 1 1 1 1 0 0 0 1
8 0 0 0 0 1 0 0 0 1 1
9 0 0 0 1 0 0 0 1 0 1
10 1 0 0 0 0 0 1 1 1 0
26
Now, we need a software package that can
  • Run the good old OLS regression model
  • Create a weight matrix
  • Test for spatial autocorrelation in OLS residuals
  • Run a spatial lag model (or some other spatial
    model)
  • Such packages do exist!

27
GeoDa
  • A software package developed by Luc Anselin
  • Can be downloaded free of charge (for members of
    educational and research institutions) at
    https//www.geoda.uiuc.edu/
  • Has a user-friendly interface
  • Accepts ESRI shapefiles as inputs
  • Is able to perform a number of basic GIS
    operations in addition to running the
    sophisticated spatial statistics models

28
Other Spatial Regression Models
  • Spatial Error (can be implemented in GeoDa)
  • Geographically-Weighted Regression (can be run in
    ArcGIS 9.3)
  • These methods also aim to account for spatial
  • dependencies in the data

29
Some References Spatial Regression
  • Bailey, T.C. and Gatrell, A.C. (1995).
    Interactive Spatial Data Analysis. Addison Wesley
    Longman, Harlow, Essex.
  • Cressie, N.A.C. (1993). Statistics for Spatial
    Data. (Revised Edition). Wiley, John Sons, Inc.
  • LeSage, J. and Pace K.R. (2009). Introduction to
    Spatial Econometrics. CRC Press/Taylor Francis
    Group.
Write a Comment
User Comments (0)
About PowerShow.com