Title: Spatial Data Analysis of Areas: Regression
1Spatial Data Analysis of Areas Regression
2Introduction
- Basic Idea
- Dependent variable (Y) determined by independent
variables X1,X2 (e.g., Y mX b). - Uses of regression
- Description
- Control
- Prediction
3Simple Linear Regression
- Yi?0?1Xi ?i
- Yi value of dependent variable on trial i
- ?0, ?1 (unknown parameters)
- Xi value of independent variable on trial i
- ?i ith error term (unexplained variation),
where - E ?i0,
- ? 2(?i) ? 2
- error terms are N(0, ?2)
basic model
4Multiple Regression
Basic Model
- Yi is the ith observation of the dependent
variable - are parameters
- are observations of the
ind variables - are independent and normal
estimated model
ith residual
5Sometimes we need to transform the data
Scatter plots (a) Y versus PORC3_NR (percentage
of large farms in number ) (b) log10 Y versus
log 10 (PORC3_NR).
Predicted versus Observed Plots (a) model with
variables not transformed) R2 0.61 (b) Model
7 R2 0.85.
6Precision of estimates and fit
- Analysis of variation
- Sum of squares of Y Sum of squares of
estimate Sum of squares of residuals
- Dividing both sides by TSS (sum of squares of Y)
- 1 ESS/TSS RSS/TSS
- where ESS/TSS r2 (coefficient of determination)
- r2 gives the proportion of total variation
explained by the sample regression equation. - The closer is r2 to 1.00, the better the fit.
7Analysis of Residuals
- It is a good idea to plot the residuals against
the independent variables to see if they show a
trend. - Possible behaviors
- Correlation (e.g., the higher the independent
variable, the higher the residual) - Nonlinearity
- Heteroskedacity (i.e., the variance of the
residual increases or decreases with the
independent variable). - Regression assumes that residuals are constant
variance and normally distributed.
8Good Residual Plot
9Nonlinearity
0.25
0.2
0.15
0.1
residual
0.05
0
-0.05
0
20
40
60
-0.1
-0.15
X
10Heteroskedacity
1
0.5
residual
0
0
20
40
60
-0.5
X
-1
11Regression with Spatial Data Understanding
Deforestation in Amazonia
12The forest...
13(No Transcript)
14The rains...
15The rivers...
16Deforestation...
17Fire...
18Fire...
19Amazon Deforestation 2003
Deforestation 2002/2003
Deforestation until 2002
Fonte INPE PRODES Digital, 2004.
20What Drives Tropical Deforestation?
of the cases
? 5 10 50
Underlying Factors driving proximate causes
Causative interlinkages at proximate/underlying
levels
Internal drivers
If less than 5of cases, not depicted here.
sourceGeist Lambin
211 9 7 3
221 9 9 1
Courtesy INPE/OBT
231 9 9 9
Courtesy INPE/OBT
24Deforestation in Amazonia
PRODES (Total 1997) 532.086 km2 PRODES (Total
2001) 607.957 km2
25Modelling Tropical Deforestation
Coarse 100 km x 100 km grid
Fine 25 km x 25 km grid
26Amazônia in 2015?
fonte Aguiar et al., 2004
27Factors Affecting Deforestation
28Coarse resolution candidate models
29Coarse resolution Hot-spots map
30Modelling Deforestation in Amazonia
- High coefficients of multiple determination were
obtained on all models built (R2 from 0.80 to
0.86). - The main factors identified were
- Population density
- Connection to national markets
- Climatic conditions
- Indicators related to land distribution between
large and small farmers. - The main current agricultural frontier areas, in
Pará and Amazonas States, where intense
deforestation processes are taking place now were
correctly identified as hot-spots of change.
31Spatial regression models
32Spatial regression
- Specifying the Structure of Spatial dependence
- which locations/observations interact
- Testing for the Presence of Spatial Dependence
- what type of dependence, what is the alternative
- Estimating Models with Spatial Dependence
- spatial lag, spatial error, higher order
- Spatial Prediction
- interpolation, missing values
source Luc Anselin
33Nonspatial regression
- Objective
- Predict the behaviour of a response variable,
given a set of known factors (explanatory
variables). - Multivariate nonspatial models
- yk ?0 ?1x1k ?ixik ?i
- yk estimate of response variable for object k
- ?i regression coefficient for factor i
- xi explanatory variable i for region k
- ?k random error
- Adjustment quality
-
n
S
(
y
y
)
2
i
i
i
1
R
1
2
n
S
2
(
y
y
)
i
i
i
1
34Nonspatial regression hypotheses
- Y X? ? (model)
- Explanatory variables are linearly independent
- Y - vector of samples of response variable (n x
1) - X matrix of explanatory variables (n x k)
- ? - coefficient vector (k x 1)
- ? - error vector (n x 1)
- E(?i ) 0 ( expected value)
- ?i N( 0, ?i2 ) (normal distribution)
35Generalized linear models
- g(Y) X? U
- Response is some function of the explanatory
variables - g(.) is a link function
- Ex logarithm function
- U error vector
- ?(U) 0 (expected value)
- ?(UUT ) C (covariance matrix)
- if C ?2 I, the error is homoskedastic
36Spatial regression
- Spatial effects
- What happens if the original data is spatially
autocorrelated? - The results will be influenced, showing
statistical associated where there is none -
- How can we evaluate the spatial effects?
- Measure the spatial autocorrelation (Morans I)
of the regression residuals
37Regression using spatial data
- Try a linear model first
- Adjust the model and calculate residuals
- Are the residuals spatially autocorrelated?
- No, were OK
- Yes, nonspatial model will be biased and we
should propose a spatial model
38Spatial dependence
- Estimating the Form/Extent of Spatial Interaction
- substantive spatial dependence
- spatial lag models
- Correcting for the Effect of Spatial Spill-overs
- spatial dependence as a nuisance
- spatial error models
source Luc Anselin
39Spatial dependence
- Substantive Spatial Dependence
- lag dependence
- include Wy as explanatory variable in regression
- y ?Wy Xß e
- Dependence as a Nuisance
- error dependence
- non-spherical error variance
- Eee O
- where O incorporates dependence structure
40Interpretation of spatial lag
- True Contagion
- related to economic-behavioral process
- only meaningful if areal units appropriate
(ecological fallacy) - interesting economic interpretation (substantive)
- Apparent Contagion
- scale problem, spatial filtering
source Luc Anselin
41Interpretation of Spatial Error
- Spill-Over in Ignored Variables
- poor match process with unit of observation or
level of aggregation - apparent contagion regional structural change
- economic interpretation less interesting nuisance
parameter - Common in Empirical Practice
source Luc Anselin
42Cost of ignoring spatial dependence
- Ignoring Spatial Lag
- omitted variable problem
- OLS estimates biased and inconsistent
- Ignoring Spatial Error
- efficiency problem
- OLS still unbiased, but inefficient
- OLS standard errors and t-tests biased
source Luc Anselin
43Spatial regression models
- Incorporate spatial dependency
- Spatial lag model
- Two explanatory terms
- One is the variable at the neighborhood
- Second is the other variables
44Spatial regimes
- Extension of the non-spatial regression model
- Considers clusters of areas
- Groups each cluster in a different explanatory
variable - yi ?0 ?1x1 ?ixi ?i
- Gets different parameters for each cluster
45A study of the spatially varying relationship
between homicide rates and socio-economic data of
São Paulo using GWR
Frederico Roman Ramos CEDEST/Brasil
46Geographically Weighted Regression
- Extensão of traditional regression model where
the parameters are estimaded locally - (ui,vi) are the geographical coordinates of point
i. - The betas vary in space (each location has a
different coeficient) - We estimate an ordinary regression for each point
where the neighbours have more weight
47Introducing São Paulo
Some numbers Metropolitan region Population
17,878,703 (ibge,200) 39 municipalities Municipali
ty of São Paulo Population 10,434,252 HDI_M
0.841 (pnud, 2000) 96 districts IEX 74 out of
96 districts were classified as socially excluded
(cedest,2002) 4,637 homicide victims in 2001
48Data
4,637 homicide victims residence geoadressed 2001
456 Census Sample Tracts 2000
49Density surface of victim-based homicides
50Victim-based homicide rate (Tx_homic)
Tx_homic count homicide events (2001)
100.000 population (census, 2000)
51LISA Victim-based homicide rate
52Percentage of illiterate house-head (Xanlf)
Definition House-head is the person responsible
for the house. Generally, but not necessarily,
who has the highest income of the house
53LISA Percentage of illiterate house-head
54OLS regression results for TX_homic and X_analf
55OLS regression results for TX_homic and X_analf
56LISA for standardized residuals of the OLS
regression for TX_homic and X_analf
Moran0,2624
57GWR regression results for TX_homic and Xanlf
GWR ESTIMATION
Fitting
Geographically Weighted Regression Model...
Number of observations............ 456 Number of
independent variables... 2 (Intercept is
variable 1) Bandwidth (in data units).........
0.0246524516 Number of locations to fit model..
456 Diagnostic information... Residual sum
of squares........ 111179.875 Effective number
of parameters.. 83.1309998 Sigma................
.......... 17.2677182 Akaike Information
Criterion... 4007.32139 Coefficient of
Determination... 0.699720224
58GWR regression results for TX_homic and Xanlf
residuals
Moran -0,0303
59GWR regression results for TX_homic and Xanlf
Local Beta1
Local t-value
60CONCLUSIONS
- There are significant differences in the
relationship between violence rates and social
territorial data over the intra-urban area of São
Paulo - This results reinforces our hypotheses that we
should avoid using general concepts - The GWR technique is a useful instrument in
social territorial analysis