Spatial Statistics : Relationships

About This Presentation

Title:

Spatial Statistics : Relationships

Description:

A better economy has more potential for people to be employed. ... Statistics that junk individuals together are useless. Example: ring speciation. ... – PowerPoint PPT presentation

Number of Views:250

Avg rating:1.0/5.0

Slides: 55

Provided by: andrew58

Category:

more less

Transcript and Presenter's Notes

Title: Spatial Statistics : Relationships

1
Lecture 3

Spatial Statistics Relationships

2
Relationship statements

Male moderate drinkers are less likely to suffer
from insulin dependent diabetes than nondrinkers.
A better economy has more potential for people to
be employed.
The number of watches someone wears is directly
proportional to the number of arms they have.
Mountains cause rainfall. Smoking causes cancer.
Girls are better than boys, so there.
The number of teapots in China has no effect on
the frequency of volcanic eruptions in Italy.

3
This lecture

Correlation how much do two variables vary
together?
Regression what is their relationship?
Spatial autocorrelation and crosscorrelation
Semi-variograms
Geographically Weighted Regression

4
Correlation

As one variable changes, how closely do others
follow it?
Usually represented on graphs.
Can plot unrelated pairs of data from datasets of
different sizes (q-q plots)
rank both datasets and plot 10 value against 10
value, 90 value against 90 value.
Or plot linked pairs of data in scatterplots.
Correlation can be positive or negative.

5
Positive correlations

Attractiveness of chosen gender vs. alcohol
intake.
Bus trip time from Headingley vs. importance of
travel reason.
Ashs cumulative Pokémon losses vs. matches
played.

6
Negative correlations

Ability to perform with chosen gender vs. alcohol
intake.
Money vs. clubs visited.
Will to live vs. time in statistics lectures.

7
Correlation is one of the most useful and used
statistical techniques.

Correlation is an essential part of science.
Correlation is an essential part of politics.

8
Examples
9
Correlation is one of the most abused statistical
techniques.

Correlation is an essential part of dodgy
science.
Correlation is an essential part of political
misinformation.
Data can be selectively correlated.
There is no cause and effect link just because
two variables correlate.

10
Examples
11
What can we do about this?

Tricky, but one start is to build convincing
cause-and-effect models that demonstrate the same
behavior.
This gives us something concrete to investigate.
But we then have to test our predictions.

12
Correlation can be strong or weak.

Strong Weak

13
How do we measure correlation?

We use Correlation coefficients.
These are usually denoted as r and vary between
1 and 1.
-1 very strong negative correlation.
0 no correlation.
1 very strong positive correlation.

14
Which one depends on the type of data

Parametric tests used for data that is
Interval or ratio.
Normally distributed.
Sample populations have the same standard
deviations.
Non-parametric tests used for all other data,
including
Ranked data.
Categorized data.

15
One Parametric test Pearsons Correlation
Coefficient.

Idea is to calculate the average covariance how
much one variable varies as the other varies.
Deviation value mean
The product of the variables deviations gives a
measure of covariance.
(valueOne meanOne) x (valueTwo meanTwo)
If both variables values are far from the mean
the product is large. If one deviation is large
and the other small, the number will be smaller.

16
Pearsons Correlation Coefficient

Pearsons correlation coefficient r is the sum
of these products normalised by the standard
deviations.
The simplest way of calculating this is
r ((Sxy) / n) xmym
sxsy
where x and y are samples, xm and ym are
sample means, sx and sy are sample standard
deviations, and n is the sample sizes.

17
One non-parametric test Spearman Rank
Correlation Coefficient.

Given x and y sample pairs, we convert the xs
into their rank in all the xs, and the ys into
their rank in all the ys.
Spearmans coefficient is then calculated using
rs 1 - 6Sd2
n3 n
where d is the difference between the ranks for
any given pair (a measure of the covariance).

18
Testing the significance of the coefficients

When the data can be assumed normal, we can test
the null hypothesis that there is no correlation
(r 0) using the following statistic
t r (n 2)0.5
(1 r2)0.5
which has a t distribution and n-2 degrees of
freedom.

19
Problem correlations
Bizarre
Strong but non-linear
With many non-linear relationships we can
transform the data to a linear form. For example
exponential data can be made linear by taking the
natural log of the data.
Very bizarre
20
Regression

Quantifying the relationship between two or more
variables.
Linear regression with two variables.

We aim to produce a single line that quantifies
the relationship.
?y
Dependent variable (y)
?x
The equation for such a line is y a bx where
b is the slope (?y/?x on the figure). We can
use this line to predict new values given an
independent value.
a
Independent variable (x)
21
Finding the regression line

We take the line that minimizes some measure of
how well the line fits the data.
In the case of two variable linear regression, we
try to minimize the deviations between the data
and the line, or residuals.

The equation for such a line is given by b
S(x-x)(y-y) S(x-x)2 a y - bx
22
How much the line explains the data

The sum of the squared residuals gives us a
measure of how much of the data is not explained
by the line.
This value, divided by the total variation in the
data (the sum of the values squared) gives a
fraction of how badly the line matches.
One minus this gives how well it matches - the
coefficient of determination.
Conveniently, this value is the square of the
correlation coefficient r, and is also therefore
known as r2.
Thus, the significance test for the r value also
gives us the significance of our line.

23
Multiple regression

We can still do regression when there is more
that one independent variable.
For example, in the case of three variables (two
independent) we are looking for a solution sheet,
not a line

We can do the same thing with a computer for as
many dimensions as we like, but more than three
become hard to visualize as graphs. Were
essentially trying to fit a line with the
equation y a b1x1 b2x2 b3x3
y
x2
x1
24
Polynomial regression

In some cases we may want to fit a curve through
non-linear data in multi-variable space.
For one independent variable, the equation (which
a type known as a polynomial) for the line is
y a bx b2x2 b3x3
Excel, for example, will fit this for you.

25
Polynomial curve fitting

The degree of the polynomial is the power of the
last term. Higher degree curves fit the sample
data better.
However, weve seen that a sample doesnt
necessarily have the same distribution as the
population.
Our curve should reflect a general population
model not the sample data, with all its
measurement and random errors.
When we look at AI techniques well see that
predictive models based on data can become less
accurate about the world as they increasingly
match our samples and not the population.
Therefore we have to make a judgement as to the
polynomial degree, and not necessarily pick the
highest.

26
Summary

Correlation measures covariance, but doesnt say
anything about causal relationships.
We can measure correlation and test its
significance.
We can quantify relationships using regression
equations and use these to predict.

27
Spatial autocorrelation

One of the major issues in dealing with
geographical data.
The idea that values at one point may be
correlated with values of the same variable
nearby (or cross-correlated with another variable
nearby).
Geodemographics people living near each other
may have the same interests because they have the
same opportunities and self-cluster.
Rainfall in one geographical area stops rainfall
in another.
Crime spots cause social decay which in turn
causes more crime in a limited geographical area.
All graded or clustered information suffers from
this.

28
Frog averaging

Say we want to know what the average number of
frogs in the country is.
We take a sample of six points.
Three of them are normal and fall across the
whole country but three fall in an small area
where theres a hidden pond.
Its like weve only really taken four samples.

29
How does this effect significance testing /
correlation / regression?

Essentially if our data is spatially correlated,
we arent sampling as randomly as we would like
in our attempts to get an overview of the
population.
i.e. some of our samples are the same (not
independent) / dont count. This is the
equivalent of taking a smaller sample.
In correlation, it is possible that all our
correlation is due to geography, and none to our
variables.

30
How do we test for it?

First, plot the data and look for geographical
trends.
Particularly plot the residuals of any
regressions.

For example, the plot to the left might represent
murder rate residuals in an area after
deprivation and policing levels are taken into
account. / regressed. Anyone want to guess where
Dr Lecter lives?

Cluster analysis (two weeks) looks for these
kinds of trends.

31
But what about if its more pervasive?

How do we test if, for example, its a constant
relationship between neighbours? Statistics that
junk individuals together are useless.
Example ring speciation.
If you start eastwards from Alaska theres no
real difference between Herring gulls in one area
of the Arctic Ocean and the next. But the minor
differences build up around the globe so that
Alaskan and Siberian gulls cant interbreed.
Theres negative spatial autocorrelation in the
fertility you wouldnt understand if you mixed
the whole population up.
Example factors in the spread of Ebola.
We need a measure of the covariance between
neighbours.

32
Plotting Autocorrelations

Imagine we had the following map of mineral
deposits, just showing one variable.

NNE

Obviously there is an spatial autocorrelation in
the NNE direction and not in the others.

33
h-scattergrams

One way of displaying of autocorrelation is to
plot the values of points against the value of a
neighbour distance h away in some direction.

Usually the correlation will decrease with
distance.
Correlation may vary with direction as well.

34
Correlogram

We can get a number of h-scatterplots for
different h, and work out their correlation
coefficients.
This gives the strength of the correlation as
distance from a set of points drops off.

We can plot these against each other for one
direction.
Or as a contour plot for all directions.
35
Moment of Inertia

If a point x1 and its neighbour x2 were identical
and plotted against each other, theyd fall on
the 45 line x1 x2.
A measure of how much data does this is the
moment of inertia.
m 1/2n S(x1-x2)2
Unlike the correlation coefficient, m increases
as the data gets further spread.

36
Variograms

A plot of the moment of inertia vs. h is called a
Semi-variogram, or, more usually, just a
Variogram.

m 100
m
h-distance
m 200
h
h-angle
37
Problems

Variograms cant use all the data values without
additional assumptions e.g. what is North or the
Northernmost data point? Usually we ignore the
boundaries.
All the correlation plots can suffer badly from a
few unusual values, which can badly reduce the
correlations.
h-scatterplots allow us to see which unusual
points are causing the problems and let us decide
whether to remove them.

38
Multi-variant Plots

We may be interested in the relationship between
two variables and whether they are spatially
cross-correlated.
We can plot h-scatterplots for a variable x and a
variable y, but shift the y location by h.

y, h11/2
x
39
Multi-variant Correlation

We can also calculate the cross-correlation for
this h-scatterplot.
r (for some h) ((Sxy) / n) xmym
sxsy
Where the means and standard deviations are just
for the variable points used, i.e. x at one
position, y at another.

40
Multi-variant Variograms

Equally the equation for the moment of inertia
can be extended to
m 1/2n S((x1-x2) (y1-y2))
Note that this is no longer the moment of
inertia as the line can be off 45.
Also note that it uses both x and y at positions
1 and 2.

41
The Use of Variograms

As well see in later lectures, variograms can be
very useful.
They represent the variability at different
distances from a point.
You can therefore use them to construct
probability models of a landscape and predict the
value of missing areas.
This is known as kriging, and well look at it in
later sessions.
However, it might be nice to have a single
statistic we can use to assess autocorrelation.
One way is using Joint Count Statistics.

42
Joint Count Statistics

Moran and Geary in the 1950s.
Defines a binary variable something is either
present (white) or not (black).
Calculate the number of B-B W-W and B-W
connections.
These totals can then be compared with the normal
distribution, which is what wed get if the
process was random.

D
43
Developments of this

Morans I for contiguous areas.
Gearys c for contiguous areas.
However, this is strongly dependent on
which directions you take as contiguous,
variation in the size of areas and boundaries.

44
Cliff and Ords Morans I test

Core values are the deviations from the mean at
two locations.
These are then multiplied by an a priori weight
which represents how much two areas might effect
each other.
This is then normalized by the variation and
sample-number-to-weights ratio.

45
How do we define the weights?

Various options
One or zero depending on whether the areas are
adjacent.
Each area has a total of one, and this is
divided up between its adjacent neighbours
dependant on the number of them.
Exponentially related to the distance between the
areas (its possible to assess the relationship
between each area and all the others).
We have to pick the most reasonable.

46
Geographically Weighted Regression

Pioneered by the Newcastle United team of
Fotheringham, Brunsdon, and Charlton.
A bit like Morans for regression, only even more
arduous.

47
The Core Idea

A standard regression has the same parameters
wherever you are geographically.
E.g. relationship between socioeconomics and
secondary school performance.
Usually the residuals tell you where youve gone
wrong.
GWR allows the parameters to vary spatially, so
you can look at these.
Assumes a link between where you are, and the
strength of a relationship.

48
Locally weighted regression

Run a standard regression for each point, but
weight near points as more important.
Often the weights are an exponential function of
distance and/or limited to a fixed number of
nearest neighbours.
Weakly dependent on the form of the weights.
Strongly dependent on how far weights stretch
around an area.
Can try to find the best distance. This is the
one that gives the best prediction for each
point, if that point is excluded from GWR
calculations.
Run, run and run again.

49
GWR Software

Derive local t statistics.
Perform tests to assess the significance of the
spatial variation in the local parameter
estimates.
Perform tests to determine if the local model
performs better than the global one, accounting
for differences in degrees of freedom.
http//www.ncl.ac.uk/geography/GWR

50
GWR ExampleSpatial Variations in School
Performance

Did a global regression on Primary School Maths
results vs. demographics.
Then did a GWR regression. Derived the weights
for each factor for each point and plotted them.
Divided by an error variation term to give a
rough idea of the significance of the weights,
and plotted these.

51
GWR Example Results

In Leeds / Bradford, school size was much more
important than elsewhere (inverse relationship).
In Manchester, middle-class children do
proportionally better than their social group
elsewhere.
While the combination of variables and unknowns
is complex, GWR does suggest interesting avenues
of investigation.

Weights for school size
52
Summary

Correlation measures covariance, but doesnt say
anything about causal relationships.
We can measure correlation and test its
significance.
We can quantify relationships using regression
equations and use these to predict.

53
Summary

Spatial autocorrelation means our sampling
strategies arent as random / large as wed like.
Correlations can be due to geographical
correlations, not the ones weve tested for.
Plotting residuals geographically may let us see
autocorrelation.
Variograms and h-scatterplots are another good
way.
Morans I test allows us to quantify spatial
autocorrelation for a given weight scheme.
Geographically Weighted Regression helps us to
take autocorrelation into account and investigate
the weight it has.

54
Next lecture