Geostatistics with gstat

About This Presentation

Title:

Geostatistics with gstat

Description:

Topic: The Meuse soil pollution data set ... Mixed: adust by eye, evaluate statistically; or vice versa. Fitting the model manually ... – PowerPoint PPT presentation

Number of Views:619

Avg rating:3.0/5.0

Slides: 65

Provided by: gilbert79

Category:

more less

Transcript and Presenter's Notes

Title: Geostatistics with gstat

1
Geostatistics with gstat

Organized by Gilberto Câmara from
ITC course on Geostatistics by D G Rossiter

2
Original material

ITC course on Geostatistics by D G Rossiter
http//www.itc.nl/rossiter/teach/lecnotes.html

3
Topic The Meuse soil pollution data set

This will be used as a running example for the
following lectures.
It is an example of an environmental dataset
which can be used to answer a variety of
practical and theoretical research question.

4
Meuse data set source

Rikken, M.G.J. Van Rijn, R.P.G., 1993.
Soil pollution with heavy metals An inquiry
into spatial variation, cost of mapping and the
risk evaluation of copper, cadmium, lead and zinc
in the floodplains of the Meuse west of Stein,
the Netherlands.
Doctoraalveldwerkverslag, Dept. of Physical
Geography, Utrecht University
This data set is also used in the GIS text of
Burrough McDonnell.

5
Meuse data set

155 samples taken on a support of 10x10 m from
the top 0-20 cm of alluvial soils in a 5x2 km
part the floodplain of the Maas (Meuse) near
Stein (NL).
id point number
x, y coordinates E and N, in m
cadmium concentration in the soil, in mg kg-1
copper concentration in the soil, in mg kg-1
lead concentration in the soil, in mg kg-1
zinc concentration in the soil, in mg kg-1
elev elevation above local reference level, in
m
om organic matter loss on ignition, in percent
ffreq flood frequency class,
1 annual, 2 2-5 years, 3 every 5 years
soil soil class, coded
lime has the land here been limed? 0 or 1 F
or T
landuse land use, coded
dist.m distance from main River Maas channel, in
m

6
Assessing the Meuse data set in R

gt library(gstat)
gt data(meuse)
gt attach(meuse)
gt meuse 14,15
x y cadmium copper lead
1 181072 333611 11.7 85 299
2 181025 333558 8.6 81 277
3 181165 333537 6.5 68 199
4 181298 333484 2.6 81 116

7
Exploratory data analysis

Statistical analysis should lead to
understanding, not confusion . . .
. . . so it makes sense to examine and visualise
the data with a critical eye to see
Patterns outstanding features
Unusual data items (not fitting a pattern)
blunders? from a different population?
Promising analyses
Reconaissance before the battle
Draw obvious conclusions with a minimum of
analysis

8
Boxplot, stem-and-leaf plot, histogram

Questions
One population or several?
Outliers?
Centered or skewed (mean vs. median)?
Heavy or light tails (kurtosis)?
gt stem(cadmium)
gt boxplot(cadmium)
gt boxplot(cadmium, horizontal T)
gt points(mean(cadmium),1, pch20, cex2,
col"blue")
gt hist(cadmium) automatic bin selection
gt hist(cadmium, n16) specify the number of bins
gt hist(cadmium, breaksseq(0,20, by1)) specify
breakpoints

9
Cadmium histogram (skewed variable)
10
Summary Statistics

These summarize a single sample of a single
variable
5-number summary (min, 1st Q, median, 3rd Q, max)
Sample mean and variance
gt summary(cadmium)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.200 0.800 2.100 3.246 3.850 18.100
gt var(cadmium)
1 12.41678

11
The normal distribution

Arises naturally in many processes a variables
that can be modelled as a sum of many small
contributions, each with the same distribution of
errors (central limit theorem)
Easy mathematical manipulation
Fits many observed distributions of errors or
random effects
Some statistical procedures require that a
variable be at least approximately normally
distributed
Note even if a variable itself is not normally
distributed, its mean may be, since the
deviations from the mean may be the sum of many
small errors.

12
Evaluating normality

Graphical
Histograms
Quantile-Quantile plots (normal probability
plots)
Numerical
Various tests including Kolmogorov-Smirnov,
Anderson-Darling, Shapiro-Wilk
These all work by compare the observed
distribution with the theoretical normal
distribution having parameters estimated from the
observed, and computing the probability that the
observed is a realisation of the theoretical
gt qqnorm(cadmium) qqline(cadmium)
gt shapiro.test(cadmium)
Shapiro-Wilk normality test
W 0.7856, p-value 8.601e-14

13
Evaluating normality
14
Transforming to Normality Based on what criteria?

1. A priori understanding of the process
e.g. lognormal arises if contributing variables
multiply, rather than add
2. EDA visual impression of what should be done
3. Results transformed variable appears and
tests normal

15
Log transform of a variable with positive skew

gt summary(log(cadmium))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.6090 -0.2231 0.7419 0.5611 1.3480 2.8960
gt stem(log(cadmium))
gt hist(log(cadmium), n20)
gt boxplot(log(cadmium), horizontalT)
gt points(mean(log(cadmium)),1, pch20, cex2,
col"blue")
gt qqnorm(log(cadmium), main"Q-Q plot for
log(cadmium ppm)")
gt qqline(log(cadmium))
gt shapiro.test(log(cadmium))
Shapiro-Wilk normality test
W 0.9462, p-value 1.18e-05
This is still not normal, but much more symmetric

16
Topic Regression

A general term for modelling the distribution of
one variable (the response or dependent) from
(on) another (the predictor or independent)
This is only logical if we have a priori
(non-statistical) reasons to believe in a causal
relation
Correlation makes no assumptions about
causation both variables have the same logical
status
Regression assumes one variable is the predictor
and the other the response

17
Actual vs. fictional causality

Example proportion of fine sand in a topsoil and
subsoil layer
Does one cause the other?
Do they have a common cause?
Can one be used to predict the other?
Why would this be useful?

18
Linear regression model

Model y b0b1xe
b0 intercept, constant shift from x to y
b1 slope, change in y for an equivalent change
in x
e error, or better, unexplained variation
The parameters b0 and b1 are selected to minimize
some summary measure of e over all sample points

19
Simple regression

Given the fitted model, we can predict at the
original data points pi these are called the
fitted values
Then we can compute the deviations of the fitted
from the measured values
ei ( pi-yi)
these are called the residuals
The deviations can be summarized to give an
overall goodness-of-fit measure

20
Look before you leap!

Anscombe developed four different bivariate
datasets, all with the exact same correlation r
0.81 and linear regression y 30.5x
1. bi-variate normal
2. quadratic
3. bi-variate normal with one outlier
4. one high-leverage point

21
(No Transcript)
22
Bivariate analysis heavy metals vs. organic
matter

Scatterplot
Scatterplot by flood frequency
Regression of metal on organic matter (why this
order?)
Same, including flood frequency in the model
gt plot(om,log(cadmium))
gt plot(om, log(cadmium), colas.numeric(ffreq),
cex1.5, pch20)

23
Model Regression of metal on organic matter

gt mlt-lm(log(cadmium) om)
gt summary(m)
Residuals
Min 1Q Median 3Q Max
-2.3070 -0.3655 0.1270 0.6079 2.0503
Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) -1.04574 0.19234 -5.437 2.13e-07
om 0.21522 0.02339 9.202 2.70e-16
---
Residual standard error 0.9899 on 151 degrees of
freedom
Multiple R-Squared 0.3593, Adjusted R-squared
0.3551
F-statistic 84.68 on 1 and 151 DF, p-value
2.703e-16
Highly-significant model, but organic matter
content explains only about 35 of the
variability of log(Cd).

24
Good fit vs. significant fit

R2 can be highly significant (reject null
hypothesis of no relation), but . . .
. . . the prediction can be poor
In other words, only a small portion of the
variance is explained by the model
Two possiblities
1. incorrect or incomplete model
(a) other factors are more predictive
(b) other factors can be included to improve the
model
(c) form of the model is wrong
2. correct model, noisy data

25
Regression diagnostics

Objective to see if the regression truly
represents the presumed relation
Objective to see if the computational methods
are adequate
Main tool plot of standardized residuals vs.
fitted values
Numerical measures leverage, large residuals

26
Examining the scatterplot with the fitted line

Is there a trend in lack of fit? (further away in
part of range)
a non-linear model
Is there a trend in the spread?
heteroscedasity (unequal variances) so linear
modelling is invalid
Are there points that, if eliminated, would
substantially change the fit?
high leverage, isolated in the range and far from
other points

27
Model diagnostics regression of metal on organic
matter

gt mlt-lm(log(cadmium) om)
gt plot(om, log(cadmium), colas.numeric(ffreq),
cex1.5, pch20) abline(m)
gt plot(log(cadmium!is.na(om)),fitted(m),
colas.numeric(ffreq), pch20)
gt abline(0,1)
gt plot(fitted(m),studres(m), colas.numeric(ffreq)
, pch20)
abline(h0)
qqnorm(studres(m), colas.numeric(ffreq),
pch20)qqline(studres(m))
We can see problems at the low metal
concentrations. This is probably an artifact of
the measurement precision at these levels (near
or below the detection limit).

28
(No Transcript)
29
Revised model Cd detection limit

Values of Cd below 1mg kg-1 are unreliable
Replace them all with 1mg kg-1 and re-analyze
gt cdx lt-ifelse(cadmiumgt1, cadmium, 1)
gt plot(om, log(cdx), colas.numeric(ffreq),
cex1.5, pch20)
gt mlt-lm(log(cdx) om)
gt summary(m)
Residuals
Min 1Q Median 3Q Max
-1.0896 -0.4250 -0.0673 0.3527 1.5836
Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) -0.43030 0.11092 -3.879 0.000156
om 0.17272 0.01349 12.806 lt 2e-16
---

30
Revised model Cd detection limit

Residual standard error 0.5709 on 151 degrees of
freedom
Multiple R-Squared 0.5206,Adjusted R-squared
0.5174
F-statistic 164 on 1 and 151 DF, p-value lt
2.2e-16
gt abline(m)
gt plot(log(cdx!is.na(om)),fitted(m),
colas.numeric(ffreq),pch20) abline(0,1)
gt plot(fitted(m),studres(m),
colas.numeric(ffreq),pch20) abline(h0)
gt qqnorm(studres(m),colas.numeric(ffreq),pch20)
qqline(studres(m))
Much higher R2 and better diagnostics. Still,
there is a lot of spread at any value of the
predictor (organic matter).

31
Experimental variogram

gt v lt- variogram(log(cadmium)1, xy, meuse)
np dist gamma
1 57 79.29244 0.6650872
2 299 163.97367 0.8584648
3 419 267.36483 1.0064382
4 457 372.73542 1.1567136
5 547 478.47670 1.3064732
6 533 585.34058 1.5135658
7 574 693.14526 1.6040086
8 564 796.18365 1.7096998
9 589 903.14650 1.7706890
10 543 1011.29177 1.9875659
11 500 1117.86235 1.8259154
12 477 1221.32810 1.8852099
13 452 1329.16407 1.9145967
14 457 1437.25620 1.8505336
15 415 1543.20248 1.8523791
np are the number of point pairs in the bin dist
is the average separation of
these pairs gamma is the average semivariance in
the bin.

32
Plotting the experimental variogram

This can be plotted as semivariance gamma against
average separation dist, along with the number of
points that contributed to each estimate np
gt plot(v, plot.numbersT)
(Note gstat defaults to 15 equally-spaced bins
and a maximum distance of 1/3 of the maximum
separation. These can be over-ridden with the
width and cutoff arguments, respectively.)

33
Plotting the variogram
34
Analysing the variogram

Later we will look at fitting a model to the
variogram but even without a model we can notice
some features, which we define here only
qualitatively
Sill maximum semi-variance represents
variability in the absence of spatial dependence
Range separation between point-pairs at which
the sill is reached distance at which there is
no evidence of spatial dependence
Nugget semi-variance as the separation
approaches zero represents variability at a
point that cant be explained by spatial
structure.
In the previous slide, we can estimate the sill ?
1.9, the range ? 1200 m, and the nugget ? 0.5
i.e. 25 of the sill.

35
Using the experimental variogram to model the
random process

Notice that the semivariance of the separation
vector g(h) is now given as the estimate of
covariance in the spatial field.
So it models the spatially-correlated component
of the regionalized variable
We must go from the experimental variogram to a
variogram model in order to be able to model the
random process at any separation.

36
Modelling the variogram

From the empirical variogram we now derive a
variogram model which expresses semivariance as a
function of separation vector. It allows us to
Infer the characteristics of the underlying
process from the functional form and its
parameters
Compute the semi-variance between any point-pair,
separated by any vector
Interpolate between sample points using an
optimal interpolator (kriging)

37
Authorized Models

Any variogram function must be able to model the
following
1. Monotonically increasing
possibly with a fluctuation (hole)
2. Constant or asymptotic maximum (sill)
3. Non-negative intercept (nugget)
4. Anisotropy
Variograms must obey mathematical constraints so
that the resulting kriging equations are solvable
(e.g., positive definite between-sample
covariance matrices).
The permitted functions are called authorized
models.

38
(No Transcript)
39
(No Transcript)
40
Visualization of the model in gstat-windows

Open the meuse.txt file in gstatw
Visualize the data
Try to fit various variogram models

41
Gstat for windows
42
Defining the bins

Distance interval, specifying the centres. E.g.
(0,100,200, . . .) means intervals of 0. . .50,
50. . .150, . . .
All point pairs whose separation is in the
interval are used to estimate g(h) for h as the
interval centre
Narrow intervals more resolution but fewer point
pairs for each sample
gt v lt- variogram(log(cadmium)1, xy, meuse,
boundaries seq(50, 2050, by 100))
gt plot(v, plT)
gt par(mfrow c(2, 3)) show all six plots
together
gt for (bw in seq(20, 220, by 40))
vlt-variogram(log(cadmium)1, xy, meuse,
widthbw)
plot(vdist, vgamma, xlabpaste("bin width",
bw))

43
Defining the bins

Each bin should have gt 100 point pairs gt 300 is
much more reliable
gt v lt- variogram(log(cadmium)1, xy, meuse,
width20)
gt plot(v, plot.numbersT)
gt vnp
1 6 19 27 27 51 65 58 62 62 82 76 75 86 81 76
16 91 92 90 88 92 112 103 80 116 108 106 79 94
117 99
31 100 101 108 117 110 117 114 107 96 110 109
106 114 117 104
46 98 94 117 92 110 105 91 89 98 89 91 103 102
93 92
61 73 85 88 91 88 84 75 81 90 73 93 95 76 85 67
76 77 88 60
gt v lt- variogram(log(cadmium)1, xy, meuse,
width120)
gt vnp
1 79 380 485 577 583 642 654 648 609 572 522
491 493 148
gt plot(v, plot.numbersT)

44
Fitting the model

Once a model form is selected, then the model
parameters must be adjusted for a best fit of
the experimental variogram.
By eye, adjusting parameters for good-looking fit
Hard to judge the relative value of each point
Use the windows version of gstat
Automatically, looking for the best fit according
to some objective criterion
Various criteria possible in gstat
In both cases, favour sections of the variogram
with more pairs and at shorter ranges (because it
is a local interpolator).
Mixed adust by eye, evaluate statistically or
vice versa

45
Fitting the model manually

Weve decided on a spherical nugget model
Calculate the experimental variogram and
display it
v1 lt- variogram(log(cadmium)1, xy, meuse)
gt plot(v1, plot.numbersT)
gt Fit by eye, display fit
gt m1 lt- vgm(1.4, "Sph", 1200, 0.5) plot(v1,
plot.numbersT, modelm1)

46
Fitting the model automatically

gt Let gstat adjust the parameters, display fit
gt m2 lt- fit.variogram(v1, m1)
gt m2
model psill range
1 Nug 0.5478478 0.000
2 Sph 1.3397965 1149.436
gt plot(v1, plot.numbersT, modelm2)
gt Fix the nugget, fit only the sill of
spherical model
gt m2a lt- fit.variogram(v1,m1,fit.sillsc(F,T),fit.
rangeF)
gt m2a
model psill range
1 Nug 0.500000 0
2 Sph 1.465133 1200

47
Model comparison

By eye c0 0.5,c1 1.4,a 1200 total sill
c0c1 1.9
Automaticc0 0.548,c1 1.340,a 1149 total
sill c0c1 1.888
The total sill was almost unchanged gstat raised
the nugget and lowered the partial sill of the
spherical model a bit the range was shortened by
51 m.

48
What sample size to fit a variogram model?

Cant use non-spatial formulas for sample size,
because spatial samples are correlated, and each
sample is used multiple times in the variogram
estimate
No way to estimate the true error, since we have
only one realisation
Stochastic simulation from an assumed true
variogram suggests
lt 50 points not at all reliable
100 to 150 points more or less acceptable
gt 250 points almost certaintly reliable
More points are needed to estimate an anisotropic
variogram.
This is very worrying for many environmental
datasets (soil cores, vegetation plots, . . . )
especially from short-term fieldwork, where
sample sizes of 40 60 are typical. Should
variograms even be attempted on such small
samples?

49
Approaches to spatial prediction

This is the prediction of the value of some
variable at an unsampled point, based on the
values at the sampled points.
This is often called interpolation, but strictly
speaking that is only for points that are
geographically inside the sample set (otherwise
it is extrapolation.

50
Approaches to prediction Local predictors

Value of the variable is predicted from nearby
samples
Example concentrations of soil constituents
(e.g. salts, pollutants)
Example vegetation density

51
Local Predictors

Each interpolator has its own assumptions, i.e.
theory of spatial variability
Nearest neighbour
Average within a radius
Average of n nearest neighbours
Distance-weighted average within a radius
Distance-weighted average of n nearest neighbours
Optimal weighting -gt Kriging

52
Ordinary Kriging

The theory of regionalised variables leads to an
optimal interpolation method, in the sense that
the prediction variance is minimized.
This is based on the theory of random functions,
and requires certain assumptions.

53
Optimal local interpolation motivation

Problems with average-in-circle methods
1. No objective way to select radius of circle or
number of points
Problems with inverse-distance methods
1. How to choose power (inverse, inverse squared
. . . )?
2. How to choose limiting radius?
In both cases
1. Uneven distribution of samples could over or
underemphasize some parts of the field
2. prediction error must be estimated from a
separate validation dataset

54
An optimal local predictor would have these
features

Prediction is made as a linear combination of
known data values (a weighted average).
Prediction is unbiased and exact at known points
Points closer to the point to be predicted have
larger weights
Clusters of points reduce to single equivalent
points, i.e., over-sampling in a small area cant
bias result
Closer sample points mask further ones in the
same direction
Error estimate is based only on the sample
configuration, not the data values
Prediction error should be as small as possible.

55
Kriging

A Best Linear Unbiased Predictor (BLUP) that
satisfies certain criteria for optimality.
It is only optimal with respect to the chosen
model!
Based on the theory of random processes, with
covariances depending only on separation (i.e. a
variogram model)
Theory developed several times (Kolmogorov
1930s, Wiener 1949) but current practise dates
back to Matheron (1963), formalizing the
practical work of the mining engineer D G Krige
(RSA).

56
How do we use Kriging?

1. Sample, preferably at different resolutions
2. Calculate the experimental variogram
3. Model the variogram with one or more
authorized functions
4. Apply the kriging system, with the variogram
model of spatial dependence, at each point to be
predicted
Predictions are often at each point on a regular
grid (e.g. a raster map)
These points are actually blocks the size of
the sampling support
Can also predict in blocks larger than the
original support
5. Calculate the error of each prediction this
is based only on the sample point locations, not
their data values.

57
Prediction with Ordinary Kriging (OK)

In OK, we model the value of variable z at
location si as the sum of a regional mean m and a
spatially-correlated random component e(si)
Z(si) me(si)
The regional mean m is estimated from the sample,
but not as the simple average, because there is
spatial dependence. It is implicit in the OK
system.

58
Prediction with Ordinary Kriging (OK)

Predict at points, with unknown mean (which must
also be estimated) and no trend
Each point x is predicted as the weighted average
of the values at all sample
The weights assigned to each sample point sum to
1
Therefore, the prediction is unbiased
Ordinary no trend or strata regional mean
must be estimated from sample

59
OK in gstat

load 40m interpolation grid
data(meuse.grid)
also make a full grid in the bounding box
covering the same area
meuse.boxlt-expand.grid(
xseq(min(meuse.gridx), max(meuse.gridx),
by40),
yseq(min(meuse.gridy), max(meuse.gridy),
by40)
)

60
OK in gstat

The krige method is used with a variogram model
compute experimental variogram
v lt- variogram(log(cadmium) 1, xy, meuse)
estimated model
m lt- vgm(1.4, "Sph", 1200, 0.5)
fitted model
m.f lt- fit.variogram(v, m)
1 "SSErr 2.80736271217044e-05"
kr lt- krige(log(cadmium) 1, loc xy,
datameuse, newdatameuse.grid, modelm.f)
using ordinary kriging

61
OK in gstat

library (lattice)
visualize interpolation note aspect option to
get correct geometry
levelplot(var1.pred xy, kr, aspectmapasp(k1))
visualize prediction error
levelplot(var1.var xy, kr, aspectmapasp(k1))

62
(No Transcript)
63
(No Transcript)
64
How realistic are maps made by Ordinary Kriging?

The resulting surface is smooth and shows no
noise, no matter if there is a nugget effect in
the variogram model
So the field is the best at each point taken
separately, but taken as a whole is not a
realistic map
The sample points are predicted exactly they are
assumed to be without error, again even if there
is a nugget effect in the variogram model (Note
bloc kriging does not have this problem)