Geostatistics with gstat - PowerPoint PPT Presentation

About This Presentation
Title:

Geostatistics with gstat

Description:

Topic: The Meuse soil pollution data set ... Mixed: adust by eye, evaluate statistically; or vice versa. Fitting the model manually ... – PowerPoint PPT presentation

Number of Views:618
Avg rating:3.0/5.0
Slides: 65
Provided by: gilbert79
Category:

less

Transcript and Presenter's Notes

Title: Geostatistics with gstat


1
Geostatistics with gstat
  • Organized by Gilberto Câmara from
  • ITC course on Geostatistics by D G Rossiter

2
Original material
  • ITC course on Geostatistics by D G Rossiter
  • http//www.itc.nl/rossiter/teach/lecnotes.html

3
Topic The Meuse soil pollution data set
  • This will be used as a running example for the
    following lectures.
  • It is an example of an environmental dataset
    which can be used to answer a variety of
    practical and theoretical research question.

4
Meuse data set source
  • Rikken, M.G.J. Van Rijn, R.P.G., 1993.
  • Soil pollution with heavy metals An inquiry
    into spatial variation, cost of mapping and the
    risk evaluation of copper, cadmium, lead and zinc
    in the floodplains of the Meuse west of Stein,
    the Netherlands.
  • Doctoraalveldwerkverslag, Dept. of Physical
    Geography, Utrecht University
  • This data set is also used in the GIS text of
    Burrough McDonnell.

5
Meuse data set
  • 155 samples taken on a support of 10x10 m from
    the top 0-20 cm of alluvial soils in a 5x2 km
    part the floodplain of the Maas (Meuse) near
    Stein (NL).
  • id point number
  • x, y coordinates E and N, in m
  • cadmium concentration in the soil, in mg kg-1
  • copper concentration in the soil, in mg kg-1
  • lead concentration in the soil, in mg kg-1
  • zinc concentration in the soil, in mg kg-1
  • elev elevation above local reference level, in
    m
  • om organic matter loss on ignition, in percent
  • ffreq flood frequency class,
  • 1 annual, 2 2-5 years, 3 every 5 years
  • soil soil class, coded
  • lime has the land here been limed? 0 or 1 F
    or T
  • landuse land use, coded
  • dist.m distance from main River Maas channel, in
    m

6
Assessing the Meuse data set in R
  • gt library(gstat)
  • gt data(meuse)
  • gt attach(meuse)
  • gt meuse 14,15
  • x y cadmium copper lead
  • 1 181072 333611 11.7 85 299
  • 2 181025 333558 8.6 81 277
  • 3 181165 333537 6.5 68 199
  • 4 181298 333484 2.6 81 116

7
Exploratory data analysis
  • Statistical analysis should lead to
    understanding, not confusion . . .
  • . . . so it makes sense to examine and visualise
    the data with a critical eye to see
  • Patterns outstanding features
  • Unusual data items (not fitting a pattern)
    blunders? from a different population?
  • Promising analyses
  • Reconaissance before the battle
  • Draw obvious conclusions with a minimum of
    analysis

8
Boxplot, stem-and-leaf plot, histogram
  • Questions
  • One population or several?
  • Outliers?
  • Centered or skewed (mean vs. median)?
  • Heavy or light tails (kurtosis)?
  • gt stem(cadmium)
  • gt boxplot(cadmium)
  • gt boxplot(cadmium, horizontal T)
  • gt points(mean(cadmium),1, pch20, cex2,
    col"blue")
  • gt hist(cadmium) automatic bin selection
  • gt hist(cadmium, n16) specify the number of bins
  • gt hist(cadmium, breaksseq(0,20, by1)) specify
    breakpoints

9
Cadmium histogram (skewed variable)
10
Summary Statistics
  • These summarize a single sample of a single
    variable
  • 5-number summary (min, 1st Q, median, 3rd Q, max)
  • Sample mean and variance
  • gt summary(cadmium)
  • Min. 1st Qu. Median Mean 3rd Qu. Max.
  • 0.200 0.800 2.100 3.246 3.850 18.100
  • gt var(cadmium)
  • 1 12.41678

11
The normal distribution
  • Arises naturally in many processes a variables
    that can be modelled as a sum of many small
    contributions, each with the same distribution of
    errors (central limit theorem)
  • Easy mathematical manipulation
  • Fits many observed distributions of errors or
    random effects
  • Some statistical procedures require that a
    variable be at least approximately normally
    distributed
  • Note even if a variable itself is not normally
    distributed, its mean may be, since the
    deviations from the mean may be the sum of many
    small errors.

12
Evaluating normality
  • Graphical
  • Histograms
  • Quantile-Quantile plots (normal probability
    plots)
  • Numerical
  • Various tests including Kolmogorov-Smirnov,
    Anderson-Darling, Shapiro-Wilk
  • These all work by compare the observed
    distribution with the theoretical normal
    distribution having parameters estimated from the
    observed, and computing the probability that the
    observed is a realisation of the theoretical
  • gt qqnorm(cadmium) qqline(cadmium)
  • gt shapiro.test(cadmium)
  • Shapiro-Wilk normality test
  • W 0.7856, p-value 8.601e-14

13
Evaluating normality
14
Transforming to Normality Based on what criteria?
  • 1. A priori understanding of the process
  • e.g. lognormal arises if contributing variables
    multiply, rather than add
  • 2. EDA visual impression of what should be done
  • 3. Results transformed variable appears and
    tests normal

15
Log transform of a variable with positive skew
  • gt summary(log(cadmium))
  • Min. 1st Qu. Median Mean 3rd Qu. Max.
  • -1.6090 -0.2231 0.7419 0.5611 1.3480 2.8960
  • gt stem(log(cadmium))
  • gt hist(log(cadmium), n20)
  • gt boxplot(log(cadmium), horizontalT)
  • gt points(mean(log(cadmium)),1, pch20, cex2,
    col"blue")
  • gt qqnorm(log(cadmium), main"Q-Q plot for
    log(cadmium ppm)")
  • gt qqline(log(cadmium))
  • gt shapiro.test(log(cadmium))
  • Shapiro-Wilk normality test
  • W 0.9462, p-value 1.18e-05
  • This is still not normal, but much more symmetric

16
Topic Regression
  • A general term for modelling the distribution of
    one variable (the response or dependent) from
    (on) another (the predictor or independent)
  • This is only logical if we have a priori
    (non-statistical) reasons to believe in a causal
    relation
  • Correlation makes no assumptions about
    causation both variables have the same logical
    status
  • Regression assumes one variable is the predictor
    and the other the response

17
Actual vs. fictional causality
  • Example proportion of fine sand in a topsoil and
    subsoil layer
  • Does one cause the other?
  • Do they have a common cause?
  • Can one be used to predict the other?
  • Why would this be useful?

18
Linear regression model
  • Model y b0b1xe
  • b0 intercept, constant shift from x to y
  • b1 slope, change in y for an equivalent change
    in x
  • e error, or better, unexplained variation
  • The parameters b0 and b1 are selected to minimize
    some summary measure of e over all sample points

19
Simple regression
  • Given the fitted model, we can predict at the
    original data points pi these are called the
    fitted values
  • Then we can compute the deviations of the fitted
    from the measured values
  • ei ( pi-yi)
  • these are called the residuals
  • The deviations can be summarized to give an
    overall goodness-of-fit measure

20
Look before you leap!
  • Anscombe developed four different bivariate
    datasets, all with the exact same correlation r
    0.81 and linear regression y 30.5x
  • 1. bi-variate normal
  • 2. quadratic
  • 3. bi-variate normal with one outlier
  • 4. one high-leverage point

21
(No Transcript)
22
Bivariate analysis heavy metals vs. organic
matter
  • Scatterplot
  • Scatterplot by flood frequency
  • Regression of metal on organic matter (why this
    order?)
  • Same, including flood frequency in the model
  • gt plot(om,log(cadmium))
  • gt plot(om, log(cadmium), colas.numeric(ffreq),
    cex1.5, pch20)

23
Model Regression of metal on organic matter
  • gt mlt-lm(log(cadmium) om)
  • gt summary(m)
  • Residuals
  • Min 1Q Median 3Q Max
  • -2.3070 -0.3655 0.1270 0.6079 2.0503
  • Coefficients
  • Estimate Std. Error t value Pr(gtt)
  • (Intercept) -1.04574 0.19234 -5.437 2.13e-07
  • om 0.21522 0.02339 9.202 2.70e-16
  • ---
  • Residual standard error 0.9899 on 151 degrees of
    freedom
  • Multiple R-Squared 0.3593, Adjusted R-squared
    0.3551
  • F-statistic 84.68 on 1 and 151 DF, p-value
    2.703e-16
  • Highly-significant model, but organic matter
    content explains only about 35 of the
    variability of log(Cd).

24
Good fit vs. significant fit
  • R2 can be highly significant (reject null
    hypothesis of no relation), but . . .
  • . . . the prediction can be poor
  • In other words, only a small portion of the
    variance is explained by the model
  • Two possiblities
  • 1. incorrect or incomplete model
  • (a) other factors are more predictive
  • (b) other factors can be included to improve the
    model
  • (c) form of the model is wrong
  • 2. correct model, noisy data

25
Regression diagnostics
  • Objective to see if the regression truly
    represents the presumed relation
  • Objective to see if the computational methods
    are adequate
  • Main tool plot of standardized residuals vs.
    fitted values
  • Numerical measures leverage, large residuals

26
Examining the scatterplot with the fitted line
  • Is there a trend in lack of fit? (further away in
    part of range)
  • a non-linear model
  • Is there a trend in the spread?
  • heteroscedasity (unequal variances) so linear
    modelling is invalid
  • Are there points that, if eliminated, would
    substantially change the fit?
  • high leverage, isolated in the range and far from
    other points

27
Model diagnostics regression of metal on organic
matter
  • gt mlt-lm(log(cadmium) om)
  • gt plot(om, log(cadmium), colas.numeric(ffreq),
    cex1.5, pch20) abline(m)
  • gt plot(log(cadmium!is.na(om)),fitted(m),
    colas.numeric(ffreq), pch20)
  • gt abline(0,1)
  • gt plot(fitted(m),studres(m), colas.numeric(ffreq)
    , pch20)
  • abline(h0)
  • qqnorm(studres(m), colas.numeric(ffreq),
    pch20)qqline(studres(m))
  • We can see problems at the low metal
    concentrations. This is probably an artifact of
    the measurement precision at these levels (near
    or below the detection limit).

28
(No Transcript)
29
Revised model Cd detection limit
  • Values of Cd below 1mg kg-1 are unreliable
  • Replace them all with 1mg kg-1 and re-analyze
  • gt cdx lt-ifelse(cadmiumgt1, cadmium, 1)
  • gt plot(om, log(cdx), colas.numeric(ffreq),
    cex1.5, pch20)
  • gt mlt-lm(log(cdx) om)
  • gt summary(m)
  • Residuals
  • Min 1Q Median 3Q Max
  • -1.0896 -0.4250 -0.0673 0.3527 1.5836
  • Coefficients
  • Estimate Std. Error t value Pr(gtt)
  • (Intercept) -0.43030 0.11092 -3.879 0.000156
  • om 0.17272 0.01349 12.806 lt 2e-16
  • ---

30
Revised model Cd detection limit
  • Residual standard error 0.5709 on 151 degrees of
    freedom
  • Multiple R-Squared 0.5206,Adjusted R-squared
    0.5174
  • F-statistic 164 on 1 and 151 DF, p-value lt
    2.2e-16
  • gt abline(m)
  • gt plot(log(cdx!is.na(om)),fitted(m),
  • colas.numeric(ffreq),pch20) abline(0,1)
  • gt plot(fitted(m),studres(m),
  • colas.numeric(ffreq),pch20) abline(h0)
  • gt qqnorm(studres(m),colas.numeric(ffreq),pch20)
    qqline(studres(m))
  • Much higher R2 and better diagnostics. Still,
    there is a lot of spread at any value of the
    predictor (organic matter).

31
Experimental variogram
  • gt v lt- variogram(log(cadmium)1, xy, meuse)
  • np dist gamma
  • 1 57 79.29244 0.6650872
  • 2 299 163.97367 0.8584648
  • 3 419 267.36483 1.0064382
  • 4 457 372.73542 1.1567136
  • 5 547 478.47670 1.3064732
  • 6 533 585.34058 1.5135658
  • 7 574 693.14526 1.6040086
  • 8 564 796.18365 1.7096998
  • 9 589 903.14650 1.7706890
  • 10 543 1011.29177 1.9875659
  • 11 500 1117.86235 1.8259154
  • 12 477 1221.32810 1.8852099
  • 13 452 1329.16407 1.9145967
  • 14 457 1437.25620 1.8505336
  • 15 415 1543.20248 1.8523791
  • np are the number of point pairs in the bin dist
    is the average separation of
  • these pairs gamma is the average semivariance in
    the bin.

32
Plotting the experimental variogram
  • This can be plotted as semivariance gamma against
    average separation dist, along with the number of
    points that contributed to each estimate np
  • gt plot(v, plot.numbersT)
  • (Note gstat defaults to 15 equally-spaced bins
    and a maximum distance of 1/3 of the maximum
    separation. These can be over-ridden with the
    width and cutoff arguments, respectively.)

33
Plotting the variogram
34
Analysing the variogram
  • Later we will look at fitting a model to the
    variogram but even without a model we can notice
    some features, which we define here only
    qualitatively
  • Sill maximum semi-variance represents
    variability in the absence of spatial dependence
  • Range separation between point-pairs at which
    the sill is reached distance at which there is
    no evidence of spatial dependence
  • Nugget semi-variance as the separation
    approaches zero represents variability at a
    point that cant be explained by spatial
    structure.
  • In the previous slide, we can estimate the sill ?
    1.9, the range ? 1200 m, and the nugget ? 0.5
    i.e. 25 of the sill.

35
Using the experimental variogram to model the
random process
  • Notice that the semivariance of the separation
    vector g(h) is now given as the estimate of
    covariance in the spatial field.
  • So it models the spatially-correlated component
    of the regionalized variable
  • We must go from the experimental variogram to a
    variogram model in order to be able to model the
    random process at any separation.

36
Modelling the variogram
  • From the empirical variogram we now derive a
    variogram model which expresses semivariance as a
    function of separation vector. It allows us to
  • Infer the characteristics of the underlying
    process from the functional form and its
    parameters
  • Compute the semi-variance between any point-pair,
    separated by any vector
  • Interpolate between sample points using an
    optimal interpolator (kriging)

37
Authorized Models
  • Any variogram function must be able to model the
    following
  • 1. Monotonically increasing
  • possibly with a fluctuation (hole)
  • 2. Constant or asymptotic maximum (sill)
  • 3. Non-negative intercept (nugget)
  • 4. Anisotropy
  • Variograms must obey mathematical constraints so
    that the resulting kriging equations are solvable
    (e.g., positive definite between-sample
    covariance matrices).
  • The permitted functions are called authorized
    models.

38
(No Transcript)
39
(No Transcript)
40
Visualization of the model in gstat-windows
  • Open the meuse.txt file in gstatw
  • Visualize the data
  • Try to fit various variogram models

41
Gstat for windows
42
Defining the bins
  • Distance interval, specifying the centres. E.g.
    (0,100,200, . . .) means intervals of 0. . .50,
    50. . .150, . . .
  • All point pairs whose separation is in the
    interval are used to estimate g(h) for h as the
    interval centre
  • Narrow intervals more resolution but fewer point
    pairs for each sample
  • gt v lt- variogram(log(cadmium)1, xy, meuse,
    boundaries seq(50, 2050, by 100))
  • gt plot(v, plT)
  • gt par(mfrow c(2, 3)) show all six plots
    together
  • gt for (bw in seq(20, 220, by 40))
  • vlt-variogram(log(cadmium)1, xy, meuse,
    widthbw)
  • plot(vdist, vgamma, xlabpaste("bin width",
    bw))

43
Defining the bins
  • Each bin should have gt 100 point pairs gt 300 is
    much more reliable
  • gt v lt- variogram(log(cadmium)1, xy, meuse,
    width20)
  • gt plot(v, plot.numbersT)
  • gt vnp
  • 1 6 19 27 27 51 65 58 62 62 82 76 75 86 81 76
  • 16 91 92 90 88 92 112 103 80 116 108 106 79 94
    117 99
  • 31 100 101 108 117 110 117 114 107 96 110 109
    106 114 117 104
  • 46 98 94 117 92 110 105 91 89 98 89 91 103 102
    93 92
  • 61 73 85 88 91 88 84 75 81 90 73 93 95 76 85 67
  • 76 77 88 60
  • gt v lt- variogram(log(cadmium)1, xy, meuse,
    width120)
  • gt vnp
  • 1 79 380 485 577 583 642 654 648 609 572 522
    491 493 148
  • gt plot(v, plot.numbersT)

44
Fitting the model
  • Once a model form is selected, then the model
    parameters must be adjusted for a best fit of
    the experimental variogram.
  • By eye, adjusting parameters for good-looking fit
  • Hard to judge the relative value of each point
  • Use the windows version of gstat
  • Automatically, looking for the best fit according
    to some objective criterion
  • Various criteria possible in gstat
  • In both cases, favour sections of the variogram
    with more pairs and at shorter ranges (because it
    is a local interpolator).
  • Mixed adust by eye, evaluate statistically or
    vice versa

45
Fitting the model manually
  • Weve decided on a spherical nugget model
  • Calculate the experimental variogram and
    display it
  • v1 lt- variogram(log(cadmium)1, xy, meuse)
  • gt plot(v1, plot.numbersT)
  • gt Fit by eye, display fit
  • gt m1 lt- vgm(1.4, "Sph", 1200, 0.5) plot(v1,
    plot.numbersT, modelm1)

46
Fitting the model automatically
  • gt Let gstat adjust the parameters, display fit
  • gt m2 lt- fit.variogram(v1, m1)
  • gt m2
  • model psill range
  • 1 Nug 0.5478478 0.000
  • 2 Sph 1.3397965 1149.436
  • gt plot(v1, plot.numbersT, modelm2)
  • gt Fix the nugget, fit only the sill of
    spherical model
  • gt m2a lt- fit.variogram(v1,m1,fit.sillsc(F,T),fit.
    rangeF)
  • gt m2a
  • model psill range
  • 1 Nug 0.500000 0
  • 2 Sph 1.465133 1200

47
Model comparison
  • By eye c0 0.5,c1 1.4,a 1200 total sill
    c0c1 1.9
  • Automaticc0 0.548,c1 1.340,a 1149 total
    sill c0c1 1.888
  • The total sill was almost unchanged gstat raised
    the nugget and lowered the partial sill of the
    spherical model a bit the range was shortened by
    51 m.

48
What sample size to fit a variogram model?
  • Cant use non-spatial formulas for sample size,
    because spatial samples are correlated, and each
    sample is used multiple times in the variogram
    estimate
  • No way to estimate the true error, since we have
    only one realisation
  • Stochastic simulation from an assumed true
    variogram suggests
  • lt 50 points not at all reliable
  • 100 to 150 points more or less acceptable
  • gt 250 points almost certaintly reliable
  • More points are needed to estimate an anisotropic
    variogram.
  • This is very worrying for many environmental
    datasets (soil cores, vegetation plots, . . . )
    especially from short-term fieldwork, where
    sample sizes of 40 60 are typical. Should
    variograms even be attempted on such small
    samples?

49
Approaches to spatial prediction
  • This is the prediction of the value of some
    variable at an unsampled point, based on the
    values at the sampled points.
  • This is often called interpolation, but strictly
    speaking that is only for points that are
    geographically inside the sample set (otherwise
    it is extrapolation.

50
Approaches to prediction Local predictors
  • Value of the variable is predicted from nearby
    samples
  • Example concentrations of soil constituents
    (e.g. salts, pollutants)
  • Example vegetation density

51
Local Predictors
  • Each interpolator has its own assumptions, i.e.
    theory of spatial variability
  • Nearest neighbour
  • Average within a radius
  • Average of n nearest neighbours
  • Distance-weighted average within a radius
  • Distance-weighted average of n nearest neighbours
  • Optimal weighting -gt Kriging

52
Ordinary Kriging
  • The theory of regionalised variables leads to an
    optimal interpolation method, in the sense that
    the prediction variance is minimized.
  • This is based on the theory of random functions,
    and requires certain assumptions.

53
Optimal local interpolation motivation
  • Problems with average-in-circle methods
  • 1. No objective way to select radius of circle or
    number of points
  • Problems with inverse-distance methods
  • 1. How to choose power (inverse, inverse squared
    . . . )?
  • 2. How to choose limiting radius?
  • In both cases
  • 1. Uneven distribution of samples could over or
    underemphasize some parts of the field
  • 2. prediction error must be estimated from a
    separate validation dataset

54
An optimal local predictor would have these
features
  • Prediction is made as a linear combination of
    known data values (a weighted average).
  • Prediction is unbiased and exact at known points
  • Points closer to the point to be predicted have
    larger weights
  • Clusters of points reduce to single equivalent
    points, i.e., over-sampling in a small area cant
    bias result
  • Closer sample points mask further ones in the
    same direction
  • Error estimate is based only on the sample
    configuration, not the data values
  • Prediction error should be as small as possible.

55
Kriging
  • A Best Linear Unbiased Predictor (BLUP) that
    satisfies certain criteria for optimality.
  • It is only optimal with respect to the chosen
    model!
  • Based on the theory of random processes, with
    covariances depending only on separation (i.e. a
    variogram model)
  • Theory developed several times (Kolmogorov
    1930s, Wiener 1949) but current practise dates
    back to Matheron (1963), formalizing the
    practical work of the mining engineer D G Krige
    (RSA).

56
How do we use Kriging?
  • 1. Sample, preferably at different resolutions
  • 2. Calculate the experimental variogram
  • 3. Model the variogram with one or more
    authorized functions
  • 4. Apply the kriging system, with the variogram
    model of spatial dependence, at each point to be
    predicted
  • Predictions are often at each point on a regular
    grid (e.g. a raster map)
  • These points are actually blocks the size of
    the sampling support
  • Can also predict in blocks larger than the
    original support
  • 5. Calculate the error of each prediction this
    is based only on the sample point locations, not
    their data values.

57
Prediction with Ordinary Kriging (OK)
  • In OK, we model the value of variable z at
    location si as the sum of a regional mean m and a
    spatially-correlated random component e(si)
  • Z(si) me(si)
  • The regional mean m is estimated from the sample,
    but not as the simple average, because there is
    spatial dependence. It is implicit in the OK
    system.

58
Prediction with Ordinary Kriging (OK)
  • Predict at points, with unknown mean (which must
    also be estimated) and no trend
  • Each point x is predicted as the weighted average
    of the values at all sample
  • The weights assigned to each sample point sum to
    1
  • Therefore, the prediction is unbiased
  • Ordinary no trend or strata regional mean
    must be estimated from sample

59
OK in gstat
  • load 40m interpolation grid
  • data(meuse.grid)
  • also make a full grid in the bounding box
    covering the same area
  • meuse.boxlt-expand.grid(
  • xseq(min(meuse.gridx), max(meuse.gridx),
    by40),
  • yseq(min(meuse.gridy), max(meuse.gridy),
    by40)
  • )

60
OK in gstat
  • The krige method is used with a variogram model
  • compute experimental variogram
  • v lt- variogram(log(cadmium) 1, xy, meuse)
  • estimated model
  • m lt- vgm(1.4, "Sph", 1200, 0.5)
  • fitted model
  • m.f lt- fit.variogram(v, m)
  • 1 "SSErr 2.80736271217044e-05"
  • kr lt- krige(log(cadmium) 1, loc xy,
    datameuse, newdatameuse.grid, modelm.f)
  • using ordinary kriging

61
OK in gstat
  • library (lattice)
  • visualize interpolation note aspect option to
    get correct geometry
  • levelplot(var1.pred xy, kr, aspectmapasp(k1))
  • visualize prediction error
  • levelplot(var1.var xy, kr, aspectmapasp(k1))

62
(No Transcript)
63
(No Transcript)
64
How realistic are maps made by Ordinary Kriging?
  • The resulting surface is smooth and shows no
    noise, no matter if there is a nugget effect in
    the variogram model
  • So the field is the best at each point taken
    separately, but taken as a whole is not a
    realistic map
  • The sample points are predicted exactly they are
    assumed to be without error, again even if there
    is a nugget effect in the variogram model (Note
    bloc kriging does not have this problem)
Write a Comment
User Comments (0)
About PowerShow.com