Title: Model Based Geostatistics
1Model Based Geostatistics
- Archie Clements
- University of Queensland
- School of Population Health
2Overview
- Introduction to geostatistics
- Assumptions
- Variogram components
- Variogram models
- Kriging
- Assumptions
- Model-based geostatistics
- Principles
- Building the model
- Prediction
- Validation
- Applications parasitic disease control in Africa
3Spatial variation
Z
Y
X
4First and second order variation
- First-order variation
- Trend
- Large-scale variation
- Can be due to large-scale environmental drivers
(e.g. temperature for vector-borne diseases) - Second-order variation
- Localised variation clustering
- Modelled using geostatistics
5Spatial dependence
- Observations close in space are more similar than
observations far apart - The variance of pairs of observations that are
close together (small h) tends to be smaller than
the variance of pairs far apart (large h) - Basis of the semivariogram
- Spatial decomposition of the sample variance
6Semivariance statistical notation
Semivariance is half the average squared
difference of values observed at locations
separated by a given distance (and direction)
Function of distance (and direction) distance in
bins, direction in sectors of compass azimuth
7Modelling spatial correlation semivariogram
Partial Sill
Semivariance
Sill
Nugget
Lag (h)
8Nugget
- Random variation (white noise) non-spatial
measurement error - Microvariation (spatial variation at a scale
smaller than the smallest bin) - If no spatial correlation
- Nugget sill (flat semivariogram)
9Semivariogram decisions to be made
- How many/what sized bins?
- Depends on density of data points
- For regular-spaced (grid-sampled) data bin size
size of cells in the grid - For irregular sampling modify according to
range of spatial correlation (big range, big
bins small range, small bins) - What maximum lag(h) to use?
- Should be estimated up to half the length of the
shortest side of study area - Which parametric model to use?
- Visual fit
- Statistical fit
10Variogram models
11Schistosoma mansoni, Uganda
Omnidirectional semivariograms
12Anisotropy
- Spatial dependence is different in different
directions - Semivariogram calculated in one direction is
different from semivariogram calculated in
another direction - Should check for anisotropy and, if present,
accommodate it in interpolation - Range or sill (or both) can differ
13Schistosoma mansoni, Uganda directional
semivariograms
Direction Range (km) Sill Nugget
Omni- directional 43.4 7E-2 4E-2
0 39.4 1E-1 -3E-3
45 43.6 7E-2 2E-2
90 35.8 8E-2 3E-2
135 39.5 1E-1 2E-2
14Schistosoma haematobium, Northwestern Tanzania
Direction Range (km) Sill Nugget
Omni- directional 36.0 5E-2 0
0 260.1 2E-2 3E-2
45 163.9 6E-3 3E-2
90 56.2 5E-2 0
135 97.7 3E-2 7E-3
15Schistosoma haematobium, Northwestern Tanzania
16Trended and skewed data
- Data should be de-trended
- Polynomials (regression on XY coordinates)
- Generalised linear models (regression on
covariates) - Generalised additive models (can over-fit)
- If directional variograms are calculated range
in one direction is gt3 X range in perpendicular,
sign of trend - If skewed, consider transformation (e.g. log
transformation, normal score transformation) - Otherwise, extreme values overly influence
interpolated map - Have to back-transform interpolated values
- Called disjunctive Kriging
17Non-stationarity
- Spatial correlation structure cannot be
generalised to the whole study area - Why does it occur?
- Different factors may operate in different parts
of the study area - Different ecological zones with different disease
epidemiology - Need to estimate the spatial correlation
structure separately in each homogeneous zone
18Kriging
- Z(si) is the measured value at the ith location
- ?i is the weight attributed to the measured value
at the ith location (calculated using
semivariogram) - So is the prediction location
For formulae on how the weights are estimated
using the variogram http//en.wikipedia.org/wiki/
Kriging
Prediction standard error/variance gives an
indication of precision of the prediction
19Geostatistics summary
- Geostatistics involves 3 steps
- Exploratory data analysis
- Definition of a variogram
- Using the variogram for interpolation (Kriging)
- Technique applicable for
- Point-referenced data
- Spatially continuous processes
- Disease risk
- Rainfall, elevation, temperature, other climate
variables - Wildlife, vegetation, geology (mineral deposits)
20Bayesian model-based geostatistics
- Seminal paper
- Diggle, Tawn and Moyeed (1998). Model-based
geostatistics. Appl. Stat. 473299-350 - Observed a need for addressing non-Gaussian
observational error - Idea is to embed linear Kriging methodology
within a more general distributional framework - Generalised linear models with an unobserved
Gaussian process in the linear predictor - Implemented in a Bayesian framework
21Advantages of the Bayesian approach
- Natural framework for incorporation of parameter
uncertainty into spatial prediction - Can build uncertainty into parameters using
priors - Non-informative
- Informative (based on exploratory analysis,
additional sources of information) - Convenient for modelling hierarchical data
structures
22Bayesian model-based geostatistics
23Predictions
- Can predict at specified validation locations
(with observed outcomes for comparison) - Can predict at non-sampled locations, e.g. a
prediction grid - Might be interested in
- outcome
- spatial random effect
- Standard error of predicted outcome
24Validation
- Jack-knifing sampling with replacement
- Remove one observation, do prediction at that
location and store predicted value - Repeat for all observations
- Compare predicted to observed using statistical
measures of fit (RMSE) and discriminatory
performance (AUC) - Not feasible with MBG other than with v. small
datasets - Cross-validation sampling without replacement
- Set aside a subset for validation (ideally 50)
- Use remaining data to train model
- Compare predicted and observed for the validation
subset using statistical measures - Can then recombine the validation and training
subsets for final model build - External validation using other prospective or
retrospective dataset
25Model-based geostatistics summary
- Model-based geostatistics involves
- Visual and exploratory data analysis
- Variography (to determine if there is
second-order spatial variation) - Variable selection (for deterministic component)
- Building model (e.g. in WinBUGS)
- Model selection (e.g. using DIC)
- Prediction and validation
26Application Schistosomiasis in Sub-Saharan
Africa
27- Schistosomiasis
- 779 million people at risk
- 207 million infected
- Most in Africa
- Significant illness and mortality
- Two main forms in Africa
- Urinary schistosomiasis caused by Schistosoma
haematobium - Intestinal schistosomiasis caused by S. mansoni
28Life cycle of Schistosoma haematobium
Cercariae released
Adult worm in human bladder wall
Sporocysts in snail
Eggs in urine
Miracidia
29Diagnosis of infection
- S. haematobium
- Microscopic examination of urine slides Presence
of eggs and egg counts - Macrohaematuria (visible blood)
- Microhaematuria (invisible blood) tested using
chemical reagent strips - Blood in urine questionnaire
- S. mansoni and soil-transmitted helminths
- microscopic examination of stool samples
30School-based control programmes
- School-aged children have highest prevalence
(proportion infected) and intensity (severity) of
infection - Education system is convenient for control
central location to access target population
31How do we determine which schools should be
targeted?
- World Health Organisation guidelines treat
communities biannually where prevalence in
school-age children is gt10 and annually where
prevalence gt50
- No surveillance
- Need to do surveys
32Field survey northwest Tanzania
Lake Victoria
- 153 schools surveyed
- 60 children per school
- What about non-sampled locations? Need to predict
(interpolate) values
33MBG model for S. haematobium prevalence
34S. haematobium model results
Variable Coefficient Odds Ratio
Intercept 1.9 (-2.3 - 10.3)
LST gt35-39C 0.4 (-0.3 - 1.1) 1.5 (0.8 - 2.9)
LST gt39C 0.3 (-1.5 - 2.2) 1.4 (0.2 - 8.6)
Rainfall gt1050mm -1.1 (-3.4 - 1.1) 0.3 (3.3 x 10-2 - 3.1)
? 0.9 (0.6 - 1.3)
f 0.2 (0.1 - 1.0)
35Clements et al. TMIH 2006
36Uncertainty
Lower bound 95 PI
Upper bound 95 PI
37- Co-ordinated surveys in 3 contiguous countries
- 418 schools
- gt26,000 children
Variable
Variable Mean (95 CI) SD
Sex Female 0.70 (0.65, 0.76) 0.03
Age 910 years 1.16 (1.00, 1.33) 0.08
Age 1112 years 1.51 (1.31, 1.73) 0.10
Age 1316 years 1.79 (1.53, 2.06) 0.14
Distance to perennial water body 0.34 (0.21, 0.54) 0.08
Land surface temperature 0.80 (0.51, 1.21) 0.18
Land surface temperature2 1.10 (0.85, 1.40) 0.14
Rate of decay of spatial correlation 2.03 (1.48, 2.74) 0.32
Variance of the spatial random effect (sill) 7.03 (5.36, 9.31) 1.01
Probability that prevalence is gt50 Clements et
al. EID 2008
38Other outcomes co-infection
East Africa Brooker and Clements, Int. J.
Parasitol., in press S. mansoni mono-infection
7.9 Hookworm mono-infection 40.5 Co-infection
8.1
39Model for co-infection
,
YijkMultinomial(pijk,nijk),
40Variable S. mansoni mono-infection posterior mean (95 posterior CI) Hookworm mono-infection posterior mean (95 posterior CI) S. mansoni/hookworm co-infection posterior mean (95 posterior CI)
Intercept -3.8 (-4.7 - -2.9) -0.6 (-1.1 - -0.3) -4.4 (-5.0 - -3.7)
OR Elevation 0.35 (0.22 - 0.58) 0.77 (0.65 - 0.89) 0.30 (0.20 - 0.47)
OR DPWB 0.23 (0.10 - 0.45) 0.94 (0.76 - 1.15) 0.30 (0.18 - 0.58)
OR Rural vs urban 0.43 (0.21 - 0.79) 0.98 (0.68 - 1.37) 0.61 (0.36 - 1.02)
OR Ext. rural vs urban 0.62 (0.23 - 1.44) 1.16 (0.82 - 1.81) 0.75 (0.31 - 1.62)
OR LST 0.88 (0.62 - 1.25) 0.60 (0.50 - 0.72) 0.57 (0.31 - 0.87)
OR Female 0.86 (0.76 - 0.96) 0.91 (0.86 - 0.97) 0.70 (0.63 - 0.77)
OR Age (9-10 years) 1.67 (1.37 - 2.06) 1.17 (1.04 - 1.30) 1.82 (1.52 - 2.21)
OR Age (11-13 years) 2.44 (2.06 - 2.89) 1.55 (1.39 - 1.71) 2.99 (2.55 - 3.52)
OR Age (14 years) 2.87 (2.19 - 3.71) 1.88 (1.63 - 2.14) 3.83 (3.01 - 4.86)
Phi (rate of decay) 3.52 (1.73 - 7.21) 4.98 (3.38 - 7.33) 3.76 (2.10 - 7.36)
Sill 6.39 (3.52 - 11.78) 1.31 (0.98 - 1.76) 6.34 (3.98 - 9.95)
41Co-infection
Hookworm monoinfection
S. mansoni - Hookworm coinfection
S. mansoni monoinfection
42Other outcomes Intensity of infection
- Prevalence is used (currently) for disease
control planning - Intensity of infection (eggs/ml urine or /g
faeces) is more indicative of - Morbidity (anaemia, urine tract, hepatic
pathology) - Transmission
43Model for intensity of infection
44Intensity of S. mansoni infection, East Africa
Clements et al. Parasitol 2006
Variable Posterior Mean (95 CI)
Intercept 10.06 (5.77 - 13.22)
Female -0.41 (-0.72 - -0.11)
Elevation (m) -0.007 (-0.01 - -0.004)
DPWB (dec deg) -5.36 (-7.51 - -3.30)
Sill 23.96 (19.06 - 32.07)
Range 0.134 (0.09 - 0.20)
Overdispersion 0.06 (0.058 - 0.062)
45Conclusions
- In disease control we need evidence-based
framework for deciding on where to allocate
limited control resources - Maps are useful tools for highlighting
sub-national variation targeting interventions
advocacy (national and local) integrated control
programmes estimating heterogeneities in disease
burden - Model-based geostatistics enables rich inference
from spatial data uncertainty