Title: Between and beyond: Irregular series, interpolation, variograms, and smoothing
1Between and beyond Irregular series,
interpolation, variograms, and smoothing
2Mind the gap!
- Repeated reminder, London Underground.
3Irregular series
- Irregular series are series in which non-missing
values are not all equally spaced. - Special case Values would be equally spaced
(every day, every year, ), but there are some
gaps with missing values, for human or inhuman
reasons. - General case Values are just at known times or
points with no necessary rules about spacing. - Irregular series often seem to invite
interpolation.
4Luke Howard (1772 1864)
- Best remembered for his nomenclature for clouds
- (cumulus, stratus, cirrus and so forth).
- Here we use as sandbox some of his temperature
data from Plaistow, near London, in 1807.
5- Howard, Luke. 1818.
- The Climate of London, Deduced from
Meteorological Observations, Made at Different
Places in the Neighbourhood of the Metropolis. - Volume I.
- London W. Phillips, etc.
6(No Transcript)
7Series of events
- N.B. We are not talking here about series of
events, - or realisations of point processes.
- In such series occurrences are typically
irregularly spaced, but the gaps are inherent in
the process, - not a failing of our data.
- Examples range from eruptions to elections.
8(No Transcript)
9Interpolation
- Interpolation is the art of reading between the
lines. - Historically, it is a deterministic process,
often a matter of going beyond printed tables of
functions - (logarithmic, trigonometric, and so forth).
- In principle, we should worry about the
statistical properties of interpolation. It is
estimation or prediction. - In practice, imputation now appears better known
among statistical researchers.
10Interpolation in (official) Stata
- The ipolate command for linear interpolation
(and extrapolation) was added in Stata 3.1
(1993). - The Mata functions spline3() and spline3eval()
were added in Stata 9.0 (2005).
11User-written programs
- Programs (NJC) are available from SSC for
- cubic interpolation cipolate (2002)
- cubic spline interpolation csipolate (2009)
- piecewise cubic Hermite interpolation pchipolate
(2012) - nearest neighbour interpolation nnipolate (2012)
- A combined and extended program mipolate will
shortly be available too.
12Two dimensions too
- Note also bipolate (Joseph Canner, SSC) (2014).
- By default it uses quintic polynomials.
- Other available methods include thin plate
splines - and Shepards method.
- Note also twoway contour.
13mipolate generalises ipolate
- Interpolation is of yvar with respect to
specified xvar. - Prior tsset or xtset is not assumed.
- Regular spacing is not assumed.
- Multiple values of yvar at the same xvar are
averaged first. - Groupwise operations using by are supported.
14Linear and cubic
- Linear interpolation just uses previous and
following known values (only). - This is done by ipolate, and also mipolate by
default. - Cubic interpolation is another classic method,
using two previous and two following known values
(only). - This is done by mipolate, cubic.
- The default of mipolate with either method
- (as with ipolate) is not to extrapolate.
15Un peu dhistoire
- Cubic interpolation is often attributed to
Joseph-Louis Lagrange (17361813) but was
proposed earlier by Edward Waring (1735?1798).
16 Lagrange Waring
17Cubic splines
- As before, we are using cubic polynomials
locally, but they are constrained to join
smoothly. - The syntax is mipolate, spline.
- This is merely a wrapper for the official Mata
functions. - As before, the default of mipolate with this
option is not to extrapolate.
18Linear extrapolation
- As with ipolate linear extrapolation is available
as an option in mipolate to fill in missings at
the end of series. - What your teachers told you is true
- extrapolation is dangerous.
- Dont point that straight line It can go off
anywhere. - (Allude here to Mark Twain on the Mississippi.)
19Piecewise cubic Hermite interpolation
- This method also uses piecewise cubics joining
smoothly. The syntax is mipolate, pchip. - The interpolant is shape-preserving and cannot
- overshoot locally.
- Sections in which yvar is increasing, decreasing
or constant with xvar remain so after
interpolation. - Hence local maxima and minima also remain so.
- This interpolation method also extrapolates.
20 Charles Hermite (18221901)
21Other methods
- mipolate adds forward, backward and nearest
neighbour interpolation - Use the previous, next or the nearest known
value. - Using the last known value is often dubious
statistically, - but it is a very common request in data
management. - The other methods are provided mostly for
completeness. - There is small print (option choices) about how
to break ties when two values are equally near.
22mipolate summary
- Seven methods
- linear
- cubic
- (cubic) spline
- pchip
- forward
- backward
- nearest
- Linear extrapolation?
- yes
- yes
- yes
- no
- no
- no
- no
23(No Transcript)
24(No Transcript)
25Simple messages
- There are many interpolation methods to choose
from. - They will often disagree, even for simple-looking
instances. - Disagreement gives a handle on uncertainty.
- In a real problem, simulate missings and test how
well known values are estimated. - What makes most sense in your problem will
reflect its dependence structure.
26Leo Breiman (19282005)
- The main thing to learn about statistics is what
is sensible and honest and possible. - Doubt and suspicion, as well as technical
knowledge, are indispensable tools in statistics. - 1973.
- Statistics With a view towards applications.
- Boston Houghton Mifflin, pp.1, 18.
27- We turn from a project that is nearly done to one
that is very much in progress.
28Variograms
- Variograms (more properly semivariograms) are
plots of - (mean) half difference between values squared
- versus
- separation, distance or lag.
- By a tempting abuse of terminology, we often use
the same name for the underlying relationship as
a function.
29First known use of term variogram
- Geoffrey H. Jowett (1922 ) in 1955
- The comparison of means of sets of observations
from sections of independent stochastic series. - Journal of the Royal Statistical Society. Series
B (Methodological) 17 208227.
30Spatial and time series
- Variograms are central to one approach to spatial
statistics, in this context often known as
geostatistics. - Georges Matheron (19302000) is most often
mentioned here. - But variograms can be very useful for time series
too.
31Time series too
- Variograms are prominent in these texts on time
series and longitudinal data - Diggle, P.J. 1990. Time Series A Biostatistical
Introduction. Oxford Oxford University Press. - Diggle, P.J., Heagerty, P.J., Liang, K-Y. and
Zeger, S.L. 2002. Analysis of Longitudinal Data.
Oxford Oxford University Press.
32User-written programs
- Programs (NJC) are available from SSC for
- variograms in one dimension variog (2005)
- variograms in two dimensions variog2 (2005)
- A combined and extended program vgram is under
development.
33Generality of variograms
- So, variograms are without undue strain
defined - for time series and for spatial series,
- whether regular or irregular,
- as they just depend on separation being measured.
- Plotting the mean for each distinct separation is
- a common, but not compulsory, convention.
34A simple example webuse air2
35Variograms
- vgram air, recast(connected) xla(0(12)72)
36Comparison at different lags
- We are plotting mean squared differences between
values compared at lags 1, 2, 3, - In this example, we have monthly data, so are
comparing values 1, 2, 3, months apart. - Many readers may be familiar with the same idea
for calculating autocorrelation and
cross-correlation. - The variogram like the raw data plot hints at
a structure of trend plus seasonality.
37Variograms of residuals, not data
- Here, as elsewhere, it is a good idea to work
with residuals, rather than the original data. - Time series modellers could have a happy time
arguing which model was best for the airline
data, but we just use a Poisson regression on
time and look at its residuals. - On the versatility and virtuosity of Poisson
regression, check out Gould, William. - http//blog.stata.com/2011/08/22/use-poisson-rathe
r-than-regress-tell-a-friend/
38Sometimes, structure is this simple
39A little more formally
- The semivariogram ?(h) for response z is
given by - 2 ?(h) A z(i) - z(i h)2
- where A denotes averaging over pairs of values
at lag h. - As emphasised, using a mean is a convention. The
fuller picture (literally!) is a plot of z(i) -
z(i h)2 versus h. - This is often known as a variogram cloud.
- I borrow the notation A() from Whittle, P. 1970.
Probability. Harmondsworth Penguin.
40Where does the 2 come from?
- The units of the semivariogram are those of the
response squared. - Adding the variance to the graph as a reference
line underlines the connection. - A non-standard formula for the variance is, for
any i, j, - (1/2) E (zi - zj)2 .
41Back to vgram
- vgram (not yet public) is already quite general.
- We take possibilities one by one.
- With just one argument, the response, it checks
for a tsset or xtset time variable and uses it to
define separations if found. Note that panel
data are supported for free. - With just one argument otherwise, the order of
the observations is taken to define position in
time or space.
42- With two arguments, the second variable is taken
to define position. A width() option is required
to specify the width of bins within which
differences squared are averaged. Equal and
unequal spacing can thus both be accommodated. - With three arguments, the second and third
variables are taken to define position. A width()
option is required to specify the width of bins
within which differences squared are averaged.
Distance is calculated from coordinates using
Pythagoras theorem.
43Why not just use autocorrelation?
- Variograms are defined for a wider class of
processes. Autocorrelation functions require weak
stationarity variograms are defined for
processes with stationary increments. - Variograms are more flexible in the face of
irregular spacing. - The very wide use of autocorrelation reflects
custom and familiarity as well as intrinsic
merit.
44A further example
- We look at rainfalls for 8 May 1986 (a single
day) for 467 stations in Switzerland.
45(No Transcript)
46(No Transcript)
47How much information ?
- Optionally the semivariogram results can be saved
in vgram to new variables. - Keeping track of the number of pairs used at each
lag is important. - Here we exploit the feature that spikeplot can
show frequencies on a square root scale.
48(No Transcript)
49To do list
- variogram clouds
-
- robust estimators
- more flexible binning
- spherical distances too
- direction as well as lag
- model fitting
- (valid functional forms)
- use for interpolation
- (and smoothing)
- (kriging, Gaussian process regression)
50Variogram virtues
- Defined for time and spatial series.
- Defined for regular and irregular series.
- Can help identify and check for structure.
- even if you have no interest in their most
mentioned use, as a means towards the end of
spatial interpolation.
51This paper
- This paper fills a much needed gap in the
literature. - See Jackson, A. 1997.
- Chinese acrobatics, an old-time brewery,
- and the much needed gap
- The life of Mathematical Reviews.
- Notices of the American Mathematical Society
- 44 330337.
52Acknowledgments
- Historical portraits Wikipedia.
- MATLAB code for pchip
- Moler, C. 2004. Numerical Computing with MATLAB.
Philadelphia SIAM. Chapter 3. http//www.mathwork
s.com/moler/interp.pdf) - The Swiss rainfall data can be found here
- http//www.ai-geostats.org/pub/AI_GEOSTATS/AI_GEOS
TATSData/sic97data_01.zip