Title: Between%20and%20beyond:%20Irregular%20series,%20interpolation,%20variograms,%20and%20smoothing
1Between and beyond Irregular series,
interpolation, variograms, and smoothing
2Mind the gap!
- Repeated reminder, London Underground.
3Executive summary
- A new program mipolate for several kinds of
interpolation is now available. - It can be downloaded from SSC (3 September 2015).
- Variograms are useful for examining dependence
structure in time and spatial series. - Work is in progress on a new program vgram for
variograms.
4Irregular series
- Irregular series are series in which non-missing
values are not all equally spaced. - Special case Values would be equally spaced
(every day, every year, ), but there are some
gaps with missing values, for human or inhuman
reasons. - General case Values are just at known times or
points with no necessary rules about spacing. - Irregular series often seem to invite
interpolation.
5Luke Howard (1772 1864)
- Best remembered for his nomenclature for clouds
- (cumulus, stratus, cirrus and so forth).
- Here we use as sandbox some of his temperature
data from Plaistow, near London, in 1807.
6- Howard, Luke. 1818.
- The Climate of London, Deduced from
Meteorological Observations, Made at Different
Places in the Neighbourhood of the Metropolis. - Volume I.
- London W. Phillips, etc.
7(No Transcript)
8Series of events
- N.B. We are not talking here about series of
events, - or realisations of point processes.
- In such series occurrences are typically
irregularly spaced, but the gaps are inherent in
the process, - not a failing of our data.
- Examples range from eruptions to elections.
9(No Transcript)
10Interpolation
- Interpolation is the art of reading between the
lines. - Historically, it is a deterministic process,
often a matter of going beyond printed tables of
functions - (logarithmic, trigonometric, and so forth).
- In principle, we should worry about the
statistical properties of interpolation. It is
local prediction. - In practice, imputation now appears better known
among statistical researchers.
11Interpolation in (official) Stata
- The ipolate command for linear interpolation
(and extrapolation) was added in Stata 3.1
(1993). - The Mata functions spline3() and spline3eval()
were added in Stata 9.0 (2005).
12User-written programs on SSC
- Programs (NJC) have been available from SSC for
- cubic interpolation cipolate (2002)
- cubic spline interpolation csipolate (2009)
- piecewise cubic Hermite interpolation
- pchipolate (2012)
- nearest neighbour interpolation
- nnipolate (2012)
- A combined and extended program mipolate is now
available too.
13Two dimensions too
- Note also bipolate (Joseph Canner, SSC) (2014).
- By default it uses quintic polynomials.
- Other available methods include thin plate
splines - and Shepards method.
- Note also twoway contour.
14mipolate generalises ipolate
- Interpolation is of yvar with respect to
specified xvar. - Prior tsset or xtset is not assumed.
- Regular spacing is not assumed.
- Multiple values of yvar at the same xvar are
averaged first. - Groupwise operations using by are supported.
15Linear and cubic
- Linear interpolation just uses previous and
following known values (only). - This is done by ipolate, and also mipolate by
default. - Cubic interpolation is another classic method,
using two previous and two following known values
(only). - This is done by mipolate, cubic.
- The default of mipolate with either method
- (as with ipolate) is not to extrapolate.
16Un peu dhistoire
- Cubic interpolation, as a particular kind of
polynomial interpolation, is often attributed to
Joseph-Louis Lagrange (17361813) but was
proposed earlier by Edward Waring (1735?1798). - In fact there is a long history of work with
contributions by many outstanding mathematicians,
not least Isaac Newton (16431727) and Leonhard
Euler (17071783).
17 Lagrange Waring
18Cubic splines
- As before, we are using cubic polynomials
locally, but they are constrained to join
smoothly. - The syntax is mipolate, spline.
- This is merely a wrapper for the official Mata
functions. - As before, the default of mipolate with this
option is not to extrapolate.
19Linear extrapolation
- As with ipolate linear extrapolation is available
as an option in mipolate to fill in missings at
the end of series. - What your teachers told you is true
- extrapolation is dangerous.
- Dont point that straight line It can go off
anywhere. - (Allude here to Mark Twain on the Mississippi.)
20Piecewise cubic Hermite interpolation
- This method also uses piecewise cubics joining
smoothly. The syntax is mipolate, pchip. - The interpolant is shape-preserving and cannot
- overshoot locally.
- Sections in which yvar is increasing, decreasing
or constant with xvar remain so after
interpolation. - Hence local maxima and minima also remain so.
- This interpolation method also extrapolates.
21 Charles Hermite (18221901)
22Inverse distance weighting
- Interpolation can use a weighted average of known
values, the weights being inverse powers of
distance d from unknown value. - If I dont know the value at 42, 41 and 43 are
distance 1 away, 40 and 44 distance 2, and so on.
- For weights d-p, limiting case p 0 makes all
weights equal, and so the interpolant is the
overall mean, while p very large means that only
the very nearest values have effect.
23Other methods
- mipolate adds forward, backward, nearest
neighbour and groupwise interpolation - Use the previous, next or the nearest known
value. Or extend the single non-missing value in
a group to all others. - Using the last known value is often dubious
statistically, - but it is a very common request in data
management. - The other methods are provided partly for
completeness. - There is small print (option choices) about how
to break ties when two values are equally near.
24mipolate summary
- Nine methods
- linear
- cubic
- (cubic) spline
- pchip
- idw
- forward
- backward
- nearest
- groupwise
- Linear extrapolation?
- yes
- yes
- yes
- no
- no
- no
- no
- no
- no
25(No Transcript)
26(No Transcript)
27Simple messages
- There are many interpolation methods to choose
from. - They will often disagree, even for simple-looking
instances. - Disagreement gives a handle on uncertainty.
- In a real problem, simulate missings and test how
well known values are estimated. - What makes most sense in your problem will
reflect its dependence structure.
28- We turn from a project that is done to one that
is very much in progress.
29Variograms
- Variograms (more properly semivariograms) are
plots of - (mean) half difference between values squared
- versus
- separation, distance or lag.
- By a tempting abuse of terminology, we often use
the same name for the underlying relationship as
a function.
30First known use of term variogram
- Geoffrey H. Jowett (1922 ) in 1955
- The comparison of means of sets of observations
from sections of independent stochastic series. - Journal of the Royal Statistical Society. Series
B (Methodological) 17 208227.
31Spatial and time series
- Variograms are central to one approach to spatial
statistics, in this context often known as
geostatistics. - Georges Matheron (19302000) is most often
mentioned here. - But variograms can be very useful for time series
too.
32Time series too
- Variograms are prominent in these texts on time
series and longitudinal data - Diggle, P.J. 1990. Time Series A Biostatistical
Introduction. Oxford Oxford University Press. - Diggle, P.J., Heagerty, P.J., Liang, K-Y. and
Zeger, S.L. 2002. Analysis of Longitudinal Data.
Oxford Oxford University Press.
33User-written programs
- Programs (NJC) are available from SSC for
- variograms in one dimension variog (2005)
- variograms in two dimensions variog2 (2005)
- A combined and extended program vgram is under
development.
34Generality of variograms
- So, variograms are without undue strain
defined - for time series and for spatial series,
- whether regular or irregular,
- as they just depend on separation being measured.
- Plotting the mean for each distinct separation is
- a common, but not compulsory, convention.
35A simple example webuse air2
36Variograms
- vgram air, recast(connected) xla(0(12)72)
37Comparison at different lags
- We are plotting mean squared differences between
values compared at lags 1, 2, 3, - In this example, we have monthly data, so are
comparing values 1, 2, 3, months apart. - Many readers may be familiar with the same idea
for calculating autocorrelation and
cross-correlation. - The variogram like the raw data plot hints at
a structure of trend plus seasonality.
38Variograms of residuals, not data
- Here, as elsewhere, it is a good idea to work
with residuals, rather than the original data. - Time series modellers could have a happy time
arguing which model was best for the airline
data, but we just use a Poisson regression on
time and look at its residuals. - On the versatility and virtuosity of Poisson
regression, check out Gould, William. - http//blog.stata.com/2011/08/22/use-poisson-rathe
r-than-regress-tell-a-friend/
39Sometimes, structure is this simple
40A little more formally
- The semivariogram ?(h) for response z is
given by - 2 ?(h) A z(i) - z(i h)2
- where A denotes averaging over pairs of values
at lag h. - As emphasised, using a mean is a convention. The
fuller picture (literally!) is a plot of z(i) -
z(i h)2 versus h. - This is often known as a variogram cloud.
- I borrow the notation A() from Whittle, P. 1970.
Probability. Harmondsworth Penguin.
41Where does the 2 come from?
- The units of the semivariogram are those of the
response squared. - Adding the variance to the graph as a reference
line underlines the connection. - A non-standard formula for the variance is, for
any i, j, - (1/2) E (zi - zj)2 .
42Back to vgram
- vgram (not yet public) is already quite general.
- We take possibilities one by one.
- With just one argument, the response, it checks
for a tsset or xtset time variable and uses it to
define separations if found. Note that panel
data are supported for free. - With just one argument otherwise, the order of
the observations is taken to define position in
time or space.
43- With two arguments, the second variable is taken
to define position. A width() option is required
to specify the width of bins within which
differences squared are averaged. Equal and
unequal spacing can thus both be accommodated. - With three arguments, the second and third
variables are taken to define position. A width()
option is required to specify the width of bins
within which differences squared are averaged.
Distance is calculated from coordinates using
Pythagoras theorem.
44Why not just use autocorrelation?
- Variograms are defined for a wider class of
processes. Autocorrelation functions require weak
stationarity variograms are defined for
processes with stationary increments. - Variograms are more flexible in the face of
irregular spacing. - The very wide use of autocorrelation reflects
custom and familiarity as well as intrinsic
merit.
45A further example
- We look at rainfalls for 8 May 1986 (a single
day) for 467 stations in Switzerland.
46(No Transcript)
47(No Transcript)
48How much information ?
- Optionally the semivariogram results can be saved
in vgram to new variables. - Keeping track of the number of pairs used at each
lag is important. - Here we exploit the feature that spikeplot can
show frequencies on a square root scale.
49(No Transcript)
50To do list
- variogram clouds
-
- robust estimators
- more flexible binning
- spherical distances too
- direction as well as lag
- model fitting
- (valid functional forms)
- use for interpolation
- (and smoothing)
- (kriging, Gaussian process regression)
51Variogram virtues
- Defined for time and spatial series.
- Defined for regular and irregular series.
- Can help identify and check for structure.
- even if you have no interest in their most
mentioned use, as a means towards the end of
spatial interpolation.
52This paper
- This paper fills a much needed gap in the
literature. - See Jackson, A. 1997.
- Chinese acrobatics, an old-time brewery,
- and the much needed gap
- The life of Mathematical Reviews.
- Notices of the American Mathematical Society
- 44 330337.
53Acknowledgments
- Historical portraits Wikipedia.
- MATLAB code for pchip
- Moler, C. 2004. Numerical Computing with MATLAB.
Philadelphia SIAM. Chapter 3. http//www.mathwork
s.com/moler/interp.pdf) - The Swiss rainfall data can be found here
- http//www.ai-geostats.org/pub/AI_GEOSTATS/AI_GEOS
TATSData/sic97data_01.zip
54Leo Breiman (19282005)
- The main thing to learn about statistics is what
is sensible and honest and possible. - Doubt and suspicion, as well as technical
knowledge, are indispensable tools in statistics. - 1973.
- Statistics With a view towards applications.
- Boston Houghton Mifflin, pp.1, 18.