Between%20and%20beyond:%20Irregular%20series,%20interpolation,%20variograms,%20and%20smoothing - PowerPoint PPT Presentation

About This Presentation

Title:

Between%20and%20beyond:%20Irregular%20series,%20interpolation,%20variograms,%20and%20smoothing

Description:

Between and beyond: Irregular series, interpolation, variograms, and smoothing Nicholas J. Cox – PowerPoint PPT presentation

Number of Views:201

Avg rating:3.0/5.0

Slides: 55

Provided by: DavidC393

Learn more at: http://fmwww.bc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Between%20and%20beyond:%20Irregular%20series,%20interpolation,%20variograms,%20and%20smoothing

1
Between and beyond Irregular series,
interpolation, variograms, and smoothing

Nicholas J. Cox

2
Mind the gap!

Repeated reminder, London Underground.

3
Executive summary

A new program mipolate for several kinds of
interpolation is now available.
It can be downloaded from SSC (3 September 2015).
Variograms are useful for examining dependence
structure in time and spatial series.
Work is in progress on a new program vgram for
variograms.

4
Irregular series

Irregular series are series in which non-missing
values are not all equally spaced.
Special case Values would be equally spaced
(every day, every year, ), but there are some
gaps with missing values, for human or inhuman
reasons.
General case Values are just at known times or
points with no necessary rules about spacing.
Irregular series often seem to invite
interpolation.

5
Luke Howard (1772 1864)

Best remembered for his nomenclature for clouds
(cumulus, stratus, cirrus and so forth).
Here we use as sandbox some of his temperature
data from Plaistow, near London, in 1807.

Howard, Luke. 1818.
The Climate of London, Deduced from
Meteorological Observations, Made at Different
Places in the Neighbourhood of the Metropolis.
Volume I.
London W. Phillips, etc.

7
(No Transcript)
8
Series of events

N.B. We are not talking here about series of
events,
or realisations of point processes.
In such series occurrences are typically
irregularly spaced, but the gaps are inherent in
the process,
not a failing of our data.
Examples range from eruptions to elections.

9
(No Transcript)
10
Interpolation

Interpolation is the art of reading between the
lines.
Historically, it is a deterministic process,
often a matter of going beyond printed tables of
functions
(logarithmic, trigonometric, and so forth).
In principle, we should worry about the
statistical properties of interpolation. It is
local prediction.
In practice, imputation now appears better known
among statistical researchers.

11
Interpolation in (official) Stata

The ipolate command for linear interpolation
(and extrapolation) was added in Stata 3.1
(1993).
The Mata functions spline3() and spline3eval()
were added in Stata 9.0 (2005).

12
User-written programs on SSC

Programs (NJC) have been available from SSC for
cubic interpolation cipolate (2002)
cubic spline interpolation csipolate (2009)
piecewise cubic Hermite interpolation
pchipolate (2012)
nearest neighbour interpolation
nnipolate (2012)
A combined and extended program mipolate is now
available too.

13
Two dimensions too

Note also bipolate (Joseph Canner, SSC) (2014).
By default it uses quintic polynomials.
Other available methods include thin plate
splines
and Shepards method.
Note also twoway contour.

14
mipolate generalises ipolate

Interpolation is of yvar with respect to
specified xvar.
Prior tsset or xtset is not assumed.
Regular spacing is not assumed.
Multiple values of yvar at the same xvar are
averaged first.
Groupwise operations using by are supported.

15
Linear and cubic

Linear interpolation just uses previous and
following known values (only).
This is done by ipolate, and also mipolate by
default.
Cubic interpolation is another classic method,
using two previous and two following known values
(only).
This is done by mipolate, cubic.
The default of mipolate with either method
(as with ipolate) is not to extrapolate.

16
Un peu dhistoire

Cubic interpolation, as a particular kind of
polynomial interpolation, is often attributed to
Joseph-Louis Lagrange (17361813) but was
proposed earlier by Edward Waring (1735?1798).
In fact there is a long history of work with
contributions by many outstanding mathematicians,
not least Isaac Newton (16431727) and Leonhard
Euler (17071783).

17
Lagrange Waring
18
Cubic splines

As before, we are using cubic polynomials
locally, but they are constrained to join
smoothly.
The syntax is mipolate, spline.
This is merely a wrapper for the official Mata
functions.
As before, the default of mipolate with this
option is not to extrapolate.

19
Linear extrapolation

As with ipolate linear extrapolation is available
as an option in mipolate to fill in missings at
the end of series.
What your teachers told you is true
extrapolation is dangerous.
Dont point that straight line It can go off
anywhere.
(Allude here to Mark Twain on the Mississippi.)

20
Piecewise cubic Hermite interpolation

This method also uses piecewise cubics joining
smoothly. The syntax is mipolate, pchip.
The interpolant is shape-preserving and cannot
overshoot locally.
Sections in which yvar is increasing, decreasing
or constant with xvar remain so after
interpolation.
Hence local maxima and minima also remain so.
This interpolation method also extrapolates.

21
Charles Hermite (18221901)
22
Inverse distance weighting

Interpolation can use a weighted average of known
values, the weights being inverse powers of
distance d from unknown value.
If I dont know the value at 42, 41 and 43 are
distance 1 away, 40 and 44 distance 2, and so on.
For weights d-p, limiting case p 0 makes all
weights equal, and so the interpolant is the
overall mean, while p very large means that only
the very nearest values have effect.

23
Other methods

mipolate adds forward, backward, nearest
neighbour and groupwise interpolation
Use the previous, next or the nearest known
value. Or extend the single non-missing value in
a group to all others.
Using the last known value is often dubious
statistically,
but it is a very common request in data
management.
The other methods are provided partly for
completeness.
There is small print (option choices) about how
to break ties when two values are equally near.

24
mipolate summary

Nine methods
linear
cubic
(cubic) spline
pchip
idw
forward
backward
nearest
groupwise

Linear extrapolation?
yes
yes
yes
no
no
no
no
no
no

25
(No Transcript)
26
(No Transcript)
27
Simple messages

There are many interpolation methods to choose
from.
They will often disagree, even for simple-looking
instances.
Disagreement gives a handle on uncertainty.
In a real problem, simulate missings and test how
well known values are estimated.
What makes most sense in your problem will
reflect its dependence structure.

We turn from a project that is done to one that
is very much in progress.

29
Variograms

Variograms (more properly semivariograms) are
plots of
(mean) half difference between values squared
versus
separation, distance or lag.
By a tempting abuse of terminology, we often use
the same name for the underlying relationship as
a function.

30
First known use of term variogram

Geoffrey H. Jowett (1922 ) in 1955
The comparison of means of sets of observations
from sections of independent stochastic series.
Journal of the Royal Statistical Society. Series
B (Methodological) 17 208227.

31
Spatial and time series

Variograms are central to one approach to spatial
statistics, in this context often known as
geostatistics.
Georges Matheron (19302000) is most often
mentioned here.
But variograms can be very useful for time series
too.

32
Time series too

Variograms are prominent in these texts on time
series and longitudinal data
Diggle, P.J. 1990. Time Series A Biostatistical
Introduction. Oxford Oxford University Press.
Diggle, P.J., Heagerty, P.J., Liang, K-Y. and
Zeger, S.L. 2002. Analysis of Longitudinal Data.
Oxford Oxford University Press.

33
User-written programs

Programs (NJC) are available from SSC for
variograms in one dimension variog (2005)
variograms in two dimensions variog2 (2005)
A combined and extended program vgram is under
development.

34
Generality of variograms

So, variograms are without undue strain
defined
for time series and for spatial series,
whether regular or irregular,
as they just depend on separation being measured.
Plotting the mean for each distinct separation is
a common, but not compulsory, convention.

35
A simple example webuse air2
36
Variograms

vgram air, recast(connected) xla(0(12)72)

vgram air

37
Comparison at different lags

We are plotting mean squared differences between
values compared at lags 1, 2, 3,
In this example, we have monthly data, so are
comparing values 1, 2, 3, months apart.
Many readers may be familiar with the same idea
for calculating autocorrelation and
cross-correlation.
The variogram like the raw data plot hints at
a structure of trend plus seasonality.

38
Variograms of residuals, not data

Here, as elsewhere, it is a good idea to work
with residuals, rather than the original data.
Time series modellers could have a happy time
arguing which model was best for the airline
data, but we just use a Poisson regression on
time and look at its residuals.
On the versatility and virtuosity of Poisson
regression, check out Gould, William.
http//blog.stata.com/2011/08/22/use-poisson-rathe
r-than-regress-tell-a-friend/

39
Sometimes, structure is this simple

Poisson regression

Residuals from Poisson

40
A little more formally

The semivariogram ?(h) for response z is
given by
2 ?(h) A z(i) - z(i h)2
where A denotes averaging over pairs of values
at lag h.
As emphasised, using a mean is a convention. The
fuller picture (literally!) is a plot of z(i) -
z(i h)2 versus h.
This is often known as a variogram cloud.
I borrow the notation A() from Whittle, P. 1970.
Probability. Harmondsworth Penguin.

41
Where does the 2 come from?

The units of the semivariogram are those of the
response squared.
Adding the variance to the graph as a reference
line underlines the connection.
A non-standard formula for the variance is, for
any i, j,
(1/2) E (zi - zj)2 .

42
Back to vgram

vgram (not yet public) is already quite general.
We take possibilities one by one.
With just one argument, the response, it checks
for a tsset or xtset time variable and uses it to
define separations if found. Note that panel
data are supported for free.
With just one argument otherwise, the order of
the observations is taken to define position in
time or space.

With two arguments, the second variable is taken
to define position. A width() option is required
to specify the width of bins within which
differences squared are averaged. Equal and
unequal spacing can thus both be accommodated.
With three arguments, the second and third
variables are taken to define position. A width()
option is required to specify the width of bins
within which differences squared are averaged.
Distance is calculated from coordinates using
Pythagoras theorem.

44
Why not just use autocorrelation?

Variograms are defined for a wider class of
processes. Autocorrelation functions require weak
stationarity variograms are defined for
processes with stationary increments.
Variograms are more flexible in the face of
irregular spacing.
The very wide use of autocorrelation reflects
custom and familiarity as well as intrinsic
merit.

45
A further example

We look at rainfalls for 8 May 1986 (a single
day) for 467 stations in Switzerland.

46
(No Transcript)
47
(No Transcript)
48
How much information ?

Optionally the semivariogram results can be saved
in vgram to new variables.
Keeping track of the number of pairs used at each
lag is important.
Here we exploit the feature that spikeplot can
show frequencies on a square root scale.

49
(No Transcript)
50
To do list

variogram clouds
robust estimators
more flexible binning
spherical distances too
direction as well as lag

model fitting
(valid functional forms)
use for interpolation
(and smoothing)
(kriging, Gaussian process regression)

51
Variogram virtues

Defined for time and spatial series.
Defined for regular and irregular series.
Can help identify and check for structure.
even if you have no interest in their most
mentioned use, as a means towards the end of
spatial interpolation.

52
This paper

This paper fills a much needed gap in the
literature.
See Jackson, A. 1997.
Chinese acrobatics, an old-time brewery,
and the much needed gap
The life of Mathematical Reviews.
Notices of the American Mathematical Society
44 330337.

53
Acknowledgments

Historical portraits Wikipedia.
MATLAB code for pchip
Moler, C. 2004. Numerical Computing with MATLAB.
Philadelphia SIAM. Chapter 3. http//www.mathwork
s.com/moler/interp.pdf)
The Swiss rainfall data can be found here
http//www.ai-geostats.org/pub/AI_GEOSTATS/AI_GEOS
TATSData/sic97data_01.zip

54
Leo Breiman (19282005)

The main thing to learn about statistics is what
is sensible and honest and possible.
Doubt and suspicion, as well as technical
knowledge, are indispensable tools in statistics.
1973.
Statistics With a view towards applications.
Boston Houghton Mifflin, pp.1, 18.

Write a Comment

User Comments (0)