Between and beyond: Irregular series, interpolation, variograms, and smoothing - PowerPoint PPT Presentation

About This Presentation
Title:

Between and beyond: Irregular series, interpolation, variograms, and smoothing

Description:

Between and beyond: Irregular series, interpolation, variograms, and smoothing Nicholas J. Cox – PowerPoint PPT presentation

Number of Views:287
Avg rating:3.0/5.0
Slides: 53
Provided by: NickC160
Category:

less

Transcript and Presenter's Notes

Title: Between and beyond: Irregular series, interpolation, variograms, and smoothing


1
Between and beyond Irregular series,
interpolation, variograms, and smoothing
  • Nicholas J. Cox

2
Mind the gap!
  • Repeated reminder, London Underground.

3
Irregular series
  • Irregular series are series in which non-missing
    values are not all equally spaced.
  • Special case Values would be equally spaced
    (every day, every year, ), but there are some
    gaps with missing values, for human or inhuman
    reasons.
  • General case Values are just at known times or
    points with no necessary rules about spacing.
  • Irregular series often seem to invite
    interpolation.

4
Luke Howard (1772 1864) 
  • Best remembered for his nomenclature for clouds
  • (cumulus, stratus, cirrus and so forth).
  • Here we use as sandbox some of his temperature
    data from Plaistow, near London, in 1807.

5
  • Howard, Luke. 1818.
  • The Climate of London, Deduced from
    Meteorological Observations, Made at Different
    Places in the Neighbourhood of the Metropolis.
  • Volume I.
  • London W. Phillips, etc.

6
(No Transcript)
7
Series of events
  • N.B. We are not talking here about series of
    events,
  • or realisations of point processes.
  • In such series occurrences are typically
    irregularly spaced, but the gaps are inherent in
    the process,
  • not a failing of our data.
  • Examples range from eruptions to elections.

8
(No Transcript)
9
Interpolation
  • Interpolation is the art of reading between the
    lines.
  • Historically, it is a deterministic process,
    often a matter of going beyond printed tables of
    functions
  • (logarithmic, trigonometric, and so forth).
  • In principle, we should worry about the
    statistical properties of interpolation. It is
    estimation or prediction.
  • In practice, imputation now appears better known
    among statistical researchers.

10
Interpolation in (official) Stata
  • The ipolate command for linear interpolation
    (and extrapolation) was added in Stata 3.1
    (1993).
  • The Mata functions spline3() and spline3eval()
    were added in Stata 9.0 (2005).

11
User-written programs
  • Programs (NJC) are available from SSC for
  • cubic interpolation cipolate (2002)
  • cubic spline interpolation csipolate (2009)
  • piecewise cubic Hermite interpolation pchipolate
    (2012)
  • nearest neighbour interpolation nnipolate (2012)
  • A combined and extended program mipolate will
    shortly be available too.

12
Two dimensions too
  • Note also bipolate (Joseph Canner, SSC) (2014).
  • By default it uses quintic polynomials.
  • Other available methods include thin plate
    splines
  • and Shepards method.
  • Note also twoway contour.

13
mipolate generalises ipolate
  • Interpolation is of yvar with respect to
    specified xvar.
  • Prior tsset or xtset is not assumed.
  • Regular spacing is not assumed.
  • Multiple values of yvar at the same xvar are
    averaged first.
  • Groupwise operations using by are supported.

14
Linear and cubic
  • Linear interpolation just uses previous and
    following known values (only).
  • This is done by ipolate, and also mipolate by
    default.
  • Cubic interpolation is another classic method,
    using two previous and two following known values
    (only).
  • This is done by mipolate, cubic.
  • The default of mipolate with either method
  • (as with ipolate) is not to extrapolate.

15
Un peu dhistoire
  • Cubic interpolation is often attributed to
    Joseph-Louis Lagrange (17361813) but was
    proposed earlier by Edward Waring (1735?1798).

16
Lagrange Waring
17
Cubic splines
  • As before, we are using cubic polynomials
    locally, but they are constrained to join
    smoothly.
  • The syntax is mipolate, spline.
  • This is merely a wrapper for the official Mata
    functions.
  • As before, the default of mipolate with this
    option is not to extrapolate.

18
Linear extrapolation
  • As with ipolate linear extrapolation is available
    as an option in mipolate to fill in missings at
    the end of series.
  • What your teachers told you is true
  • extrapolation is dangerous.
  • Dont point that straight line It can go off
    anywhere.
  • (Allude here to Mark Twain on the Mississippi.)

19
Piecewise cubic Hermite interpolation
  • This method also uses piecewise cubics joining
    smoothly. The syntax is mipolate, pchip.
  • The interpolant is shape-preserving and cannot
  • overshoot locally.
  • Sections in which yvar is increasing, decreasing
    or constant with xvar remain so after
    interpolation.
  • Hence local maxima and minima also remain so.
  • This interpolation method also extrapolates.

20
Charles Hermite (18221901)
21
Other methods
  • mipolate adds forward, backward and nearest
    neighbour interpolation
  • Use the previous, next or the nearest known
    value.
  • Using the last known value is often dubious
    statistically,
  • but it is a very common request in data
    management.
  • The other methods are provided mostly for
    completeness.
  • There is small print (option choices) about how
    to break ties when two values are equally near.

22
mipolate summary
  • Seven methods
  • linear
  • cubic
  • (cubic) spline
  • pchip
  • forward
  • backward
  • nearest
  • Linear extrapolation?
  • yes
  • yes
  • yes
  • no
  • no
  • no
  • no

23
(No Transcript)
24
(No Transcript)
25
Simple messages
  • There are many interpolation methods to choose
    from.
  • They will often disagree, even for simple-looking
    instances.
  • Disagreement gives a handle on uncertainty.
  • In a real problem, simulate missings and test how
    well known values are estimated.
  • What makes most sense in your problem will
    reflect its dependence structure.

26
Leo Breiman (19282005)
  • The main thing to learn about statistics is what
    is sensible and honest and possible.
  • Doubt and suspicion, as well as technical
    knowledge, are indispensable tools in statistics.
  • 1973. 
  • Statistics With a view towards applications. 
  • Boston Houghton Mifflin, pp.1, 18.

27
  • We turn from a project that is nearly done to one
    that is very much in progress.

28
Variograms
  • Variograms (more properly semivariograms) are
    plots of
  • (mean) half difference between values squared
  • versus
  • separation, distance or lag.
  • By a tempting abuse of terminology, we often use
    the same name for the underlying relationship as
    a function.

29
First known use of term variogram
  • Geoffrey H. Jowett (1922 ) in 1955
  • The comparison of means of sets of observations
    from sections of independent stochastic series.
  • Journal of the Royal Statistical Society. Series
    B (Methodological) 17 208227.

30
Spatial and time series
  • Variograms are central to one approach to spatial
    statistics, in this context often known as
    geostatistics.
  • Georges Matheron (19302000) is most often
    mentioned here.
  • But variograms can be very useful for time series
    too.

31
Time series too
  • Variograms are prominent in these texts on time
    series and longitudinal data
  • Diggle, P.J. 1990. Time Series A Biostatistical
    Introduction. Oxford Oxford University Press.
  • Diggle, P.J., Heagerty, P.J., Liang, K-Y. and
    Zeger, S.L. 2002. Analysis of Longitudinal Data.
    Oxford Oxford University Press.

32
User-written programs
  • Programs (NJC) are available from SSC for
  • variograms in one dimension variog (2005)
  • variograms in two dimensions variog2 (2005)
  • A combined and extended program vgram is under
    development.

33
Generality of variograms
  • So, variograms are without undue strain
    defined
  • for time series and for spatial series,
  • whether regular or irregular,
  • as they just depend on separation being measured.
  • Plotting the mean for each distinct separation is
  • a common, but not compulsory, convention.

34
A simple example webuse air2
35
Variograms
  • vgram air, recast(connected) xla(0(12)72)
  • vgram air

36
Comparison at different lags
  • We are plotting mean squared differences between
    values compared at lags 1, 2, 3,
  • In this example, we have monthly data, so are
    comparing values 1, 2, 3, months apart.
  • Many readers may be familiar with the same idea
    for calculating autocorrelation and
    cross-correlation.
  • The variogram like the raw data plot hints at
    a structure of trend plus seasonality.

37
Variograms of residuals, not data
  • Here, as elsewhere, it is a good idea to work
    with residuals, rather than the original data.
  • Time series modellers could have a happy time
    arguing which model was best for the airline
    data, but we just use a Poisson regression on
    time and look at its residuals.
  • On the versatility and virtuosity of Poisson
    regression, check out Gould, William.
  • http//blog.stata.com/2011/08/22/use-poisson-rathe
    r-than-regress-tell-a-friend/

38
Sometimes, structure is this simple
  • Poisson regression
  • Residuals from Poisson

39
A little more formally
  • The semivariogram ?(h) for response z is
    given by
  • 2 ?(h) A z(i) - z(i h)2
  • where A denotes averaging over pairs of values
    at lag h.
  • As emphasised, using a mean is a convention. The
    fuller picture (literally!) is a plot of z(i) -
    z(i h)2 versus h.
  • This is often known as a variogram cloud.
  • I borrow the notation A() from Whittle, P. 1970.
    Probability. Harmondsworth Penguin.

40
Where does the 2 come from?
  • The units of the semivariogram are those of the
    response squared.
  • Adding the variance to the graph as a reference
    line underlines the connection.
  • A non-standard formula for the variance is, for
    any i, j,
  • (1/2) E (zi - zj)2 .

41
Back to vgram
  • vgram (not yet public) is already quite general.
  • We take possibilities one by one.
  • With just one argument, the response, it checks
    for a tsset or xtset time variable and uses it to
    define separations if found. Note that panel
    data are supported for free.
  • With just one argument otherwise, the order of
    the observations is taken to define position in
    time or space.

42
  • With two arguments, the second variable is taken
    to define position. A width() option is required
    to specify the width of bins within which
    differences squared are averaged. Equal and
    unequal spacing can thus both be accommodated.
  • With three arguments, the second and third
    variables are taken to define position. A width()
    option is required to specify the width of bins
    within which differences squared are averaged.
    Distance is calculated from coordinates using
    Pythagoras theorem.

43
Why not just use autocorrelation?
  • Variograms are defined for a wider class of
    processes. Autocorrelation functions require weak
    stationarity variograms are defined for
    processes with stationary increments.
  • Variograms are more flexible in the face of
    irregular spacing.
  • The very wide use of autocorrelation reflects
    custom and familiarity as well as intrinsic
    merit.

44
A further example
  • We look at rainfalls for 8 May 1986 (a single
    day) for 467 stations in Switzerland.

45
(No Transcript)
46
(No Transcript)
47
How much information ?
  • Optionally the semivariogram results can be saved
    in vgram to new variables.
  • Keeping track of the number of pairs used at each
    lag is important.
  • Here we exploit the feature that spikeplot can
    show frequencies on a square root scale.

48
(No Transcript)
49
To do list
  • variogram clouds
  • robust estimators
  • more flexible binning
  • spherical distances too
  • direction as well as lag
  • model fitting
  • (valid functional forms)
  • use for interpolation
  • (and smoothing)
  • (kriging, Gaussian process regression)

50
Variogram virtues
  • Defined for time and spatial series.
  • Defined for regular and irregular series.
  • Can help identify and check for structure.
  • even if you have no interest in their most
    mentioned use, as a means towards the end of
    spatial interpolation.

51
This paper
  • This paper fills a much needed gap in the
    literature.
  • See Jackson, A. 1997.
  • Chinese acrobatics, an old-time brewery,
  • and the much needed gap
  • The life of Mathematical Reviews.
  • Notices of the American Mathematical Society
  • 44 330337.

52
Acknowledgments
  • Historical portraits Wikipedia.
  • MATLAB code for pchip
  • Moler, C. 2004. Numerical Computing with MATLAB.
    Philadelphia SIAM. Chapter 3. http//www.mathwork
    s.com/moler/interp.pdf)
  • The Swiss rainfall data can be found here
  • http//www.ai-geostats.org/pub/AI_GEOSTATS/AI_GEOS
    TATSData/sic97data_01.zip
Write a Comment
User Comments (0)
About PowerShow.com