Title: An Introduction to Functional Data Analysis
1An Introduction to Functional Data Analysis
- Jim Ramsay
- McGill University
2Overview
- Well use three case studies to see what is meant
by functional data, and to consider some
important issues in the analysis of functional
data - Human growth data
- US nondurable goods manufacturing index
- Thirty years of Montreal weather
3Human Growth From data to functions
4- We need repeated and regular access to subjects
for up to 20 years. - Height changes over the day, and must be measured
at a fixed time. - Height is measured in supine position in infancy,
followed by standing height. The change involves
an adjustment of about 1 cm. - Measurement error is about 0.5 cm in later years,
but is rather larger in infancy. - Measurements are not taken at equally spaced
points in time.
5Challenges to functional modeling
- We want smooth curves that fit the data as well
as is reasonable. - We will want to look at velocity and
acceleration, so we want to differentiate twice
and still be smooth. - In principle the curves should be monotone i.
e., have a positive derivative.
6The monotonicity problem
- The tibia of a newborn measured daily shows us
that over the short term growth takes places in
spurts. - This babys tibia grows as fast as 2 mm/day!
- How can we fit a smooth monotone function?
7Weighted sums of basis functions
- We need a flexible method for constructing curves
to fit the data. - We begin with a set of basic functional building
blocks fk(t), called basis functions. - Our fitting function x(t) is a weighted sum of
these
8B-splines for growth data
- Order 4 splines look smooth, but their second
derivatives are rough. - We use order 6 B-splines because we want to
differentiate the result at least twice. - We place a knot at each of the 31 ages.
- The total number of basis functions order
number of interior knots. 35 in this case.
9Isnt using 35 basis functions to fit 31
observations a problem?
- Yes. We will fit each observation exactly.
- This will ignore the fact that the measurement
error is typically about 0.5 cm. - But well fix this up later, when we look at
roughness penalties.
10Okay, lets see what happens
- These two Matlab commands define the basis and
fit the data - hgtbasis
- create_bspline_basis(1,18, 35, 6, age)
- hgtfd
- data2fd(hgtfmat, age, hgtbasis)
11Why we need to smooth
- Noise in the data has a huge impact on derivative
estimates.
12Please let me smooth the data!
- This command sets up 12 B-spline basis functions
defined by equally spaced knots. This gives us
about the right amount of fitting power given the
error level. - hgtbasis
- create_bspline_basis(1,18, 12, 6)
13- These are velocities are much better.
- They go negative on the right, though.
14Lets see some accelerations
- These acceleration curves are too unstable at the
ends. - We need something better.
15A measure of roughness
- What do we mean by smooth?
- A function that is smooth has limited curvature.
- Curvature depends on the second derivative. A
straight line is completely smooth.
16Total curvature
- We can measure the roughness of a function x(t)
by integrating its squared second derivative. - The second derivative notation is D2x(t).
17Total curvature of acceleration
- Since we want acceleration to be smooth, we
measure roughness at the level of acceleration
18The penalized least squares criterion
- We strike a compromise between fitting the data
and keeping the fit smooth.
19How does this control roughness?
- Smoothing parameter ? controls roughness.
- When ? 0, only fitting the data matters.
- But as ? increases, we place more and more
emphasis on penalizing roughness. - As ? ? 8, only roughness matters, and functions
having zero roughness are used.
20- We can either smooth at the data fitting step, or
smooth a rough function. - This Matlab command smooths the fit to the data
obtained using knots at ages. The roughness of
the fourth derivative is controlled. - lambda 0.01
- hgtfd smooth_fd(hgtfd, lambda, 4)
21Accelerations using a roughness penalty
- These accelerations are much less variable at the
extremes.
22The corresponding velocities look good, too
23How did you choose ??
- We smooth just enough to obtain tolerable
roughness in the estimated curves (accelerations
in this case), but not so much as to lose
interesting variation. - There are data-driven methods for choosing ?, but
they offer only a reasonable place to begin
exploring. - But this is inevitably involves judgment.
24What about monotonicity?
- The growth curves should be monotonic.
- The velocities should be non-negative.
- Its hard to prevent linear combinations of
anything from breaking rules like monotonicity. - We need an indirect approach to constructing a
monotonic model.
25A differential equation for monotonicity
- Any strictly monotonic function x(t) must satisfy
a simple linear differential equation
The reason is simple Because of strict
monotonicity, the first derivative Dx(t) will
never be 0, and function w(t) is therefore
simply D2x(t)/Dx(t).
26The solution of the differential equation
- Consequently, any strictly monotonic function
x(t) must be expressible in the form
This suggests that we transform the monotone
smoothing problem into one of estimating function
w(t), and constants ß0 and ß1.
27What we have learned from the growth data
- We can control smoothness by either using a
restricted number of basis functions, or by
imposing a roughness penalty. - Roughness penalty methods generally work better
than simple basis expansions. - Differential equations can play a useful role in
defining constrained functions.
28Phase-Plane Plotting the Nondurable Goods Index
29- Nondurable goods last less than two years Food,
clothing, cigarettes, alcohol, but not personal
computers!! - The nondurable goods manufacturing index is an
indicator of the economics of everyday life. - The index has been published monthly by the US
Federal Reserve Board since 1919. - It complements the durable goods manufacturing
index.
30What we want to do
- Look at important events.
- Examine the overall trend in the index.
- Have a look at the annual or seasonal behavior of
the index. - Understand how the seasonal behavior changes over
the years and with specific events.
31The log nondurable goods index
32Events and Trends
- Short term
- 1929 stock market crash
- 1937 restriction of money supply
- 1974 end of Vietnam war, OPEC oil crisis
- Medium term
- Depression
- World War II
- Unusually rapid growth 1960-1974
- Unusually slow growth 1990 to present
- Long term increase of 1.5 per year
33The evolution of seasonal trend
- We focus on the years 1948 to 1999
- We estimate long- and medium-term trend by spline
smoothing, but with knots too far apart to
capture seasonal trend - We subtract this smooth trend to leave only
seasonal trend
34Smoothing the data
We want to represent the data yj by a smooth
curve x(t). The curve should have at least two
smooth derivatives. We use spline smoothing,
penalizing the size of the 4th derivative.
A function Pspline in S-PLUS is available by ftp
from ego.psych.mcgill.ca/pub/ramsay/FDAfuns
35Three years of typical trend 1964-1966
36Seasonal Trend
- Typically three peaks per year
- The largest is in the fall, peaking at the
beginning of October - The low point is mid-December
37Non-seasonal trend is in red
38Seasonal trend data nonseasonal trend
39Phase-Plane Plots
- Looking at seasonal trend itself does not reveal
as much as looking at the interplay between - Velocity or its first derivative, reflecting
kinetic energy in the system. - Acceleration or its second derivative, reflecting
potential energy. - The phase-plane diagram plots acceleration
against velocity. - For purely sinusoidal trend, the plot would be an
ellipse.
40Position of a swinging pendulum
41Phase-plane plot for pendulum
42Phase-plane plot for 1964
- There are three large loops separated by two
small loops or cusps - Spring cycle mid-January into April
- Summer cycle May through August
- Fall cycle October through December
43A look at the years 1929-1931.
44(No Transcript)
45(No Transcript)
46(No Transcript)
471929 through 1931
- The stock market crash shows up as a large
negative surge in velocity. - Subsequent years nearly lose the fall production
cycle, as people tighten their belts and spend
less at Christmas.
48What happened in 1937-1938?
49(No Transcript)
50(No Transcript)
51(No Transcript)
521937 and 1938
- The Treasury Board, fearing that the economy was
becoming overheated again, clamped down on the
money supply. The effect was catastrophic, and
nearly wiped out the fall cycle. - This new crash was even more dramatic than that
of 1929, but was forgotten because of the
outbreak of World War II.
53What about World War II?
54(No Transcript)
55- During World War II, the seasonal cycle became
very small, since the war, and the production
that fed it, lasted all year long. - Now look at three pivotal years, 1974 to 1976,
when the Vietnam War ended and the OPEC oil
crisis happened. Watch the shrinking of the fall
cycle.
56(No Transcript)
57(No Transcript)
58(No Transcript)
59What about today?
60(No Transcript)
61These days
- Over the last ten years the size of all three
cycles have become much smaller. - Why?
- Is variation now smoothed out by information
technology? - Are the aging baby boomers spending less?
- Are personal computers, video games, and other
electronic goods really durable? - Has manufacturing now moved off shore?
62Conclusions
- We can separate long- and medium-term trends from
seasonal trends by smoothing. - Phase-plane plots are great ways to inspect
seasonality. - Derivatives were used in two ways to penalize
roughness, and to reflect the dynamics of
manufacturing.
63Trends in Seasonality
- We see by inspection that seasonal trends change
systematically over time, and can also change
abruptly. - We first estimate the principal components of
seasonal variation, using a version of principal
components analysis adapted to functional data,
and sensitive only to effects periodic over one
year.
64(No Transcript)
65(No Transcript)
66(No Transcript)
67The Components
- Relative sizes of spring and summer cycles (53)
- Joint size of spring and summer cycles (25)
- Size of fall cycle (11)
68Plotting Component Scores
- We can compute scores at each year for these
three principal components, sometimes called
empirical orthogonal functions. - Plotting the evolution of these scores over the
51 years shows some interesting structural
changes in the economics of everyday life.
69(No Transcript)
70(No Transcript)
71Wrap-up
- Phase-plane plots are good for inspecting
seasonal quasi-harmonic trends - Principal components analysis reveals main
components of variation in seasonal trend. - Plotting component scores shows how trend has
evolved.
72- This was joint with work with James B. Ramsey,
Dept. of Economics, New York University, and is
reported in - Ramsay, J. O. and Ramsey, J. B. (2001) Functional
data analysis of the dynamics of the monthly
index of non-durable goods production. Journal
of Econometrics, 107, 327-344.
73Phase and Amplitude Variation in Montreal Weather
- 34 years of daily temperatures, 1961-1994
inclusive - Values are averages of daily maximum and minimum
- 12410 observations in tenths of a degree Celsius
- Available for Montreal and 34 other Canadian
weather stations
74(No Transcript)
75- We know that there are two kinds of variation in
these data - Amplitude variation day-to-day and year-to-year
variation in temperature at events such as the
depth of winter. - Phase variation the timing of these events --
the seasons arrive early in some years, and late
in others.
76Goals
- Separate phase variation from amplitude variation
by registering the series to its strictly
periodic image. - Estimate components of variation due to amplitude
and phase variation.
77Smoothing
- The registration process requires that we smooth
the data two ways - With an unconstrained smooth that removes the
day-to-day variation, but leaves longer-term
variation unchanged. - With a strictly periodic smooth that eliminates
all but strictly periodic trend.
78Unconstrained smooth
- Raw data are represented by a B-spline expansion
using 500 basis functions of order 6. - Knot about every 25 days.
- The standard deviation of the raw data about this
smooth, adjusted for degrees of freedom, is 1.52
degrees Celsius.
79(No Transcript)
80Periodic smooth
- The basis is Fourier, with 9 basis functions
judged to be enough to capture most of the
strictly periodic trend for a period of one year. - The standard deviation of the raw about data
about this smooth is 2.18 deg C. - Compare this to 2.07 deg C. for the unconstrained
smooth.
81(No Transcript)
82- Plotting the unconstrained B-spline smooth minus
the constrained Fourier smooth reveals some
striking discrepancies. - We focus on Christmas, 1989. The Ramsays spent
the holidays in a chalet in the Townships, and
awoke to 37 deg C. No skiing, car dead,
marooned! - This temperature would still be cold in
mid-January, but less unusual.
83(No Transcript)
84Registration
- Let the unconstrained smooth be x(t) and the
strictly periodic smooth be x0(t). - We need to estimate a nonlinear strictly
increasing smooth transformation of time h(t),
called a warping function, such that a fitting
criterion is minimized.
85Fitting criterion
The fitting criterion was the smallest eigenvalue
of the matrix
This criterion measures the extent to which a
plot of xh(t) against x0(t) is linear, and thus
whether the two curves are in phase.
86The warping function h(t)
- Every smooth strictly monotone function h(t) such
that h(0) 0 can be represented as
We represent unconstrained function w(v) by a
B-spline expansion. Constant C is determined by
constraint h(T) T.
87The deformation u(t) h(t) - t
- Plotting this allows us to see when the seasons
come early (negative deformation) or late
(positive deformation).
88(No Transcript)
89- Mid-winter for 1989-1990 arrived about 25 days
early. - The next step is to register the temperature data
by computing x(t) xh(t). The registered
curve x(t) contains only amplitude variation. - Registration was done by Matlab function
registerfd, available by ftp from - ego.psych.mcgill.ca/pub/ramsay/FDAfuns
90(No Transcript)
91Amplitude variation
- The standard deviation of the difference between
the unconstrained smooth and the strictly
periodic smooth is 2.15 C. - The standard deviation of the difference between
the registered smooth and the period smooth is
1.73 C. - (2.152 1.732)/2.152 .35, the proportion of
the variation due to phase.
92- The standard deviation of the raw data around the
registered smooth is 2.13 C, compared with 2.07 C
for the unregistered smooth. - About 10 of the total variation is due to phase.
93Conclusions
- Phase variation is an important part of weather
behavior. - Statisticians seldom think about phase variation,
and classical time series methods ignore it
completely. - Phase variation needs more attention, and
registration is an essential tool.
94Unique Aspects of Functional Data Analysis
- The data are smooth, so we can use derivatives in
various ways. - Differential equations can play a big role.
- Events in functional data occur over different
time scales. - Time itself may be an elastic medium, and vary
over functional observations.
95Finding out More
- Ramsay, J. O. and Silverman, B. W. (1997, 2004)
Functional Data Analysis. Springer. - Ramsay, J. O. and Silverman, B. W.
- (2002) Applied Functional Data Analysis. Springer
- Visit the FDA website www.psych.mcgill.ca/misc/fd
a/ - Software in Matlab, R and S-PLUS available at
ego.psych.mcgill.ca/pub/ramsay