Chapter 4 Prediction and Bayesian Inference

About This Presentation

Title:

Chapter 4 Prediction and Bayesian Inference

Description:

Prediction and Bayesian Inference 4.1 Estimators versus predictors 4.2 Prediction for one-way ANOVA models Shrinkage estimation, types of predictions – PowerPoint PPT presentation

Number of Views:135

Avg rating:3.0/5.0

Slides: 43

Provided by: Technol311

Learn more at: https://instruction.bus.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 4 Prediction and Bayesian Inference

1
Chapter 4Prediction and Bayesian Inference

4.1 Estimators versus predictors
4.2 Prediction for one-way ANOVA models
Shrinkage estimation, types of predictions
4.3 Best linear unbiased predictors (BLUPs)
4.4 Mixed model predictors
4.5 Bayesian inference
4.6 Case study Forecasting lottery sales
4.7 Credibility Theory
Appendix 4A Linear unbiased predictors

2
4.1 Estimators versus predictors

In the longitudinal data model, yit zit ai
xit b eit , the variables ai describe
subject-specific effects.
Given the data yit, zit, xit, in some problems
it is of interest to summarize subject effects.
We have discussed how to estimate fixed, unknown
parameters .
It is also of interest to summarize
subject-specific effects, such as those described
by the random variable ai.
Predictors are estimators of random variables.
Like estimators, predictors are said to be linear
if they are formed from a linear combination of
the response y.

3
Applications of prediction

In animal and plant breeding, one wishes to
predict the production of milk for cows based on
(1) their lineage (random) and (2) herds (fixed)
In credibility theory, one wishes to predict
expected claims for a policyholder given exposure
to several risk factors
In sample surveys, one wishes to predict the size
of a specific age-sex-race cohort within a small
geographical area (known as small area
estimation).
In a survey article, Robinson (1991) also cites
(1) ore reserve estimation in geological surveys,
(2) measuring quality of a production plan and
(3) ranking baseball players abilities.

4
4.2. Prediction for one-way ANOVA models

Consider the traditional one-way random effects
ANOVA (analysis of variance) model
yit ma ai eit
Suppose that we wish to summarize the
subject-specific conditional mean, ma ai .
For contrast, first consider using the fixed
effects model with ma 0.
Here, we have that is the best
(Gauss-Markov) estimate of ai.
This estimate is unbiased, that is, E ai.
This estimate has minimum variance among all
linear unbiased estimators (BLUE).

5
Shrinkage estimator

Using the one-way random effects model.
Consider an estimator of ma ai that is a
linear combination of and , that is,
for constants c1 and c2.
Calculations show that the best values of c1 and
c2 that minimize are c2
1 c1 and
For large n, we have the shrinkage estimator, or
predictor, of ma ai to be
, where

6
Example of shrinkage estimator Hypothetical Run
Times for Three Machines

Machine Run Times Average Run Time
1 14, 12, 10, 12 1 12
2 9, 16, 15, 12 2 13
3 8, 10, 7, 7 3 8
Notation yij means the jth run from the ith
machine.
For example, y21 9 and y23 15.
Are there real differences among machines?

7
Example - Continued

To see the shrinkage effect, consider
Figure 4.1 Comparison of Subject-Specific Means
to Shrinkage Estimators.

8
13
12
11
11.825
12.650
8.525
8
More on shrinkage estimators

Under the random effects model, is an
unbiased predictor of maai in the sense that E
- (ma ai) 0.
However, is inefficient in the sense that
has a smaller mean square error than .
Here, has been shrunk towards the stable
estimator
The estimator is said to borrow strength
from the stable estimator
Recall
Note that zi1 as either (i) Ti or (ii) sa2/
s?2 .

9
Best predictors

From Section 3.1, it is easy to check that the
generalized least square estimator of ma is
The linear predictor of ma ai that has minimum
variance is zi (1 - zi )
ma,GLS .
Here, the acronym BLUP stands for best linear
unbiased predictor.

10
Types of Predictors

We have now introduced the BLUP of ma ai . This
quantity is a linear combination of global
parameters and subject-specific effects.
Two other types of predictors are of interest.
Residuals. Here, we wish to predict eit . The
BLUP residual turns out to be
Forecasts. Here, we wish to predict, for L lead
time units into the future,
Without serial correlation, the predictor is the
same as the predictor of ma ai . However, we
will see that the mean square error turns out to
be larger.

11
4.3 Best linear unbiased predictors

This section develops best linear unbiased
predictors in the context of mixed linear models,
then specializes the consideration to
longitudinal data mixed models.
BLUPs are developed by examining the minimum mean
square error predictor of a random variable, w.
We give a development due to Harville (1976).
The argument is originally due to Goldberger
(1962), who coined the phrase best linear
unbiased predictor.
The acronym was first used by Henderson (1973).
BLUPs can also be developed as conditional
expectations using multivariate normality
BLUPs can also be developed in a Bayesian context.

12
Mixed linear models

Suppose that we observe an N ? 1 random vector y
with mean E y X b and variance Var y V.
We wish to predict a random variable w, that has
mean E w l? b and Var w sw2.
Denote the covariance between w and y as Cov(w,y)
covwy.
Assuming known regression parameters (b), the
best linear (in y) predictor of w is
w E w covwy? V-1(y - E y ) l? b covwy
V-1(y - X b ).
If w,y are multivariate normal, then w equals E
(w y ) and hence is a minimum mean square
predictor of w.
The predictor w is also a minimum mean square
predictor of w without the assumption of
normality. See Appendix 4A.1.

13
BLUPs as predictors

To develop the BLUP,
define bGLS ( X? V -1 X )-1 X? V-1 y to be the
generalized least squares (GLS) estimator of b.
This is the best linear unbiased estimator
(BLUE).
Replace b by bGLS in the definition of w to get
the BLUP
wBLUP l? bGLS covwy ? V-1(y - X bGLS )
(l? - covwy? V-1X) bGLS covwy? V-1 y.
See Appendix 4A.2 for a check, establishing wBLUP
as the best linear unbiased predictor of w.
From Appendix 4A.3, we also have the form for the
minimum mean square error
Var (wBLUP - w) (l? - covwy? V-1X) ( X? V -1 X
)-1
(l? - covwy? V-1X)? -
covwy? V-1 covwy sw2.

14
Example One-way model

Recall, yit ma ai eit
Thus, yi 1i (ma ai) ei . Thus,
Xi 1i and
With this, we note that Vi-1 (yi - Xi bGLS)
Thus, for predicting w ma ai we have l1 and
Cov(w, yi) 1i sa2 for the ith subject, 0
otherwise. Thus,

15
Random effect ANOVA model

For predicting residuals eit we have l0 and
Cov(w, yi) se2 for the ith subject, tth time
period, 0 otherwise.
Let 1it be a Ti ? 1 vector with a 1 in the tth
position, 0 otherwise. Thus,
is our BLUP residual.

16
4.4 Mixed model predictors

Recall the longitudinal data mixed model
yi Zi ai Xi b ei
As described in Section 3.3, this is a special
case of the mixed linear model. We use
V block diagonal (V1, ..., Vn) ,
where Vi Zi D Zi? Ri.
X (X1?, ... Xn?)?
For BLUP calculations, note that
covwy ( Cov(w, y1? ),, Cov(w, yn?) )?

17
Longitudinal data mixed model BLUP

Recall that the r.v. w has mean E w l? b and
Var w sw2.
The BLUP is
The mean square error is Var (wBLUP - w)

18
BLUP special cases

Global parameters and subject-specific effects.
Suppose that the interest is in predicting linear
combinations of global parameters b and
subject-specific effect ai.
Consider linear combinations of the form
w c1 ai c2 b.
Residuals. Here, w eit .
Forecasts. Suppose that the ith subject is
included in the data set predict
for L lead time units in the future.

19
Predicting global parameters and subject-specific
effects

Consider linear combinations of the form w c1
ai c2 b.
Straightforward calculations show that
E w c2 b so that l c2,
Cov (w, yj ) c1 D Zi for j i
Cov (w , yj ) 0 for j ¹ i.
Thus, wBLUP c2 bGLS c1 D Zi Vi-1 (yi -
Xi bGLS ).

20
Special case 1

Take c2 0 . Because the means and variance
expressions are true for all vectors c2, we may
write this in vector notation to get the BLUP of
ai, the vector
ai,BLUP D Zi Vi-1 (yi - Xi bGLS ).
This is unbiased in the sense that E ai,BLUP - ai
0.
This estimate has minimum variance among all
linear unbiased predictors (BLUP).
In the case of the error components model (zit
1), this reduces to
For comparison, recall the fixed effects
parameter estimate,

21
Motivating BLUPs

We can also motivate BLUPs using normal theory
Consider the case where ai and e are multivariate
normally distributed.
Then, it can be shown that E (ai yi) D Zi?
Vi-1 (yi -Xi b).
To motivate this, consider asking the question
what realization of ai could be associated with
yi? The expectation!
The BLUP is the BLUE of E (ai yi). (That is,
replace b by bGLS.)

22
Special case 2

As another example, it is of interest to predict
Choose and
This yields
This predictor is of interest in actuarial
science, where it is known as the credibility
estimator.

23
BLUP Residuals

Here, w eit . Because E w 0, it follows that
l 0.
Straightforward calculations show that
Cov (w, yj ) se2 1it for j i and
Cov (w , yj ) 0 for j ¹ i.
Here, the symbol 1it denotes a Ti 1 vector
that has a one in the tth position and is zero
otherwise.
Thus
eit,BLUP se2 1it Vi-1 (yi - Xi bGLS ).
This can also be expressed as

24
Predicting future observations

Suppose that the ith subject is included in the
data set predict
for L lead time units in the future.
We will assume that and
are known.
It follows that
Straightforward calculations show that
Thus, the forecast of yi,TiL is
Thus, the forecast is the estimate of the
conditional mean plus the serial correlation
correction factor

25
Predicting future observations

To illustrate, consider the special case where we
have autoregressive of order 1 (AR(1)), serially
correlated errors.
Thus, we have
After some algebra, the L step forecast is

26
4.5 Bayesian Inference

With Bayesian statistical models, one views both
the model parameters and the data as random
variables.
We assume distributions for each type of random
variable.
Given the parameters ß and a, the response model
is
Specifically, we assume that the responses y
conditional on a and ß are normally distributed
and that
E (y a, ß ) Z a X ß and Var (y a, ß) R.
Assume that a is distributed normally with mean
?a and variance D and that ß is distributed
normally with mean µß and variance ?ß, each
independent of the other.

27
Distributions

The joint distribution of (a?, ß?)? is known as
the prior distribution.
To summarize, the joint distribution of (a?, ß?,
y?)? is
where V R Z D Z?.

28
Posterior Distribution

The distribution of parameters given the data is
known as the posterior distribution.
The posterior distribution of (a?, ß?)? given y
is normal.
The conditional moments are

29
Relation with BLUPs

In longitudinal data applications, one typically
has more information about the global parameters
ß than subject-specific parameters a.
Consider first the case ?ß 0, so that ß ?ß
with probability one.
Intuitively, this means that ß is precisely
known, generally from collateral information.
Assuming that ?a 0, it is easy to check that
the best linear unbiased estimator (BLUE) of E (
a y ) is
aBLUP D Z? V-1 ( y X bGLS)
Recall from equation (4.11) that aBLUP is also
the best linear unbiased predictor in the
frequentist (non-Bayesian) model framework.

30
Relation with BLUPs

Consider second the case where ?ß-1 0.
In this case, prior information about the
parameter ß is vague this is known as using a
diffuse prior.
Assuming ?a 0, one can show that
E ( a y ) aBLUP
It is interesting that in both extreme cases, we
arrive at the statistic aBLUP as a predictor of
a.
This analysis assumes D and R are matrices of
fixed parameters.
It is also possible to assume distributions for
these parameters typically, independent Wishart
distributions are used for D-1 and R-1 as these
are conjugate priors.
The general strategy of substituting point
estimates for certain parameters in a posterior
distribution is called empirical Bayes
estimation.

31
Example One-way random effects ANOVA model

The posterior means turn out to be
where
Note that ?? measures the precision of knowledge
about ?. Specifically, we see that ?? approaches
one as ??2 ??, and approaches zero as ??2 ?0.

32
4.6 Wisconsin Lottery Sales

T40 weeks of sales from n 50 zip codes

33
Lottery Sales Data Analysis

Cross-sectional analysis shows that population
size heavily influences sales, with Kenosha as an
outlier
Multiple time series plots
show the effect of jackpots that is common to all
postal codes
show the heterogeneity among postal codes
(reaffirmed by a pooling test)
show the heteroscedasticity that is accommodated
through a logarithmic transformation

34
Lottery Sales Model Selection

In-sample results show that
One-way error components dominates pooled
cross-sectional models
An AR(1) error specification significantly
improves the fit.
The best model is probably the two-way error
component model, with an AR(1) error
specification (not yet documented)
Out-of-sample analysis suggests that
logarithmic sales is the preferred choice of
response it outperforms sales and percentage
change.

35
4.7. What is Credibility?

Hickmans (1975) Analogy
In politics, leaders begin with a reservoir of
credibility which decreases as executive
experience is compiled.
Insurance behaves in a reverse fashion!
Here, credibility increases as experience
increases.

36
Credibility Theory

Credibility is a technique for predicting future
expected claims for a risk class, given past
claims of that and related risk classes.
Importance
Credibility is widely used for pricing property
and casualty, workers compensation and health
care coverages.
According to Rodermund (1989), the concept of
credibility has been the casualty actuaries most
important and enduring contribution to casualty
actuarial science.

37
History

Mowbray (1914 - PCAS)
Asked the question, how extensive is an exposure
necessary to give a dependable pure premium?
This approach is now known as the limited
fluctuation or American credibility
Question 1 do we have enough exposure to give
full weight to the risk class under
consideration?
Question 2 if not, how can we combine
information from this and related risk classes?

38
More History

Whitney (1918 - PCAS)
introduced the idea of using a weighted average
of average claims of (1) a given risk class and
(2) all risk classes.
The weight is known as the credibility factor.
It is of the form
New Premium
Z ? Claims Experience (1 Z) ? Old Premium.

39
Example - Balanced Bühlmann

Consider the model
yit ? ?i ?it.
The credibility factor is
The traditional credibility estimator is

40
Example Hypothetical Claims for Three Towns

Town Claims Average Claim
1 14, 12, 10, 12 1 12
2 9, 16, 15, 12 2 13
3 8, 10, 7, 7 3 8
Are there real differences among towns?
Mowbray - does Town 3 have enough data to support
its own estimator of pure premiums?
Whitney - how can I use the information in Towns
1 and 2 to help determine my rate for Town 3?

41
Response toWhitney

Known as the shrinkage effect
Comparison of Subject-Specific Means to
Credibility Estimators.

8
13
12
11
11.825
12.650
8.525
42
Why study credibility theory?

Long history of applications a business
necessity
More recently, many theoretical advances with
fewer innovative applications
Credibility techniques required in legal statutes
and standards of practice
Standard of Practice 25 by the Actuarial
Standards Board of the American Academy of
Actuaries
Wisconsin statutes on credibility insurance and
disability income
Advanced techniques are critical for keeping up
with competition (health insurance health
economists)
Innovative techniques enhance the credibility
of the profession