Title: Chapter 4 Prediction and Bayesian Inference
1Chapter 4Prediction and Bayesian Inference
- 4.1 Estimators versus predictors
- 4.2 Prediction for one-way ANOVA models
- Shrinkage estimation, types of predictions
- 4.3 Best linear unbiased predictors (BLUPs)
- 4.4 Mixed model predictors
- 4.5 Bayesian inference
- 4.6 Case study Forecasting lottery sales
- 4.7 Credibility Theory
- Appendix 4A Linear unbiased predictors
24.1 Estimators versus predictors
- In the longitudinal data model, yit zit ai
xit b eit , the variables ai describe
subject-specific effects. - Given the data yit, zit, xit, in some problems
it is of interest to summarize subject effects. - We have discussed how to estimate fixed, unknown
parameters . - It is also of interest to summarize
subject-specific effects, such as those described
by the random variable ai. - Predictors are estimators of random variables.
- Like estimators, predictors are said to be linear
if they are formed from a linear combination of
the response y.
3Applications of prediction
- In animal and plant breeding, one wishes to
predict the production of milk for cows based on
(1) their lineage (random) and (2) herds (fixed) - In credibility theory, one wishes to predict
expected claims for a policyholder given exposure
to several risk factors - In sample surveys, one wishes to predict the size
of a specific age-sex-race cohort within a small
geographical area (known as small area
estimation). - In a survey article, Robinson (1991) also cites
(1) ore reserve estimation in geological surveys,
(2) measuring quality of a production plan and
(3) ranking baseball players abilities.
44.2. Prediction for one-way ANOVA models
- Consider the traditional one-way random effects
ANOVA (analysis of variance) model - yit ma ai eit
- Suppose that we wish to summarize the
subject-specific conditional mean, ma ai . - For contrast, first consider using the fixed
effects model with ma 0. - Here, we have that is the best
(Gauss-Markov) estimate of ai. - This estimate is unbiased, that is, E ai.
- This estimate has minimum variance among all
linear unbiased estimators (BLUE).
5Shrinkage estimator
- Using the one-way random effects model.
- Consider an estimator of ma ai that is a
linear combination of and , that is,
- for constants c1 and c2.
- Calculations show that the best values of c1 and
c2 that minimize are c2
1 c1 and - For large n, we have the shrinkage estimator, or
predictor, of ma ai to be
, where
6Example of shrinkage estimator Hypothetical Run
Times for Three Machines
- Machine Run Times Average Run Time
- 1 14, 12, 10, 12 1 12
- 2 9, 16, 15, 12 2 13
- 3 8, 10, 7, 7 3 8
- Notation yij means the jth run from the ith
machine. - For example, y21 9 and y23 15.
- Are there real differences among machines?
7 Example - Continued
- To see the shrinkage effect, consider
- Figure 4.1 Comparison of Subject-Specific Means
to Shrinkage Estimators.
8
13
12
11
11.825
12.650
8.525
8More on shrinkage estimators
- Under the random effects model, is an
unbiased predictor of maai in the sense that E
- (ma ai) 0. - However, is inefficient in the sense that
has a smaller mean square error than . - Here, has been shrunk towards the stable
estimator - The estimator is said to borrow strength
from the stable estimator - Recall
- Note that zi1 as either (i) Ti or (ii) sa2/
s?2 .
9Best predictors
- From Section 3.1, it is easy to check that the
generalized least square estimator of ma is - The linear predictor of ma ai that has minimum
variance is zi (1 - zi )
ma,GLS . - Here, the acronym BLUP stands for best linear
unbiased predictor.
10Types of Predictors
- We have now introduced the BLUP of ma ai . This
quantity is a linear combination of global
parameters and subject-specific effects. - Two other types of predictors are of interest.
- Residuals. Here, we wish to predict eit . The
BLUP residual turns out to be - Forecasts. Here, we wish to predict, for L lead
time units into the future, - Without serial correlation, the predictor is the
same as the predictor of ma ai . However, we
will see that the mean square error turns out to
be larger.
114.3 Best linear unbiased predictors
- This section develops best linear unbiased
predictors in the context of mixed linear models,
then specializes the consideration to
longitudinal data mixed models. - BLUPs are developed by examining the minimum mean
square error predictor of a random variable, w. - We give a development due to Harville (1976).
- The argument is originally due to Goldberger
(1962), who coined the phrase best linear
unbiased predictor. - The acronym was first used by Henderson (1973).
- BLUPs can also be developed as conditional
expectations using multivariate normality - BLUPs can also be developed in a Bayesian context.
12Mixed linear models
- Suppose that we observe an N ? 1 random vector y
with mean E y X b and variance Var y V. - We wish to predict a random variable w, that has
mean E w l? b and Var w sw2. - Denote the covariance between w and y as Cov(w,y)
covwy. - Assuming known regression parameters (b), the
best linear (in y) predictor of w is - w E w covwy? V-1(y - E y ) l? b covwy
V-1(y - X b ). - If w,y are multivariate normal, then w equals E
(w y ) and hence is a minimum mean square
predictor of w. - The predictor w is also a minimum mean square
predictor of w without the assumption of
normality. See Appendix 4A.1.
13BLUPs as predictors
- To develop the BLUP,
- define bGLS ( X? V -1 X )-1 X? V-1 y to be the
generalized least squares (GLS) estimator of b. - This is the best linear unbiased estimator
(BLUE). - Replace b by bGLS in the definition of w to get
the BLUP - wBLUP l? bGLS covwy ? V-1(y - X bGLS )
- (l? - covwy? V-1X) bGLS covwy? V-1 y.
- See Appendix 4A.2 for a check, establishing wBLUP
as the best linear unbiased predictor of w. - From Appendix 4A.3, we also have the form for the
minimum mean square error - Var (wBLUP - w) (l? - covwy? V-1X) ( X? V -1 X
)-1 - (l? - covwy? V-1X)? -
covwy? V-1 covwy sw2.
14Example One-way model
- Recall, yit ma ai eit
- Thus, yi 1i (ma ai) ei . Thus,
- Xi 1i and
- With this, we note that Vi-1 (yi - Xi bGLS)
- Thus, for predicting w ma ai we have l1 and
Cov(w, yi) 1i sa2 for the ith subject, 0
otherwise. Thus,
15Random effect ANOVA model
- For predicting residuals eit we have l0 and
Cov(w, yi) se2 for the ith subject, tth time
period, 0 otherwise. - Let 1it be a Ti ? 1 vector with a 1 in the tth
position, 0 otherwise. Thus, - is our BLUP residual.
164.4 Mixed model predictors
- Recall the longitudinal data mixed model
- yi Zi ai Xi b ei
- As described in Section 3.3, this is a special
case of the mixed linear model. We use - V block diagonal (V1, ..., Vn) ,
- where Vi Zi D Zi? Ri.
- X (X1?, ... Xn?)?
- For BLUP calculations, note that
- covwy ( Cov(w, y1? ),, Cov(w, yn?) )?
17Longitudinal data mixed model BLUP
- Recall that the r.v. w has mean E w l? b and
Var w sw2. - The BLUP is
- The mean square error is Var (wBLUP - w)
18BLUP special cases
- Global parameters and subject-specific effects.
- Suppose that the interest is in predicting linear
combinations of global parameters b and
subject-specific effect ai. - Consider linear combinations of the form
- w c1 ai c2 b.
- Residuals. Here, w eit .
- Forecasts. Suppose that the ith subject is
included in the data set predict - for L lead time units in the future.
19Predicting global parameters and subject-specific
effects
- Consider linear combinations of the form w c1
ai c2 b. - Straightforward calculations show that
- E w c2 b so that l c2,
- Cov (w, yj ) c1 D Zi for j i
- Cov (w , yj ) 0 for j ¹ i.
- Thus, wBLUP c2 bGLS c1 D Zi Vi-1 (yi -
Xi bGLS ).
20Special case 1
- Take c2 0 . Because the means and variance
expressions are true for all vectors c2, we may
write this in vector notation to get the BLUP of
ai, the vector - ai,BLUP D Zi Vi-1 (yi - Xi bGLS ).
- This is unbiased in the sense that E ai,BLUP - ai
0. - This estimate has minimum variance among all
linear unbiased predictors (BLUP). - In the case of the error components model (zit
1), this reduces to - For comparison, recall the fixed effects
parameter estimate,
21Motivating BLUPs
- We can also motivate BLUPs using normal theory
- Consider the case where ai and e are multivariate
normally distributed. - Then, it can be shown that E (ai yi) D Zi?
Vi-1 (yi -Xi b). - To motivate this, consider asking the question
what realization of ai could be associated with
yi? The expectation! - The BLUP is the BLUE of E (ai yi). (That is,
replace b by bGLS.)
22Special case 2
- As another example, it is of interest to predict
-
- Choose and
- This yields
- This predictor is of interest in actuarial
science, where it is known as the credibility
estimator.
23BLUP Residuals
- Here, w eit . Because E w 0, it follows that
l 0. - Straightforward calculations show that
- Cov (w, yj ) se2 1it for j i and
- Cov (w , yj ) 0 for j ¹ i.
- Here, the symbol 1it denotes a Ti 1 vector
that has a one in the tth position and is zero
otherwise. - Thus
- eit,BLUP se2 1it Vi-1 (yi - Xi bGLS ).
- This can also be expressed as
24Predicting future observations
- Suppose that the ith subject is included in the
data set predict - for L lead time units in the future.
- We will assume that and
are known. - It follows that
- Straightforward calculations show that
- Thus, the forecast of yi,TiL is
- Thus, the forecast is the estimate of the
conditional mean plus the serial correlation
correction factor
25Predicting future observations
- To illustrate, consider the special case where we
have autoregressive of order 1 (AR(1)), serially
correlated errors. - Thus, we have
- After some algebra, the L step forecast is
264.5 Bayesian Inference
- With Bayesian statistical models, one views both
the model parameters and the data as random
variables. - We assume distributions for each type of random
variable. - Given the parameters ß and a, the response model
is - Specifically, we assume that the responses y
conditional on a and ß are normally distributed
and that - E (y a, ß ) Z a X ß and Var (y a, ß) R.
- Assume that a is distributed normally with mean
?a and variance D and that ß is distributed
normally with mean µß and variance ?ß, each
independent of the other.
27Distributions
- The joint distribution of (a?, ß?)? is known as
the prior distribution. - To summarize, the joint distribution of (a?, ß?,
y?)? is - where V R Z D Z?.
28Posterior Distribution
- The distribution of parameters given the data is
known as the posterior distribution. - The posterior distribution of (a?, ß?)? given y
is normal. - The conditional moments are
29Relation with BLUPs
- In longitudinal data applications, one typically
has more information about the global parameters
ß than subject-specific parameters a. - Consider first the case ?ß 0, so that ß ?ß
with probability one. - Intuitively, this means that ß is precisely
known, generally from collateral information. - Assuming that ?a 0, it is easy to check that
the best linear unbiased estimator (BLUE) of E (
a y ) is - aBLUP D Z? V-1 ( y X bGLS)
- Recall from equation (4.11) that aBLUP is also
the best linear unbiased predictor in the
frequentist (non-Bayesian) model framework.
30Relation with BLUPs
- Consider second the case where ?ß-1 0.
- In this case, prior information about the
parameter ß is vague this is known as using a
diffuse prior. - Assuming ?a 0, one can show that
- E ( a y ) aBLUP
- It is interesting that in both extreme cases, we
arrive at the statistic aBLUP as a predictor of
a. - This analysis assumes D and R are matrices of
fixed parameters. - It is also possible to assume distributions for
these parameters typically, independent Wishart
distributions are used for D-1 and R-1 as these
are conjugate priors. - The general strategy of substituting point
estimates for certain parameters in a posterior
distribution is called empirical Bayes
estimation.
31Example One-way random effects ANOVA model
- The posterior means turn out to be
- where
- Note that ?? measures the precision of knowledge
about ?. Specifically, we see that ?? approaches
one as ??2 ??, and approaches zero as ??2 ?0.
324.6 Wisconsin Lottery Sales
- T40 weeks of sales from n 50 zip codes
33Lottery Sales Data Analysis
- Cross-sectional analysis shows that population
size heavily influences sales, with Kenosha as an
outlier - Multiple time series plots
- show the effect of jackpots that is common to all
postal codes - show the heterogeneity among postal codes
(reaffirmed by a pooling test) - show the heteroscedasticity that is accommodated
through a logarithmic transformation
34Lottery Sales Model Selection
- In-sample results show that
- One-way error components dominates pooled
cross-sectional models - An AR(1) error specification significantly
improves the fit. - The best model is probably the two-way error
component model, with an AR(1) error
specification (not yet documented) - Out-of-sample analysis suggests that
- logarithmic sales is the preferred choice of
response it outperforms sales and percentage
change.
354.7. What is Credibility?
- Hickmans (1975) Analogy
- In politics, leaders begin with a reservoir of
credibility which decreases as executive
experience is compiled. - Insurance behaves in a reverse fashion!
- Here, credibility increases as experience
increases.
36Credibility Theory
- Credibility is a technique for predicting future
expected claims for a risk class, given past
claims of that and related risk classes. - Importance
- Credibility is widely used for pricing property
and casualty, workers compensation and health
care coverages. - According to Rodermund (1989), the concept of
credibility has been the casualty actuaries most
important and enduring contribution to casualty
actuarial science.
37History
- Mowbray (1914 - PCAS)
- Asked the question, how extensive is an exposure
necessary to give a dependable pure premium? - This approach is now known as the limited
fluctuation or American credibility - Question 1 do we have enough exposure to give
full weight to the risk class under
consideration? - Question 2 if not, how can we combine
information from this and related risk classes?
38More History
- Whitney (1918 - PCAS)
- introduced the idea of using a weighted average
of average claims of (1) a given risk class and
(2) all risk classes. - The weight is known as the credibility factor.
- It is of the form
- New Premium
- Z ? Claims Experience (1 Z) ? Old Premium.
39Example - Balanced Bühlmann
- Consider the model
- yit ? ?i ?it.
- The credibility factor is
- The traditional credibility estimator is
40Example Hypothetical Claims for Three Towns
- Town Claims Average Claim
- 1 14, 12, 10, 12 1 12
- 2 9, 16, 15, 12 2 13
- 3 8, 10, 7, 7 3 8
- Are there real differences among towns?
- Mowbray - does Town 3 have enough data to support
its own estimator of pure premiums? - Whitney - how can I use the information in Towns
1 and 2 to help determine my rate for Town 3?
41Response toWhitney
- Known as the shrinkage effect
- Comparison of Subject-Specific Means to
Credibility Estimators.
8
13
12
11
11.825
12.650
8.525
42Why study credibility theory?
- Long history of applications a business
necessity - More recently, many theoretical advances with
fewer innovative applications - Credibility techniques required in legal statutes
and standards of practice - Standard of Practice 25 by the Actuarial
Standards Board of the American Academy of
Actuaries - Wisconsin statutes on credibility insurance and
disability income - Advanced techniques are critical for keeping up
with competition (health insurance health
economists) - Innovative techniques enhance the credibility
of the profession