Title: Additional Topics in Prediction Methodology
1Additional Topics in Prediction Methodology
2Introduction
- Predictive distribution for random variable Y0 is
meant to capture all the information about Y0
that is contained in Yn. - not completely specify Y0 but does provide a
probability distribution of more likely and less
likely values of Y0 - EY0Yn is the best MSPE predictor of Y0
3Hierarchical models have two stages
- X ?Rd
- f0f(x0) known p1 vector
- F(fj(xj)) known np matrix
- ? unknown p1 vector regression coefficients
- R(R(xi-xj)) known nn matrix correlations among
trainning data Yn - r0(R(xi-x0)) known n1 vector correlations of Y0
with Yn
4Predictive Distributions when ?Z2, R and r0 are
known
5(No Transcript)
6Interesting features of (a) and (b)
- Non-informative Prior is the limit of the normal
prior as ??? - While the prior is non-informative, it is not a
proper distribution. The corresponding predictive
distribution is proper. - The same conditioning argument can be applied to
drive posterior mean for the non-informative
prior and normal prior.
7The mean and variance of the predictive
distribution (mean)
- ?0n(x0) and ? 0n(x0) depend on x0 only through
the regression function f0 and correlation vector
r0 - ?0n(x0) is a linear unbiased predictor of Y(x0)
- The continuity and other smoothness properties of
?0n(x0) are inherited from correlation function
R(.) and the regressors f(.)j1p
8- ?0n(x0) depends on the parameters ?z2 ?2 only
through their ratio - ?0n(x0) interpolate the training data. When
x0xi, f0f(xi), and r0TR-1eiT, the ith unit
vector.
9(No Transcript)
10The mean and variance of the predictive
distribution (Variance)
- MSPE(?0n(x0) ) ? 0n2(x0)
- The variance of the posterior of Y(x0) given Yn
should be 0 whenever x0xi - ? 0n2(xi)0
11Most important use of Theorem 4.1.1
12Predictive Distributions when R and r0 are known
The posterior is a location shifted and scaled
univariate t distribution having degrees of
freedom that are enhanced when there is
informative prior information for either ? or ?z2
13(No Transcript)
14(No Transcript)
15Degree of freedom
- Base value for the degree of freedom ?in-p
- P additional degrees of freedom when prior ? is
informative - ?0 additional degree of freedom when ?z2 is
informative
16Location shift
The same centering value as Theorem 4.1.1 (known
?z2 ) The non-informative prior gives the BLUP
17Scale factor ?i2(x0) (compare 4.1.15 with 4.1.6)
- Estimate of the scale factor ?0n2(x0).
- Qi2/?i estimate ?z2
- Qi2 get information about ?z2 from the
conditional distribution Yn given ?z2 and
information from the prior of ?z2 - ?i2(xi)0, xi is any of the training data points.
18Prediction Distributions when Correlation
parameters are unknown
- If the correlations among the observations is
unknown (R r0 are unknown)? - Assume y(.) has a Gaussian prior with
correlation function R(.?), ? is unknown vector
parameters - Two issues
- Standard error of Plug-in predictor ?0n(x0?) by
substituting ? comes from MLE or REML - Bayesian approach to uncertainty in ? which is to
model it by a prior distribution
19Prediction of Multiple Response Models
- Several outputs are available for from a computer
experiment - Several codes are available for computing the
same response (fast and slow code) - Competing response
- Several stochastic models for joint response
- Using these models to describe the optimal
predictor for one of the several computed
responses.
20Modeling Multiple Outputs
- Zi(.) marginally mean zero stationary Gaussian
stochastic processes with unknown variance and
correlation function R - Zi(x) implies that the correlation between Zi(x1)
and Zi(x2) only depends on x1-x2 - Assume Cov(Zi(x1), Zj(x2))?i?jRij(x1-x2)
- Rij(.) cross-correlation function of Zi(.) and
Zj(.) - Linear model global mean of the Yi process.
fi(.) known regression functions - ?i unknown regression parameters
21Selection of correlation and cross-correlation
functions are complicated
- Reason for any input sites xli, the multivariate
normal distributed random vector (Z1(x11), .)T
must have a nonnegative definite covariance
matrix - Solution construct the Zi(.) from a set of
elementary processes (usually this processes are
mutually independent)
22Example by Kennedy and OHagan
- Yi(x) prior for the ith code level (im
top-level code). The autoregressive model - Yi(x)?i-1Yi-1(x)?i(x), i2, , m
- The output for each successive higher level code
i at x is related to the output of the less
precise code i-1 at x plus the refinement ?i(x) - Cov(Yi(x), Yi-1(w)Yi-1(x))0 for all wx
- No additional second-order knowledge of code i at
x can be obtained from the lower-level code i-1
if the value of code i-1 at x is known (Markov
property on the hierarchy of codes) - Since there is no natural hierarchy of computer
code in such applications, we need find something
better.
23More reasonable Model
- Each constraint function is associated with the
objective function plus a refinement - Yi(x)?iY1(x)?i(x), i2, , m1
- Ver Hoef and Marry
- Form models in the environmental sciences
- Include an unknown smooth surface plus a random
measurement error. - Moving averages over white noise processes
24Morris and Mitchell model
- Prior information about y(x) is specified by a
Gaussian processor Y(.) - Prior information about the partial derivatives
y(j)(x) is obtained by considering the
derivative processes of Y(.) - Y1(.)y(.), y2(.) y(1)(.), y1m(.)y(m)(.)
- Natural prior for y(j)(x)
- The covariances between Y(x1), Y(j)(x2) and
Y(i)(x1), Y(j)(x2) are
25Optimal Predictors for Multiple Outputs
- The best MSPE predictor based on training data
is - Where Y0Y1(X0), Yini(Yi(x1i), ), and yini is
observed value for i1,m
26The joint distribution is the multivariate normal
distribution
27Conditional expectation
- ..
- In practice, this is useless (it requires
knowledge of marginal correlation functions,
joint correlation function and ratio of all the
process variance) - Empirical versions are of practical use
- Every time we assume each of the correlation
matrices Ri and cross-correlation matrices Rij
are known up to a vector of parameters. - Estimate ? using MLE or REML
28example1
- 14 point training data has feature that it allows
us to learn over the entire input space
space-filling - Compare two model
- Using the predictor of y(.) based on y(.) alone
- Using the predictor of y(.) base on (y(.),
y(1)(.), y(2)(.)) - Second one is both more visually fit and has 24
smaller ERMSPE
29(No Transcript)
30Thank you!