Title: Regression and Calibration
1Regression and Calibration
- Forensic Statistics CIS205
2Introduction
- More on the nature of the relationship between
covariates which are continuous. - This is useful because many variables are
impossible to measure directly in a forensic
context age at death for human identification,
post-mortem interval, time since discharge for a
firearm. - However, changes occur which are covariates of
these immeasurables, e.g. root dentine
translucency, concentration of potassium in the
vitreous humour, chemical residues in the barrels
and chambers of firearms. - From measurements of these variables and exact
knowledge of the relationships with their
immeasurable covariables, an estimate of the
immeasurable covariate can be made. - This process is called calibration.
3Values for post-mortem interval (PMI) and
vitreous potassium ion concentration K for a
sample of 8 cadavers (Munoz et al., 2001)
4Linear Models
- Figure 7.1 is a scatterplot of the data in Table
7.1 - From Figure 7.1 it is easy to imagine a single
straight line going through the cloud of x,y
points. - This line would represent a linear model of PMI
and K. Such a line is marked as the straight
line on Figure 7.2.
5(No Transcript)
6(No Transcript)
7Parameters
- Any linear model which describes the relationship
between two covariates may be described by two
parameters, the values of which are termed
coefficients. - The first is the gradient or slope, the second is
the intercept with the y axis. - The slope is dx /dy (the change in y divided by
the change in x). The gradient of a simple linear
model is conventionally denoted b. - In Figure 7.2 the gradient is (9.35 7.07) / (20
7) 2.28 / 13 0.17. - Thus for an increase of 1h in PMI there is an
increase of 0.17 mmol/l in vitreous K. This
can be turned round to say that for every
increase in 1mmol/l in K there is an increase
of 1/0.17 5.88h in PMI. - a is the value of K when PMI 0. From Figure
7.2 this is 5.85. - A general form for a linear model is y a bx.
- E.g. K 5.85 (0.17 PMI).
8Calculation of a linear regression model
- How are a and b calculated?
- Usually in linear regression we calculate a best
fit model which minimises the sum of squared
errors in either the x or y direction - These errors are called the residuals, and are
the difference between the model and the true
data. - Figure 7.3 is a detail of three of the points
from Figure 7.2 showing these residuals. The
objective of least squares regression is to
select the model which minimises dy1² dy2²
dy3²
9(No Transcript)
10Estimating a and b
- Without going into mathematical detail, the
estimate of the gradient b is b Sxy / Sxx, - Where Sxx S(x mean x)²,
- Sxy S(x mean x)(y mean y)
- Sxx is related to variance, while Sxy is the
covariance between x and y. - An estimate of the intercept a is given by
- a mean y b(mean x)
11Table 7.2 Calculations for the regression y
(K) fitted to x (PMI)
12Calculation of a and b continued
- Sxx S(x mean x)² 371.73
- Sxy S(x mean x)(y mean y) 64.89
- b Sxy / Sxx 64.89 / 371.73 0.1745
- a mean y b(mean x) 7.47 (0.1745 9.28)
5.85
13Testing goodness of fit
- One of the first things we may wish to know about
our model is whether it is a good fit to the
data. Measures of this type are known as
goodness of fit statistics. - A suitable test statistic is as follows, where df
n 2, y is the estimate of y from the model
14Calculations for the goodness of fit statistic
for the regression y (K) fitted to x (PMI)
15Calculation of F continued
- F 11.33 / ( 1/6 4.92) 13.82
- See appendix F, the F-distribution, df n 2
6 - For 6df at 5 significance, F 5.99. The
calculated value of F is 13.82 which is greater
than 5.99, so we act as though the model is an
adequate fit at the 5 level of significance. - Other assumptions include that the residuals must
be normally distributed.
16Testing coefficients a and b
17Estimated Standard Errors
- Using the equations on the previous slide,
- s 0.9
- ESE(b) 0.047
- ESE(a) 0.54
- A confidence interval for both a and b is found
using the t-distribution with n-2 df, so the 99
confidence level occurs at 3.707 standard
errors. - This means the confidence level for b 0.1745
(3.707 0.047) 0.0002 ? 0.3487. The lower
limit is very close to 0, so it might be wise not
to reject the null hypothesis that the gradient
of the model is 0, and that there is no relation
between K and PMI. - The 99 confidence level for a 5.85(3.707
0.54) 3.85 ? 7.85. This interval does not
contain 0, so can be regarded as evidence for the
hypothesis that the intercept is non-zero.
18Calibration
- We have now established that there is good
evidence to suggest that there is a linear
relationship between K and PMI, and we know
the parameters for the relationship. - In this case PMI is not directly observable, but
a measurement of K is possible. From the
regression model it should be possible to make an
estimate of PMI from the measurement of K.
This is calibration. - From our equation K (0.17 PMI) 5.85, we
derive PMI (K 5.85) / 0.17. - Check this by drawing a residual plot of (PMI
estimates of PMI) vs. PMI. - We will see later how to calculate the standard
errors of the estimates.
19(No Transcript)
20Table 7.5. Calculation of estimated PMI values
and standard errors
21An approximation of standard error for the point
value (y0)
22Interpretation of Table 7.5
- The final column Table 7.5 gives the calculation
on the previous slide for all 8 points. - To arrive at a suitable confidence interval the
standard error of the estimate has to be
multiplied by the appropriate value from the
t-distribution for n-2 degrees of freedom, i.e.
x0 x0 t x Sx0 - From appendix C at 95 confidence and at n 2
6 df, the value of t is 2.447. - The standard error for the first value in Table
7.5 is 5.91 and the point estimate for PMI is
16.87, so a 95 confidence interval is 16.87
(5.91 x 2.447) 16.87 14.46 2.41 ? 31.33.
23Points to remember
- Avoid using overly complex models use linear
modelling unless background theory suggests a
non-linear model, or unless a linear model does
not meet goodness of fit criteria. - If covariates really are related in a non-linear
way, then usually a simple transformation such as
taking logs of one or both of the variables will
produce an adequate linear fit. - Plot the independent variable (e.g. PMI) on the x
axis, and the dependent variable (e.g. K) on
the y axis. This is because of the notion of
causation. It is PMI which causes K, but
there is no way K could cause PMI. Minimise
the residuals in the y direction (PTO). - Finally, always plot covariates and residuals.
Good eyeball statistics are more effective and
give a better understanding of the relationships
between variables than poorly understood and
inappropriate tests.
24(No Transcript)
25(No Transcript)