Title: Regression Analysis of Count Data and Development of Statistical Models
1Regression Analysis of Count Data and Development
of Statistical Models
2Why use Statistical Models?
- Crashes are independent and random events
(probabilistic events) - Estimate a relationship between crashes and
covariates (or explanatory variables) - Determine the long-term average of crash
occurrences for transportation facilities - Have a wide variation of applications in safety
analyses - Prediction
- Variable screening
- Risk factors
- Before-after study
3Back to Basics
Definition A model is an abstraction of reality
in that it provides an approximation of some
relatively complex phenomenon. A model can be
deterministic or probabilistic.
Most common types of probabilistic models ?
Linear Models (oka Multivariate Linear Models) Y
baX ? Non-Linear Models Y aXbX ?
Generalized Linear Models (GLM) (generalization
of linear models) (appropriate for crash data)
4Review of Multivariate Linear Models
The probabilistic linear models have the
following form
where, y outcome or response variable x1, x2,
, xk covariates or explanatory variables ß1,
ß2, , ßk unknown coefficients or regressor
variables e random error term
If e is assumed to be normally and independently
distributed with constant variance, then
statistical tests on the model parameters,
confidence intervals on coefficients, variables
and predictions can be easily obtained.
5Review of Multivariate Linear Models
Ordinary Least Square Method
This is an estimation technique that is used for
estimating unknown coefficients. It consists of
solving p k 1 simultaneously linear equations
and by minimizing the sum of square errors.
Let
Note E(e) 0 and var(e) s2
6Review of Multivariate Linear Models
The least square function S is given by
The S function is to be minimized with respect to
ß1, ß2, , ßk. The least square estimators, say
b0, b1, , bk, must satisfy
j 1, 2, , k
7Review of Multivariate Linear Models
It is easier to solve the equations by using a
matrix format. The equations can be written the
following way
where
8Review of Multivariate Linear Models
Need to find the least square estimator b that
minimizes
It can be shown that S(ß) can be expressed this
way
The least square estimator must satisfy
which simplifies to
b is called the ordinary least squares
estimator of ß.
9Review of Multivariate Linear Models
Note that XX is defined as the
variance-covariance matrix
(XX)-1 is its inverse. This inverse matrix is
very useful for estimating the confidence
intervals, as described below. The software
programs can usually provide the
variance-covariance matrix or its inverse.
10Review of Multivariate Linear Models
Example taken from Myers et al. (2002). The study
described the relationship between transistor
gain in an integrated circuit and two emitter
variables, drive-in time (x1) and emitter dose
(x2).
11Review of Multivariate Linear Models
The multilinear model is given by
The X matrix and y vector are
12Review of Multivariate Linear Models
The XX matrix is constructed as follows
The Xy vector is
13Review of Multivariate Linear Models
The least square estimate of ß is
or
The fitted regression equation becomes
14Review of Multivariate Linear Models
Observed and Fitted Values
Mean response (fitted values)
Observed value
15Review of Multivariate Linear Models
Analysis of Variance
Note SSE/s2 is distributed
and SSE and SSR are independent.
16Review of Multivariate Linear Models
The computation of SSE, SSR, SST, F0 , and R2 are
as follows
SSE SST - SSR
Reject F0 if below
17Review of Multivariate Linear Models
Confidence intervals for the coefficients of the
models are
where
Test statistics (t-test)
Cjj is the diagonal element of XX -1
corresponding to bj.
18Review of Multivariate Linear Models
Confidence intervals on the mean response
where
Confidence intervals on new response observations
19Review of Multivariate Linear Models
Example suppose we wish to find a 95 confidence
interval on the mean response for the point x01
225 min and x02 4.36 x 1014 ions, so that
The estimate for the on the mean response is
20Review of Multivariate Linear Models
We find
as
Using s2 MSE 1220.1, we find the confidence
interval
21Review of Multivariate Linear Models
Maximum Likelihood Method
The likelihood function is found from the joint
probability distribution of the observations.
Given the assumption that the distribution of
errors is normally distributed and the variance
s2 is constant, the likelihood function is the
following (normal distribution)
Same model as before
22Review of Multivariate Linear Models
The maximum likelihood estimators are the values
of the parameters ß and s2 that maximize the
likelihood function. Maximizing the likelihood is
equivalent to maximizing the log-likelihood,
. The log-likelihood is
The derivative of the log-likelihood function is
called the score function. Taking the derivatives
with respect to the coefficients ß and equating
to zero yields
23Review of Multivariate Linear Models
Taking the partial derivative with respect to
gives
Which is
24Generalized Linear Models
In the previous overheads, it was obvious how the
normal distribution played an important role in
estimating the coefficients and inferences of
probabilistic models. Unfortunately, there are
many practical situations where the normal
assumption is not valid. Count data, binary
response (0 or 1) or other continuous variables
with positive and high-skewed distribution cannot
be modeled with a normally distributed
errors. The generalized linear model (GLM) was
developed to allow fitting regression models for
univariate response data that follows a very
general distribution called exponential family.
This family includes the normal, binomial,
negative binomial, geometric, gamma, etc.
25Generalized Linear Models
Members of the exponential family of
distributions all have the probability density
functions for an observed response y that can be
expressed in the form
Where are specific
functions. The parameter ? is the natural
location parameter, and F is often called the
dispersion or scale parameter. The function a(F)
is generally of the form a(F)F.? where ? is a
known constant.
26Generalized Linear Models
The most prominent member of the exponential
family is the normal distribution
where
Note the location parameter is µ and the scale
parameter is s2.
27Generalized Linear Models
The Poisson distribution can be derived this way
where
The location parameter lnµ the scale
parameter F 1.
28Generalized Linear Models
The formal structure of GLMs is as follows (Myers
et al., 2002)
1. We have y1, y2, , yn independent response
observations with means, µ1, µ2, , µn
respectively.
2. The observation yi has a distribution that is
a member of the exponential family.
3. The systematic portion of the model involves
regressors or explanatory variables x1, x2, ,
xk.
4. The model is constructed around the linear
predictor . The
involvement of this linear predictor suggests
the terminology generalized linear model.
29Generalized Linear Models
The formal structure of GLMs is as follows (Myers
et al., 2002)
5. The model is found through the use of a link
function
The term link is derived from the fact that the
function is the link between the mean and the
linear predictor. Note the expected response is
Note that for the multivariate linear regression
the model suggests the
special case in which .
Note ? is usually defined as the canonical link.
30Generalized Linear Models
The formal structure of GLMs is as follows (Myers
et al., 2002)
6. The link function is a monotonic
differentiable function.
7. The variance s12, s22, , sn2 is a function
of the mean µ.
The topic of GLMs is very broad and can easily
cover an entire semester. You are referred to
McCullagh and Nelder (1989) Generalized Linear
Models. Chapman and Hall New York, NY for
additional details. Other recent books, including
the one by Myers et al. (2002), on GLMs are
available.
31Generalized Linear Models
Common Canonical Link
(identity link)
(logistic link)
(log link)
(log link)
(reciprocal link)
(reciprocal link)
Only valid if modeled as a Poisson-gamma
distribution (mean µ and dispersion parameter F)
32Generalized Linear Models
Maximum Likelihood Method
Similar to the multivariate linear models, the
coefficients of models can be estimated via the
MLE. The log-linear function of the exponential
family is
Using the canonical link
, we have
33Generalized Linear Models
The maximum likelihood estimates of the
coefficients are found by solving the system of
equation for ß
If a(F) a is constant, these equations become
In matrix format
34Generalized Linear Models
In the case where the variance is unequal, we
need to find the solution this way
where is a positive matrix
note that si2 is a function of µi.
The variance-covariance matrix is
35Statistical Models For Crash Data
Given that crash data are actually discrete and
non-negative events, statistical models are
usually build using GLMs ? Poisson-based models
(Poisson, Poisson-gamma, Poisson-lognormal,
etc.) ? log-link canonical function ? The
variance function is related to the mean - The
dispersion parameter F in models is usually
assumed to be constant - Recent work has shown
that the variance function may be structured
meaning that F will vary for each observation
(or site).
36Statistical Models For Crash Data
Traditional functional forms for statistical
models in safety
Software program fits this question
Notation
Software program automatically transforms the
dependent variable via the link function
Mean Structure
ind
Error structure or variance function
ind
37Statistical Models For Crash Data
Example of statistical models for intersection
crashes (flow only)
38Statistical Models For Crash Data
Example of codes used in Genstat
CALCULATE L_FLOWLOG(FLOW) CALCULATE YACC OPEN
'OUT_1.TXT', 'OUT_2.txt' CHANNEL3,4 FILETYPE
output, output MODEL DISTRIBUTIONnegativebinomi
al LINKloga AGGREGATION1.83 \ DISPERSION
OFFSET Y FIT printmodel, estimates, summary
CONSTANTESTIMATE L_FLOW RKEEP
FITTEDVALUESfitted RESIDUALSresi
ESTIMATESEsti DEVIANCEDevi \ SEStand
DFDegree PEARSONCHIPear PRINT CHANNEL3 Y,
fitted, resi CLOSE 3,4 FILETYPEoutput, output
39Statistical Models For Crash Data
Note How this translate for the Poisson model
Where,
Because of the log-link, the actual equation
becomes
40Statistical Models For Crash Data
Next Lecture More details about modeling crash
data using Poisson and Poisson-gamma models
(selection of variables, statistical fit,
etc.) Time-trend effects in models Statistical
inferences on coefficients and predicted
values Issues with GLMs specific to crash data