Title: Generalized Linear Models
1Generalized Linear Models
Chapter 8
2Contents
3GLM A General Introduction
What is a Generalized Linear Model? A
traditional linear model is of the form
where Yi, the responses, are assumed to be
independent, normally distributed random
variables with mean and constant variance
?2. While traditional linear models are used
extensively in statistical data analysis, there
are types of problems for which they are not
appropriate.
4GLM A General Introduction
- It may not be reasonable to assume that data are
normally distributed, e,g, proportion data,
counts data, etc. - If the mean of the data is naturally restricted
to a range of values, the traditional linear
model may not be appropriate, since the linear
predictor can take on any value. - For example, the mean of a measured proportion
is between 0 and 1, but the linear predictor of
the mean in a traditional linear model is not
restricted to this range. - It may not be realistic to assume that the
variance of the data is constant for all
observations. For example, it is not unusual to
observe data where the variance increases with
the mean of the data.
5GLM A General Introduction
A generalized linear model (GLM) extends the
traditional linear model and is, therefore,
applicable to a wider range of data analysis
problems. A generalized linear model consists of
the following components
- a linear component, defined just as it is for the
traditional linear models - ?i
- a link function g, monotonic differentiable,
describing how the expected value ?i of Yi is
related to the linear predictor ?i - g(?i)
- a family of distributions called the exponential
family, from which Yi, i 1, 2, ... , n, are
independently drawn
6GLM A General Introduction
where ?i is called the natural parameter and
? the dispersion parameter, which is a constant
across i and may be known.
- Naturally, the GLM models the ?i by a linear
model, i.e., -
- If this is the case, the resulted link is called
canonical link. - The mean of the exponential family is
- i.e., ?i is a function of only the mean ?i,
leading to g function upon inversion !
7GLM A General Introduction
It can easily be seen that the following
distributions are the members of the exponential
family
- Normal leading to regular linear regression
- Binomial leading to logistic regression
- Poisson leading to loglinear model
- Multinomial multinomial response model
- Gamma lifetime data analysis
GLM provides a unified approach in terms of both
modeling and statistical inferences.
8Normal GLM
If Yi N(?i, ?2), then E(Yi) ?i, Var(Yi) ?2,
and the pdf of Yi is
Which can be rewritten as
So, in normal GLM, the canonical link is
identity link, leading to linear regression.
?
9Normal GLM
glm() the R Software which does GLM
Example 8.1. Education Expenditure Data
Finding a Linear Regression to Education
Expenditure Data (Text 1, p144) edu read.table("P146.txt", headerTRUE) y ,2 Per capita expenditure on public
education x1 income x2 thousands under 18 years of age x3 ,5 Number of people per thousands residing in
urban areas x4 northeast region x5 north central x6 South fit gaussian) (summary(fit)) (vcov(fit))
10Normal GLM
Example 8.1. Contd. GLM Fit
The results are identical to those from lm()
11 GLM for Binary Data
If Yi Bernoulli(?i), then E(Yi) ?i, and
Var(Yi) ?i(1-?i). The distribution function of
Yi is
This can be rewritten in the form of exponential
family
which shows that the natural parameter is b(?i)
log(1exp(?i)), a(?) 1, and c(yi, ?) 0.
12 GLM for Binary Data
Thus, a generalized linear model for binary
response data, with canonical link function, has
the form which is referred to as the logistic
regression with logit link.
This function is called logit function logit(?i)
- Like the regular linear model, the explanatory
variables can be - Continuous, giving the logistic regression
- Categorical, giving the logistic ANOVA
- Mixture of continuous and categorical, giving the
logistic analysis of covariance model
13 GLM for Binary Data
To fit a binary response model using glm(), there
are three possibilities for the response
- The response is a vector of 0s or 1s,
representing failure and success, - The response is two-column matrix, with 1st
containing the number of successes, and the 2nd
the number of failures, - The response is a factor, with its 1st level
taken as failure (0) and all others as success
(1).
In each of the three cases, the values for the
explanatory variables X should follow
accordingly. Following example illustrate the
second case.
14 GLM for Binary Data
Example 8.2. Age and Eye Disease, Silvey (1970)
On the Aegean Island of Kalythos the male
inhabitants suffer from a congenital eye disease,
the effect of which becomes more marked with
increasing age. Samples of islander males of
various age were tested for blindness with
results shown below
- Fit a logistic regression model relating age to
the probability of blindness. - Estimate the age at which the chance of blindness
for a male inhabitant is 50.
15 GLM for Binary Data
Example 8.2. Solution
The R code for running GLM is as follows
Fit a GLM to Age and Eye Disease Data kalythos
rep(50,5), y c(6,17,26,37,44)) add a
matrix Ymat to the data frame "kalythos"
containing a column of successes and a column
of failures kalythosYmat kalythosn - kalythosy) fit family binomial, data kalythos) (summary(fit))
(vcov(fit))
16 GLM for Binary Data
Example 8.2. Solution 1)
Highly significant relation!
Estimate Std. Error z value Pr(z)
Intercept -3.53778 0.50232 -7.043
1.88e-12 x 0.08114 0.01082
7.498 6.47e-14
Variance-Covariance Matrix of the Parameter
Estimates (Intercept)
x Intercept 0.252329917 -0.0051852955 x
-0.005185295 0.0001170988
- If y is the number of blind at age x and n the
number tested, then y Binomial(n, ?(x)), where - log?(x)/(1??(x)) ?0 ?1x,
- Let ?(x) 0.5. Then we have 0 ?0 ?1x,
giving x ??0/?1, which is estimated as
0.252329917/0.005185295 48.66.
17 GLM for Binary Data
Logistic Regression with Probit Link
- The logit link function, i.e., g(?)
log?/(1??), in the early case can be replaced
by other functions of similar feature - it is a monotonic increasing function of ?,
- it maps the values of ? ?0,1 onto the whole
real line (-?, ?) - The inverse of a cumulative distribution function
(CDF) possesses such main features.
If g(?) ??1(?), where ??1 denotes the inverse
of the standard normal CDF, then this link
function is called the probit link.
18 GLM for Binary Data
Example 8.2. Contd,
The R statement for fitting a logistic regression
with probit link would be
fit ), data kalythos)
The output are given below
Estimate Std. Error z value Pr(z)
Intercept -2.102270 0.276287 -7.609
2.76e-14 x 0.048147
0.005885 8.181 2.82e-16
The variance-covariance matrix Intercept
x Intercept 0.076334285
-1.539996e-03 x -0.001539996
3.463864e-05
19GLM for Count Data
- Many discrete response variables have counts as
possible outcomes. Examples are - Y number of parties attended in the past
month, for a sample of students, - Y number of imperfections on each of a sample
of silicon wafers used in manufacturing computer
chips, - Y number of customers entering a bank on Monday.
If Yi Poisson(?i), then E(Yi) Var(Yi) ?i,
and the distribution function has the form
20GLM for Count Data
which can be written as
Clearly this is an exponential family with
natural parameter ?i log(?i) and b(?i)
exp(?i), a(?) 1, and c(yi, ?) log(yi!).
This leads to a Poisson loglinear model under the
canonical link (the log)
21GLM for Count Data
Example 8.3. Female Horseshoe Crabs and Their
Satellites http//www.stat.ufl.edu/aa/intro-cda/a
ppendix.html
In a study of nesting horseshoe crabs, each
female horseshoe crab had a male crab attached to
her in her best. The study investigated the
factors that affect whether the female crab had
any other males, called satellites, residing
nearby her. The response variable is the number
of satellites (Sa), the explanatory variables are
female crabs collor (C), Spine Condition (S),
shell width (W), and wight (Wt). The data are
given in the above web site, and is also
available here
22GLM for Count Data
Example 8.3. Female Horseshoe Crabs and Their
Satellites http//www.stat.ufl.edu/aa/intro-cda/a
ppendix.html
Fit a GLM to Horseshoe Crabs Data crab read.table("horseshoecrab.txt", headerTRUE) y
,1 Color of the female horseshoe crab x2 crab ,2 Spine Condition x3 Shell width of the female horseshoe crab x4 crab ,5 Wight of the female horseshoe
crab fit poisson) (summary(fit)) (vcov(fit))
23GLM for Count Data
Example 8.3. Female Horseshoe Crabs and Their
Satellites http//www.stat.ufl.edu/aa/intro-cda/a
ppendix.html
Estimate Std. Error z value Pr(z)
Intercept -3.30476 0.54224 -6.095
1.10e-09 x3 0.16405 0.01997
8.216 Intercept x3 COV
Intercept 0.29402590 -0.01078952 x3
-0.01078952 0.00039861
- Shell width has a highly significant effect on
the number of satellites. - More sophisticated loglinear models can be fitted
using glm() by adding more explanatory variables. - Meaningful explanations can be given to the
estimated model parameters.
24Statistical Inferences
Summary of GLM Families by R Fnction glm()
25Statistical Inferences
The exact definition of the built-in link
functions
- Identity Link g(?) ?
- Logit Link g(?) log?/(1??)
- Probit Link g(?)
- Log Link g(?) log(?)
- Complementary Log-Log g(?) log(?log(1??))
In glm() of R software, the default link function
is the canonical link where the natural parameter
is modeled by the linear combination of the
predictors.
26Statistical Inferences
- Method of Estimation for GLM
- Maximum likelihood method or
- Quasi-maximum likelihood method
- Inference for Model Parameters
- 100(1??) confidence interval for ?
- where SE is the square-root of a diagonal
element of the output in VCOV - Level ? test for testing H0 ? 0 rejects null
hypothesis if - or
27Statistical Inferences
- The Deviance
- LM the maximized log-likelihood for a model of
interest - LS the maximized log-likelihood for the most
complex model possible, i.e., the model which has
a separate parameter for each observation and
provides perfect fit to the data. This model is
called the saturated model. - Deviance ?2(LM ? LS)
The deviance is the likelihood-ratio statistic
for comparing the model of interest to the
saturated model. Likelihood ratio statistic
?2(maximized log-likelihood under the null
hypothesis ? maximized log-likelihood under the
alternative model)
28Model Checking
glm() has many generic functions for extracting
information
29Model Checking
For more details on glm(), see R Search Help ?
glm
30Model Checking
Example 8.3. Contd fit a larger model
R Command
fit Deviance Residuals Min 1Q Median
3Q Max -3.0126 -1.8846 -0.5406
0.9448 4.9602 Estimate Std. Error z
value Pr(z) Intercept -0.3435447
0.9684204 -0.355 0.72278 x1
-0.1849325 0.0665236 -2.780 0.00544
x2 0.0399764 0.0568062 0.704
0.48160 x3 0.0275251 0.0479425
0.574 0.56588 x4 0.0004725
0.0001649 2.865 0.00417 Signif. codes
0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
31Model Checking
(Dispersion parameter for poisson family taken to
be 1) Null deviance 632.79 on 172 degrees
of freedom Residual deviance 551.85 on 168
degrees of freedom AIC 917.15 Number of Fisher
Scoring iterations 6
Intercept x1 x2 x3
x4 Intercept 0.937838 -0.0211481 0.0004346
-0.0438027 0.0001193 x1 -0.021148
0.0044254 -0.0013256 0.0004440
-0.0000008 x2 0.000435 -0.0013256 0.0032269
-0.0003295 0.0000019 x3 -0.043803
0.0004440 -0.0003295 0.0022985
-0.0000072 x4 0.000119 -0.0000008 0.0000019
-0.0000072 0.0000000