Generalized Linear Models - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Generalized Linear Models

Description:

On the Aegean Island of Kalythos the male inhabitants suffer from a congenital ... LS: the maximized log-likelihood for the most ... Deviance = 2(LM LS) ... – PowerPoint PPT presentation

Number of Views:430
Avg rating:3.0/5.0
Slides: 32
Provided by: themegal
Category:

less

Transcript and Presenter's Notes

Title: Generalized Linear Models


1
Generalized Linear Models
Chapter 8
  • www.smu.edu.sg

2
Contents
3
GLM A General Introduction
What is a Generalized Linear Model? A
traditional linear model is of the form
where Yi, the responses, are assumed to be
independent, normally distributed random
variables with mean and constant variance
?2.   While traditional linear models are used
extensively in statistical data analysis, there
are types of problems for which they are not
appropriate.
4
GLM A General Introduction
  • It may not be reasonable to assume that data are
    normally distributed, e,g, proportion data,
    counts data, etc.
  • If the mean of the data is naturally restricted
    to a range of values, the traditional linear
    model may not be appropriate, since the linear
    predictor can take on any value.
  • For example, the mean of a measured proportion
    is between 0 and 1, but the linear predictor of
    the mean in a traditional linear model is not
    restricted to this range.
  • It may not be realistic to assume that the
    variance of the data is constant for all
    observations. For example, it is not unusual to
    observe data where the variance increases with
    the mean of the data.

5
GLM A General Introduction
A generalized linear model (GLM) extends the
traditional linear model and is, therefore,
applicable to a wider range of data analysis
problems. A generalized linear model consists of
the following components
  • a linear component, defined just as it is for the
    traditional linear models
  • ?i
  • a link function g, monotonic differentiable,
    describing how the expected value ?i of Yi is
    related to the linear predictor ?i
  • g(?i)
  • a family of distributions called the exponential
    family, from which Yi, i 1, 2, ... , n, are
    independently drawn

6
GLM A General Introduction
where ?i is called the natural parameter and
? the dispersion parameter, which is a constant
across i and may be known.
  • Naturally, the GLM models the ?i by a linear
    model, i.e.,
  • If this is the case, the resulted link is called
    canonical link.
  • The mean of the exponential family is
  • i.e., ?i is a function of only the mean ?i,
    leading to g function upon inversion !

7
GLM A General Introduction
It can easily be seen that the following
distributions are the members of the exponential
family
  • Normal leading to regular linear regression
  • Binomial leading to logistic regression
  • Poisson leading to loglinear model
  • Multinomial multinomial response model
  • Gamma lifetime data analysis

GLM provides a unified approach in terms of both
modeling and statistical inferences.
8
Normal GLM
If Yi N(?i, ?2), then E(Yi) ?i, Var(Yi) ?2,
and the pdf of Yi is
Which can be rewritten as
So, in normal GLM, the canonical link is
identity link, leading to linear regression.
?
9
Normal GLM
glm() the R Software which does GLM
Example 8.1. Education Expenditure Data
Finding a Linear Regression to Education
Expenditure Data (Text 1, p144) edu read.table("P146.txt", headerTRUE) y ,2 Per capita expenditure on public
education x1 income x2 thousands under 18 years of age x3 ,5 Number of people per thousands residing in
urban areas x4 northeast region x5 north central x6 South fit gaussian) (summary(fit)) (vcov(fit))
10
Normal GLM
Example 8.1. Contd. GLM Fit
The results are identical to those from lm()
11
GLM for Binary Data
If Yi Bernoulli(?i), then E(Yi) ?i, and
Var(Yi) ?i(1-?i). The distribution function of
Yi is
This can be rewritten in the form of exponential
family
which shows that the natural parameter is b(?i)
log(1exp(?i)), a(?) 1, and c(yi, ?) 0.
12
GLM for Binary Data
Thus, a generalized linear model for binary
response data, with canonical link function, has
the form which is referred to as the logistic
regression with logit link.
This function is called logit function logit(?i)
  • Like the regular linear model, the explanatory
    variables can be
  • Continuous, giving the logistic regression
  • Categorical, giving the logistic ANOVA
  • Mixture of continuous and categorical, giving the
    logistic analysis of covariance model

13
GLM for Binary Data
To fit a binary response model using glm(), there
are three possibilities for the response
  • The response is a vector of 0s or 1s,
    representing failure and success,
  • The response is two-column matrix, with 1st
    containing the number of successes, and the 2nd
    the number of failures,
  • The response is a factor, with its 1st level
    taken as failure (0) and all others as success
    (1).

In each of the three cases, the values for the
explanatory variables X should follow
accordingly. Following example illustrate the
second case.
14
GLM for Binary Data
Example 8.2. Age and Eye Disease, Silvey (1970)
On the Aegean Island of Kalythos the male
inhabitants suffer from a congenital eye disease,
the effect of which becomes more marked with
increasing age. Samples of islander males of
various age were tested for blindness with
results shown below
  • Fit a logistic regression model relating age to
    the probability of blindness.
  • Estimate the age at which the chance of blindness
    for a male inhabitant is 50.

15
GLM for Binary Data
Example 8.2. Solution
The R code for running GLM is as follows
Fit a GLM to Age and Eye Disease Data kalythos
rep(50,5), y c(6,17,26,37,44)) add a
matrix Ymat to the data frame "kalythos"
containing a column of successes and a column
of failures kalythosYmat kalythosn - kalythosy) fit family binomial, data kalythos) (summary(fit))
(vcov(fit))
16
GLM for Binary Data
Example 8.2. Solution 1)
Highly significant relation!
Estimate Std. Error z value Pr(z)
Intercept -3.53778 0.50232 -7.043
1.88e-12 x 0.08114 0.01082
7.498 6.47e-14
Variance-Covariance Matrix of the Parameter
Estimates (Intercept)
x Intercept 0.252329917 -0.0051852955 x
-0.005185295 0.0001170988
  • If y is the number of blind at age x and n the
    number tested, then y Binomial(n, ?(x)), where
  • log?(x)/(1??(x)) ?0 ?1x,
  • Let ?(x) 0.5. Then we have 0 ?0 ?1x,
    giving x ??0/?1, which is estimated as
    0.252329917/0.005185295 48.66.

17
GLM for Binary Data
Logistic Regression with Probit Link
  • The logit link function, i.e., g(?)
    log?/(1??), in the early case can be replaced
    by other functions of similar feature
  • it is a monotonic increasing function of ?,
  • it maps the values of ? ?0,1 onto the whole
    real line (-?, ?)
  • The inverse of a cumulative distribution function
    (CDF) possesses such main features.

If g(?) ??1(?), where ??1 denotes the inverse
of the standard normal CDF, then this link
function is called the probit link.
18
GLM for Binary Data
Example 8.2. Contd,
The R statement for fitting a logistic regression
with probit link would be
fit ), data kalythos)
The output are given below
Estimate Std. Error z value Pr(z)
Intercept -2.102270 0.276287 -7.609
2.76e-14 x 0.048147
0.005885 8.181 2.82e-16
The variance-covariance matrix Intercept
x Intercept 0.076334285
-1.539996e-03 x -0.001539996
3.463864e-05
19
GLM for Count Data
  • Many discrete response variables have counts as
    possible outcomes. Examples are
  • Y number of parties attended in the past
    month, for a sample of students,
  • Y number of imperfections on each of a sample
    of silicon wafers used in manufacturing computer
    chips,
  • Y number of customers entering a bank on Monday.

If Yi Poisson(?i), then E(Yi) Var(Yi) ?i,
and the distribution function has the form
20
GLM for Count Data
which can be written as
Clearly this is an exponential family with
natural parameter ?i log(?i) and b(?i)
exp(?i), a(?) 1, and c(yi, ?) log(yi!).
This leads to a Poisson loglinear model under the
canonical link (the log)
21
GLM for Count Data
Example 8.3. Female Horseshoe Crabs and Their
Satellites http//www.stat.ufl.edu/aa/intro-cda/a
ppendix.html
In a study of nesting horseshoe crabs, each
female horseshoe crab had a male crab attached to
her in her best. The study investigated the
factors that affect whether the female crab had
any other males, called satellites, residing
nearby her. The response variable is the number
of satellites (Sa), the explanatory variables are
female crabs collor (C), Spine Condition (S),
shell width (W), and wight (Wt). The data are
given in the above web site, and is also
available here
22
GLM for Count Data
Example 8.3. Female Horseshoe Crabs and Their
Satellites http//www.stat.ufl.edu/aa/intro-cda/a
ppendix.html
Fit a GLM to Horseshoe Crabs Data crab read.table("horseshoecrab.txt", headerTRUE) y
,1 Color of the female horseshoe crab x2 crab ,2 Spine Condition x3 Shell width of the female horseshoe crab x4 crab ,5 Wight of the female horseshoe
crab fit poisson) (summary(fit)) (vcov(fit))
23
GLM for Count Data
Example 8.3. Female Horseshoe Crabs and Their
Satellites http//www.stat.ufl.edu/aa/intro-cda/a
ppendix.html
Estimate Std. Error z value Pr(z)
Intercept -3.30476 0.54224 -6.095
1.10e-09 x3 0.16405 0.01997
8.216 Intercept x3 COV
Intercept 0.29402590 -0.01078952 x3
-0.01078952 0.00039861
  • Shell width has a highly significant effect on
    the number of satellites.
  • More sophisticated loglinear models can be fitted
    using glm() by adding more explanatory variables.
  • Meaningful explanations can be given to the
    estimated model parameters.

24
Statistical Inferences
Summary of GLM Families by R Fnction glm()
25
Statistical Inferences
The exact definition of the built-in link
functions
  • Identity Link g(?) ?
  • Logit Link g(?) log?/(1??)
  • Probit Link g(?)
  • Log Link g(?) log(?)
  • Complementary Log-Log g(?) log(?log(1??))

In glm() of R software, the default link function
is the canonical link where the natural parameter
is modeled by the linear combination of the
predictors.
26
Statistical Inferences
  • Method of Estimation for GLM
  • Maximum likelihood method or
  • Quasi-maximum likelihood method
  • Inference for Model Parameters
  • 100(1??) confidence interval for ?
  • where SE is the square-root of a diagonal
    element of the output in VCOV
  • Level ? test for testing H0 ? 0 rejects null
    hypothesis if
  • or

27
Statistical Inferences
  • The Deviance
  • LM the maximized log-likelihood for a model of
    interest
  • LS the maximized log-likelihood for the most
    complex model possible, i.e., the model which has
    a separate parameter for each observation and
    provides perfect fit to the data. This model is
    called the saturated model.
  • Deviance ?2(LM ? LS)

The deviance is the likelihood-ratio statistic
for comparing the model of interest to the
saturated model. Likelihood ratio statistic
?2(maximized log-likelihood under the null
hypothesis ? maximized log-likelihood under the
alternative model)
28
Model Checking
glm() has many generic functions for extracting
information
29
Model Checking
For more details on glm(), see R Search Help ?
glm
30
Model Checking
Example 8.3. Contd fit a larger model
R Command
fit Deviance Residuals Min 1Q Median
3Q Max -3.0126 -1.8846 -0.5406
0.9448 4.9602 Estimate Std. Error z
value Pr(z) Intercept -0.3435447
0.9684204 -0.355 0.72278 x1
-0.1849325 0.0665236 -2.780 0.00544
x2 0.0399764 0.0568062 0.704
0.48160 x3 0.0275251 0.0479425
0.574 0.56588 x4 0.0004725
0.0001649 2.865 0.00417 Signif. codes
0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
31
Model Checking
(Dispersion parameter for poisson family taken to
be 1) Null deviance 632.79 on 172 degrees
of freedom Residual deviance 551.85 on 168
degrees of freedom AIC 917.15 Number of Fisher
Scoring iterations 6
Intercept x1 x2 x3
x4 Intercept 0.937838 -0.0211481 0.0004346
-0.0438027 0.0001193 x1 -0.021148
0.0044254 -0.0013256 0.0004440
-0.0000008 x2 0.000435 -0.0013256 0.0032269
-0.0003295 0.0000019 x3 -0.043803
0.0004440 -0.0003295 0.0022985
-0.0000072 x4 0.000119 -0.0000008 0.0000019
-0.0000072 0.0000000
Write a Comment
User Comments (0)
About PowerShow.com