Intermediate Social Statistics Lecture 5 - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Intermediate Social Statistics Lecture 5

Description:

Count data simply refers to variables that can be measured by counts or ... 99% 35 45 Kurtosis 8.949302. The variance is nearly 10 times larger than the mean. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 24
Provided by: itde1
Category:

less

Transcript and Presenter's Notes

Title: Intermediate Social Statistics Lecture 5


1
Intermediate Social StatisticsLecture 5
  • Hilary Term 2006
  • Dr David Rueda

2
Today Models for Count Data. Poisson Regression
and Negative Binomial Models.
  • Models for Count Data.
  • Poisson Regression setup, interpretation and
    analysis.
  • Negative Binomial setup, interpretation and
    analysis.

3
Models for Count Data (1)
  • Count data simply refers to variables that can be
    measured by counts or summary frequency data.
  • Even counts are variables that for each
    observation have a number of occurrences (of the
    event) in a fixed domain.
  • The fixed domain for each observation can be
    time-related (day, month, year) or space
    (geographic unit, individual, etc).
  • The observations are non-negative, integer
    valued, and generally contain a small number of
    meaningful values with high proportion of zeros.
  • Count data are abundant in many disciplines and
    in many applications in social science.
  • In Political Science very common in IR (conflict
    between countries, etc), presidential vetoes per
    year, number of Parliamentary representatives
    that switch parties, months a government lasts,
    etc, etc.
  • In all these cases, our observations would be
    non-negative, integer valued, and often would
    contain a small number of meaningful values with
    high proportion of zeros.

4
Models for Count Data (2)
  • We model the data in order to describe and
    predict the counts.
  • The preponderance of zeros and the small values
    and discrete nature of the dependent variable
    make OLS estimation not appropriate.
  • Instead we can use two different methods Poisson
    regression and negative binomial.

5
Poisson Regression Setup (1)
  • Interesting early application of Poisson
    regression the number of soldiers kicked to
    death by horses in the Prussian army
    (Bortkewitsch 1898).
  • In general, the dependent variable Y is the
    number of occurrences of an event we are
    interested in.
  • The Poisson regression model specifies that each
    yi is drawn from a Poisson distribution with
    parameter ?i (which is related to the regressors
    xi).
  • What does this mean? It means the probability of
    observing Yik is
  • P(Yik) ((exp -?i) ?ik) / k!, k0,
    1, 2,
  • The most common specification for ?i is the
    log-linear model
  • ln ?i ß xi
  • The expected number of events per period is
  • EYi xi eß xi

6
Poisson Regression Setup (2)
  • We use maximum likelihood to estimate the
    parameters of the regression model.
  • Note that the mean in the log linear model (ln ?i
    ß xi) is nonlinear, which means that the
    effect of a change in xi will depend not only on
    ß (as in the classical linear regression), but
    also on the value of xi.
  • What do we gain from using a Poisson
    distribution?
  • Imagine the probability of the number of soldiers
    kicked to death by horses in the Prussian army.
  • The following is the plot of the Poisson
    probability density function for four values of ?.

7
(No Transcript)
8
Poisson Regression Setup (2)
  • We use maximum likelihood to estimate the
    parameters of the regression model.
  • Note that the mean in the log linear model (ln ?i
    ß xi) is nonlinear, which means that the
    effect of a change in xi will depend not only on
    ß (as in the classical linear regression), but
    also on the value of xi.
  • What do we gain from using a Poisson
    distribution?
  • Imagine the probability of the number of soldiers
    kicked to death by horses in the Prussian army.
  • The following is the plot of the Poisson
    probability density function for four values of
    ?.
  • However
  • What are assuming in a Poisson regression?
    Equidispersion.
  • This means that the mean equals the variance.
  • This may not be the case, and then a Poisson
    regression is not appropriate.

9
Poisson Regression Interpretation (1)
  • In Stata, we will obtain a z test value and a p
    value associated to a two-tailed significance
    level test of P gt z . The null hypothesis is
    bi 0.
  • Interpretation of the coefficients
  • A positive coefficient means a one-unit increase
    in the independent variable has the effect of
    increasing the predicted number of events.
  • Often we want to compare the rate at which events
    occur, we can do this by calculating incidence
    ratios (in Stata irr). This means we can
    estimate the incidence ratio associated with a
    one-unit increase an independent variable
    (keeping the rest constant).
  • We can also compute percentage changes (more in
    computer session).

10
Poisson Regression Interpretation (2)
  • More intuitive (as usual).
  • We can also calculate the predicted number of
    events associated with a particular set of
    independent variable values. For example
    predicted number of events when x133, x20, x30
    and x4 0.
  • Predicted number of events exp
    (bob133b20b30b40).
  • For goodness of fit
  • The chi-square test tells us whether all the
    estimates in the model are insignificant (the
    usual likelihood ratio test).
  • Stata also provides a Pseudo R squared.
  • We can also perform a likelihood ratio
    chi-squared statistic test comparing our model
    with a model taking into consideration all
    possible effects of the variables (more in
    computer session).

11
Poisson Regression Analysis (1)
  • Data well look at
  • Los Angeles High School data.
  • 316 students at two Los Angeles high schools.
  • Example taken from UCLAs Statistical Computing
    Resources http//www.ats.ucla.edu/stat/stata/stat
    130/count2.htm
  • Explanatory variables we are using
  • gender female1, male2.
  • mathpr Math Exam Score (percentile rank).
  • langpr Language Exam Score (percentile rank).
  • Dependent variable
  • daysabs Number of days absent.

12
Poisson Regression Analysis (2)
  • Theoretical claims
  • We think that (controlling for academic
    attainment) being a male is associated with
    higher number of days absent.
  • We will test our hypotheses with a Poisson
    regression analysis.
  • Should we do OLS?
  • Lets see a histogram

13
(No Transcript)
14
Poisson Regression Analysis (2)
  • Theoretical claims
  • We think that (controlling for ethnic origin, and
    academic attainment) being a male is associated
    with higher number of days absent.
  • We will test our hypotheses with a Poisson
    regression analysis.
  • Should we do OLS?
  • Lets see a histogram
  • The data are strongly skewed to the right, there
    are a large number of 0s, OLS would be
    inappropriate.
  • Lets do a Poisson regression.

15
Poisson Regression Analysis (3)
  • Poisson regression
    Number of obs 316

  • LR chi2(3) 175.27

  • Prob gt chi2 0.0000
  • Log likelihood -1547.9709
    Pseudo R2 0.0536
  • --------------------------------------------------
    ----------------------------
  • daysabs Coef. Std. Err. z
    Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • gender -.4009209 .0484122 -8.281
    0.000 -.495807 -.3060348
  • mathnce -.0035232 .0018213 -1.934
    0.053 -.007093 .0000466
  • langnce -.0121521 .0018348 -6.623
    0.000 -.0157483 -.0085559
  • _cons 3.088587 .1017365 30.359
    0.000 2.889187 3.287987
  • --------------------------------------------------
    ----------------------------
  • More interpretation in computer session, but

16
Poisson Regression Analysis (4)
  • Problems? In a Poisson distribution, the mean and
    the variance are the same.
  • In a preliminary way, we can test this by
    checking our dependent variable
  • number days absent
  • --------------------------------------------------
    -----------
  • Percentiles Smallest
  • 1 0 0
  • 5 0 0
  • 10 0 0 Obs
    316
  • 25 1 0 Sum of Wgt.
    316
  • 50 3 Mean
    5.810127
  • Largest Std. Dev.
    7.449003
  • 75 8 35
  • 90 14 35 Variance
    55.48764
  • 95 23 41 Skewness
    2.250587
  • 99 35 45 Kurtosis
    8.949302
  • The variance is nearly 10 times larger than the
    mean.

17
Poisson Regression Analysis (5)
  • In a more systematic way, we can test this with a
    likelihood ratio chi-squared statistic test
    comparing our model with a model taking into
    consideration all possible effects of the
    variables.
  • If the test is significant, the Poisson
    regression is not appropriate
  • Goodness of fit chi-2 2234.546
  • Prob gt chi2(312) 0.0000
  • The large value of the chi-square is another sign
    that the poisson distribution is not a good
    choice.
  • What do we do now?

18
Negative Binomial Setup (1)
  • Negative Binomial is used to estimate counts of
    an event when the event has overdispersion
    (extra-Poisson variation).
  • Some details about the negative binomial
    distribution (in general)
  • The number of successes is fixed and we're
    interested in the number of failures before
    reaching the fixed number of successes.
  • The experiment consists of a sequence of
    independent trials.
  • Each trial has two possible outcomes, S or F.
  • The probability of success, pP(S), is constant
    from one trial to another.
  • The experiment continues until a total of r
    successes.
  • A random variable X which follows a negative
    binomial distribution is denoted XNB (r, p) .
    Its probabilities are computed with the formula
  • The expected value and the variance are

19
Negative Binomial Setup (2)
  • We assume that the model is the same as the one
    described in the Poisson Regression case, except
    the variation is greater.
  • The log of the mean, ?, is a linear function of
    some independent variables
  • log(?) intercept b1X1 b2X2 .... b3Xm,
  • This means that ? is the exponential function of
    independent variables
  • ? exp(intercept b1X1 b2X2 .... b3Xm).
  • Before, we assumed that the distribution of Y
    (the number of occurrences of an event) was
    Poisson.
  • A negative binomial distribution can be
    understood as a gamma mixture of Poisson random
    variables (for more details, see Long, Regression
    Models for Categorical and Limited Dependent
    Variables, Sage 1997).
  • We use maximum likelihood to estimate the
    parameters of the regression model.

20
Negative Binomial Interpretation (1)
  • Same as with Poisson regression.
  • In Stata, we will obtain a z test value and a p
    value associated to a two-tailed significance
    level test of P gt z . The null hypothesis is
    bi 0.
  • Interpretation of the coefficients
  • A positive coefficient means a one-unit increase
    in the independent variable has the effect of
    increasing the predicted number of events.
  • Often we want to compare the rate at which events
    occur, we can do this by calculating incidence
    ratios (in Stata irr). This means we can
    estimate the incidence ratio associated with a
    one-unit increase an independent variable
    (keeping the rest constant).
  • We can also compute percentage changes (more in
    computer session).

21
Negative Binomial Interpretation (2)
  • More intuitive (as usual).
  • We can also calculate the predicted number of
    events associated with a particular set of
    independent variable values. For example
    predicted number of events when x133, x20, x30
    and x4 0.
  • Predicted number of events exp
    (bob133b20b30b40).
  • For goodness of fit
  • The chi-square test tells us whether all the
    estimates in the model are insignificant (the
    usual likelihood ratio test).
  • Stata also provides a Pseudo R squared.
  • We also get an estimate for a parameter measuring
    overdispersion
  • Stata provides a maximum likelihood test for this
    estimate. Significance in the p value means that
    the data are not a Poisson distribution (when the
    parameter is not significantly different from 0,
    NB and Poisson are equivalent).

22
Negative Binomial Analysis (1)
  • Data well look at, same as before
  • Los Angeles High School data.
  • 316 students at two Los Angeles high schools.
  • Explanatory variables we are using
  • gender female1, male2.
  • mathpr Math Exam Score (percentile rank).
  • langpr Language Exam Score (percentile rank).
  • Dependent variable
  • daysabs Number of days absent.
  • This time we estimate a negative binomial model

23
Negative Binomial Analysis (2)
  • Negative binomial regression
    Number of obs 316

  • LR chi2(3) 20.74

  • Prob gt chi2 0.0001
  • Log likelihood -880.87312
    Pseudo R2 0.0116
  • --------------------------------------------------
    ----------------------------
  • daysabs Coef. Std. Err. z
    Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • gender -.4311844 .1396656 -3.087
    0.002 -.704924 -.1574448
  • mathnce -.001601 .00485 -0.330
    0.741 -.0111067 .0079048
  • langnce -.0143475 .0055815 -2.571
    0.010 -.0252871 -.003408
  • _cons 3.147254 .3211669 9.799
    0.000 2.517778 3.776729
  • -------------------------------------------------
    ----------------------------
  • /lnalpha .2533877 .0955362
    .0661402 .4406351
  • -------------------------------------------------
    ----------------------------
  • alpha 1.288383 .1230871 10.467
    0.000 1.068377 1.553694
  • --------------------------------------------------
    ----------------------------
  • Likelihood ratio test of alpha0 chi2(1)
    1334.20 Prob gt chi2 0.0000
Write a Comment
User Comments (0)
About PowerShow.com