Data Handling ZO4030 Lecture 5 Count data - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Data Handling ZO4030 Lecture 5 Count data

Description:

Flowering in 5 varieties of perennial plants. 6 dose levels of a growth fertiliser applied ... flowered. What data type is our response variable? ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 26
Provided by: Andr864
Category:

less

Transcript and Presenter's Notes

Title: Data Handling ZO4030 Lecture 5 Count data


1
Data HandlingZO4030Lecture 5 Count data
  • Andrew Jackson

2
So far
  • GLMs to study effects on continuous response data
  • Fixed factors
  • Linear covariates
  • Normal errors(residuals)

3
More types of data
  • Count data
  • E.g. the number of deaths resulting from IHD
    (Ischaemic Heart Disease) in an age group
  • Positive integers
  • Proportional data
  • Number of males born as a proportion of females
  • Binary data
  • Species presence (True / False)
  • Infection status
  • Behaviour (ate / didnt eat)

4
Proportional counts
  • Flowering in 5 varieties of perennial plants
  • 6 dose levels of a growth fertiliser applied
  • Response is number flowered / number of flower
    buds

5
What data type is our response variable?
  • Number of successes (flowers) out of a number of
    attempts (flower buds)
  • Remind you of anything?
  • Number of heads (successes) out of repeated coin
    tosses (trials)

6
The Binomial Distribution
  • PDF Probability Density Function
  • Y binom(p,N)
  • p 0.5
  • N 6

7
For our flowers
  • Flowers Binomial(p,Number)
  • Now, the probability p is surely related to
    things like, dose and variety.
  • So we have something likep b0 b1dose
    b2variety

8
But
  • We have a problem
  • p is a probability and as such is strictly
    bounded by0 p 1

9
Need to transform our relationship
  • p b0 b1dose b2variety
  • Aasdf
  • This looks equationpretty terrible
  • and it is!

10
Can simplify though..
  • p is probability of flowering
  • 1-p is probability of not flowering
  • Odds of flowering p/(1-p)
  • Ln(p/(1-p)) logit(p)
  • Logit(p) b0 b1dose b2variety
  • And we have a straight line again!
  • Perfect for a GLM (twiddles and all)

11
Running Logistic Regression
  • Now that its a simple GLM all we have to do is
    identify
  • Response variable
  • Fixed factors
  • Covariates

12
Running Logistic Regression in R
  • In R this becomes
  • Y lt- cbind(flowers,number-flowers)
  • Model1 lt- glm(Ydosevariety, familybinomial)
  • So this is an ANCOVA or a GLM with multiple
    slopes and intercepts

13
Interpreting the output
14
But what about the transformation?
  • When we calculate the effect sizes (coefficients)
    they are on the transformed logit(p) scale
  • So, variety D is 3.18 larger than variety A
  • BUT.. This means the log(odds) are bigger

15
Back-calculating
  • To convert back to raw probabilities we use the
    equation
  • For the intercept (i.e. variety A and dose0)p
    1 / (1exp(-(-4.6))) 0.001
  • For variety Dp 1 / (1exp(-(-4.6 3.18)))
    0.19

16
Visualising the output
17
Binary Outcome Data
18
Binary Dataset
  • Fish infected with a parasite
  • Binary response variable - Infected
  • Gender as fixed factor
  • Age and weight as covariates

19
Our hypotheses
  • That age affects parasite incidence
  • That gender affects parasite incidence
  • That weight affects parasite incidence

20
Exactly as before
  • Number of trials here 1 (just like a single
    coin toss)
  • The probability of having the disease is modelled
    as a function of the explanatory variables

21
Visualise the Data
22
Run the model
  • Coefficients
  • Estimate Std. Error z
    value Pr(gtz)
  • (Intercept) -0.109124 1.375388
    -0.079 0.937
  • age 0.024128 0.020874
    1.156 0.248
  • weight -0.074156 0.147678
    -0.502 0.616
  • sexmale -5.969109 4.278066
    -1.395 0.163
  • ageweight -0.001977 0.002006
    -0.985 0.325
  • agesexmale 0.038086 0.041325 0.922
    0.357
  • weightsexmale 0.213830 0.343265 0.623
    0.533
  • ageweightsexmale 0.001651 0.003419 -0.483
    0.629

23
What do these parameters mean?
  • Again
  • We have modelled the log(odds)
  • And depending on what you want to compare or
    describe, you may need to back transform them
    using the earlier equation

24
Summary
  • Non-continuous data can be modelled using under
    the same GLM framework by picking the right
    family or distribution
  • Binomial
  • Proportion data
  • Binary data
  • Poisson
  • Counts
  • Things that happen at a given rate

25
To do
  • Revise your notes from last year on the
    chi-square test for analysing proportions
  • R can do this test too
  • Read up on the following distributions
  • Binomial
  • Poisson
Write a Comment
User Comments (0)
About PowerShow.com