Title: Longitudinal Data Analysis and Survival Analysis
1Longitudinal Data Analysis and Survival Analysis
- Ming-Yu Fan, PhD
- June 25, 2008
2Outline
- Longitudinal data
- Methods for LDA
- Robust standard error
- Generalized Estimating Equation (GEE)
- Random effects model
- Survival data
- Methods for analyzing survival data
3Longitudinal Data
- Each individual has multiple observations
- The intervals between observations are
approximately the same for each individual - Even intervals are nice but not necessary
- Ex depression severity (SCL-20) evaluated at
baseline, 3-, 6-, and 12-month
4Why are longitudinal data desirable?
- More information
- Can control for individual heterogeneity
- Can better assess causality than cross-sectional
data
5Problem with longitudinal data
- Conventional statistical methods require
independence between observations - Longitudinal data are likely to violate this
assumption - Missing data due to attrition
6Notation
- yit, ith individual, tth observation
- i 1, 2, , n t 1, 2, , T
- yi1, yi2, yi3, ., yiT are very likely to be
correlated
7Example
- Data from the IMPACT (PI Dr. Unützer) study
- 8 organizations, total N 1801
- Outcome SCL-20 measured at baseline and 3 months
- Comparison between the intervention and the usual
care groups
8Example cont.
- General linear model
- ß -0.1323
- SE 0.0222
- t -5.96
- p-value lt0.0001
- Ignore the correlation
- coefficient (? 0.4)
- between SCL00 and
- SCL03
- Robust standard error
- ß -0.1323
- SE 0.0251
- z -5.26
- p-value lt0.0001
9Example 2
- Suppose we are only interested in two large study
sites (N551) - Outcome MCS12 (mental health component score
from SF12) measured at baseline and 3 months - Comparison between the intervention and the usual
care groups
10Example 2 cont.
- General linear model
- ß 0.8088, SE 0.4117, t 1.96, p-value
0.0497 - Ignore the correlation coefficient (? 0.23)
between MCS1200 and MCS1203 - Robust standard error
- ß 0.8088, SE 0.4448, z 1.82, p-value
0.0690 - The robust standard error is greater than the SE
estimated without accounting for correlation - Different methods lead to different conclusions
11Methods for LDA
- Robust standard error
- Generalized Estimating Equations (GEE)
- Random effects models (hierarchical models)
12Robust standard error
- Ordinary least squares covariance estimator
- Robust covariance estimator
- Residuals
13Robust standard error cont.
- Robust standard errors are usually larger than
conventional standard errors, but its possible
to see a smaller robust standard error - Robust standard errors may be inaccurate if the
sample sizes are small
14Generalized Estimating Equation (GEE)
- Estimate ß by solving the following equation
(Wedderburn, 1974)
15Variance Structure
- Covariance matrix for yi1, yi2, yi3, ., yiT
- Corr(yik, yil) ?kl ? 0
- Corr(yik, yjl) 0 when i ? j (for both k l and
k ? l ) - ?kl ?lk (symmetry)
- Ex T 4, need to estimate 6 correlation
coefficients (?12, ?13, ?14, ?23, ?24, ?34)
16Possible Variance Structure
- Unstructured (UN)
- no constraints on ?s
- Exchangeable (EXCH)
- ?kl ? for all ks and ls
- 1st order autoregressive (AR)
- ?kl ?k-l
- Banded structure (MDEP(m))
- ?kl ?k-l when k-l lt m, otherwise ?kl 0
- Independent (IND) ? Robust standard errors
- ?kl 0 for all ks and ls
17Example of Variance Structure
- Outcome measured at 4 time points yi1, yi2, yi3,
yi4 - In total 6 correlation coefficients (?12, ?13,
?14, ?23, ?24, ?34) - Unstructured (UN) ? need to estimate all 6
correlation coefficients - Exchangeable (EXCH) ? need to estimate only 1
correlation coefficient - 1st order autoregressive (AR) ?12 ?, ?13 ?2,
?14 ?3 . ? need to estimate only 1 correlation
coefficient - (e.g. ?12 0.4, ?13 0.16, ?14 0.064..)
- Banded structure (MDEP(2)) ?12 ?23 ?34, ?13
?24, ?14 0 ? need to estimate 2 correlation
coefficients (distance 1, 2) - Independent (IND) ? no need to estimate the
correlation coefficient
18GEE cont.
- GEE produces efficient estimates of the
coefficients - Can assume different variance structures. The
results are usually robust to the choice - GEE assumes the drop-outs are Missing Completely
At Random
19Random effects models
- General linear model
- Random effects model
- Other names hierarchical models multilevel
models mixed models - This model is also called random intercepts
model. It implies equal correlations and thus is
equivalent to the exchangeable model in GEE
estimation
20Random effects models cont.
- Can have more than 2 levels
- Allow for random coefficients / slopes
- The dependent variable can have missing data
under a weaker assumption Missing At Random
21Compared with GEE
- Takes more computing time than GEE
- Computationally less stable than GEE
- More restrictive assumption on correlation
structure (GEE can assume unstructured
correlation)
22LDA Methods for Continuous Dependent Variables
- Robust standard errors
- Can be derived using SAS Proc GENMOD (variance
structure IND)or Proc SURVEYREG procedures - GEE
- Stata xtgee, xtreg
- SAS Proc GENMOD procedure
- Choices of variance structure
- Unstructured (UN)
- Exchangeable (EXCH)
- 1st order autoregressive (AR)
- Banded structure (MDEP(m))
23LDA Methods for Continuous Dependent Variables
cont.
- How to check variance structure?
- Assign UN (unstructured) first
- Use CORRW option to print out the estimated
correlation matrix - Random effects models
- Stata xtreg
- SAS Proc MIXED procedure
24LDA Methods for Categorical Dependent Variables
- Robust standard errors
- SAS Proc GENMOD procedure (with IND variance
structure) - Specify distribution family (e.g. Binomial for
binary outcomes, Poisson for count data), default
is Normal distribution - Can also use Proc SURVEYLOGISTIC procedure for
binary outcome - GEE
- Stata xtlogit, xtpoisson, etc
- SAS Proc GENMOD procedure
- Specify distribution family for categorical
dependent variables - Assume variance structure to be UN, EXCH, AR, or
MDEP(m)
25LDA Methods for Categorical Dependent Variables
cont.
- Random effects models
- Use SAS Proc NLMIXED (non-linear mixed models)
procedure - Can only estimate two-level models
- Syntax is quite complicated
- Computationally intensive and often unstable, and
thus is not recommended (by Dr. Paul D. Alison) - Can also use SAS Proc GLIMMIX procedure (need to
download it from SAS web site (http//support.sas.
com/rnd/app/da/glimmix.html) - According to Dr. Alison
- Can handle more than two levels of data
- Much faster than NLMIXED
- Syntax is simpler
- Inaccurate for small numbers of time points (e.g.
2-3 points per person)
26IMPACT Example
- Outcome SCL-20 measured at baseline, 3 months, 6
months, 12 months, 18 months, and 24 months (time
0, 3, 6, 12, 18, 24) - Compare between intervention and usual care
groups - Models are adjusted for age, sex, education,
ethnicity, number of chronic conditions, and time
27(No Transcript)
28UN, AR, EXCH, or MDEP(5)?
29Survival Analysis
- Outcome failure failure time
- Unlike repeated measures, survival data have only
1 outcome measure - Methods for recurrent event are available
- Failure time is (often) the clock time between
the time origin and failure - Time origin should be precisely defined
- Ex the date of randomization in a randomized
clinical trial - Time origin doesnt have to be the same calendar
time for all individuals - Censoring individuals are not observed for the
full time to failure
30BMT Example
- From Klein and Moeschberger (1997)
- Sample 137 patients who received bone marrow
transplant - At the time of transplant, each patient is
classified into one of three risk categories - ALL (Acute Lymphoblastic Leukemia)
- Low-risk AML (Acute Myeloid Leukemia)
- High-risk AML
- End point disease-free survival in days
- Time origin the date of transplant
- Failure death or relapse
- Censored no death or relapse by the end of the
study
31Survival Function
- Failure time T
- Survival function S(t) P(T gt t)
- the probability that the failure time is greater
than or equal to t - Hazard function
- the chance that the failure occurs within the
time interval t, t?t (let ?t be extremely
small), given that the individual survives at t
32Estimating Survival Function
- Life-table
- Divide the period of observation into a series of
time intervals (often of equal length) - Compute the number of deaths and number of
censored survival times - Estimate the survival probability in each
interval - Take the product of the probabilities
- Kaplan-Meier estimate
- Similar to Life-Table method
- Each interval has only one death occurred at the
start of the interval - Cox proportional hazard model
33A subsample from the BMT example
34Life-Table
- D death C censored N number of
individuals who are alive (at risk) at beginning
of the interval - N N (C/2) number of individuals who are at
risk during the interval - S(t) cumulative survival
35Kaplan-Meier Estimate
- The beginning of each interval is determined by
death - Each interval contains one death (or more if
there are ties) - N(t) includes individuals with censored data at t
36(No Transcript)
37(No Transcript)
38Cox Proportional Hazard Model
- h0(t) baseline hazard function
- The interpretation of b1
-
39BMT Example
40(No Transcript)
41Reference
- Wedderburn, R.W.M. (1974). Quasi-likelihood
functions, generalized linear models and the
Gaussian method. Biometrika, 61, 439-47. - Dr. Paul Alisons upcoming short courses
- http//www.statisticalhorizons.com/index.html
- Klein, J. P. and Moeschberger, M. L. (1997),
Survival Analysis Techniques for Censored and
Truncated Data, New York Springer-Verlag.