Title: Statistical Models for Stream Ecology Data: Random Effects Graphical Models
1Statistical Models for Stream Ecology
DataRandom Effects Graphical Models
- Devin S. Johnson
- Jennifer A. Hoeting
- STARMAP
- Department of Statistics
- Colorado State University
2The work reported here was developed under the
STAR Research Assistance Agreement CR-829095
awarded by the U.S. Environmental Protection
Agency (EPA) to Colorado State University. This
presentation has not been formally reviewed by
EPA. The views expressed here are solely those
of presenter and the STARMAP, the Program he
represents. EPA does not endorse any products or
commercial services mentioned in this
presentation.
3Motivating Problem
- Various stream sites in the Mid-Atlantic region
of the United States were visited in Summer 1994. - For each site, each observed fish species was
cross categorized according to several traits - Environmental variables are also measured at each
site (e.g. precipitation, chloride
concentration,) - Relative proportions are more informative.
- How can we determine if collected environmental
variables affect species richness compositions
(which ones)?
4Outline
- Introduction
- Compositional data
- Probability models
- Brief introduction to chain graphs
- A graphical model for compositional data
- Modeling individual probabilities
- Markov properties of random effects graphical
models - Analysis of fish species richness compositional
data - Conclusions and Future Research
5Discrete Compositions and Probability Models
- Compositional data are multivariate observations
- Z (Z1,,ZD) subject to the constraints that
SiZi 1 and Zi ? 0. - Compositional data are usually modeled with the
Logistic-Normal distribution (Aitchison 1986). - Scale and location parameters provide a large
amount of flexibility compared to the Dirichlet
model - LN model defined for positive compositions only
- Problem With discrete counts one has a
non-trivial probability of observing 0
individuals in a particular category
6Existing Compositional Data Models
- Billhiemer and Guttorp (2001) proposed using a
multinomial state-space model for a single
composition, - where Yij is the number of individuals belonging
to category j 1,,D at site i 1,,S. - Limitations
- Models proportions of a single categorical
variable. - Abstract interpretation of included covariate
effects
7Existing Graphical Models
- Graph model theory (see Lauritzen 1996) has been
used for many years to - model cell probabilities for high dimensional
contingency tables - determine dependence relationships among
categorical and continuous variables - Limitation
- Graphical models are designed for a single sample
(or site in the case of the Oregon stream data).
Compositional data may arise at many sites
8New Improvements for Compositional Data Models
- The Billhiemer and Guttorp model can be
generalized by the application of graphical model
theory. - Generalized models can be applied to
cross-classified compositions - Simple interpretation of covariate effects as a
variable in a Markov random field - Conversely, graphical model theory can be
expanded to include models for multiple site
sampling schemes
9Chain Graphs
b
a
c
d
e
- Mathematical graphs are used to illustrate
complex dependence relationships in a
multivariate distribution. - A random vector is represented as a set of
vertices, V . - Pairs of vertices are connected by directed edges
if a causal relationship is assumed, undirected
if the relationship is mutual
10Probability Model for Individuals (Unobserved
Composition)
- Response variables
- Set F of discrete categorical variables
- Notation y is a specific cell
- Explanatory variables
- Set G ? D of categorical (D) and/or continuous
(G) variables - Notation x refers to a specific explanatory
observation - Random effects
- Allows flexibility when sampling many sites
- Unobserved covariates
- Notation ef, f ? F, refers to a random effect.
11Probability Model and Extended Chain Graph, Ge
- Joint distribution
- f (y, x, e) f (yx, e) ? f (x) ? f (e)
- Graph illustrating possible dependence
relationships for the full model, Ge.
12Random Effects Discrete Regression Model(REDR)
- Sampling of individuals occurs at many different
random sites, i 1,,S, where covariates are
measured only once per site - Hierarchical model for individual probabilities
13Random Effects Discrete Regression Model(REDR)
- Response parameters constraints
- The function aF(x,e) is a normalizing constant
w.r.t. y(x,e), and therefore, is not a function
of y. - The parameters bfcd(y, xD), wfg dm(y, xD), and
ef (y) are interaction effects that depend on y
and xD through the levels of the variables in f
and d only. - Interaction parameters (and random effects) are
set to zero for identifiability of the model if
the cells y or xD are indexed by the first level
of any variable in f or d.
14Random Effects Discrete Regression Model(REDR)
- Model for explanatory variables (CG
distribution) - Again, interactions depend on xD through the
levels of the variables in the set d only, and
identifiability constraints are imposed.
15Graphical Models for Discrete Compositions
- Sampling many individuals at a site results in
cell counts, - C(y)i individuals in cell y at site i.
- Conditional count likelihood
- C(y)iy xi, Ni multinomial(Ni fRE(yxi,
ei)y ), - or
- C(y)i xi indep. Poisson(k (xi) ? fRE(yxi,
ei) ) - Joint covariate count likelihood
- multinomial(Ni fRE(yxi, ei)y ) ? CG(l, t, ?Ø)
16Markov Properties of Chain Graph Models
- Let P denote a probability measure on the product
space - X ?a?V X a
- Markov (Global) property
- The probability measure P is Markovian with
respect to a chain graph G if for any triple (A,
B, S) of disjoint sets in V, such that S
separates A from B in Gan(A?B?S)m, we have - A ? B S.
- There are two weaker Markov properties, pairwise
and local Markov properties.
17Markov Properties of the REDR Model
- Proposition 1. A REDR model is Ge Markovian if
and only if the following six constraints are
satisfied for a given extended graph Ge. - Response model
- bfcd(y, xD) 0 unless f ? c ? d is complete for
- c ? d ? Ø.
- wfgdm(y , xD) 0 for m 1,,M, unless f ? g ?
d is complete, where g ? G and d ? D. - ef (y) -bf ØØ (y) with probability 1 if f is
not complete.
18Markov Properties of the REDR Model
- Proposition 1. A REDR model is Ge Markovian if
and only if the following six constraints are
satisfied for a given extended graph Ge. - Covariate model
- ld(xd) 0 unless d is complete .
- tdg(xd) 0 unless g ? c is complete, where g
? G and d ? D. - ?mg. 0 unless m, g is complete, where g, m ? G
and ?mg is the (m, g) element of ?Ø.
19Markov Properties of the REDR Model
- Sketch of proof.
- Lauritzen and Wermuth (1989) prove conditions
concerning the l, t, and ?Ø parameters for the CG
distribution. - If the b and w parameters are 0 for the specified
sets then the density factorizes according to
Frydenburgs (1990) theorem. - A modified version of the proof of the
Hammersley-Clifford Theorem shows that if f (yx,
e) separates into complete factors, then, the
corresponding b and w vectors for non-complete
sets must be 0.
20Preservative REDR Models
- Preservative REDR models are defined by the
following conditions - All connected components aq, q 1,,Q, of F in
Ge are complete, where Q is the total number of
connected components. - Any d ? G?D that is a parent of f ? aq is also a
parent of every other f ? aq, q 1,,Q.
21Markov Properties of the REDR Model
- Proposition 2. If P is a preservative REDR model,
and P is Ge Markovian, then the marginal
distribution, PF?G ?D, of the covariates and
response variables is G (Ge)F?G ?D Markovian.
Sketch of Proof. The integrated REDR density
follows Frydenbergs (1990) factorization
criterion. The factor functions, however, do not
exist in closed form.
22Parameter Estimation
- A Gibbs sampling approach is used for parameter
estimation - Hierarchical centering
- Produces Gibbs samplers which converge to the
posterior distributions faster - Most parameters have standard full conditionals
if given conditional conjugate distributions. - Independent priors imply that covariate and
response models can be analyzed with separate
MCMC procedures.
23Fish Species Richness in the Mid-Atlantic
Highlands
- 91 stream sites in the Mid Atlantic region of the
United States were visited in an EPA EMAP study - Response composition
- Observed fish species were cross-categorized
according to 2 discrete variables
- Habit
- Column species
- Benthic species
- Pollution tolerance
- Intolerant
- Intermediate
- Tolerant
24Stream Covariates
- Environmental covariates
- Values were measured at each site for the
following covariates - Mean watershed precipitation (m)
- Minimum watershed elevation (m)
- Turbidity (ln NTU)
- Chloride concentration (ln meq/L)
- Sulfate concentration (ln meq/L)
- Watershed area (ln km2)
25Fish Species Richness Model
- Composition Graphical Model
- and
- Prior distributions
26Model Selection
- Three different models are considered
- Independent response
- (i.e. bfg(yi) ef (yi) 0 for f H, T )
- Depended response w/ independent errors
- Dependent response w/ correlated errors
- (equivalent to Billheimer Guttorp model)
27Fish Species Functional Groups
Posterior suggested chain graph for independence
model (lowest DIC model)
- Edge exclusion determined from 95 HPD intervals
for b parameters and off-diagonal elements of ?Ø.
28Comments and Conclusions
- Using Discrete Response model with random
effects, the Billheimer-Guttorp model can be
generalized - Relationships evaluated though a graphical model
- Multi-way compositions can be analyzed with
specified dependence structure between cells - MVN random effects imply that the cell
probabilities have a constrained LN distribution - DR models also extend the capabilities of
graphical models - Data can be analyzed from many multiple sites
- Over dispersion in cell counts can be added
29Future Work
- Model determination under a Bayesian framework
- Models involve regression coefficients as well as
many random effects - Initial investigation suggests selection based on
exclusion/inclusion of parameters not edges
produces models with higher posterior mass - Accounting for spatial correlation