Title: in data
1in data
structure
structure
in models
and
- uncertainty and complexity
2What do I mean by structure?
- The key idea is conditional independence
- x and z are conditionally independent given y
if p(x,zy) p(xy)p(zy) - implying, for example, that
p(xy,z) p(xy) - CI turns out to be a remarkably powerful and
pervasive idea in probability and statistics
3How to represent this structure?
- The idea of graphical modelling we draw graphs
in which nodes represent variables, connected by
lines and arrows representing relationships - We separate logical (the graph) and quantitative
(the assumed distributions) aspects of the model
4Contingency tables
Spatial statistics
Regression
Genetics
Graphical models
Markov chains
AI
Statistical physics
Sufficiency
Covariance selection
5Graphical modelling 1
- Assuming structure to do probability calculations
- Inferring structure to make substantive
conclusions - Structure in model building
- Inference about latent variables
6Basic DAG
in general
for example
7A natural DAG from genetics
AB
AO
AO
OO
OO
8A natural DAG from genetics
AB
AO
AO
OO
OO
9DNA forensics example(thanks to Julia Mortera)
- A blood stain is found at a crime scene
- A body is found somewhere else!
- There is a suspect
- DNA profiles on all three - crime scene sample is
a mixed trace is it a mix of the victim and
the suspect?
10DNA forensics in Hugin
- Disaggregate problem in terms of paternal and
maternal genes of both victim and suspect. - Assume Hardy-Weinberg equilibrium
- We have profiles on 8 STR markers - treated as
independent (linkage equilibrium)
11DNA forensics
- The data
- 2 of 8 markers show more than 2 alleles at crime
scene ?mixture of 2 or more people
12DNA forensics in Hugin
13DNA forensics
- Population gene frequencies for D7S820 (used as
prior on founder nodes)
14(No Transcript)
15DNA forensics
- Results (suspectvictim vs. unknownvictim)
16Graphical modelling 2
- Assuming structure to do probability calculations
- Inferring structure to make substantive
conclusions - Structure in model building
- Inference about latent variables
17Conditional independence graph
- draw an (undirected) edge between variables ? and
? if they are not conditionally independent given
all other variables
?
?
?
18Infant mortality example
- Data on infant mortality from 2 clinics, by level
of ante-natal care (Bishop, Biometrics, 1969)
19Infant mortality example
- Same data broken down also by clinic
20Analysis of deviance
- Resid Resid
- Df Deviance Df Dev
P(gtChi) - NULL 7 1066.43
- Clinic 1 80.06 6 986.36
3.625e-19 - Ante 1 7.06 5 979.30
0.01 - Survival 1 767.82 4 211.48
5.355e-169 - ClinicAnte 1 193.65 3 17.83
5.068e-44 - ClinicSurvival 1 17.75 2 0.08
2.524e-05 - AnteSurvival 1 0.04 1 0.04
0.84 - ClinicAnteSurvival 1 0.04 0 1.007e-12
0.84
21Infant mortality example
survival
ante
clinic
survival and clinic are dependent
and ante and clinic are dependent
but survival and ante are CI given clinic
22Prognostic factors for coronary heart disease
Analysis of a 26 contingency table (Edwards
Havranek, Biometrika, 1985)
strenuous physical work?
smoking?
family history of CHD?
blood pressure gt 140?
strenuous mental work?
ratio of ? and ? lipoproteins gt3?
23Graphical modelling 3
- Assuming structure to do probability calculations
- Inferring structure to make substantive
conclusions - Structure in model building
- Inference about latent variables
24Modelling with undirected graphs
- Directed acyclic graphs are a natural
representation of the way we usually specify a
statistical model - directionally - disease ? symptom
- past ? future
- parameters ? data ..
- However, sometimes (e.g. spatial models) there is
no natural direction
25Scottish lip cancer data
- The rates of lip cancer in 56 counties in
Scotland have been analysed by Clayton and Kaldor
(1987) and Breslow and Clayton (1993) - (the analysis here is based on the example in the
WinBugs manual)
26Scottish lip cancer data (2)
- the observed and expected cases (expected
numbers based on the population and its age and
sex distribution in the county),
- a covariate measuring the percentage of the
population engaged in agriculture, fishing, or
forestry, and
- the "position'' of each county expressed as a
list of adjacent counties.
27Scottish lip cancer data (3)
- County Obs Exp x SMR Adjacent
- cases cases ( in counties
- agric.)
- 1 9 1.4 16 652.2 5,9,11,19
- 2 39 8.7 16 450.3 7,10
- ... ... ... ... ... ...
- 56 0 1.8 10 0.0 18,24,30,33,45,55
28Model for lip cancer data
(1) Graph
regression coefficient
covariate
random spatial effects
relative risks
observed counts
29Model for lip cancer data
(2) Distributions
- Data
- Link function
- Random spatial effects
- Priors
30WinBugs for lip cancer data
- Bugs and WinBugs are systems for estimating the
posterior distribution in a Bayesian model by
simulation, using MCMC - Data analytic techniques can be used to summarise
(marginal) posteriors for parameters of interest
31WinBugs for lip cancer data
Dynamic traces for some parameters
32WinBugs for lip cancer data
Posterior densities for some parameters
33Graphical modelling 4
- Assuming structure to do probability calculations
- Inferring structure to make substantive
conclusions - Structure in model building
- Inference about latent variables
34Latent variable problems
variable unknown
variable known
edges known
edges unknown
value set unknown
value set known
35Hidden Markov models
e.g. Hidden Markov chain
z0
z1
z2
z3
z4
hidden
y1
y2
y3
y4
observed
36Hidden Markov models
- Richardson Green (2000) used a hidden Markov
random field model for disease mapping
observed incidence
relative risk parameters
expected incidence
hidden MRF
37Larynx cancer in females in France
SMRs
38Latent variable problems
variable unknown
variable known
edges unknown
edges known
value set known
value set unknown
39Ion channel model choice
Hodgson and Green, Proc Roy Soc Lond A, 1999
40Example hidden continuous time models
O2
O1
C1
C2
C1
C2
C3
O1
O2
41Ion channelmodel DAG
model indicator
transition rates
hidden state
binary signal
levels variances
data
42model indicator
C1
C2
C3
O1
O2
transition rates
hidden state
binary signal
levels variances
data
43Posterior model probabilities
.41
O1
C1
.12
O2
O1
C1
.36
O1
C1
C2
O2
O1
C1
C2
.10
44Alarm network
Learning a Bayesian network, for an
ICU ventilator management system, from 10000
cases on 37 variables (Spirtes Meek, 1995)
45Latent variable problems
variable unknown
variable known
edges known
edges unknown
value set known
value set unknown
46Wisconsin students college plans
10,318 high school seniors (Sewell Shah, 1968,
and many authors since)
ses
sex
5 categorical variables sex (2) socioeconomic
status (4) IQ (4) parental encouragement
(2) college plans (2)
pe
iq
cp
47(Vastly) most probable graph according to an
exact Bayesian analysis by Heckerman (1999)
ses
sex
5 categorical variables sex (2) socioeconomic
status (4) IQ (4) parental encouragement
(2) college plans (2)
pe
iq
cp
48h
ses
sex
pe
iq
Heckermans most probable graph with one hidden
variable
decompos
cp