Title: Searching for structure in random field data
1Searching for structure in random field data
- Keith J. Worsley12,
- Thomas W. Yee3, Russell B. Millar3
- 1Department of Mathematics and Statistics, McGill
University, 2McConnell Brain Imaging Centre,
Montreal Neurological Institute, Montreal,
Canada, and - 3Department of Statistics, University of
Auckland, New Zealand - www.math.mcgill.ca/keith
2What is Data Mining?
- The June 26, 2000, issue of TIME predicted that
one of the 10 hottest jobs of the 21st century
will be Data Mining - research gurus will be on hand to extract
useful tidbits from mountains of data,
pinpointing behaviour patterns for marketers and
epidemiologists alike.
3Some definitions
- Data mining is the process of selecting,
exploring, and modeling large amounts of data to
uncover previously unknown patterns for business
advantage (SAS 1998 Annual Report, p51) - Data mining is the nontrivial process of
identifying valid, novel, potentially useful, and
ultimately understandable patterns in data
(Fayyad) - Data mining is the process of discovering
advantageous patterns in data (John) - Data mining is the computer automated exploratory
data analysis of (usually) large complex data
sets (Freidman, 1998) - Data mining is the search for valuable
information in large volumes of data (Weiss and
Indurkhya, 1998) - In contrast, Statistics is the science of
collecting, organizing and presenting data.
4Why is it called Data Mining?
- Plentiful data can be mined for nuggets of gold
(i.e. truth /insight/knowledge) by sifting
through vast amounts of raw data. - Some statisticians have criticized it as data
dredging or a fishing expedition in the search
of publishable P-values, or torturing the data
until it confesses. - Many DM methods are heuristic, complex, computer
intensive, so their statistical properties are
usually not tractable. - The focus of DM is often prediction and not
statistical inference. - I understand mining to be a very carefully
planned search for valuables hidden out of sight,
not a haphazard ramble. Mining is thus rewarding,
but, of course, a dangerous activity. (D.R. Cox,
in the discussion of Chatfield, 1995).
5Striking fools gold
- The Bible Code, a best-selling book by Michael
Drosnin, claims to find hidden messages in the
Bible about dinosaurs, Bill Clinton, the Rabin
assassination etc. from searches of arrays of
letters - In 1992, ProCyte Corp. was dismayed when a newly
developed drug, lamin, failed to promote general
healing of diabetic ulcer wounds. So the company
searched through subsets of data and found that
lamin appeared to work on certain foot wounds.
But that was a statistical fluke, as it turned
out after an expensive clinical trial. Not
allowed drug status, lamin is now sold as a wound
dressing
6Confirming vs. Discovering
- There are two types of DM
- Hypothesis testing (aka top-down approach)
- Knowledge Discovery in Databases (KDD)
- (aka bottom-up approach)
- Directed KDD want to explain the value of some
particular variable in terms of other variables - Undirected KDD identifies patterns in the data.
- Undirected KDD recognizes relationships in data
- Directed KDD explains those relationships once
they have been found.
7Mining the miners
- DM so far has been largely a commercial
enterprise. As in most gold rushes of the past,
the goal is to mine the miners. The largest
profits are made by selling the tools to the
miners, rather than in doing the actual mining - Hardware manufacturers emphasize high
computational requirements of DM. - Software developers emphasize competitive edge
Your competitor is doing it, so you had better
keep up.
8Some commercial software
- SAS Enterprise Miner
- SPSS Clementine, Neural Connection and
AnswerTree - IBM Intelligent Miner
- SGI MineSet
- NeoVista Software ASIC
- Mathsoft S-PLUS (for small data sets)
9Some methods
- Hypothesis testing Regression, analysis of
variance, time series analysis. - Directed KDD Classification, discrimination,
structural equation modeling, supervised neural
networks. - Undirected KDD Cluster analysis, tree methods
(AID, CHAID, CART), principal components analysis
(PCA), independent components analysis (ICA),
unsupervised neural networks.
10Allied fields
- Exploratory Data Analysis (EDA) Tukey defined
statistics in terms of problems rather than
tools. - Informatics is research on, development of, and
use of technological, sociological, and
organizational tools and applications for the
dynamic acquisition, indexing, dissemination,
storage, querying, retrieval, visualization,
integration, analysis, synthesis, sharing (which
includes electronic means of collaboration), and
publication of data such that economic and other
benefits may be derived from the information by
users of all sections of society. - Pattern recognition given some examples of
complex signals and the correct decisions for
them, make decisions automatically for a stream
of future examples, e.g. identify plants, tumors,
decide to buy or sell stocks. - Machine learning is the study of computer
algorithms that improve automatically through
experience. Applications range from data mining
programs that discover rules in large data sets,
to information filtering systems that
automatically learn users interests. (Mitchell,
1997). - Meta-Analysis is the statistical analysis of a
large collection of analysis results from
individual studies for the purpose of integrating
the findings.
11Brain mapping data
- We have huge data bases of brain images (MRI,
fMRI, PET, EEG, MEG ) together with patient
information (age, sex, psychological tests,
disease, genotype ) - The novelty is that the image variables are 3D
images rather than single numbers (such as blood
pressure, cholesterol level ) - These images can themselves be mined for
interesting information, e.g. peaks or clusters
of activated regions
12Some data mining tools already used in brain
mapping
- Regression, analysis of variance, time series
- Cluster analysis (e.g. clustering of fMRI time
courses) - PCA and ICA of voxels scans matrix
- Structural equation modeling to analyze
connectivity - Pattern recognition to segment gray/white/CSF
- Meta-analysis to combine locations of activation
from different studies
13Tree methods Automatic Interaction Detection
(AID)
- Morgan, J.N. and Sonquist, J.A. (1963). Problems
in the analysis of survey data, and a proposal.
Journal of the American Statistical Association,
58, 415-434. - Kass, G.V. (1980). An exploratory technique for
investigating large quantities of categorical
data. Applied Statistics, 29, 119-127. - Worsley, K.J. (1978). Significance testing in
Automatic Interaction Detection (AID). PhD
Thesis, University of Auckland.
14How AID works
- Split observations into two groups according to
the values of a predictor - Two types of predictors
- Monotonic split by thresholding
- predictor x predictor gt x
- Free split into any two subsets, e.g. if
predictor takes values x1, , x7 - x1, x5, x6 x2, x3, x4, x7
- Choose the split that maximizes a test statistic
for the difference in dependent or target
variable - Repeat on two subgroups until some stopping
criterion is reached (split is not significant
or subgroup size is too small)
15SPSS example credit risk data
Dependant or target
Predictors M M F F M
M F F M M
M monotonic (split by thresholding), F free
(split into any two subsets)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19Brain mapping example cortical thickness
Dependant or target
Predictor M M M M
M
Subject Node1 Node2 Node3 Node4 Node40962 Sex
1 3.73 3.05 3.93 2.30 1.59 m
2 2.95 1.17 3.33 2.75 1.03 f 3
2.30 1.23 2.56 1.20 1.46 f 4
2.64 2.19 2.57 2.25 1.29 m 5
2.39 2.76 2.51 2.82 1.02 f 6
3.26 1.85 3.31 1.70 1.65 f 7
2.68 2.52 3.23 2.30 1.47 m 8
3.60 3.66 2.90 2.25 1.79 m 9
3.27 1.43 2.88 1.81 2.14 f
321 4.10
2.67 2.83 1.78 1.70 f
20Misclassification matrixcortical thickness
- Actual category
- Male Female
- Predicted Male 145 18
- category Female 18 140
21(No Transcript)
22fMRI data 120 scans, 3 scans each of hot, rest,
warm, rest, hot, rest,
T (hot warm effect) / S.d. t110 if no
effect
23Brain mapping example fMRI
Dependant or target
Predictor M M M M
M
Frame Voxel1 Voxel2 Voxel3 Voxel4 Voxel30786
Stimulus 1 1.1 1.66 1.53 0.77 ...
-0.12 hot 2 -0.59 0.23 0.38 -0.43
... -1.73 hot 3 1.06 1.57 1.56
1.14 ... 0.64 hot 4 1.63 1.79 0.88
-0.22 ... -0.07 hot 5 2.3 1.96
1.41 1.33 ... 1.76 hot 6 1.27 1.36
0.73 0.24 ... 1.22 warm 7 1.18
1.33 1.35 1.3 ... 0.88 warm 8 0.98
0.9 0.47 0.18 ... 0.6 warm 9
1.46 1.25 0.77 0.73 ... 1.3 warm
10 0.07 0.7 1.29 1.96 ... 2.04
warm 11 0.39 0.68 1.13 1.81 ... 1.8
warm 12 0.04 -0.04 -0.18 0.37 ... 1.63
hot 13 -0.06 0.2 0.29 0.49 ...
0.7 hot 14 -0.48 -0.26 -0.19 -0.16 ...
-0.42 hot 15 -0.09 -0.39 -0.84 -0.94
... -0.68 hot 16 -0.24 0.02 0.51
1.2 ... 1.38 hot 17 -1.52 -1.11 -1.44
-1.88 ... -1.11 hot 18 -0.07 0.1
-0.07 -0.24 ... 0.17 warm 19 -1.4
-0.57 0.01 0.3 ... 0.41 warm
117
-0.01 0.5 0.74 0.83 ... 0.99 warm
24Misclassification matrixfMRI
- Actual category
- Hot Warm
- Predicted Hot 51 1
- category Warm 7 58
25Splitting the SPM itself
Dependant or target
Predictor ? ? ?
- Voxel x y z T statistic
- 1 1.1719 -10.5469 7.2921 5.4852
- 2 3.5156 -10.5469 7.2921 5.9170
- 3 5.8594 -10.5469 7.2921 5.0115
- 4 1.1719 -8.2031 7.2921 6.1082
- 5 3.5156 -8.2031 7.2921 6.4825
- 6 5.8594 -8.2031 7.2921 5.7299
- 7 1.1719 -5.8594 7.2921 6.7113
- 8 3.5156 -5.8594 7.2921 7.3540
- 9 5.8594 -5.8594 7.2921 6.5934
- 10 1.1719 -10.5469 14.2921 5.4519
- 11 3.5156 -10.5469 14.2921 6.3674
- 12 5.8594 -10.5469 14.2921 6.3184
- 13 1.1719 -8.2031 14.2921 6.2774
- 14 3.5156 -8.2031 14.2921 6.5888
- 15 5.8594 -8.2031 14.2921 6.2456
- 16 1.1719 -5.8594 14.2921 6.3583
- 17 3.5156 -5.8594 14.2921 6.4093
- 18 5.8594 -5.8594 14.2921 5.8665
26How do we split on a spatial predictor?Splits
can be regarded as models with different means
for the two groups
SPM model
SPM model
Monotonic predictor
Free predictor
Smoothed SPM model
Unsmoothed SPM model
Smooth SPM with a filter that matches the model
Free predictor
Spatial predictor
27So
- Treating spatial location as a free predictor
(for the smoothed SPM) is equivalent to simply
thresholding the smoothed SPM - We can choose the threshold to control the false
splitting rate to P lt 0.05 using Bonferroni
corrections or random field theory - If model width is unknown, we can make filter
width another parameter of the model, which leads
to scale space
28(No Transcript)
29Scale space smooth X(t) with a range of filter
widths, s continuous wavelet transform adds an
extra dimension to the random field X(t, s)
Scale space, no signal
34
8
22.7
6
4
15.2
2
10.2
0
-2
6.8
-60
-40
-20
0
20
40
60
S FWHM (mm, on log scale)
One 15mm signal
34
8
22.7
6
4
15.2
2
10.2
0
-2
6.8
-60
-40
-20
0
20
40
60
t (mm)
15mm signal best detected with a 15mm smoothing
filter
30Matched Filter Theorem ( Gauss-Markov Theorem)
to best detect a signal white noise, filter
should match signal
10mm and 23mm signals
34
8
22.7
6
4
15.2
2
10.2
0
-2
6.8
-60
-40
-20
0
20
40
60
S FWHM (mm, on log scale)
Two 10mm signals 20mm apart
34
8
22.7
6
4
15.2
2
10.2
0
-2
6.8
-60
-40
-20
0
20
40
60
t (mm)
But if the signals are too close together they
are detected as a single signal half way between
them
31Scale space can even separate two signals at the
same location!
8mm and 150mm signals at the same location
10
5
0
-60
-40
-20
0
20
40
60
170
113.7
20
76
50.8
15
S FWHM (mm, on log scale)
34
10
22.7
15.2
5
10.2
6.8
-60
-40
-20
0
20
40
60
t (mm)
32FWHM 6.8mm
33FWHM 9mm
34FWHM 11mm
35FWHM 15mm
36FWHM 20mm
37FWHM 26mm
38FWHM 34mm
39FWHM
40FWHM
41FWHM
42FWHM
43FWHM
44FWHM
45(No Transcript)
46FWHM
47FWHM
48Functional connectivity
- Measured by the correlation between residuals at
every pair of voxels (6D data!) - Local maxima are larger than all 12 neighbours
- P-value can be calculated using random field
theory - Good at detecting focal connectivity, but
- PCA of residuals x voxels is better at detecting
large regions of co-correlated voxels
Activation only
Correlation only
Voxel 2
Voxel 2
Voxel 1
Voxel 1
49Correlations gt 0.7, Plt10-10 (corrected)
First Principal Component gt threshold
50False Discovery Rate (FDR)
- Benjamini and Hochberg (1995), Journal of the
Royal Statistical Society - Benjamini and Yekutieli (2001), Annals of
Statistics - Genovese et al. (2001), NeuroImage
- FDR controls the expected proportion of false
positives amongst the discoveries, whereas - Bonferroni / random field theory controls the
probability of any false positives - No correction controls the proportion of false
positives in the volume
51P lt 0.05 (uncorrected), T gt 1.64 5 of volume is
false
Signal Gaussian white noise
Signal
True
Noise
False
FDR lt 0.05, T gt 2.82 5 of discoveries is false
P lt 0.05 (corrected), T gt 4.22 5 probability of
any false
52Comparison of thresholds
- FDR depends on the ordered P-values
- P1 lt P2 lt lt Pn. To control the FDR at a
0.05, find - K max i Pi lt (i/n) a, threshold the
P-values at PK - Proportion of true 1 0.1 0.01
0.001 0.0001 - Threshold T 1.64 2.56 3.28
3.88 4.41 - Bonferroni thresholds the P-values at a/n
- Number of voxels 1 10 100 1000
10000 - Threshold T 1.64 2.58 3.29
3.89 4.42 - Random field theory resels volume / FHHM3
- Number of resels 0 1 10
100 1000 - Threshold T 1.64 2.82 3.46
4.09 4.65
53P lt 0.05 (uncorrected), T gt 1.64 5 of volume is
false
54FDR lt 0.05, T gt 2.66 5 of discoveries is false
55P lt 0.05 (corrected), T gt 4.90 5 probability of
any false