Title: Diapositive 1
1 Combination of Independent Component Analysis
and statistical modeling
for the identification of metabonomic
biomarkers in 1H-NMR spectroscopy Réjane
Rousseau (Institut de Statistique, UCL, Belgium)
2The metabonomics
specific region biomarker
Biofluid (Urine)
1H-NMR spectroscopy
Whithout contact
METABOLITES TISSUES Organs
After contact
Frequency domain
Spectral alterations Altered metabolites
detection
- One molecule several peaks
- ( 1 to 3 ) with specific positions
- in the frequency domain.
- The concentration of a molecule is
- proportional to the area under the
- curve in its peaks.
Identification of biomarkers which part of
the spectrum to examine?
- How?
- In an experimental database
- with a methodology based on Principal Component
Analysis (PCA)
Objective propose a methodology combining ICA
and statistical modeling
3Identification of biomarkers
- Experimental data
- Biologists have a pool of n rats with a
control pathological state. - They collect n samples of urine .
- The H-NMR produces n spectra of m values
Spectral data - Each sample or spectrum is described by l
variables in a matrix of design Y. - One of these variables describes the
characteristic related to the biomarkers yk - Biomarker identification statistical methods to
answer to the question - In these multivariate data,
which are the most altered variables xj
Spectral data X(n x m)
Design data Y(n x l)
y1 y2 yl
13 320
11 200
11 100
12 270
y1 age of the rat yky2 severity
of diabetis
xj
4Example controlled data
- Advantage of controlled data
- we know the spectral regions that should be
identified as biomarkers. - The controlled data
- 28 spectra of 600 points X(28 x 600)
- Each spectrum a sample of urine
- a chosen concentration of Citrate
- a
chosen concentration of Hippurate - X(28x600) Y (28x2) y1
concentration of citrate -
y2 concentration of hippurate - We need a biomarker to detect changes of the
level of citrate described by y1 - Which are the spectral regions xj the most
altered when the y1 changes? - Spectral regions corresponding to Citrate
the biomarkers to identify.
hippurate (y2)
citrate (y1)
5spectrum 3000 points
A spectrum of 600 values xj with ? xj 1
Urine Citrate Hippurate
Natural urine
Hippurate y2
Hippurate
Citrate y1
14 mixtures in 2 replications 28 samples
Citrate
The biomarkers to identify. spectral regions
corresponding to Citrate
6NEW
USUAL
I. Reduction of the dimension PCA
I. Reduction of the dimension ICA
- XTC SAT
- Components
- are independent
- with a biological meaning
- Examination of the ALL components
- to visualize unconnected
- molecules in samples
XC TP
- Principal components are
- uncorrelated
- in the direction of maximum of variance
- Examination of the 2 first components
-
-
Score plot Loadings
L1
L2
II. Biomarker discovery through Statistical
modelling
ex Citrate plays an important role
Comparison of the intensities of biomarkers
between spectra from ? conditions
Identificationof biomarker
Identificationof biomarker
This is only powerful if the biological
question is related to the highest
variance in the dataset!
7The proposed methodology
- Resulting components
- meaningfull component
- - with some advantages over
- the principal components
- Part I Dimension reduction
- with ICA on the spectral data
- XTC S.AT
Part II Biomarker discovery through
statistical modelling
Identification of biomarker
Comparison of the values in biomarker between
spectra from different conditions
8What is Independent component analysis (ICA)?
- The idea
- Each observed vector of data is a linear
combination of unknown independent (not only
linearly independent) components - The ICA provides the independent components
(sources, sk) which have created a vector of data
and the corresponding mixing weights aki. - How do we estimate the sources?
- with linear transformations of observed
signals that maximize the independence of the
sources. - How do we evaluate this property of independence?
- Using the Central Limit Theorem (), the
independence of sources components can be reflect
- by non-gaussianity.
- Solving the ICA problem consists of
finding a demixing matrix which maximises the
non-gaussianity of the estimated sources under
the constraint that their variances are constant. - Fast-ICA algorithm
- - uses an objective function related to
negentropy - - uses fixed-point iteration scheme.
- almost any measured quantity which depends on
several underlying independent factors has a
Gaussian PDF
9I. Dimension reduction by ICA
X (nxm) n spectra defined by m variables
ex (28x600)
Transposition
XT (mxn)
Centering
mixture and sj have zero mean
XTC S.AT
XTC (mxn)
- Each spectrum is
- a weighted sum of the
- independent spectral expressions
- which each one can correspond to
- an independent (composite) metabolite contained
in - the studied sample.
- (aT , weight ? quantity)
-
Whitening PCA
- we obtain uncorrelated scores with unit variance
? demixing matrix is orthogonal - possibility to discard irrelevant scores chose
the number of sources to estimate
T (mxq) XTC. P
ICA
S (mxq) XTC. P.W XTC. A
10Example I. Dimension reduction by ICA
XTC S.AT
XTC(600 x 28) S (600 x 6)
AT (6x28)
s1 s2 s3 s4 s5 s6
xTC1 xTC28
s1,1 s1,6
sij
s600,1
at1,1 at1,28
at6,1
at1 at2 at3 at4 at5 at6
........ .... ....
Urine citrate hippurate
11S (600 x 6)
AT
28 spectra
Natural urine
aTi,8
Citrate
Hippurate
12Note Comparison with the usual PCA
- Similarities projection methods linearly
decomposing multi-dimensional into components. - Differences
- The number of sources, q, has to be fixed
- Sources are not naturally sorted according to
their importances - The independence condition the biggest
advantage of the ICA - - independent components are more
meaningful than uncorrelated components - - more suitable for our question in which
the component of interest are not always in the
direction - with the maximum variance .
PCA
ICA
1
2
Natural urine
13PCA
ICA
Hippurate Citrate
Natural urine
Loading 1
s1
Citrate
Loading 2
s2
Hippurate Citrate
Hippurate
Loading 3
s3
PC2
aT3
PC1
aT2
14The proposed methodology
- Part I Dimension reduction
- with ICA on the spectral data
- XTC S.AT
q sources representing the spectra of
independent (unrelated) composite metabolites
contained in the samples.
Part II Biomarker discovery
through statistical modeling - on
the mixing matrix AT - with
covariates chosen in the design matrix
Identification of biomarkers
Comparison of the intensities in biomarkers
between spectra from different conditions
15PART II Biomarker discovery with statistical
model
- The idea Among the q recovered sj , we
suppose that some sources - present biomarker regions for a chosen factor yk
- are interpretable as the spectra of pure or
composite independent - metabolite which has a concentration in
the samples influenced by a chosen factor yk - have weights influenced by a chosen factor yk
AT (q x n)
- Modelisation of the relation between the weight
vector and the design variables
16PART II
- Part I ICA on the spectral data XTC S.AT
Part II Biomarker discovery through statistical
modelling
Step 1 Fit a linear model on AT
- relation between the weight vector and the
covariates in design variables - different models.
Step 2 Biomarker identification
Step 3 comparison of the intensities
in biomarkers between spectra
from different conditions
- apply statistical tests on the parameters
- of the models
- selection of sources with
- significant effects.
- prediction of mixing weights by the model
- reconstruction with biomarkers sources
- comparison between factor levels.
17Step 1 Fit a model on AT
- The design matrix Y is rewritten into 2 separate
matrixs - Z1 the (n x p1) incidence matrix for the p1
covariates with fixed effects - Z2 the (n x p2) incidence matrix for the p2
covariates with random effects - For each of the q recovered sj , we assume a
linear relation between its vector of weights and
the design variables -
aj Z1 ßj Z2
?j ej - Models with only fixed effects covariates
aj Z1 ß j ej - Case 1 categorical covariates ANOVA
- ? ex1 biomarker to discriminate 2 groups of
subjects disease sane. - ex2 biomarker to discriminate 3 groups of
subjects disease1, disease2 sane - Case 2 quantitative covariates linear
regression
18Step 1 Fit a model example
- For each of the q 6 recovered sj, we construct
a multiple linear regression model - with 2 quantitative covariates ( p 3) and
no interaction - aj ßj0 ßj1 y1 ßj2 y2 ej
- with ßj0 the intercept
- y1 the citrate concentration in
mg (quantitative) - y2 the hippurate concentration in
mg (quantitative) - ß1 the effect of citrate on the
mean aj for a fixed value of hippurate - ß2 the effect of hippurate on the
mean aj for a fixed value of citrate - ej the vector of independent
random error N (0,s2) - For each of the q recovered sj, the fitted model
by least square technique is - âj bj0 bj1 y1 bj2 y2
- In this example, we want to identify biomarkers
for the concentration of citrate. - The covariate of interest yk y1
19Step 1 Fit a model example
s3Hippurate
s2 Citrate
a2
a3
Citrate (y1)
Citrate (y1)
hippurate (y2)
hippurate (y2)
(y1)
(y1)
20Step 2 Biomarker identification
- Goal
- Among the q sources , we want to select the ones
presenting a significant effect of the - chosen covariate yk on their weights.
- These discriminant sources represent the
spectrum of an independent metabolite - with a concentration depending on the chosen
covariate biomarkers. - For each source sj, test the significance of the
parameter ßjk of the covariate of interest yk - (ex research of biomarkers for the dose
of citrate y1, we test each of the 6 ßj1) - H0 ßjk 0 vs H1 ßjk ? 0
- compute the following statistic tj bjk /
s(bjk) t(n-p) - take the corresponding p-value pj P( t (n-p)
? tj ) - We are in a multiple tests situation
- the selection of a significant set of r
coefficients ßjk based on q pj obtained from q
individual tests. - ? Bonferroni correction select, in a (m
x r) matrix S, the r sources with pj lt 0.05/q
21Step 2 Biomarker identification example
We research of biomarkers for the dose of
citrate y1yK ? we test each of the 6 ßj1
P-values
a 0.05/6
Sources
9.18 x 10-13
1.84x10-15
2.86 x 10-31
22Step 3 Comparison of the intensities in
biomarkers
- Goal comparison of the effects on the biomarker
caused by ? changes in yk. - Choose 3 or more values of yk
- yk1 a first value of reference of yk
- yk2 a new value of interest of yk
- yk3 a second new value of interest of yk
- Compute
- The effect on the biomarker of the change of yk
from yk1 to yk2 - C1 S ßk (yk2- yk1 )
- The effect on the biomarker of the change of yk
from yk1 to yk3 - C2 S ßk (yk3- yk1 )
23Step 3 example
yk1 yk2 yk3 yk4
Citrate yk
24Conclusions
- With the presented methodology combining ICA with
statistical modeling, - we visualize the independent metabolites
contained in the studied biofluid (through the
sources) and their quantity (through the mixing
weights) - we identify biomarkers or spectral regions
changing according to a chosen factor by a
selection of source. - we compare the effects on this spectral
biomarkers caused by different changes of this
factor. - In comparison with the PCA, ICA
- gives more biologically meaningful and natural
representations of this data.
25- Thank you
for your attention
26Example2 the data
- 18 spectra of 600 values
- 1 characteristic in Y
- X(18x600) Y (18x1)
- y1 disease
group of the rat (qualitative) - We want biomarkers for group of disease described
in y1. - ? a model with
qualitative covariates
Group 1 disease 1 Group 2 disease 2 Group 3 no
disease
27Example2 Part I. Dimension reduction by ICA
XTC S.AT
S (600 x 5)
AT (5x18)
28Example 2 Part II biomarkers discovery through
statistical modeling
- Step 1 Fit a model on AT Models with
only a categorical covariate with fixed effects
ANOVA I -
aj Z1 ß j ej - Step 2 Biomarker identification
- For each of the q recovered sj, test the effect
of y1 ? Fj statistics? pj - Bonferroni correction select, in a (m x r)
matrix S, the r sources with pj lt 0.05/q
0.009797431
0.0002412604
0.005710213
29Step 3 Comparison of the intensities in
biomarkers
- Goal comparison of the effects on the biomarker
caused by ? changes in yk. - Choose 3 or more values of yk
- yk1 a first value of reference of yk
- yk2 a new value of interest of yk
- yk3 a second new value of interest of yk
- Compute
- The effect on the biomarker of the change of yk
from yk1 to yk2 - C1 S ßk (yk2- yk1 )
- The effect on the biomarker of the change of yk
from yk1 to yk3 - C2 S ßk (yk3- yk1 )
30Step 3 Comparison of the intensities in
biomarkers
Goal comparison of the effects on the biomarker
caused by the changes of group.
Citrate
31 32Example 1 Step 4 Comparison of the intensities
in biomarkers
- For a first chosen value of interest of yk (not
necessary observed) yk0 - Choose a value of reference for the other
factors y0?k -
- ? Z01 vector of values (1,
yk0 , y0?k) - For each of the r source selected in S, use the
model to predict weights for the biomarkers - âj (yk0) bj Z01
- ? â (yk0) a vector
of r new weights - Reconstruction of the values to expect in the
biomarkers -
Sâ(yk0) -
33(No Transcript)
34Example 2 the reconstructed spectra
35(No Transcript)