Title: Christine Steinhoff
1PICB Groupmeeting Shanghai 2007, 18th
Oct. Ongoing Projects Visualization of
Multivariate Data and Exploration of DNA
Methylation
Christine Steinhoff
2OUTLINE
- Visualization of Multivariate Data
- Principal Component Analysis
- Correspondence Analysis
- Multiple Correspondence Analysis with
Supplementary Data - Can we find Masterregulation Region in Terms of
DNA Methylation? - What is DNA Methylation?
- What is DNA Methylation good for?
- DNA Methylation and Tissue Specific Expression
- Deriving Scores for Tissue Specific Expression
Switches - Exploration of Candidate Regions wrt Genomic
features - Correlation of Candidate Regions with
Evolutionary Models and Methylation Prediction
3OUTLINE
- Visualization of Multivariate Data
- Principal Component Analysis
- Correspondence Analysis
- Multiple Correspondence Analysis with
Supplementary Data
Cooperation work with Matteo Pardo, CNR-INFM
Sensor Lab, Brescia, Italy
4Idea of PCA
We often do not know which measurements best
reflects the dynamics in our system
So, PCA should reveal The dynamics are along
the x axis
http//www.snl.salk.edu/shlens/pub/notes/pca.pdf
5Back to Biology
Genes x1 xp
Patients 1
n
Dimension high So how can we reduce the
dimension ? Simplest way take the first one,
two, three Plot and discard the rest Obviously
a very bad idea.
Matrix X
6How can we transform our data appropriately ?
Assume, we can manipulate X a bit Lets call
this Y Y should be manipulated in a way that it
is a bit more optimal than X was What does
optimal mean? That means
SMALL!
Var
Cov
Var
Var
LARGE!
In other words should be diagonal and large
values on the diagonal
The principal Components of X are the
Eigenvectors of Cov(X)
7Why Covariance small and Variance high ?
- The diagonal terms of Cov(X) are the variance
genes across patients - The off-diagonal terms of Cov(X) are the
covariance between gene vectors - Cov(X) captures the correlations between all
possible pairs of measurements - In the diagonal terms, large values correspond to
interesting dynamics - In the off diagonal terms large values correspond
to high redundancy
8How to choose the transformation?
Find orthonormal P such that
Y P X
With Cov(Y) diagonalized
Then the rows of P are the principal components
of X
Let D be the diagonal Matrix containing the
Eigenvalues of XXt And let E be the
corresponding Matrix containing the Eigenvectors
of XXt Now I claim that this definition is
doing the job, That means PEt This can be
calculated ...
9How do we get there?
Cov(Y) 1/(n-1) YY t
AXX t
10How do we get there?
A is symmetric Therefore there is a matrix E of
eigenvectors and a diagonal matrix D such that
Now define P to be the transpose of the matrix E
of eigenvectors
Then we can write A
11How do we get there?
Now we can go back to our Covariance Expression
Cov(Y)
12How do we get there?
The inverse of an orthogonal matrix is its
transpose (due to its definition)
In our context that means
Cov(Y)
13Some Remarks
- If you multiply one variable by a scalar you get
different results - This is because it uses covariance matrix (and
not correlation) - PCA should be applied on data that have
approximately the same scale in each variable - The relative variance explained by each PC is
given by eigenvalue/sum(eigenvalues) - When to stop? For example Enough PCs to have a
cumulative variance explained by the PCs that is
gt50-70 - Kaiser criterion keep PCs with eigenvalues gt1
14Some Remarks
15Some Remarks
If variables have very heterogenous variances we
standardize them The standardized variables Xi
Xi (Xi-mean)/?variance The new
variables all have the same variance, so each
variable have the same weight.
16REMARKS
- PCA is useful for finding new, more informative,
uncorrelated features it reduces dimensionality
by rejecting low variance features - PCA is only powerful if the biological question
is related to the highest variance in the dataset
17Example 2
18Yang et al.
- Transforming growth factor-beta
- TGF-beta is a potent inducer of growth arrest in
many cell types, including epithelial cells. - This activity is the basis of the tumor
suppressor role of the TGF-beta signaling system
in carcinomas. - contribute to cancer progression.
- special relevance in mesenchymal differentiation,
including bone development. - Deregulated expression or activation of
components of this signaling system can
contribute to skeletal diseases, e.g.
osteoarthritis.
19Yang et al.
Stock 1 T constitutively active tkv receptor
Stock 2 B constitutively active babo receptor
T1,T2,T3 B1,B2,B3 Contr1,2,3
genes x1 xp
20Yang et al.
21Yang et al.
tkv
Babo
Control
22What are the drawbacks in PCA?
But We only see the different experiments If
we do it the other way round that means
analysing for the genes not for the experiments
we see grouping of genes But we never see both
together. So, can we relate somehow the
experiments and the genes? That means group
genes whose expression might be explained by the
the respective experimental group (tkv, babo,
control)? This goes into correspondence
analysis
23Idea of Correspondence Analysis
We cannot see genes and patients at the same
time essentially because rows and columns do not
have the same scale We want to find something,
s.t. our picture looks like here and I can
define a distance between tkv1 and gene blah
Gene blah
Tkv 1
24Idea of Correspondence Analysis
Correspondence analysis can be understood as an
extension of PCA where the variance in PCA is
replaced proportional by an inertia proportional
to the Chi Square distance of the table from
independence.
CA
PCA
X ? A X t(X) Cov(X) Decompose A V D t(V)
X ? Z (Chi square distance) A Z t(Z) Cov(Z)
Decompose A V D t(V) Which is Z U sqrt(D)
t(V)
Use Chi square distance instead of covariance.
25Application in Bioinformatics
26Still Problems???
Lets assume We have different datasets, not only
expression measurements but also aCGH
data,... This we call multivariate data IDEA
of MULTIPLE CORRESPONDENCE ANALYSIS
27Multiple Correspondence Analysis
28Multiple Correspondence Analysis
29Multiple Correspondence Analysis
Integrate this information as well????
30Multiple Correspondence Analysis
- We derive this directly from our biological
application - Multivariate data (expression, aCGH)
- Supplementary information (patients information)
31Data Sources
32Can we explore different sources in ONE picture?
Genomic (GC content, strand,...) TFBS Expressi
on GO categories
t2
33What do we want ?
p1
p2
p3
p4
Expr X
aCGH Y
Patients data
Genomic data
n
Genes
34Concept of Indicator matrix
-1 0 1.
p1
Expr categories X
G1 G2 G3 ...
n
35All matrices in indicator writing, then append
p1
p2
p3
p4
Expr X
aCGH Y
Cat
Genomic G
Now define
36- Disadvantages
- Loosing ordering
- Discretization Loss of information
- Advantages
- Get rid of scaling and distribution problems
- As many datasets as memory allows
- Combining very different data sources
37Concept of supplementary Information
- averaging the respondent row points in principal
coordinates - equivalent to appending the cross-tabulations
tr(Z ) Z (of the supplementary variable with the
active variables) as supplementary rows of Z. - This gives the same numerical coordinates as
appending - tr(Z) Z as supplementary rows to the Burt
matrix t(Z)Z. - Or, since the Burt matrix is symmetric, one can
append the transposed cross-tabulations - tr(Z) Z to C as supplementary columns.
- To illustrate the calculations, suppose that C
denotes the latter cross-tabulations tr(Z) Z
(stacked vertically) appended as columns to the
Burt matrix, with general element cij and column
sums c .j . - Now the positions of the supplementary columns
C are obtained by weighted averaging as follows
38RESULTS Integrative analysis of different data
types
39Acknowledgement
Martin Vingron
Matteo Pardo
40Comparative Analysis of DNA Methylation and
Tissue Specific Expression
Christine Steinhoff
41OUTLINE
- Tissue specific expression in health
- Two models for discovering highly variable
gene expression regions - Definition of candidates
- Genomic features of these areas
- Combination with methylation features
42Variable gene expression regions
Definition of Variable Regions
- Variability Change in Expression
- locally, that means along chromosome
- between tissues
Restrict them due to methylation indication
Explore
43DATA
http//symatlas.gnf.org/SymAtlas/
44Expression DATA
Mouse Expression 36,182 Probesets
Human Expression 22,283 Probesets
Orthologous Genes Human 13576 Probesets Mouse
11518 Probesets
45Expression DATA
Mouse Expression 122 Hybridizations
Human Expression 158 Hybridizations
Clearly matching Tissue Annotation (average for
repeated measurements) 22 Tissues
Adrenalgland Amygdala Bonemarrow Cerebellum Heart
Hypothalamus Kidney Liver Lung Lymphnode Ovary
Pancreas Pituitary Placenta Prostate Salivarygland
Skeletalmuscle Testis Thymus Thyroid Trachea Uter
us
46Definition of Variance Score
w_i
Chromosome j
Local variance
47Deriving a Variance Score
Local Variance in expression
Variance across tissues Variance score
48Variance Score Genomewide
Variance Score
49Second Variance Score Linear Model
Gene 1, gene 2, gene 3, gene 4, gene 5
50Second Variance Score Linear Model
For the ith gene In the jth window In
the kth tissue We can write
51Second Variance Score Linear Model
Analysis of Variance for the 3-fold
Classification with (5 genes) x (number windows)
x (number tissues) Observations
52Second Variance Score Linear Model
Estimate Window effect, Tissue effect and
Interaction effect Scale Maximize the three
scores simultaneously using L2 Norm
This gives for each pair (window,tissue) a score
that measures variability
53Distribution
Window effects
Tissue effects
Interaction effects
L2 Norm
54top scoring Examples
Pancreas
SEMA4F HK2 TACR1 MRPL19 GCF LRRTM4 REG1B RE
G1A REG3A CTNNA2 SUCKG1 Q9H5E1
Relative expression
55top scoring Examples
RPS18 NM_003782 WDR46 HKE2 RGL2 TAPBP ZNF297 DAXX
KIFC1 PHF1
Relative expression
56Sequence Features within genes GC content
Top 50 windows
57Sequence Features within genes GC content
58Sequence Features within genes SINEs, LINEs
59Example
NP_006240.3 PRB4 NP_955385.1 Q7M4Q5 ETV6 BCL2L14 L
RP6 MANSC1 CREBL2 GPR19
Salivary Gland Trachea
Relative expression
60Training SVM on epigenetic data for CpG islands
on human Chromosome 21 and 22
Identify informative DNA attributes that
correlate with open versus compact chromstin
structures
use attributes to predict the epigenetic states
of all CpG islands genopmewide
61Methylation Prediction
Variance Score
L2 Score
Frequency
L2 Norm (win,tis,win-tis)
Var-Var Score
62Methylation Prediction
Window score only
Tissue score only
Interaction score only (win-tis)
63Summary
Varvar Score L2Norm Score
GC GC SINE LINE Meth pred
?
?
64Acknowledgement
Martin Vingron Szymon Kielbasa Peter Arndt
Hans-Peter Lenhof Martina Paulsen Jörn Walter
Christoph Bock
65Thank you for your attention