Christine Steinhoff - PowerPoint PPT Presentation

1 / 65

About This Presentation

Title:

Christine Steinhoff

Description:

TGF-beta is a potent inducer of growth arrest in many cell types, including epithelial cells. ... Prostate. Salivarygland. Skeletalmuscle. Testis. Thymus ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 66

Provided by: stei6

Category:

more less

Transcript and Presenter's Notes

Title: Christine Steinhoff

1
PICB Groupmeeting Shanghai 2007, 18th
Oct. Ongoing Projects Visualization of
Multivariate Data and Exploration of DNA
Methylation
Christine Steinhoff
2
OUTLINE

Visualization of Multivariate Data
Principal Component Analysis
Correspondence Analysis
Multiple Correspondence Analysis with
Supplementary Data
Can we find Masterregulation Region in Terms of
DNA Methylation?
What is DNA Methylation?
What is DNA Methylation good for?
DNA Methylation and Tissue Specific Expression
Deriving Scores for Tissue Specific Expression
Switches
Exploration of Candidate Regions wrt Genomic
features
Correlation of Candidate Regions with
Evolutionary Models and Methylation Prediction

3
OUTLINE

Visualization of Multivariate Data
Principal Component Analysis
Correspondence Analysis
Multiple Correspondence Analysis with
Supplementary Data

Cooperation work with Matteo Pardo, CNR-INFM
Sensor Lab, Brescia, Italy
4
Idea of PCA
We often do not know which measurements best
reflects the dynamics in our system
So, PCA should reveal The dynamics are along
the x axis
http//www.snl.salk.edu/shlens/pub/notes/pca.pdf
5
Back to Biology
Genes x1 xp
Patients 1
n
Dimension high So how can we reduce the
dimension ? Simplest way take the first one,
two, three Plot and discard the rest Obviously
a very bad idea.
Matrix X
6
How can we transform our data appropriately ?
Assume, we can manipulate X a bit Lets call
this Y Y should be manipulated in a way that it
is a bit more optimal than X was What does
optimal mean? That means
SMALL!
Var
Cov
Var
Var
LARGE!
In other words should be diagonal and large
values on the diagonal
The principal Components of X are the
Eigenvectors of Cov(X)
7
Why Covariance small and Variance high ?

The diagonal terms of Cov(X) are the variance
genes across patients
The off-diagonal terms of Cov(X) are the
covariance between gene vectors
Cov(X) captures the correlations between all
possible pairs of measurements
In the diagonal terms, large values correspond to
interesting dynamics
In the off diagonal terms large values correspond
to high redundancy

8
How to choose the transformation?
Find orthonormal P such that
Y P X
With Cov(Y) diagonalized
Then the rows of P are the principal components
of X
Let D be the diagonal Matrix containing the
Eigenvalues of XXt And let E be the
corresponding Matrix containing the Eigenvectors
of XXt Now I claim that this definition is
doing the job, That means PEt This can be
calculated ...
9
How do we get there?
Cov(Y) 1/(n-1) YY t
AXX t
10
How do we get there?
A is symmetric Therefore there is a matrix E of
eigenvectors and a diagonal matrix D such that
Now define P to be the transpose of the matrix E
of eigenvectors
Then we can write A
11
How do we get there?
Now we can go back to our Covariance Expression
Cov(Y)
12
How do we get there?
The inverse of an orthogonal matrix is its
transpose (due to its definition)
In our context that means
Cov(Y)
13
Some Remarks

If you multiply one variable by a scalar you get
different results
This is because it uses covariance matrix (and
not correlation)
PCA should be applied on data that have
approximately the same scale in each variable
The relative variance explained by each PC is
given by eigenvalue/sum(eigenvalues)
When to stop? For example Enough PCs to have a
cumulative variance explained by the PCs that is
gt50-70
Kaiser criterion keep PCs with eigenvalues gt1

14
Some Remarks
15
Some Remarks
If variables have very heterogenous variances we
standardize them The standardized variables Xi
Xi (Xi-mean)/?variance The new
variables all have the same variance, so each
variable have the same weight.
16
REMARKS

PCA is useful for finding new, more informative,
uncorrelated features it reduces dimensionality
by rejecting low variance features
PCA is only powerful if the biological question
is related to the highest variance in the dataset

17
Example 2
18
Yang et al.

Transforming growth factor-beta
TGF-beta is a potent inducer of growth arrest in
many cell types, including epithelial cells.
This activity is the basis of the tumor
suppressor role of the TGF-beta signaling system
in carcinomas.
contribute to cancer progression.
special relevance in mesenchymal differentiation,
including bone development.
Deregulated expression or activation of
components of this signaling system can
contribute to skeletal diseases, e.g.
osteoarthritis.

19
Yang et al.
Stock 1 T constitutively active tkv receptor
Stock 2 B constitutively active babo receptor
T1,T2,T3 B1,B2,B3 Contr1,2,3
genes x1 xp
20
Yang et al.
21
Yang et al.
tkv
Babo
Control
22
What are the drawbacks in PCA?
But We only see the different experiments If
we do it the other way round that means
analysing for the genes not for the experiments
we see grouping of genes But we never see both
together. So, can we relate somehow the
experiments and the genes? That means group
genes whose expression might be explained by the
the respective experimental group (tkv, babo,
control)? This goes into correspondence
analysis
23
Idea of Correspondence Analysis
We cannot see genes and patients at the same
time essentially because rows and columns do not
have the same scale We want to find something,
s.t. our picture looks like here and I can
define a distance between tkv1 and gene blah
Gene blah
Tkv 1
24
Idea of Correspondence Analysis
Correspondence analysis can be understood as an
extension of PCA where the variance in PCA is
replaced proportional by an inertia proportional
to the Chi Square distance of the table from
independence.
CA
PCA
X ? A X t(X) Cov(X) Decompose A V D t(V)
X ? Z (Chi square distance) A Z t(Z) Cov(Z)
Decompose A V D t(V) Which is Z U sqrt(D)
t(V)
Use Chi square distance instead of covariance.
25
Application in Bioinformatics
26
Still Problems???
Lets assume We have different datasets, not only
expression measurements but also aCGH
data,... This we call multivariate data IDEA
of MULTIPLE CORRESPONDENCE ANALYSIS
27
Multiple Correspondence Analysis
28
Multiple Correspondence Analysis
29
Multiple Correspondence Analysis
Integrate this information as well????
30
Multiple Correspondence Analysis

We derive this directly from our biological
application
Multivariate data (expression, aCGH)
Supplementary information (patients information)

31
Data Sources
32
Can we explore different sources in ONE picture?
Genomic (GC content, strand,...) TFBS Expressi
on GO categories
t2
33
What do we want ?
p1
p2
p3
p4
Expr X
aCGH Y
Patients data
Genomic data
n
Genes
34
Concept of Indicator matrix
-1 0 1.
p1
Expr categories X
G1 G2 G3 ...

0 0
0 1 0
0 1 0

n
35
All matrices in indicator writing, then append
p1
p2
p3
p4
Expr X
aCGH Y
Cat
Genomic G
Now define
36

Disadvantages
Loosing ordering
Discretization Loss of information
Advantages
Get rid of scaling and distribution problems
As many datasets as memory allows
Combining very different data sources

37
Concept of supplementary Information

averaging the respondent row points in principal
coordinates
equivalent to appending the cross-tabulations
tr(Z ) Z (of the supplementary variable with the
active variables) as supplementary rows of Z.
This gives the same numerical coordinates as
appending
tr(Z) Z as supplementary rows to the Burt
matrix t(Z)Z.
Or, since the Burt matrix is symmetric, one can
append the transposed cross-tabulations
tr(Z) Z to C as supplementary columns.
To illustrate the calculations, suppose that C
denotes the latter cross-tabulations tr(Z) Z
(stacked vertically) appended as columns to the
Burt matrix, with general element cij and column
sums c .j .
Now the positions of the supplementary columns
C are obtained by weighted averaging as follows

38
RESULTS Integrative analysis of different data
types
39
Acknowledgement
Martin Vingron
Matteo Pardo
40
Comparative Analysis of DNA Methylation and
Tissue Specific Expression
Christine Steinhoff
41
OUTLINE

Tissue specific expression in health
Two models for discovering highly variable
gene expression regions
Definition of candidates
Genomic features of these areas
Combination with methylation features

42
Variable gene expression regions
Definition of Variable Regions

Variability Change in Expression
locally, that means along chromosome
between tissues

Restrict them due to methylation indication
Explore
43
DATA
http//symatlas.gnf.org/SymAtlas/
44
Expression DATA
Mouse Expression 36,182 Probesets
Human Expression 22,283 Probesets
Orthologous Genes Human 13576 Probesets Mouse
11518 Probesets
45
Expression DATA
Mouse Expression 122 Hybridizations
Human Expression 158 Hybridizations
Clearly matching Tissue Annotation (average for
repeated measurements) 22 Tissues
Adrenalgland Amygdala Bonemarrow Cerebellum Heart
Hypothalamus Kidney Liver Lung Lymphnode Ovary
Pancreas Pituitary Placenta Prostate Salivarygland
Skeletalmuscle Testis Thymus Thyroid Trachea Uter
us
46
Definition of Variance Score
w_i

Chromosome j
Local variance
47
Deriving a Variance Score
Local Variance in expression
Variance across tissues Variance score
48
Variance Score Genomewide
Variance Score
49
Second Variance Score Linear Model
Gene 1, gene 2, gene 3, gene 4, gene 5
50
Second Variance Score Linear Model
For the ith gene In the jth window In
the kth tissue We can write
51
Second Variance Score Linear Model
Analysis of Variance for the 3-fold
Classification with (5 genes) x (number windows)
x (number tissues) Observations
52
Second Variance Score Linear Model
Estimate Window effect, Tissue effect and
Interaction effect Scale Maximize the three
scores simultaneously using L2 Norm
This gives for each pair (window,tissue) a score
that measures variability
53
Distribution
Window effects
Tissue effects
Interaction effects
L2 Norm
54
top scoring Examples
Pancreas
SEMA4F HK2 TACR1 MRPL19 GCF LRRTM4 REG1B RE
G1A REG3A CTNNA2 SUCKG1 Q9H5E1
Relative expression
55
top scoring Examples
RPS18 NM_003782 WDR46 HKE2 RGL2 TAPBP ZNF297 DAXX
KIFC1 PHF1
Relative expression
56
Sequence Features within genes GC content
Top 50 windows
57
Sequence Features within genes GC content
58
Sequence Features within genes SINEs, LINEs
59
Example
NP_006240.3 PRB4 NP_955385.1 Q7M4Q5 ETV6 BCL2L14 L
RP6 MANSC1 CREBL2 GPR19
Salivary Gland Trachea
Relative expression
60
Training SVM on epigenetic data for CpG islands
on human Chromosome 21 and 22
Identify informative DNA attributes that
correlate with open versus compact chromstin
structures
use attributes to predict the epigenetic states
of all CpG islands genopmewide
61
Methylation Prediction
Variance Score
L2 Score
Frequency
L2 Norm (win,tis,win-tis)
Var-Var Score
62
Methylation Prediction
Window score only
Tissue score only
Interaction score only (win-tis)
63
Summary
Varvar Score L2Norm Score
GC GC SINE LINE Meth pred

?
?
64
Acknowledgement
Martin Vingron Szymon Kielbasa Peter Arndt
Hans-Peter Lenhof Martina Paulsen Jörn Walter
Christoph Bock
65
Thank you for your attention

Write a Comment

User Comments (0)