Christine Steinhoff - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Christine Steinhoff

Description:

TGF-beta is a potent inducer of growth arrest in many cell types, including epithelial cells. ... Prostate. Salivarygland. Skeletalmuscle. Testis. Thymus ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 66
Provided by: stei6
Category:

less

Transcript and Presenter's Notes

Title: Christine Steinhoff


1
PICB Groupmeeting Shanghai 2007, 18th
Oct. Ongoing Projects Visualization of
Multivariate Data and Exploration of DNA
Methylation
Christine Steinhoff
2
OUTLINE
  • Visualization of Multivariate Data
  • Principal Component Analysis
  • Correspondence Analysis
  • Multiple Correspondence Analysis with
    Supplementary Data
  • Can we find Masterregulation Region in Terms of
    DNA Methylation?
  • What is DNA Methylation?
  • What is DNA Methylation good for?
  • DNA Methylation and Tissue Specific Expression
  • Deriving Scores for Tissue Specific Expression
    Switches
  • Exploration of Candidate Regions wrt Genomic
    features
  • Correlation of Candidate Regions with
    Evolutionary Models and Methylation Prediction

3
OUTLINE
  • Visualization of Multivariate Data
  • Principal Component Analysis
  • Correspondence Analysis
  • Multiple Correspondence Analysis with
    Supplementary Data

Cooperation work with Matteo Pardo, CNR-INFM
Sensor Lab, Brescia, Italy
4
Idea of PCA
We often do not know which measurements best
reflects the dynamics in our system
So, PCA should reveal The dynamics are along
the x axis
http//www.snl.salk.edu/shlens/pub/notes/pca.pdf
5
Back to Biology
Genes x1 xp
Patients 1
n
Dimension high So how can we reduce the
dimension ? Simplest way take the first one,
two, three Plot and discard the rest Obviously
a very bad idea.
Matrix X
6
How can we transform our data appropriately ?
Assume, we can manipulate X a bit Lets call
this Y Y should be manipulated in a way that it
is a bit more optimal than X was What does
optimal mean? That means
SMALL!
Var
Cov
Var
Var
LARGE!
In other words should be diagonal and large
values on the diagonal
The principal Components of X are the
Eigenvectors of Cov(X)
7
Why Covariance small and Variance high ?
  • The diagonal terms of Cov(X) are the variance
    genes across patients
  • The off-diagonal terms of Cov(X) are the
    covariance between gene vectors
  • Cov(X) captures the correlations between all
    possible pairs of measurements
  • In the diagonal terms, large values correspond to
    interesting dynamics
  • In the off diagonal terms large values correspond
    to high redundancy

8
How to choose the transformation?
Find orthonormal P such that
Y P X
With Cov(Y) diagonalized
Then the rows of P are the principal components
of X
Let D be the diagonal Matrix containing the
Eigenvalues of XXt And let E be the
corresponding Matrix containing the Eigenvectors
of XXt Now I claim that this definition is
doing the job, That means PEt This can be
calculated ...
9
How do we get there?
Cov(Y) 1/(n-1) YY t
AXX t
10
How do we get there?
A is symmetric Therefore there is a matrix E of
eigenvectors and a diagonal matrix D such that
Now define P to be the transpose of the matrix E
of eigenvectors
Then we can write A
11
How do we get there?
Now we can go back to our Covariance Expression
Cov(Y)
12
How do we get there?
The inverse of an orthogonal matrix is its
transpose (due to its definition)
In our context that means
Cov(Y)
13
Some Remarks
  • If you multiply one variable by a scalar you get
    different results
  • This is because it uses covariance matrix (and
    not correlation)
  • PCA should be applied on data that have
    approximately the same scale in each variable
  • The relative variance explained by each PC is
    given by eigenvalue/sum(eigenvalues)
  • When to stop? For example Enough PCs to have a
    cumulative variance explained by the PCs that is
    gt50-70
  • Kaiser criterion keep PCs with eigenvalues gt1

14
Some Remarks
15
Some Remarks
If variables have very heterogenous variances we
standardize them The standardized variables Xi
Xi (Xi-mean)/?variance The new
variables all have the same variance, so each
variable have the same weight.
16
REMARKS
  • PCA is useful for finding new, more informative,
    uncorrelated features it reduces dimensionality
    by rejecting low variance features
  • PCA is only powerful if the biological question
    is related to the highest variance in the dataset

17
Example 2
18
Yang et al.
  • Transforming growth factor-beta
  • TGF-beta is a potent inducer of growth arrest in
    many cell types, including epithelial cells.
  • This activity is the basis of the tumor
    suppressor role of the TGF-beta signaling system
    in carcinomas.
  • contribute to cancer progression.
  • special relevance in mesenchymal differentiation,
    including bone development.
  • Deregulated expression or activation of
    components of this signaling system can
    contribute to skeletal diseases, e.g.
    osteoarthritis.

19
Yang et al.
Stock 1 T constitutively active tkv receptor
Stock 2 B constitutively active babo receptor
T1,T2,T3 B1,B2,B3 Contr1,2,3
genes x1 xp
20
Yang et al.
21
Yang et al.
tkv
Babo
Control
22
What are the drawbacks in PCA?
But We only see the different experiments If
we do it the other way round that means
analysing for the genes not for the experiments
we see grouping of genes But we never see both
together. So, can we relate somehow the
experiments and the genes? That means group
genes whose expression might be explained by the
the respective experimental group (tkv, babo,
control)? This goes into correspondence
analysis
23
Idea of Correspondence Analysis
We cannot see genes and patients at the same
time essentially because rows and columns do not
have the same scale We want to find something,
s.t. our picture looks like here and I can
define a distance between tkv1 and gene blah
Gene blah
Tkv 1
24
Idea of Correspondence Analysis
Correspondence analysis can be understood as an
extension of PCA where the variance in PCA is
replaced proportional by an inertia proportional
to the Chi Square distance of the table from
independence.
CA
PCA
X ? A X t(X) Cov(X) Decompose A V D t(V)
X ? Z (Chi square distance) A Z t(Z) Cov(Z)
Decompose A V D t(V) Which is Z U sqrt(D)
t(V)
Use Chi square distance instead of covariance.
25
Application in Bioinformatics
26
Still Problems???
Lets assume We have different datasets, not only
expression measurements but also aCGH
data,... This we call multivariate data IDEA
of MULTIPLE CORRESPONDENCE ANALYSIS
27
Multiple Correspondence Analysis
28
Multiple Correspondence Analysis
29
Multiple Correspondence Analysis
Integrate this information as well????
30
Multiple Correspondence Analysis
  • We derive this directly from our biological
    application
  • Multivariate data (expression, aCGH)
  • Supplementary information (patients information)

31
Data Sources
32
Can we explore different sources in ONE picture?
Genomic (GC content, strand,...) TFBS Expressi
on GO categories
t2
33
What do we want ?
p1
p2
p3
p4
Expr X
aCGH Y
Patients data
Genomic data
n
Genes
34
Concept of Indicator matrix
-1 0 1.
p1
Expr categories X
G1 G2 G3 ...
  • 0 0
  • 0 1 0
  • 0 1 0

n
35
All matrices in indicator writing, then append
p1
p2
p3
p4
Expr X
aCGH Y
Cat
Genomic G
Now define
36
  • Disadvantages
  • Loosing ordering
  • Discretization Loss of information
  • Advantages
  • Get rid of scaling and distribution problems
  • As many datasets as memory allows
  • Combining very different data sources

37
Concept of supplementary Information
  • averaging the respondent row points in principal
    coordinates
  • equivalent to appending the cross-tabulations
    tr(Z ) Z (of the supplementary variable with the
    active variables) as supplementary rows of Z.
  • This gives the same numerical coordinates as
    appending
  • tr(Z) Z as supplementary rows to the Burt
    matrix t(Z)Z.
  • Or, since the Burt matrix is symmetric, one can
    append the transposed cross-tabulations
  • tr(Z) Z to C as supplementary columns.
  • To illustrate the calculations, suppose that C
    denotes the latter cross-tabulations tr(Z) Z
    (stacked vertically) appended as columns to the
    Burt matrix, with general element cij and column
    sums c .j .
  • Now the positions of the supplementary columns
    C are obtained by weighted averaging as follows

38
RESULTS Integrative analysis of different data
types
39
Acknowledgement
Martin Vingron
Matteo Pardo
40
Comparative Analysis of DNA Methylation and
Tissue Specific Expression
Christine Steinhoff
41
OUTLINE
  • Tissue specific expression in health
  • Two models for discovering highly variable
    gene expression regions
  • Definition of candidates
  • Genomic features of these areas
  • Combination with methylation features

42
Variable gene expression regions
Definition of Variable Regions
  • Variability Change in Expression
  • locally, that means along chromosome
  • between tissues

Restrict them due to methylation indication
Explore
43
DATA
http//symatlas.gnf.org/SymAtlas/
44
Expression DATA
Mouse Expression 36,182 Probesets
Human Expression 22,283 Probesets
Orthologous Genes Human 13576 Probesets Mouse
11518 Probesets
45
Expression DATA
Mouse Expression 122 Hybridizations
Human Expression 158 Hybridizations
Clearly matching Tissue Annotation (average for
repeated measurements) 22 Tissues
Adrenalgland Amygdala Bonemarrow Cerebellum Heart
Hypothalamus Kidney Liver Lung Lymphnode Ovary
Pancreas Pituitary Placenta Prostate Salivarygland
Skeletalmuscle Testis Thymus Thyroid Trachea Uter
us
46
Definition of Variance Score
w_i


Chromosome j
Local variance
47
Deriving a Variance Score
Local Variance in expression
Variance across tissues Variance score
48
Variance Score Genomewide
Variance Score
49
Second Variance Score Linear Model
Gene 1, gene 2, gene 3, gene 4, gene 5
50
Second Variance Score Linear Model
For the ith gene In the jth window In
the kth tissue We can write
51
Second Variance Score Linear Model
Analysis of Variance for the 3-fold
Classification with (5 genes) x (number windows)
x (number tissues) Observations
52
Second Variance Score Linear Model
Estimate Window effect, Tissue effect and
Interaction effect Scale Maximize the three
scores simultaneously using L2 Norm
This gives for each pair (window,tissue) a score
that measures variability
53
Distribution
Window effects
Tissue effects
Interaction effects
L2 Norm
54
top scoring Examples
Pancreas
SEMA4F HK2 TACR1 MRPL19 GCF LRRTM4 REG1B RE
G1A REG3A CTNNA2 SUCKG1 Q9H5E1
Relative expression
55
top scoring Examples
RPS18 NM_003782 WDR46 HKE2 RGL2 TAPBP ZNF297 DAXX
KIFC1 PHF1
Relative expression
56
Sequence Features within genes GC content
Top 50 windows
57
Sequence Features within genes GC content
58
Sequence Features within genes SINEs, LINEs
59
Example
NP_006240.3 PRB4 NP_955385.1 Q7M4Q5 ETV6 BCL2L14 L
RP6 MANSC1 CREBL2 GPR19
Salivary Gland Trachea
Relative expression
60
Training SVM on epigenetic data for CpG islands
on human Chromosome 21 and 22
Identify informative DNA attributes that
correlate with open versus compact chromstin
structures
use attributes to predict the epigenetic states
of all CpG islands genopmewide
61
Methylation Prediction
Variance Score
L2 Score
Frequency
L2 Norm (win,tis,win-tis)
Var-Var Score
62
Methylation Prediction
Window score only
Tissue score only
Interaction score only (win-tis)
63
Summary
Varvar Score L2Norm Score
GC GC SINE LINE Meth pred


?
?
64
Acknowledgement
Martin Vingron Szymon Kielbasa Peter Arndt
Hans-Peter Lenhof Martina Paulsen Jörn Walter
Christoph Bock
65
Thank you for your attention
Write a Comment
User Comments (0)
About PowerShow.com