Title: Visualization of Multivariate Data
1Visualization of Multivariate Data Christine
Steinhoff Max Planck Institute for Molecular
Genetics Berlin, Germany
2Outline
Motivation DATA INTEGRATION
Data types EXPRESSION aCGH Patients
Information Problems DISTRIBUTION, SCALE
Procedure DISCRETIZATION FILTERING
INDICATOR MATRIX MCA TOWARDS DISTANCE
DEFINITION
Results
3Data Sources
4DATA INTEGRATION
Patients Covariates Information on Patients
under study
5Gene x3 Loss
DATA INTEGRATION
Gene x1 Overexpressed
Gene x2 Amplified
Gene x4 Overexpressed
6PROBLEMS
Discrete categories
After appropriate normalization Approx
lognormal symmetric
Not symmetric skew
Scale and Distribution differ!
7Data INPUT
Procedure
Discretization
Filtering
Indicator coding
Multiple Correspondence Analysis
8Step 1 Discretization
Patients covariates
arrayCGH
Expression
Categorical e.g. Staging Grading Smoking Mutatio
n ....
9Step 1 Discretization
arrayCGH
Expression
For example CBS Package DNAcopy Segmentation
and discretization of arrayCGH data
For example Fold Change Criterion
10Step 1 Discretization
Patients covariates
arrayCGH
Expression
Typically n23,000 -gt reduce number
11Step 2 Filtering (optional)
- Suggestion
- Neglect all genes with no change in any patient
- Select for high Correlation between arrayCGH and
expression
12Step 3 Indicator Matrix - Binary Coding
Indicator matrix With binary coding
Original matrix With categories
13Step 3 Indicator Matrix - Binary Coding
Indicator matrix With binary coding
Original matrix With categories
14Step 4 Appending Matrices
A
E
P
Experimental
SupplementalCovariates
15Multiple Correspondence Analysis with
supplementary Information
16Multiple Correspondence Analysis
Gene 251 state 1
G1 (-1) G1 (0) G1 (1) G2 (-1) ...
G1 (-1) G1 (0) G1 (1) G2 (-1) ...
t(E)E
t(E)A
t(A)E
t(A)A
17Multiple Correspondence Analysis
How to read Gene states cluster according to
- Distance from origin - Angle
18Patients Information
19Towards Distance Definition
- Determine
- Angle
- Vector length
- - Select genes according to a predefined angle
- Or
- - Select genes according to angle and length
a
20EXAMPLE PUBLISHED DATA
21EXAMPLE PUBLISHED DATA
22EXAMPLE PUBLISHED DATA
P0.006
23EXAMPLE PUBLISHED DATA
P0.005
P0.008
24SUMMARY
Pipeline for joint visualization of (a)
experimental continuous data e.g. arrayCGH and
expression data (b) Patients covariates
Application Data set parallel investigation of
arrayCGH and expression in breast cancer
patients covariate data available Determinati
on of candidate gene sets enrichment of specific
cancer related pathways
25FURTHER DIRECTIONS AND OPEN QUESTIONS
- Integration of variable datasources
- Appropriate discretization methods
- Avoid filtering by choosing algorithm for
decomposition of sparse matrices - Evaluation scheme (problem of simulation and
noise adding) - Appropriate comparision with Berger et al
approach on continuous data - (no implementation of patients covariates)
- ...
26ACKNOWLEDGEMENT
Sensor Lab, CNR-INFM
Max Planck Institute for Molecular Genetics
Martin Vingron
Matteo Pardo