Title: More On Preprocessing
 1More On Preprocessing Javier Cabrera 
 2Outline
- Transform the data into a scale suitable for 
 analysis.
- Remove the effects of systematic and obfuscating 
 sources of variation.
- Identify discrepant observations.
3Outline
- Preprocessing gt Quality of downstream analyses 
- log transformation, X ? log(X) 
-  The variation of logged intensities may be less 
 dependent on magnitude,
-  Logs reduces the skewness of highly skewed 
 distributions.
-  Taking logs improves variance estimation. 
- 2. Other Transformations 
-  Power transformations (X ? Xb for some b1/2, 
 1/3 or other)
-  Amaratunga and Cabrera (2000),  Tusher et al 
 (2001)
- 3. Variance stabilizing transformations 
-  X ? log(Xc)  Symmetrizing the spot intensity 
 data and stabilizing their variances.
4Transformations
- 4. Rocke and Durbin (2001) arrays with replicate 
 spots.
- Analogy models used for estimating concentration 
 of analyte X  a  meh  e
- a mean background, mtrue expression level h 
 and e normally distributed error (sh2 , se2)
-   
- 5. Durbin et al (2002) generalized log 
 transformation
- - a, sh2 and se2 must be estimated. 
5Power Transformations
- a, b must be estimated. 
- Three criteria 
-  Equal variances CV ( gene variances) 
-  Low skewness mean( skewness) 
-  No Mean Variance correlation correlation 
 between mean and variance
-  
6Example 1 Tissue Data Tissue data 3 
treatments applied to mice tissue. 
(A,B,C) Arrays Treatment A 11 Treatment B 
11 Treatment C 19 Genes 3487 genes. Gene 
expression matrix X Dim(X)100x41 treatA.1 
treatA.2 treatA.3 treatA.4 treatA.5 treatA.6 
treatA.7 treatA.8 treatA.9 treatA.10 treatA.11 
treatB.12 treatB.13 1 3.706 3.900 3.877 
 3.769 3.654 3.805 3.661 3.878 
4.213 3.989 3.877 3.797 3.743 2 
 3.762 4.034 4.402 3.912 3.889 
3.988 4.280 3.901 4.385 3.835 
4.051 4.583 4.973 3 4.140 4.114 
4.182 4.200 4.117 4.029 4.200 
4.137 4.344 4.122 3.989 4.273 
4.368 4 3.555 3.555 3.555 3.555 
3.555 3.555 3.555 3.621 4.181 
3.555 3.555 3.555 3.571 5 4.228 
 4.152 3.828 4.216 3.889 3.923 
3.912 4.102 4.273 3.858 4.031 
4.144 3.976 6 6.622 6.749 6.625 
6.883 6.865 6.335 6.241 6.201 
5.895 6.548 6.577 6.298 6.546 7 
 7.322 7.437 7.523 7.267 7.586 
7.562 7.238 7.294 6.812 7.557 
7.370 7.497 6.834 8 3.555 3.555 
3.555 3.555 3.555 3.555 3.555 
3.591 4.165 3.555 3.555 3.555 
3.571 9 4.756 4.605 4.935 4.295 
4.510 4.571 4.396 4.804 4.639 
5.239 4.402 4.502 4.248 10 4.468 
 4.306 4.483 4.396 4.432 4.008 
4.475 4.357 4.344 4.208 4.147 
4.227 4.436 gt. . . . . 
. . . . . . . 
 . . . . . . . 
 .  
 7 Raw Data Equal 75pctl 
Power Trans (X-3.60 )-0.4
Quantile Normalized 
Log Transformed  
 8Gene selection for classification - Left panel 
PC2 vs PC1 plot log transformation - Right panel 
PC2 vs PC1 plot power transformation   
 9Example 2 Khan et al (2001) 4 types of 
small round blue cell tumors (SRBC) 
- Neuroblastoma (NB) - Rhabdomyosarcoma 
(RMS) - Ewing family of tumors (EWS) 
- Burkitt lymphomas (BL) Training set 63 (23 
EWS, 20 RMS, 12 NB, 8 BL) Testing set 25 (6 
EWS, 5 RMS, 6 NB, 3 BL, 5 ot) Genes Of 6567 
initial genes, 2308 genes were selected because 
they showed minimal expression Subset A 
Cells 23 EWS and 20 RMS from training set. 100 
most significant genes after performing a t-test. 
 Gene expression matrix X Dim(X)100x43 
EWS.T1 EWS.T2 EWS.T3 EWS.T4 EWS.T6 EWS.T7 EWS.T9 
EWS.T11 EWS.T12 EWS.T13 EWS.T14 EWS.T15 EWS.T19 
EWS.C8 EWS.C3 EWS.C2 EWS.C4 EWS.C6 EWS.C9 1 3.203 
 1.655 3.278 1.006 2.710 2.059 1.848 2.714 
 2.356 1.929 3.616 2.151 2.312 1.069 
0.919 0.925 2.626 1.079 1.099 2 0.068 0.071 
0.116 0.191 0.237 0.082 0.123 0.180 0.079 
 0.252 0.106 0.097 0.160 0.197 0.192 
0.089 0.092 0.178 0.166 3 1.046 1.041 0.893 
0.430 0.369 0.902 0.998 0.496 0.761 
0.574 0.583 0.499 0.579 1.681 0.786 
1.511 1.869 2.346 2.019 . . . . 
 . . . . . . . 
 . . . . . . 
. . .  
 10 Raw Data Equal 75pctl 
Power Trans -(X-0.66 )-0.04
Quantile Normalized 
Log Transformed  
 11Judging the success of a normalization
- Yg1 and Yg2. 
- Successful workflow gtArrays are monotonically 
 related to each other.
- Pearsons correlation coefficient measures 
 linearity rather than agreement.
-  Concordance correlation coefficient  
12Judging the success of a normalization
- Yg1 and Yg2. 
- Successful workflow gtArrays are monotonically 
 related to each other.
- Spearmans rank correlation coefficient 
- Rgi is the rank of Ygi when the Ygi are ranked 
 from 1 to G.
13Concordance Map
 Image Plot of Concordance Correlations 
X44 X45 X46 X47 X48 X49 X50 X44 1.000 
0.703 0.622 0.706 0.674 0.746 0.694 X45 0.703 
1.000 0.702 0.679 0.784 0.710 0.788 X46 0.622 
0.702 1.000 0.791 0.683 0.562 0.776 X47 0.706 
0.679 0.791 1.000 0.691 0.607 0.760 X48 0.674 
0.784 0.683 0.691 1.000 0.770 0.832 X49 0.746 
0.710 0.562 0.607 0.770 1.000 0.727 X50 0.694 
0.788 0.776 0.760 0.832 0.727 1.000  
 14Concordance Map
 Image Plot of Concordance Correlations 
X44 X45 X46 X47 X48 X49 X50 X44 1.000 
0.756 0.622 0.700 0.695 0.813 0.698 X45 0.756 
1.000 0.813 0.722 0.793 0.710 0.803 X46 0.622 
0.813 1.000 0.789 0.753 0.655 0.826 X47 0.700 
0.722 0.789 1.000 0.714 0.663 0.763 X48 0.695 
0.793 0.753 0.714 1.000 0.779 0.834 X49 0.813 
0.710 0.655 0.663 0.779 1.000 0.742 X50 0.698 
0.803 0.826 0.763 0.834 0.742 1.000  
 15Linear correlation
Standard Normal
t dist, df6
t dist, df2 
 16correlation
-  If the distributional properties of the values 
 change substantially during a normalization
 (e.g., the skewness is decreased), it is possible
 that the concordance correlation coefficients
 might increase, but this may only be an
 artificial improvement.
- For microarrays that have been normalized by 
 equating all the quantiles, the concordance
 correlation coefficient will be equal to
 Pearsons correlation coefficient. This is
 because, after such a normalization, the
 quantiles of both samples are identical and,
 therefore, both means are equal and both
 variances are equal too
- Spearmans rank correlation coefficient is equal 
 to (a) Pearsons correlation coefficient
 calculated on the ranks of the data (b) the
 concordance correlation coefficient calculated on
 the ranks of the data.
17(No Transcript)