Title: Data Analysis of Microarrays Bioconductor.org
1Data Analysis of Microarrays Bioconductor.org
- affy package
- limma package
2Open R and Load libraries
3Affymetrix Data files
- CEL files contain information about the
expression levels of the individual probes. - CDF file contains information about which probes
belong to which probe set. The probe set
information in the CEL file by itself is not
particularly useful as there is no indication in
the file as to which probe set a probe belongs.
This information is stored in the CDF library
file associated with a chip type. All the arrays
belonging to a given type will share this same
information.
4Analysis
- Sample Generation, experimental design,
hybridization ) RAW DATA. - Data analysis.
- (a) Scanning and image analysis ) PROBE
INTENSITIES. low level analysis - (b) Background correction, normalization and
quantifying expression measures ) NORMALIZED
EXPRESSION MEASURES FOR EACH GENE (PROBE SET).
low level analysis - (c) Ranking of genes, clustering/classification
of genes ) INTERESTING GENE SET. high level
analysis - Validation of the interesting genes using
additional biological data and performing
targeted experiments.
5Estrogen dataset
- "The investigators in this experiment were
interested in the effect of estrogen on the genes
in ER breast cancer cells over time. - After serum starvation of all eight samples, they
exposed four samples to estrogen, and then
measured mRNA transcript abundance after 10 hours
for two samples and 48 hours for the other two. - They left the remaining four samples untreated,
and measured mRNA transcript abundance at 10
hours for two samples, and 48 hours for the other
two. Since there are two factors in this
experiment (estrogen and time), each at two
levels (present or absent,10 hours or 48 hours),
this experiment is said to have a 2x2 factorial
design."
6Experimental Design
7Put .cel files in a directory
- Estrogen.zip file should be extracted into a file
named estrogen. - Working directory should be set to this file
using the change dir command under the file menu.
8Information on chips/samples
- A text file that looks like the following should
be prepared using a text editor and saved as
filename.txt (e.g., estrogen.txt)
9R read the sample information
- You have to load the file into a phenoData object
- gt pd lt- read.phenoData("estrogen.txt", header
TRUE, row.names 1) - gt pData(pd)
10phenoData objects
- phenoData objects are where the Bioconductor
Package stores information about samples - treatment conditions in a cell line experiment or
clinical or histopathological characteristics of
tissue biopsies. - The header option lets the read.phenoData
function know that the first line in the file
contains column headings, and the row.names
option indicates that the first column of the
file contains the row names.
11phenoData objects
gt pd lt- read.phenoData("estrogen.txt", header
TRUE, row.names 1) gt pData(pd) estrogen
time.h low10-1.cel absent 10 low10-2.cel absent
10 high10-1.cel present 10 high10-2.cel present
10 low48-1.cel absent 48 low48-2.cel absent
48 high48-1.cel present 48 high48-2.cel present 48
12Read .cel files and name the object as a
- gt a lt- ReadAffy(filenames rownames(pData(pd)),
phenoData pd, - verbose TRUE)
Copy and paste the above statements onto the
command line in R
13Read .cel files and name the object as a
- gt a lt- ReadAffy(filenames rownames(pData(pd)),
phenoData pd, - verbose TRUE)
- 1 reading low10-1.cel ...instanciating an
AffyBatch (intensity a 409600x8 matrix)...done. - Reading in low10-1.cel
- Reading in low10-2.cel
- Reading in high10-1.cel
- Reading in high10-2.cel
- Reading in low48-1.cel
- Reading in low48-2.cel
- Reading in high48-1.cel
- Reading in high48-2.cel
14Type a
gt a AffyBatch object size of arrays640x640
features (25606 kb) cdfHG_U95Av2 (12625
affyids) number of samples8 number of
genes12625 annotationhgu95av2
15Visualize chips
16Visualize log2 transformed raw values using hist
(histogram of distribution)
gt hist(log2(intensity(a, 4)), breaks 100, col
"blue")
17Boxplots
18Normalize data using rma
- gt eset rma(a)
- gtboxplot(data.frame(exprs(eset)), col "blue")
19MA Plots
- Measurement of relative expression versus Average
log intensity plot. Also known as the RI (Ratio
versus Intensity). - MA plots can show the intensity-dependent ratio
of raw microarray data. Let R represent raw
intensity for one chip and G represent raw
intensity for another chip. Define - M log2(R/G)
- A log2(RG)1/2.
20MA plots after normalization
- gtmva.pairs(exprs(eset),c(1,3,5,7),log.itFALSE)
21MA plots before normalization
- gt mva.pairs(pm(a),c(1,3,5))
22Memory Problems
- If you experience memory errors, please restart
your R session before trying a less
memory-expensive alternative. - The following function justRMA() should read in
all 8 arrays, normalize them, and create an
object of class exprSet with as little as 128
Megabytes of RAM - gt library(affy)
- gt pd lt- read.phenoData("EstrogenTargets.txt",heade
rTRUE,row.names1,as.isTRUE) - gt eset lt- justRMA(filenamespData(pd)FileName,phe
noDatapd) - gt eset
- Or to read in only the first four arrays, you can
use - gteset lt- justRMA(filenamespData(pd)FileName14
,phenoDatapd) - gteset
23Glossary
Term Definition
AffyID An identifier for a probe-pair set.
Affymetrix The largest company which manufactures (single-channel) high-density oligonucleotide microarray chips.
CDF file Chip Description File. Describes which probes go in which probe sets and the location of probe-pair sets (genes, gene fragments, ESTs). See http//www.bioconductor.org/data/metaData.html
CEL file Cell intensity file, including probe level PM and MM values.
Cell line Cells grown in tissue culture, representing generations of a primary culture.
DAT file Image file, approximately 107 pixels, approximately 50 MB.
24Glossary
Genotype The type of RNA in a biological sample as described by the DNA sequence, e.g. using Single Nucleotide Polymorphisms (SNPs) or microsatellite markers.
MAS 5.0 The main microarray analysis software from the Affymetrix company MicroArraySuite-MAS, now version 5.
Mismatch(MM) The same as PM but with a single homomericbase change for the middle (13th) base (transversionpurinelt-gt pyrimidine, G lt-gtC, A lt-gtT). The purpose of the MM probe design is to measure non-specific binding and background.
Perfect match(PM) A 25-mer complementary to a reference sequence of interest (e.g., part of a gene).
25Glossary
Phenotype The type of RNA in a biological sample as described by physical characteristics of the biological sample, e.g. time or estrogen presence/absence or observed susceptibility to disease.
Probe An oligonucleotideof 25 base-pairs, i.e., a 25-mer.
Probe-pair A (PM,MM) pair.
Probe-pair set A collection of probe-pairs (16 to 20) related to a common gene or fraction of a gene.
RMA Robust Multichip Averaging (Irizarry et al. 6).
Target A type of RNA under a particular condition (e.g. Estrogen present) which is hybridized to a microarray chip.
26Gene Filtering (Genefilter package)
- gt library(genefilter)
- Loading required package survival
- Loading required package splines
- gt f1 lt- pOverA(0.25, log2(100))
- gt f2 lt- function(x) (IQR(x) gt 0.5)
- gt ff lt- filterfun(f1, f2)
- gt selected lt- genefilter(eset, ff)
- gt sum(selected)
- 1 1664
- gt esetSub lt- esetselected,
27See some data
- gt esetSub13,
- Expression Set (exprSet) with
- 3 genes
- 8 samples
- phenoData object with 2
variables and 8 cases - varLabels
- estrogen read from file
- time.h read from file
- gt exprs(esetSub13, )
- low10-1.cel low10-2.cel high10-1.cel
high10-2.cel low48-1.cel - 1005_at 9.206734 8.993805 8.237886
8.338003 9.173192 - 1008_f_at 10.119335 10.986661 10.830301
10.025106 11.044671 - 1009_at 10.553034 10.500375 11.159770
11.048265 10.211211 - low48-2.cel high48-1.cel high48-2.cel
- 1005_at 9.040490 7.926101 8.06968
- 1008_f_at 11.138321 10.705763 11.36952
- 1009_at 9.565864 11.363234 10.76293
28Clustering Algorithms in R
- Two main categories
- Agglomerative (e.g., hierarchical)
- Partitioning (e.g., kmeans)
29Clustering Algorithms in R
- First, dist is used to compute distances. This
function takes a matrix as its first argument and
computes the distances between the rows of the
matrix. - We will use the expression data matrix from
estrogen for this purpose - The commands below are used to carry out
hierarchical clustering using the Manhattan
distance metric and to plot the corresponding
dendrogram with nodes labeled according to file
names.
30Cluster data (hierarchical)
- gt dgTr lt- dist(t(exprs(esetSub)), method
"manhattan") - gt hcgTr lt- hclust(dgTr, method "average")
31Cluster data (hierarchical)
- gt kmgTr lt- kmeans(t(exprs(esetSub)), centers 2)
- gt kmgTrcluster
- low10-1.cel low10-2.cel high10-1.cel
high10-2.cel low48-1.cel low48-2.cel - 1 1 2
2 1 1 - high48-1.cel high48-2.cel
- 2 2
32T-test on a subset of estrogen data
- gtsubESTexprs(eset,14)
- gt group1c(1,1,2,2)
- gt group1
- 1 1 1 2 2
- gt subEST
- Expression Set (exprSet) with
- 12625 genes
- 4 samples
- phenoData object with 2
variables and 4 cases - varLabels
- estrogen read from file
- time.h read from file
- gt tf1 lt- ttest(group1, 0.1)
- gt ff2 lt- filterfun(tf1)
- gt wh2 lt- genefilter(exprs(subEST), ff2)
- gt sum(wh2)
- 1 1711
- gt
33ApoAI data
- The data is from a study of lipid metabolism by
Callow et al (2000). The apolipoprotein AI
(ApoAI) gene is known to play a pivotal role in
high density lipoprotein (HDL) metabolism. Mice
which have the ApoAI gene knocked out have very
low HDL cholesterol levels. The purpose of this
experiment is to determine how ApoAI deficiency
affects the action of other genes in the liver,
with the idea that this will help determine the
molecular pathways through which ApoAI operates.
34Hybridizations
- The experiment compared 8 ApoAI knockout mice
with 8 wild type (normal) C57BL/6 ("black six")
mice, the control mice. For each of these 16
mice, target mRNA was obtained from liver tissue
and labelled using a Cy5 dye. The RNA from each
mouse was hybridized to a separate microarray.
Common reference RNA was labelled with Cy3 dye
and used for all the arrays. The reference RNA
was obtained by pooling RNA extracted from the 8
control mice.
35Experimental Design
36Load data
- gt library(limma)
- gt load("ApoAI.RData")
- gt objects()
- 1 "last.warning" "mart" "RG"
- gt names(RG)
- 1 "R" "G" "Rb" "Gb"
"printer" "genes" "targets"
37What did you download?
gt RGtargets FileName Cy3 Cy5 c1
a1koc1.spot Pool C57BL/6 c2 a1koc2.spot Pool
C57BL/6 c3 a1koc3.spot Pool C57BL/6 c4
a1koc4.spot Pool C57BL/6 c5 a1koc5.spot Pool
C57BL/6 c6 a1koc6.spot Pool C57BL/6 c7
a1koc7.spot Pool C57BL/6 c8 a1koc8.spot Pool
C57BL/6 k1 a1kok1.spot Pool ApoAI-/- k2
a1kok2.spot Pool ApoAI-/- k3 a1kok3.spot Pool
ApoAI-/- k4 a1kok4.spot Pool ApoAI-/- k5
a1kok5.spot Pool ApoAI-/- k6 a1kok6.spot Pool
ApoAI-/- k7 a1kok7.spot Pool ApoAI-/- k8
a1kok8.spot Pool ApoAI-/-
38What does RG contain
- Type in RG at the command line
- gtRG
- R
- a1koc1 a1koc2 a1koc3 a1koc4 a1koc5
a1koc6 a1koc7 a1koc8 a1kok1 - 1, 4184.08 3300.22 2432.54 2069.81 2809.83
2029.05 1720.56 1795.58 1845.97 - 2, 4148.48 3774.18 2579.92 2566.33 2286.43
2244.33 1987.83 1908.33 2924.68 - 3, 2452.32 3028.47 3574.45 2348.09 4462.67
2703.20 1543.10 2123.91 1702.91 - 4, 1577.31 1999.92 1296.81 1926.69 1010.74
1764.58 1161.71 1495.13 1778.82 - 5, 1525.48 1967.44 1210.10 1929.64 950.80
1638.89 936.42 1540.65 1554.96 - a1kok2 a1kok3 a1kok4 a1kok5 a1kok6
a1kok7 a1kok8 - 1, 2330.53 2630.76 2234.17 2811.72 2721.10
2361.67 2585.89 - 2, 3027.00 2411.81 2023.39 4148.58 3384.13
2763.00 2775.44 - 3, 2235.84 2343.61 4196.89 2079.53 4344.00
3051.24 3277.82 - 4, 1313.93 1551.66 1359.58 1305.88 1512.14
1617.78 1584.25 - 5, 1288.93 1378.91 1216.87 1218.35 1205.67
1412.08 1666.32 - 6379 more rows ...
39What does RG contain
- G
- a1koc1 a1koc2 a1koc3 a1koc4 a1koc5
a1koc6 a1koc7 a1koc8 a1kok1 - 1, 6256.08 5269.89 4080.85 3229.38 3763.22
2773.58 2800.44 1923.58 804.80 - 2, 5389.81 3608.12 2614.42 2107.56 1878.30
1926.00 1711.17 1205.20 1385.96 - 3, 2653.29 4168.16 8091.95 3134.39 3046.00
3612.40 2002.62 1651.41 915.24 - 4, 1071.26 1162.71 819.13 1295.82 428.95
900.78 838.19 689.00 763.64 - 5, 1321.75 1148.64 642.94 1289.74 500.49
684.00 571.92 736.41 685.88 - a1kok2 a1kok3 a1kok4 a1kok5 a1kok6
a1kok7 a1kok8 - 1, 2558.68 1724.41 2309.08 2758.44 2632.43
1955.33 3191.78 - 2, 1943.40 1100.10 1233.33 2337.75 1630.91
1422.45 1961.50 - 3, 2404.77 3011.42 3708.85 2830.40 3030.90
2809.82 2482.39 - 4, 583.08 873.06 737.57 834.58 887.97
827.38 890.58 - 5, 536.43 656.76 593.17 679.73 756.44
572.64 1233.45 - 6379 more rows ...
40What does RG contain
- Rb
- a1koc1 a1koc2 a1koc3 a1koc4 a1koc5 a1koc6
a1koc7 a1koc8 a1kok1 a1kok2 - 1, 1418.50 1532.00 992.00 1306.75 781.89 1165
761.88 1151.00 1098.86 941.74 - 2, 1280.05 1497.00 980.00 1328.00 773.00 1165
759.17 1151.00 994.43 934.00 - 3, 1216.00 1481.63 935.00 1348.61 773.00 1198
758.00 1129.05 949.39 935.84 - 4, 1193.69 1467.42 973.26 1341.55 760.00 1198
752.53 1077.34 949.00 911.09 - 5, 1148.12 1442.00 929.43 1376.21 760.00 1177
710.79 1075.00 920.16 898.00 - a1kok3 a1kok4 a1kok5 a1kok6 a1kok7 a1kok8
- 1, 1042.00 954.00 930.00 987.57 1190.83 1073.44
- 2, 1042.00 952.22 930.00 933.09 1158.00 1074.62
- 3, 1042.00 904.63 930.30 919.70 1150.18 1077.00
- 4, 1037.75 899.89 914.79 911.14 1179.75 1077.00
- 5, 1023.15 898.00 892.76 882.02 1069.02 1064.04
- 6379 more rows ...
41What does RG contain
- Gb
- a1koc1 a1koc2 a1koc3 a1koc4 a1koc5 a1koc6
a1koc7 a1koc8 a1kok1 a1kok2 - 1, 663.50 520.00 371.54 563.25 228.00 274.05
354.50 292.00 224.00 239.95 - 2, 643.43 520.00 355.00 589.44 231.00 268.00
324.61 286.93 206.86 239.00 - 3, 544.81 498.63 317.55 619.58 231.00 267.00
281.66 264.00 181.00 239.00 - 4, 522.80 454.55 356.46 615.76 218.37 267.00
289.30 264.00 181.00 248.18 - 5, 465.27 433.00 295.67 567.81 203.00 267.00
238.42 257.82 174.57 229.10 - a1kok3 a1kok4 a1kok5 a1kok6 a1kok7 a1kok8
- 1, 282.53 254.00 321.28 363.00 285.00 309.00
- 2, 264.24 254.00 309.00 336.00 285.00 309.00
- 3, 250.06 254.00 305.00 336.00 285.00 309.86
- 4, 233.66 244.56 297.75 335.46 286.78 351.78
- 5, 231.00 231.05 298.74 334.22 274.44 358.92
- 6379 more rows ...
42What does RG contain
- genes
- GridROW GridCOL ROW COL NAME TYPE CLID
ACC - 1 1 1 1 1 Cy3RT Control BLANK
BLANK - 2 1 1 1 2 Cy5RT Control BLANK
BLANK - 3 1 1 1 3 mSRB1 cDNA mSRB1
mSRB1 - 4 1 1 1 4 BLANK BLANK BLANK
BLANK - 5 1 1 1 5 BLANK BLANK BLANK
BLANK - 6379 more rows ...
- targets
- FileName Cy3 Cy5
- c1 a1koc1.spot Pool C57BL/6
- c2 a1koc2.spot Pool C57BL/6
- c3 a1koc3.spot Pool C57BL/6
- c4 a1koc4.spot Pool C57BL/6
- c5 a1koc5.spot Pool C57BL/6
- 11 more rows ...
43Exercise
- gt dim(RG)
- 1 6384 16
- gt ncol(RG)
- 1 16
- gt colnames(RG)
- 1 "a1koc1" "a1koc2" "a1koc3" "a1koc4""a1koc5"
"a1koc6" "a1koc7" "a1koc8" - 9 "a1kok1" "a1kok2" "a1kok3" "a1kok4""a1kok5"
"a1kok6" "a1kok7" "a1kok8"
44subsets
- gt RG12,
- An object of class "RGList"
- R
- a1koc1 a1koc2 a1koc3 a1koc4 a1koc5
a1koc6 a1koc7 a1koc8 a1kok1 - 1, 4184.08 3300.22 2432.54 2069.81 2809.83
2029.05 1720.56 1795.58 1845.97 - 2, 4148.48 3774.18 2579.92 2566.33 2286.43
2244.33 1987.83 1908.33 2924.68 - a1kok2 a1kok3 a1kok4 a1kok5 a1kok6
a1kok7 a1kok8 - 1, 2330.53 2630.76 2234.17 2811.72 2721.10
2361.67 2585.89 - 2, 3027.00 2411.81 2023.39 4148.58 3384.13
2763.00 2775.44 - G
- a1koc1 a1koc2 a1koc3 a1koc4 a1koc5
a1koc6 a1koc7 a1koc8 a1kok1 - 1, 6256.08 5269.89 4080.85 3229.38 3763.22
2773.58 2800.44 1923.58 804.80 - 2, 5389.81 3608.12 2614.42 2107.56 1878.30
1926.00 1711.17 1205.20 1385.96 - a1kok2 a1kok3 a1kok4 a1kok5 a1kok6
a1kok7 a1kok8 - 1, 2558.68 1724.41 2309.08 2758.44 2632.43
1955.33 3191.78 - 2, 1943.40 1100.10 1233.33 2337.75 1630.91
1422.45 1961.50
45Background correction
- RG.b lt- backgroundCorrect(RG,method"minimum")
46Loess normalization
47Quantile Normalization
- gt MA.pAq lt- normalizeBetweenArrays(MA.p,
method"Aquantile") - gt plotDensities(MA.pAq)
48Print-tip loess normalization
- gt MA lt- normalizeWithinArrays(RG)
49Generate factors
- gt design lt- cbind("WT-Ref"1,"KO-WT"rep(01,c(8,8
))) - gt design
- WT-Ref KO-WT
- 1, 1 0
- 2, 1 0
- 3, 1 0
- 4, 1 0
- 5, 1 0
- 6, 1 0
- 7, 1 0
- 8, 1 0
- 9, 1 1
- 10, 1 1
- 11, 1 1
- 12, 1 1
- 13, 1 1
- 14, 1 1
- 15, 1 1
- 16, 1 1
50Fit a linear model
- gt fit lt- lmFit(MA,designdesign)
- gt fit
- An object of class "MArrayLM"
- coefficients
- WT-Ref KO-WT
- 1, -0.65953808 0.63931974
- 2, 0.22938847 0.65516034
- 3, -0.25176365 0.33421048
- 4, -0.05169672 0.04049065
- 5, -0.25006850 0.22302129
- 6379 more rows ...
51Ebayes fit
52Significance
- p.value
- WT-Ref KO-WT
- 1, 0.0005273336 0.009821223
- 2, 0.0840905898 0.001674820
- 3, 0.2514756923 0.280677844
- 4, 0.5306727045 0.727393168
- 5, 0.0140994442 0.103570036
- 6379 more rows ...
- lods
- WT-Ref KO-WT
- 1, -0.1466897 -2.581616
- 2, -4.8450256 -1.065213
- 3, -5.7239418 -5.260559
- 4, -6.2106950 -5.777607
- 5, -3.2510578 -4.528469
- 6379 more rows ...
- F
53Top table
- gt topTable(fit,coef"KO-WT",adjust"fdr")
- GridROW GridCOL ROW COL NAME
TYPE CLID ACC - 2149 2 2 8 7 ApoAI,lipid-Img
cDNA 1077520 - 540 1 2 7 15 EST,HighlysimilartoA
cDNA 439353 - 5356 4 2 9 1 CATECHOLO-METHYLTRAN
cDNA 1350232 - 4139 3 3 8 2 EST,WeaklysimilartoC
cDNA 374370 - 1739 2 1 7 17 ApoCIII,lipid-Img
cDNA 483614 - 2537 2 3 7 17 ESTs,Highlysimilarto
cDNA 483614 - 1496 1 4 15 5 est
cDNA 484183 genome.wustl - 4941 4 1 8 6 similartoyeaststerol
cDNA 737183 - 947 1 3 8 2 EST,WeaklysimilartoF
cDNA 353292 - 5604 4 3 1 18
cDNA 317638 - M A t P.Value
B - 2149 -3.1661645 12.46803 -23.980007 3.045649e-11
14.9266311 - 540 -3.0485504 12.28064 -12.962085 5.020967e-07
10.8133421 - 5356 -1.8481659 12.92660 -12.440765 6.505619e-07
10.4479375 - 4139 -1.0269537 12.60572 -11.756162 1.209061e-06
9.9285280 - 1739 -0.9325824 13.73744 -9.835355 1.563760e-05
8.1924099 - 2537 -1.0098117 13.63012 -9.015243 4.222433e-05
7.3048994
54MA plots
55Venn diagram
- gt results lt- decideTests(fit)
- gt vennDiagram(results)
56Switch function
center function(x, type) switch(type, mean
mean(x), median median(x), trimmed mean(x,
trim .1))
57Switch function
The switch function uses the value of its first
argument to determine which of its remaining
arguments to evaluate and return. The first
argument can be either an integer index, or a
character string to be used in matching one of
the following arguments.
58Apply function
- For example,
- gt apply(x, 2, mean)
- computes the column means of a matrix x, while
- gt apply(x, 1, median)
- computes the row medians.