Title: Introduction to R and Bioconductor
1Introduction to R and Bioconductor
- W. Huber
- EMBL-EBI
- Cambridge UK
2Why?
- Biology is becoming a computational science
- Data analysis and mathematical modeling require
computational solutions - We put a premium on code reuse
- - many of the tasks have already been solved
- - if we use those solutions we can put effort
into new research - Data complexity is dealt with using well
designed, self-describing data structures - Reproducible research requires open access to
computational code
3The S language
- The S language has been developed since the late
1970s by John Chambers and his colleagues at Bell
Labs. - The language has been through a number of major
changes but has been relatively stable since the
mid 1990s - The language combines ideas from a variety of
sources (e.g. Awk, Lisp, APL...) and provides an
environment for quantitative computations and
visualization.
4Implementations
- S-Plus is a commercialization of the Bell Labs
code. - R is an independent open source version that was
originally developed at the University of
Auckland but which is now developed by a world
wide group of developers. - Each version has advantages and problems.
5References
- The New S Language, Statistical models in S,
Programming with Data, by John Chambers and
various co-authors - Modern Applied Statistics, S Programming by W. N.
Venables and B. D. Ripley - Introductory Statistics with R by P. Dalgaard
- Data Analysis and Graphics Using R by J.
Maindonald and J. Braun.
6Packages
- Packages are the main unit of software authoring,
versioning and distribution - CRAN is the major repository for R packages. It
is hosted by TU Vienna and ETH Zürich, and has
many mirrors world-wide - Bioconductor is a repository for biology related
packages. It is hosted at the Fred Hutchinson
Cancer Research Centre.
7Bioconductor
- an open source and open development software
project for the analysis of biomedical and
genomic data - was started in the autumn of 2001 and includes
core developers in the US, Europe, and Australia - R and the R package system are used to design and
distribute software
8Goals of the Bioconductor project
- Provide access to powerful statistical and
graphical methods for the analysis of genomic
data. - Facilitate the integration of biological metadata
(e.g. Entrez, Ensembl, GO(A), PubMed) in the
analysis of experimental data. - Allow the rapid development of extensible,
interoperable, and scalable software. - Promote high-quality documentation and
reproducible research. - Provide training in computational and statistical
methods.
9Why are we Open Source?
- so that you can find out what algorithm is being
used, and how it is being used - so that you can modify these algorithms to try
out new ideas or to accommodate local conditions
or needs - so that they can be used by others as components
(potentially modified)
10Introduction
- R is powerful language dedicated to data analysis
- Put a premium on statistical analysis
- Able to process, visualize, import, export almost
any kind of data - From your raw data to publishable material
- Open-source and free
- Programmatic (i.e. text) interface GUIs for some
functionalities are available but are much less
powerful - Easy human-readable syntax (cf. Perl)
- The goal is to make users become programmers
- Great for flexibility and extensibilty
- Can be used as a middleware to integrate multiple
specialized tools (e.g. XML, graph theoretic
libraries, image analysis, relational database
access)
11Starting R
- R is a free software, available on Windows, MacOS
and Unixes - Command-line application
- Just a window
- Waiting for an input command with a gt
- The result of a command can be a line of text in
the window
12Tinn-R emacs / ESS
13Reproducible Research and Compendia
There is a tendency to accept seemingly
realistic computational results, as presented
by figures and tables, without any proofof
correctness.
F. Leisch, T. Rossini, Chance 16 (2003)
We re-analyzed the breast cancer data from vant
Veer et al. (2002). ... Even with some helpof
the authors, we were unable to exactly
re- produce this analysis.
R. Tibshirani, B. Efron, SAGMB (2002)
14Re-analysis of a breast cancer outcome study
- E. Huang et al., Gene expression predictors
- of breast cancer outcome, The Lancet 361 (9369)
1590-6 (2003) - 89 primary breast tumors on Affymetrix Chips
(HG-U95av2), among them 52 with 1-3 positive
lymph nodes, 18 led to recurrence within 3 years,
34 did not. - Goal predict recurrence
- Claim 5 misclassification errors, 1 unclear
(leave-one-out cross-validation) - Method Bayesian binary prediction trees (at the
time, unpublished) - http//www.cagp.duke.edu
15we tried to reproduce these results, starting
from the published marray raw data (CEL files)
But couldn't. The paper (and supplements) didn't
contain the necessary details to re-implement
their algorithm. Authors didn't provide
comparisons to simple well-known methods. In our
hands, all other methods resulted in worse
misclassification results. Is their new Bayesian
tree method miles better than everything else? Or
was their analysis over-optimistic?
(over-fitting, selection bias)
16A general pattern
New publications often present a new
(microarray) data set, and a new
(classification) method. Merits of the method and
merits of the data are entangled. Is it necessary
to develop an ideosyncratic method? Which result
could be achieved with other approaches? Is
there a big difference and what are the reasons
for it ?
17Compendia
- Interactive documents that contain
- Primary data
- Processing methods (computer code)
- Derived data, figures, tables and other output
- Text research report (result, materials and
methods, conclusions) - Based on R/Bioconductor's package and vignette
technologies - Published examples
- M. Ruschhaupt et al., SAGMB 2004 (cancer
classification with marrays) - T. Chiang et al., Genome Biology 2007 (large
scale protein interaction datasets)
18source markup (here latex R)
processed docu- ment (here PDF)
Sweave
ltltMCRestimate call,evalFALSE,echoTRUEgtgtr.fores
t lt- MCRestimate(eset,class.label,
class.function"RF.wrap", select.funred.fct,cro
ss.outer10, cross.inner5, cross.repeat20)_at_ ltlt
rf.save,echoFALSE,resultshidegtgtsavepdf(plot(r.
forest, main"Random Forest"),"image-RF.pdf")_at_ ltlt
resultgtgtr.forest_at_ The final document includes
results of the calculation, graphical outputs,
tables, and optionally parts of the R-Code which
has been used. Also the description of the
experiment, the interpretation of the results,
and the conclusion can be integrated. In this
example we applied our compendium to T. Golubs
ALL/AML data\citeGolub.1999. \beginfigureh
\begincenter \includegraphicswidth0.4\text
widthimage-RF \endcenter
\endfigure\smallskip ltltsummary,echoFALSEgtgtme
thod.list lt- list(r.forest,r.pam,r.logReg,r.svm)n
ame.list lt- c("RF","PAM","PLR","SVM")conf.table
lt- MCRconfusion(method.list, col.namesname.list
)_at_ ltltwritinglatex1,echoFALSE,
resultstexgtgtxtable(conf.table,"Overall number
of misclassifications",label"conf.table",display
rep("d",6))_at_ \inputsamples.1\inputconf.tab
le \beginthebibliography1\bibitemGolub et
al. ,1999Golub.1999 Golub TR, Slonim DK,
Tamayo P, Huard C, Gaasenbeek M, Mesirov JP,
Coller H, Loh ML, Downing JR, Caligiuri MA,
Bloomfield CD, Lander ES.\newblock Molecular
classification of cancer class discovery and
class prediction by gene expression
monitoring\newblock\textitScience 286(5439)
531-7 (1999). \endthebibliography
19?Structure of a compendium
Package directory
General info author, version,
software documentation
data directory
additional software code
source markup
function definition
manual page
data files
function definition
manual page
data files
manual page
function definition
. . .
data files
. . .
. . .
20?Compendia
See also the work by Donald Knuth HP
Wolf Günther Sawitzki Friedrich Leisch Robert
Gentleman Duncan Temple Lang
21R_intro_easy
22EBImage demo.R