Title: Bioinformatics: The Analysis of Microarray Data
1BioinformaticsThe Analysis of Microarray Data
- Robert Gentleman
- Department of Biostatistics
- Harvard University
- DFCI
2Bioconductor
- a new project aimed at providing software
resources for analysing and manipulating
biological data - the project has multiple goals (they include)
- provide high quality software to researchers
- provide structure and examples that will enable
rapid development of new methodology - explore new methods in both statistics and
computing and to make these available as rapidly
as possible
3Bioconductor
- web site will be located at www.bioconductor.org
- not active yet but it should be in the next
couple of weeks - initial offerings will be several libraries of
functions providing - infrastructure support in the form of data
structures - annotation support in the form of a synthesis of
different databases into a form that is useful
for the analyses we want to carry out
4Bioconductor
- is a collaborative project, we have participants
from - DFCI and Harvard School of Public Health and FAS
- UC Berkeley, Department of Biostatistics
- University of Heidelberg
- ETH, Zurich
- Technical University, Vienna
5Bioconductor
- our current membership is mainly statisticians
with a strong computing background - we would like to have a team of statisticians,
computer scientists and biologists identifying
both problems and potential solutions
6Bioinformatics
Computer Science
Biology
Statistics
7Bioinformatics
- there are many challenges
- large and complex data
- complex models
- computational requirements can be enormous
- data are often a mix of numeric and non-numeric
(we can deal with the former better than the
latter) - perhaps the largest challenge is to develop tools
that easily and accurately reveal the biology
8Bioinformatics
- the tools used to analyse the data are themselves
complex and they need to be! - we are asking and answering very complex
questions - the user interface should be simple and intuitive
- we need a flexible system for the development of
new tools
9Bioinformatics Tools
- some methods that have been successfully employed
in similar situations - object-oriented programming
- visualization
- statistical modeling
- parallel algorithms
10Bioinformatics Tools
- I contend that the basis for constructing
bioinformatics tools should be a proper
programming language
11Bioinformatics Tools
- the ideal development environment should have
some properties - high quality graphics (preferably interactive)
- seamless access to databases
- good numerics (preferably with many math/stat
algorithms as primitives) - a system for producing packages
- an intuitive user interface
- it doesnt exist!
12Bioinformatics Tools
- our approach is to start with a language that has
most, but not all of these properties. - we will then work on extending that language to
provide the missing pieces - the language
- R a language for statistical computing and
graphics - www.r-project.org
13Bioinformatics
- in the remainder of this talk I will consider
some basic tasks that we are interested in
carrying out and show how our use of R has
simplified the implementation - note that a number of other teams working on the
development of tools for Bioinformatics have also
adopted R however they typically have a less
aggressive strategy
14Specifics
- we now turn our attention to the analysis of DNA
microarray data (only as a specific example) - most of the important points here are
transferable to the analysis of other types of
data
15Experimental Design
- random errors exact replication using the same
reagents, same samples, same technicians etc will
still yield variation - systematic variation
- between technicians
- between batches/reagents
- dont want systematic components to align with
experimental conditions (confounding)
16Types of assays
- The main types of gene expression assays
- Serial analysis of gene expression (SAGE)
- Short oligonucleotide arrays (Affymetrix)
- Long oligonucleotide arrays (Agilent)
- Fibre optic arrays (Illumina)
- cDNA arrays (Brown/Botstein).
17Microarray Data
- data are typically obtained from three distinct
sources - the experimental data that provides expression
level data for a selected set of genes (or ESTs) - the sample level covariates, including
experimental conditions - the biological metadata (GenBank, LocusLink,
KEGG, and so on)
18Applications of microarrays
- Measuring transcript abundance (cDNA arrays)
- Genotyping
- Estimating DNA copy number (CGH)
- Determining identity by descent (GMS)
- Measuring mRNA decay rates
- Identifying protein binding sites
- Determining sub-cellular localization of gene
products
19Some Questions
- Which genes have expression levels that are
correlated with some external variable? - For a given pathway, which of the genes in our
collection are most likely to be involved? - For a diffuse disease, which genes are associated
with different outcomes?
20Answering the questions
- we need to obtain and then analyse the expression
data - preprocessing of the image
- normalization of the images
- modeling to extract expression level data
- gene-filtering
- clustering
- relating to biologic data
21Steps in image analysis
1. Addressing. Estimate location of spot centers.
2. Segmentation. Classify pixels as foreground
(signal) or background.
- 3. Information extraction. For
- each spot on the array and each
- dye
- signal intensities
- background intensities
- quality measures.
22(No Transcript)
23(No Transcript)
24(No Transcript)
25Addressing
Automatic addressing within the same batch of
images.
- Estimate translation of grids.
- Estimate row and column positions.
4 by 4 grids
Other problems -- Mis-registration --
Rotation -- Skew in the array
Foreground and background grids.
26Segmentation
Adaptive segmentation, SRG
Fixed circle segmentation
Spots usually vary in size and shape.
27Image Analysis
- we need a mechanism for storing and accessing the
raw data (note databases are not really the
answer) - we need tools to allow us to go back from the
expression data to the set of spots that were
used to compute that expression data
28Image processing
- Almost all steps in this process seem to lend
themselves to a Bayesian analysis. - Many processing techniques use only the single
slide of interest but there are usually other
slides with similar structure that could be used. - In most cases there is structure that exists
across slides, technicians, machines.
29An Image Storage Solution
- we have developed an HDF5 library
- http//hdf.ncsa.uiuc.edu/
- HDF5 is a storage format for image data that is
widely used - our package allows users to access image data in
R as if it were an ordinary array but the image
data remains in a disk file - we have used Rs ability to link with other
software to quickly and effectively implement this
30Post-Processing
- once the spot intensities have been obtained
further processing is required to obtain
expression level data (expression is often in
terms of mRNA levels) - for Affymetrix arrays the spot intensities are
for short oligos and must be processed to obtain
gene level data - in all cases some form of normalization is
required (basically an intensity alignment)
31Expression Level Data
- now, for each array we have obtained expression
level data - the next step is to select those genes that have
interesting expression levels. - interesting is interpreted in many different ways
- high levels of expression in a subgroup of
interest - lack of expression in a subgroup of interest
- pattern of expression that correlates well with
experimental conditions.
32Data Structures
- one of our goals is to introduce some standard
data structures - in an object oriented setting these are called
classes - a particular data set is referred to as an
instance of the class - they allow us to model complex data in a natural
way
33Data Structures
- if the class is well defined and describes the
physical data well then using it is natural - methods or functions can be written to perform
calculations on instances of the class - this makes it easier to both write the functions
and to share methods across developers
34Data Structures
- one of the difficult tasks that confronts a data
analyst in this field is to ensure that the data
are correctly aligned - the expression data for each sample must be
related to the correct phenotypic data and the
gene annotation must be aligned correctly with
the genes on the microarray - the class structure can help ensure proper
alignment
35Data Structures
- these problems will become more severe as we
obtain more data - for example relating data from several different
experiments with different sources (ie both cDNA
array and short oligo arrays) - starting to look at pathways or binding/promoter
sites
36The Evolution of Gene Selection
- first fold-change ratio of expression levels
between two groups - then t-tests now statistical variation comes in
to play - other statistical models Anova, Cox Model, etc
- for large enough samples we can tailor the test
to the distribution (which might be different in
the two groups)
37Filters and Orders
- what sorts of questions can we answer
- does a gene have a pattern of expression that
correlates well with experimental conditions? - does a gene express at a reasonably high level in
a reasonably large portion of my samples? - can we find genes that have a pattern of
expression that is similar to that of some known
gene(s)?
38Filters
- A filter is a mechanism for removing a gene from
further consideration - we want to reduce the number of genes under
consideration so that we can concentrate on those
that are more interesting (it is a waste of
resources to study genes that are not likely to
be of interest)
39Orders or Rankings
- while there is a strong desire to have methods
that determine which genes are important, such
methods do not exist and cannot exist - we can rank genes according to some measure of
interesting - a test statistic, a p-value, etc
- such rankings can be used to select genes (those
with high ranks) for further study
40Non-specific filters
- at least k (or a proportion p) of the samples
must have expression values larger than some
specified amount, A. - the gene should show sufficient variation to be
interesting - either a gap of size A in the central portion of
the data - or a interquartile range of at least B
- genes that fail to pass can be eliminated early
41Specific Filters
- any filter based on a statistical comparison
- t-test, ANOVA, Cox Model and so on
- these are all readily available in R and hence in
our gene filtering package
42Multiple Testing
- to assess the significance of the p-values one
must make some adjustments for multiple testing
(Dudoit et al) - this is especially important if the expression
data are coming from multiple sources
43An Example
- two cell lines
- observed at four time points
- we want to select genes that have interesting
patterns - non-linear mixed effects models seem appropriate
- R provides the filter without any extra coding
44Annotation
- obtained from multiple sources
- align it using a combination of R, XML and
Postgres into a coherent collection - we can produce uniform annotation for any set of
genes - Unigene, Locus Link
- chromosome, cytoband,
- GO Ontology
45Annotation
- in some cases knowing the chromosome or cytoband
is useful - hyperdiploid samples, hypodiploid samples
- examine gene amplification/deletion patterns
- if multiple genes are involved we may be able to
detect this - given information of this kind we want access to
visualization tools - where are the top expressing genes located?
46Annotation
- Pathways
- given pathway information can we determine which
genes express in a manner similar to known genes
in the pathway - Need to link, via browser technology, with other
sources. It is particularly useful if the
information can be parsed and analysed by machine
(eg XML) - again R has support for exactly this type of
processing
47Clustering
- Supervised
- some clusters are determined a priori, the
training set, and we assign new samples to those
clusters (classification) - Unsupervised
- data are grouped, using some method, to provide
(potentially) new groupings or classifications
48Some References
- Classification, A.D. Gordon Chapman and Hall
- Finding Groups in Data, L.Kaufman and P. J.
Rousseeuw Wiley. - Pattern Recognition and Neural Networks, B. D.
Ripley Cambridge University Press. - The Elements of Statistical Learning, T. Hastie,
R. Tibshirani, and J. Friedman Springer
49Clustering
- Clustering is often used to answer the following
questions. - Can I find a set of genes that help me to
correctly classify the samples into specific
groups (often disease categories) - If I can do this, which genes were used and how
important are they?
50Inference
- Permutation tests
- Cross-validation
- How many clusters are there?
- Gordon Ch 3 addresses some of these issues
- Are my clusters significant?
- it is more important to ask whether they provide
a meaningful representation or reduction of the
data
51Variable Selection
- how do we select genes to help in clustering?
- this process has been studied in other areas of
statistics, however, few of the solutions seem
appropriate
52Model Selection
- gene selection can be characterized as a model
selection procedure - most statistical research in this area has been
concentrated on the situation where the number of
variables is much less than the number of
observations - we need to develop model selection procedures for
the situation where there are many more variables
than observations
53Selection Problems
- may need to adjust for other variables
- this makes the bootstrap and cross validation
algorithms more complex - computing is already hard -- now it is getting
much more difficult data structures play an
essential role in simplifying the computations
54Selection Problems
- use bootstrap or crossvalidation type
experiments to do variable selection - for example select genes that show up as
important predictors in many different subsets - to assess variable importance (when there are
several variables) Breiman has proposed
reclassifying using a new data set that has the
covariate values permuted.
55Cross-validation
- leave one out CV is easy to implement but
probably not the best method to use for assessing
or selecting a model - the problem is that leaving out one observation
yields too small a change to the data - in general we need to consider perturbations on
the order of about the square root of the number
of observations
56Cross-validation
- make sure that you are cross-validating the
experiment that you have carried out - in particular, if you are selecting genes, rather
than working with known genes, you must
cross-validate the gene selection process as well - most examples I have seen with low classification
error rates do not cross-validate properly
(model/gene selection was not validated)
57Classification
- Breimans Random Forest ideas
- notions of boosting using several weak
classifiers to obtain a good classifier - weighted averages of predictions the weights
might be different in different parts of the
covariate (gene) space
58Acknowledgements
- Vincent Carey, Channing Laboratory
- Jianhua Zhang
- Sabina Chiaretti
- The DFCI for funding
- Dirk Iglehart
- Cheng Li
- Byron Ellis
- Sandrine Dudoit