Bioinformatics: The Analysis of Microarray Data - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

Bioinformatics: The Analysis of Microarray Data

Description:

for large enough samples we can tailor the test to the distribution (which might ... make sure that you are cross-validating the experiment that you have carried out ... – PowerPoint PPT presentation

Number of Views:186

Avg rating:3.0/5.0

Slides: 59

Provided by: rgen6

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics: The Analysis of Microarray Data

1
BioinformaticsThe Analysis of Microarray Data

Robert Gentleman
Department of Biostatistics
Harvard University
DFCI

2
Bioconductor

a new project aimed at providing software
resources for analysing and manipulating
biological data
the project has multiple goals (they include)
provide high quality software to researchers
provide structure and examples that will enable
rapid development of new methodology
explore new methods in both statistics and
computing and to make these available as rapidly
as possible

3
Bioconductor

web site will be located at www.bioconductor.org
not active yet but it should be in the next
couple of weeks
initial offerings will be several libraries of
functions providing
infrastructure support in the form of data
structures
annotation support in the form of a synthesis of
different databases into a form that is useful
for the analyses we want to carry out

4
Bioconductor

is a collaborative project, we have participants
from
DFCI and Harvard School of Public Health and FAS
UC Berkeley, Department of Biostatistics
University of Heidelberg
ETH, Zurich
Technical University, Vienna

5
Bioconductor

our current membership is mainly statisticians
with a strong computing background
we would like to have a team of statisticians,
computer scientists and biologists identifying
both problems and potential solutions

6
Bioinformatics
Computer Science
Biology
Statistics
7
Bioinformatics

there are many challenges
large and complex data
complex models
computational requirements can be enormous
data are often a mix of numeric and non-numeric
(we can deal with the former better than the
latter)
perhaps the largest challenge is to develop tools
that easily and accurately reveal the biology

8
Bioinformatics

the tools used to analyse the data are themselves
complex and they need to be!
we are asking and answering very complex
questions
the user interface should be simple and intuitive
we need a flexible system for the development of
new tools

9
Bioinformatics Tools

some methods that have been successfully employed
in similar situations
object-oriented programming
visualization
statistical modeling
parallel algorithms

10
Bioinformatics Tools

I contend that the basis for constructing
bioinformatics tools should be a proper
programming language

11
Bioinformatics Tools

the ideal development environment should have
some properties
high quality graphics (preferably interactive)
seamless access to databases
good numerics (preferably with many math/stat
algorithms as primitives)
a system for producing packages
an intuitive user interface
it doesnt exist!

12
Bioinformatics Tools

our approach is to start with a language that has
most, but not all of these properties.
we will then work on extending that language to
provide the missing pieces
the language
R a language for statistical computing and
graphics
www.r-project.org

13
Bioinformatics

in the remainder of this talk I will consider
some basic tasks that we are interested in
carrying out and show how our use of R has
simplified the implementation
note that a number of other teams working on the
development of tools for Bioinformatics have also
adopted R however they typically have a less
aggressive strategy

14
Specifics

we now turn our attention to the analysis of DNA
microarray data (only as a specific example)
most of the important points here are
transferable to the analysis of other types of
data

15
Experimental Design

random errors exact replication using the same
reagents, same samples, same technicians etc will
still yield variation
systematic variation
between technicians
between batches/reagents
dont want systematic components to align with
experimental conditions (confounding)

16
Types of assays

The main types of gene expression assays
Serial analysis of gene expression (SAGE)
Short oligonucleotide arrays (Affymetrix)
Long oligonucleotide arrays (Agilent)
Fibre optic arrays (Illumina)
cDNA arrays (Brown/Botstein).

17
Microarray Data

data are typically obtained from three distinct
sources
the experimental data that provides expression
level data for a selected set of genes (or ESTs)
the sample level covariates, including
experimental conditions
the biological metadata (GenBank, LocusLink,
KEGG, and so on)

18
Applications of microarrays

Measuring transcript abundance (cDNA arrays)
Genotyping
Estimating DNA copy number (CGH)
Determining identity by descent (GMS)
Measuring mRNA decay rates
Identifying protein binding sites
Determining sub-cellular localization of gene
products

19
Some Questions

Which genes have expression levels that are
correlated with some external variable?
For a given pathway, which of the genes in our
collection are most likely to be involved?
For a diffuse disease, which genes are associated
with different outcomes?

20
Answering the questions

we need to obtain and then analyse the expression
data
preprocessing of the image
normalization of the images
modeling to extract expression level data
gene-filtering
clustering
relating to biologic data

21
Steps in image analysis
1. Addressing. Estimate location of spot centers.
2. Segmentation. Classify pixels as foreground
(signal) or background.

3. Information extraction. For
each spot on the array and each
dye
signal intensities
background intensities
quality measures.

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Addressing
Automatic addressing within the same batch of
images.

Estimate translation of grids.
Estimate row and column positions.

4 by 4 grids
Other problems -- Mis-registration --
Rotation -- Skew in the array
Foreground and background grids.
26
Segmentation
Adaptive segmentation, SRG
Fixed circle segmentation
Spots usually vary in size and shape.
27
Image Analysis

we need a mechanism for storing and accessing the
raw data (note databases are not really the
answer)
we need tools to allow us to go back from the
expression data to the set of spots that were
used to compute that expression data

28
Image processing

Almost all steps in this process seem to lend
themselves to a Bayesian analysis.
Many processing techniques use only the single
slide of interest but there are usually other
slides with similar structure that could be used.
In most cases there is structure that exists
across slides, technicians, machines.

29
An Image Storage Solution

we have developed an HDF5 library
http//hdf.ncsa.uiuc.edu/
HDF5 is a storage format for image data that is
widely used
our package allows users to access image data in
R as if it were an ordinary array but the image
data remains in a disk file
we have used Rs ability to link with other
software to quickly and effectively implement this

30
Post-Processing

once the spot intensities have been obtained
further processing is required to obtain
expression level data (expression is often in
terms of mRNA levels)
for Affymetrix arrays the spot intensities are
for short oligos and must be processed to obtain
gene level data
in all cases some form of normalization is
required (basically an intensity alignment)

31
Expression Level Data

now, for each array we have obtained expression
level data
the next step is to select those genes that have
interesting expression levels.
interesting is interpreted in many different ways
high levels of expression in a subgroup of
interest
lack of expression in a subgroup of interest
pattern of expression that correlates well with
experimental conditions.

32
Data Structures

one of our goals is to introduce some standard
data structures
in an object oriented setting these are called
classes
a particular data set is referred to as an
instance of the class
they allow us to model complex data in a natural
way

33
Data Structures

if the class is well defined and describes the
physical data well then using it is natural
methods or functions can be written to perform
calculations on instances of the class
this makes it easier to both write the functions
and to share methods across developers

34
Data Structures

one of the difficult tasks that confronts a data
analyst in this field is to ensure that the data
are correctly aligned
the expression data for each sample must be
related to the correct phenotypic data and the
gene annotation must be aligned correctly with
the genes on the microarray
the class structure can help ensure proper
alignment

35
Data Structures

these problems will become more severe as we
obtain more data
for example relating data from several different
experiments with different sources (ie both cDNA
array and short oligo arrays)
starting to look at pathways or binding/promoter
sites

36
The Evolution of Gene Selection

first fold-change ratio of expression levels
between two groups
then t-tests now statistical variation comes in
to play
other statistical models Anova, Cox Model, etc
for large enough samples we can tailor the test
to the distribution (which might be different in
the two groups)

37
Filters and Orders

what sorts of questions can we answer
does a gene have a pattern of expression that
correlates well with experimental conditions?
does a gene express at a reasonably high level in
a reasonably large portion of my samples?
can we find genes that have a pattern of
expression that is similar to that of some known
gene(s)?

38
Filters

A filter is a mechanism for removing a gene from
further consideration
we want to reduce the number of genes under
consideration so that we can concentrate on those
that are more interesting (it is a waste of
resources to study genes that are not likely to
be of interest)

39
Orders or Rankings

while there is a strong desire to have methods
that determine which genes are important, such
methods do not exist and cannot exist
we can rank genes according to some measure of
interesting
a test statistic, a p-value, etc
such rankings can be used to select genes (those
with high ranks) for further study

40
Non-specific filters

at least k (or a proportion p) of the samples
must have expression values larger than some
specified amount, A.
the gene should show sufficient variation to be
interesting
either a gap of size A in the central portion of
the data
or a interquartile range of at least B
genes that fail to pass can be eliminated early

41
Specific Filters

any filter based on a statistical comparison
t-test, ANOVA, Cox Model and so on
these are all readily available in R and hence in
our gene filtering package

42
Multiple Testing

to assess the significance of the p-values one
must make some adjustments for multiple testing
(Dudoit et al)
this is especially important if the expression
data are coming from multiple sources

43
An Example

two cell lines
observed at four time points
we want to select genes that have interesting
patterns
non-linear mixed effects models seem appropriate
R provides the filter without any extra coding

44
Annotation

obtained from multiple sources
align it using a combination of R, XML and
Postgres into a coherent collection
we can produce uniform annotation for any set of
genes
Unigene, Locus Link
chromosome, cytoband,
GO Ontology

45
Annotation

in some cases knowing the chromosome or cytoband
is useful
hyperdiploid samples, hypodiploid samples
examine gene amplification/deletion patterns
if multiple genes are involved we may be able to
detect this
given information of this kind we want access to
visualization tools
where are the top expressing genes located?

46
Annotation

Pathways
given pathway information can we determine which
genes express in a manner similar to known genes
in the pathway
Need to link, via browser technology, with other
sources. It is particularly useful if the
information can be parsed and analysed by machine
(eg XML)
again R has support for exactly this type of
processing

47
Clustering

Supervised
some clusters are determined a priori, the
training set, and we assign new samples to those
clusters (classification)
Unsupervised
data are grouped, using some method, to provide
(potentially) new groupings or classifications

48
Some References

Classification, A.D. Gordon Chapman and Hall
Finding Groups in Data, L.Kaufman and P. J.
Rousseeuw Wiley.
Pattern Recognition and Neural Networks, B. D.
Ripley Cambridge University Press.
The Elements of Statistical Learning, T. Hastie,
R. Tibshirani, and J. Friedman Springer

49
Clustering

Clustering is often used to answer the following
questions.
Can I find a set of genes that help me to
correctly classify the samples into specific
groups (often disease categories)
If I can do this, which genes were used and how
important are they?

50
Inference

Permutation tests
Cross-validation
How many clusters are there?
Gordon Ch 3 addresses some of these issues
Are my clusters significant?
it is more important to ask whether they provide
a meaningful representation or reduction of the
data

51
Variable Selection

how do we select genes to help in clustering?
this process has been studied in other areas of
statistics, however, few of the solutions seem
appropriate

52
Model Selection

gene selection can be characterized as a model
selection procedure
most statistical research in this area has been
concentrated on the situation where the number of
variables is much less than the number of
observations
we need to develop model selection procedures for
the situation where there are many more variables
than observations

53
Selection Problems

may need to adjust for other variables
this makes the bootstrap and cross validation
algorithms more complex
computing is already hard -- now it is getting
much more difficult data structures play an
essential role in simplifying the computations

54
Selection Problems

use bootstrap or crossvalidation type
experiments to do variable selection
for example select genes that show up as
important predictors in many different subsets
to assess variable importance (when there are
several variables) Breiman has proposed
reclassifying using a new data set that has the
covariate values permuted.

55
Cross-validation

leave one out CV is easy to implement but
probably not the best method to use for assessing
or selecting a model
the problem is that leaving out one observation
yields too small a change to the data
in general we need to consider perturbations on
the order of about the square root of the number
of observations

56
Cross-validation

make sure that you are cross-validating the
experiment that you have carried out
in particular, if you are selecting genes, rather
than working with known genes, you must
cross-validate the gene selection process as well
most examples I have seen with low classification
error rates do not cross-validate properly
(model/gene selection was not validated)

57
Classification

Breimans Random Forest ideas
notions of boosting using several weak
classifiers to obtain a good classifier
weighted averages of predictions the weights
might be different in different parts of the
covariate (gene) space

58
Acknowledgements