Microarray Data Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Microarray Data Analysis

Description:

Title: Microarray Data Analysis Author: Kath Last modified by: Janet Murray Created Date: 10/14/2003 5:52:40 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:213

Avg rating:3.0/5.0

Slides: 56

Provided by: kath208

Learn more at: https://www.uvm.edu

Category:

more less

Transcript and Presenter's Notes

Title: Microarray Data Analysis

1
Microarray Data Analysis

The Bioinformatics side of the bench

2
The anatomy of your data files from Affymetrix
array analysis

.DAT image file (107 pixels)
.CEL measured cell intensities
.CDF cell descriptions files (identify probe
sets and probe set pairs)
.CHP calculated probe set data
.RPT report generated from .CHP

3
Quality Control (QC) of the chip visual
inspection

Look at the .DAT file or the .CHP file image
Scratches? Spots?
Corners and outside border checkerboard
appearance (B2 oligo)
Positive hybridization control
Used by software to place grid over image
Array name is written out in oligos!

4
(No Transcript)
5
Chip defects
6
Internal controls

B. subtilis genes (added poly-A tails)
Assessment of quality of sample preparation
Also as hybridization controls
Hybridization controls (bioB, bioC, bioD, cre)
E. coli and P1 bacteriophage biotin-labeled cRNAs
Spiked into the hybridization cocktail
Assess hybridization efficiency
Actin and GAPDH assess RNA sample/assay quality
Compare signal values from 3 end to signal
values from 5 end
ratio generally should not exceed 3
Percent genes present (P)
Replicate samples - similar P values

7
Microarray Data Process/Outline

Experimental Design
Image Analysis scan to intensity measures (raw
data)
Normalization clean data
More low level analysis-fold change, ANOVA,
data filtering
Data mining-how to interpret gt 6000 measures
Databases
Software
Techniques-clustering, pattern recognition etc.
Comparing to prior studies, across platforms?
Validation

8
Experimental Design

A good microarray design has 4 elements
A clearly defined biological question or
hypothesis
Treatment, perturbation and observation of
biological materials should minimize systematic
bias
Simple and statistically sound arrangement that
minimizes cost and gains maximal information
Compliance with MIAME (minimal information about
microarray experiment)

The goal of statistics is to find signals in a
sea of noise
The goal of exp. design is to reduce the noise so
signals can be found with as small a sample size
as possible

9
Observational Study vs. Designed Experiment

Observational study-
Investigator is a passive observer who measures
variables of interest, but does not attempt to
influence the responses
Designed Experiment-
Investigator intervenes in natural course of
events
What type is our DMSO exp?

10
Experimental Replicates

Why?
In any exp. system there is a certain amount of
noiseso even 2 identical processes yield
slightly different results
Sources?
In order to understand how much variation there
is it is necessary to repeat an exp a of
independent times
Replicates allow us to use statistical tests to
ascertain if the differences we see are real

11
(No Transcript)
12
Technical vs. Biological Replicates
As we progress from the starting material to the
scanned image we are moving from a system
dominated by biological effects through one
dominated by chemistry and physics noise Within
Affy platform the dominant variation is usually
of a biological nature thus best strategy is to
produce replicates as high up the experimental
tree as possible
13
Low level data analysis / pre-processing

Varying biological or cellular composition among
sample types.
Differences in sample preparation, labeling or
hybridization
Non specific cross-hybridization of target to
probes.
Lead to systemic differences between individual
arrays

Raw Data Quality Control
Scaling
Normalization and filtering.

14
Image Analysis - Raw Data
15
From probe level signals to gene abundance
estimates
The job of the expression summary algorithm is to
take a set of Perfect Match (PM) and Mis-Match
(MM) probes, and use these to generate a single
value representing the estimated amount of
transcript in solution, as measured by that
probeset.
To do this, .DAT files containing array images
are first processed to produce a .CEL file, which
contains measured intensities for each probe on
the array. It is the .CEL files that are
analyzed by the expression calling algorithm.
16
MAS 5.0 output files

For each transcript (gene) on the chip
signal intensity
a present or absent call (presence call)
p-value (significance value) for making that call
Each gene associated with GenBank accession
number (NCBI database)

17
How are transcripts determined to be present or
absent?

Probe pair (PM vs. MM) intensities
generate a detection p-value
assign Present, Absent, or Marginal call
for transcript
Every probe pair in a probe SET has a potential
vote for presence call

18
PM and MM Probes

The purpose of each MM probe is to provide a
direct measure of background and stray-signal
(perhaps due to cross-hybridization) for its
perfect-match partner. In most situations the
signal from each probe-pair is simply the
difference PM - MM.
For some probe-pairs, however, the MM signal is
greater than the PM value we have an apparently
impossible measure of background.

19
Thank goodness for software!!!

MAS 5.0 does these calculations for you
.CHP file
Basic analysis in MAS 5.0, but it wont handle
replicates
Import MAS 5.0 (.CHP) data into other software,
Genesifter, GCOS, SpotFire, and many others

20
Signal Intensity

Following these calculations, the MAS 5.0
algorithm now has a measure of the signal for
each probe in a probeset.
Other algortihms, ex RMA, GCRMA, dCHIP, PLIER and
others have been developed by academic teams to
improve the precision and accuracy of this
calculation
In our Exp we will use RMA and GCRMA

21
How do we want to analyze this data?

Pairwise analysis is most appropriate
Control vs. DMSO
List of genes that are upregulated or
downregulated
Determine fold up or down cutoffs
What is significant?
1.5 fold up/down?
2 fold up/down?
10 fold up/down?

22
Normalization - clean data

Normalizing data allows comparisons ACROSS
different chips
Intensity of fluorescent markers might be
different from one batch to the other
Normalization allows us to compare those chips
without altering the interpretation of changes in
GENE EXPRESSION

Why Normalize Data?
The experimental goal is to identify biological
variation (expression changes between samples)
Technical variation can hide the real data
Unavoidable systematic bias should be recognized
and corrected
Normalization is necessary to effectively make
comparisons
between chips-and sometimes within a single chip.

There are different methods of normalization the
assumptions of where variation exist will
determine the normalization techniques used.
Always look at data before and after
normalization
Spike in controls can help show which method may
be best

24
Caveat

There is NO standard way to analyze microarray
data
Still figuring out how to get the best answers
from microarray experiments
Best to combine knowledge of biology, statistics,
and computers to get answers

25
Venn Diagrams
MAS 5.0
GCRMA
RMA
26
Data processing is completed now what?Fold
change, ANOVA, Data filtering
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Where are we now?

Ran analysis, output is a GENE LIST
List indicates what genes are up or down
regulated
p values for t-test
Graphs of signal levels
Absolute numbers not as important here as the
trends you see
Now what????

36
What is the first set of genes on our chips that
will be filtered out?
37
Follow the links

Click on a gene
Find links to other databases
Follow links to discover what the protein does
Now the fun part begins.

38
Back to Biology

Do the changes you see in gene expression make
sense BIOLOGICALLY?
If they dont make sense, can you hypothesize as
to why those genes might be changing?
Leads to many, many more experiments

39
The Gene Ontologies
A Common Language for Annotation of Genes from
Yeast, Flies and Mice
and Plants and Worms
and Humans
and anything else!
40
Gene Ontology Objectives

GO represents concepts used to classify specific
parts of our biological knowledge
Biological Process
Molecular Function
Cellular Component
GO develops a common language applicable to any
organism
GO terms can be used to annotate gene products
from any species, allowing comparison of
information across species

41
Sriniga Srinivasan, Chief Ontologist, Yahoo!
The ontology. Dividing human knowledge into a
clean set of categories is a lot like trying to
figure out where to find that suspenseful black
comedy at your corner video store. Questions
inevitably come up, like are Movies part of Art
or Entertainment? (Yahoo! lists them under the
latter.) -Wired Magazine, May 1996
42
The 3 Gene Ontologies

Molecular Function elemental activity/task
the tasks performed by individual gene products
examples are carbohydrate binding and ATPase
activity
Biological Process biological goal or objective
broad biological goals, such as mitosis or purine
metabolism, that are accomplished by ordered
assemblies of molecular functions
Cellular Component location or complex
subcellular structures, locations, and
macromolecular complexes examples include
nucleus, telomere, and RNA polymerase II
holoenzyme

43
Example Gene Product hammer
Function (what) Process (why) Drive nail (into
wood) Carpentry Drive stake (into soil)
Gardening Smash roach Pest Control Clowns
juggling object Entertainment
44
Biological Examples
Molecular Function
Biological Process
Cellular Component
45
Validation

Not enough to just do microarrays
Usually validate microarray results via some
other technique
rt-PCR
TaqMan
Northern analysis
Protein level analysis
No technique is perfect

46
Yeast Genome and Data Mining
47
Dynamic Nature of Yeast Genome
eORF essential kORF known hORF homology
identified shORF short tORF transposon
identified qORF questionable dORF disabled
First published sequence claimed 6274 genes a
that has been revised many times, why?
48
6603 4373 1410 820
The Affy detection oligonucleotide sequences are
frozen at the time of synthesis, how does this
impact downstream data analysis?
49
Terms, Definitions, IDs
term MAPKKK cascade (mating sensu
Saccharomyces) goid GO0007244 definition
MAPKKK cascade involved in transduction of mating
pheromone signal, as described in
Saccharomyces definition_reference PMID9561267
50
SGD
51
(No Transcript)
52
(No Transcript)
53
SGD public microarray data sets available for
public query
54
Homework

Go to http//www.yeastgenome.org/ and find 3
candidate genes of known f(x) and one of
undefined f(x) that you might predict to be
altered by DMSO treatment
What GO biological processes and molecular
mechanisms are associated with your candidate
genes?
Where, subcellularly does the protein reside in
the cell?
What other proteins are known or inferred to
interact with yours? How was this interaction
determined? Is this a genetic or physical
interaction?
Find the expression of at least one of your known
genes in another public ally deposited microarray
data set?
Name of data set and how you found it?
What is the largest Fold change observed for this
gene in the public study?
Now that you are microarray technology experts
can you give me 3 reasons why the observed
transcript level difference may not be confirmed
through a second technology like RTQPCR?