Microarray Data Analysis - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Microarray Data Analysis

Description:

The Central Dogma of ... Reverse the hybridization - put the probes on the filter and ... Bolstad, B.M., Irizarry R. A., Astrand, M., and Speed, T.P. (2003) ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 74
Provided by: researchco3
Category:

less

Transcript and Presenter's Notes

Title: Microarray Data Analysis


1
Microarray Data Analysis
  • Stuart M. Brown
  • NYU School of Medicine

2
The Central Dogma of Molecular BiologyDNA is
transcribed into RNA which is then translated
into protein
Measured by Microarray
3
What is a Microarray
  • A simple concept Dot Blot Northern
  • Reverse the hybridization - put the probes on the
    filter and label the bulk RNA
  • Make probes for lots of genes - a massively
    parallel experiment
  • Make it tiny so you dont need so much RNA from
    your experimental cells.
  • Make quantitative measurements

4
Microarrays are Popular
  • At NYU Med Center we are now collecting about 3
    GB of microarray data per week (60 chips, 6-10
    different experiments)
  • PubMed search "microarray" 13,948 papers
  • 2005 4406
  • 2004 3509
  • 2003 2421
  • 2002 1557
  • 2001 834
  • 2000 294

5
A Filter Array
6
DNA Chip Microarrays
  • Put a large number (100K) of cDNA sequences or
    synthetic DNA oligomers onto a glass slide (or
    other subtrate) in known locations on a grid.
  • Label an RNA sample and hybridize
  • Measure amounts of RNA bound to each square in
    the grid
  • Make comparisons
  • Cancerous vs. normal tissue
  • Treated vs. untreated
  • Time course
  • Many applications in both basic and clinical
    research

7
cDNA Microarray Technologies
  • Spot cloned cDNAs onto a glass microscope slide
  • usually PCR amplified segments of plasmids
  • Label 2 RNA samples with 2 different colors of
    flourescent dye - control vs. experimental
  • Mix two labeled RNAs and hybridize to the chip
  • Make two scans - one for each color
  • Combine the images to calculate ratios of amounts
    of each RNA that bind to each spot

8
Spot your own Chip (plans available for free
from Pat Browns website)
Robot spotter
Ordinary glass microscope slide
9
(No Transcript)
10
Combine scans for Red Green
False color image is made from digitized
fluorescence data, not by superimposing scanned
images
11
cDNA Spotted Microarrays
12
(No Transcript)
13
Affymetrix Gene chip system
  • Uses 25 base oligos synthesized in place on a
    chip (20 pairs of oligos for each gene)
  • RNA labeled and scanned in a single color
  • one sample per chip
  • Can have as many as 20,000 genes on a chip
  • Arrays get smaller every year (more genes)
  • Chips are expensive
  • Proprietary system black box software, can
    only use their chips

14
Affymetrix Gene Chip
15
(No Transcript)
16
(No Transcript)
17
Affymetrix Technology
18
(No Transcript)
19
Affymetrix Pivot Table
20
Data Acquisition
  • Scan the arrays
  • Quantitate each spot
  • Subtract background
  • Normalize
  • Export a table of fluorescent intensities for
    each gene in the array

21
Automate!!
  • All of this can be done automatically by
    software.
  • Much more consistent
  • Mistakes will be made (especially in the spot
    quantitation) but you cant manually check
    hundreds of thousands of spots

22
(No Transcript)
23
Affymetrix Software
  • Affymetrix System is totally automated
  • Computes a single value for each gene from 40
    probes - (using surprisingly kludgy math)
  • Highly reproducible (re-scan of same chip or
    hyb. of duplicate chips with same labeled sample
    gives very similar results)
  • Incorporates false results due to image artefacts
  • dust, bubbles
  • pixel spillover from bright spot to neighboring
    dark spots

24
(No Transcript)
25
Goals of a Microarray Experiment
  • Find the genes that change expression between
    experimental and control samples
  • Classify samples based on a gene expression
    profile
  • Find patterns Groups of biologically related
    genes that change expression together across
    samples/treatments

26
Basic Data Analysis
  • Fold change (relative increase or decrease in
    intensity for each gene)
  • Set cutoff filter for low values (background
    noise)
  • Cluster genes by similar changes - only really
    meaningful across multiple treatments or time
    points
  • Cluster samples by similar gene expression
    profiles

27
Streamlined Affy Analysis
Normalize
Filter
Present/AbsentMinimum valueFold change
Raw data
(RMA)
Classification
Significance
Clustering
PAM Machine learning
t-test SAM Rank Product
Gene lists
28
Sources of Variability
  • Image analysis (identifying and quantitating each
    spot on the array)
  • Scanning (laser and detector, chemistry of the
    flourescent label))
  • Hybridization (temperature, time, mixing, etc.)
  • Probe labeling
  • RNA extraction
  • Biological variability

29
Scatter plot of all genes in a simple comparison
of two control (A) and two treatments (B high
vs. low glucose) showing changes in expression
greater than 2.2 and 3 fold.
30
Thomas Hudson, Montreal Genome Center
31
Normalization
  • Can control for many of the experimental sources
    of variability (systematic, not random or gene
    specific)
  • Bring each image to the same average brightness
  • Can use simple math or fancy -
  • divide by the mean (whole chip or by sectors)
  • LOESS (locally weighted regression)
  • No sure biological standards

32
RMA
  • Robust Multichip Average
  • Bolstad, B.M., Irizarry R. A., Astrand, M., and
    Speed, T.P. (2003), A Comparison of Normalization
    Methods for High Density Oligonucleotide Array
    Data Based on Bias and Variance. Bioinformatics
    19(2)185-193

33
Are the Treatments Different?
  • Analysis of microarray data has tended to focus
    on making lists of genes that are up or down
    regulated between treatments
  • Before making these lists, ask the
    question "Are the treatments different?"
  • Use standard statistical methods to evaluate
    expression profiles for each treatment (t-test or
    f-test)
  • If there are differences, find the genes most
    responsible
  • If there are not significant overall differences,
    then lists of genes with large fold changes may
    only reflect random variability.

34
Statistics
  • When you have variability in measurements, you
    need replication and statistics to find real
    differences
  • Its not just the genes with 2 fold increase, but
    those with a significant p-value across
    replicates
  • Non-parametric (i.e. rank) or paired value
    statistics may be more appropriate

35
Multiple Comparisons
  • In a microarray experiment, each gene (each probe
    or probe set) is really a separate experiment
  • Yet if you treat each gene as an independent
    comparison, you will always find some with
    significant differences
  • (the tails of a normal distribution)

36
False Discovery
  • Statisticians call false positives a "type 1
    error" or a "False Discovery"
  • False Discovey Rate (FDR) is equal to the p-value
    of the t-test X the number of genes in the array
  • For a p-value of 0.01 X 10,000 genes 100
    false different genes
  • You cannot eliminate false positives, but by
    choosing a more stringent p-value, you can keep
    them manageable (try p0.001)
  • The FDR must be smaller than the number of real
    differences that you find - which in turn depends
    on the size of the differences and varability of
    the measured expression values

37
SAM
  • Significance Analysis of Microarrays
  • Tusher, Tibshirani and Chu (2001) Significance
    analysis of microarrays applied to the ionizing
    radiation response. PNAS 2001 98 5116-5121, (Apr
    24).
  • Excel plugin
  • Free
  • Permutation based
  • Most published method of microarray data analysis

38
Higher LevelMicroarray data analysis
  • Clustering and pattern detection
  • Data mining and visualization
  • Controls and normalization of results
  • Statistical validatation
  • Linkage between gene expression data and gene
    sequence/function/metabolic pathways databases
  • Discovery of common sequences in co-regulated
    genes
  • Meta-studies using data from multiple experiments

39
Types of Clustering
  • Herarchical
  • Link similar genes, build up to a tree of all
  • Self Organizing Maps (SOM)
  • Split all genes into similar sub-groups
  • Finds its own groups (machine learning)
  • Principle Component
  • every gene is a dimension (vector), find a single
    dimension that best represents the differences in
    the data

40
Cluster by color difference
41
GeneSpring
42
(No Transcript)
43
SOM Clusters
44
Classification
  • How to sort samples into two classes based on
    gene expression data
  • Cancer vs. normal
  • Cancer sub-types (benign vs. malignant)
  • Responds well to drug vs. poor response (i.e.
    tamoxifen for breast cancer)

45
Support Vector Machines
Fat planes With an infinitely thin plane the
data can always be separated correctly, but not
necessarily with a fat one. Again if a large
margin separation exists, chances are good that
we found something relevant. Large Margin
Classifiers
46
  • PAM Prediction Analysis for Microarrays
  • Class Prediction and Survival Analysis for
    Genomic Expression Data Mining
  • Performs sample classification from gene
    expression data,
  • via "nearest shrunken centroid method'' of
    Tibshirani, Hastie, Narasimhan and Chu (2002)
  • "Diagnosis of multiple cancer types by shrunken
    centroids of gene expression"
  • PNAS 2002 996567-6572 (May 14).

47
BioConductor
  • All of these normalization, statistical, and
    clustering methods are available in a free
    software package called BioConductor.
  • www.bioconductor.org
  • User hostile command line interface
  • Uses scripts in the 'R' statistical language

gt data(SpikeIn) gt pms lt- pm(SpikeIn) gt mms lt-
mm(SpikeIn) gt par(mfrow c(1, 2)) gt
concentrations lt- matrix(as.numeric(sampleNames(Sp
ikeIn)), 20, 12, byrow TRUE) gt
matplot(concentrations, pms, log "xy", main
"PM", ylim c(30, 20000)) gt lines(concentration
s1, , apply(pms, 2, mean), lwd 3) gt
matplot(concentrations, mms, log "xy", main
"MM", ylim c(30, 20000)) gt lines(concentration
s1, , apply(mms, 2, mean), lwd 3)
48
Functional Genomics
  • Take a list of "interesting" genes and find their
    biological relationships
  • Gene lists may come from significance/classficatio
    n analysis of microarrays, proteomics, or other
    high-throughput methods
  • Requires a reference set of "biological
    knowledge"

49
Genome Ontology
  • How to organize biological knowledge?
  • Biologists work on a variety of different
    research organisms yeast, fruit fly, mouse,
    human
  • the same gene can have very different functions
    (antennapedia)
  • and very different names (sonic hedgehog)

50
GO
  • Biologists got together a few years ago and
    developed a sensible system called Genome
    Ontology (GO)
  • 3 hierarchical sets of terminology
  • Biological Process
  • Cellular Component (location within cell)
  • Molecular Function
  • about 1000 categories of functions

51
(No Transcript)
52
(No Transcript)
53
Biological Pathways
54
Microarray Databases
  • Large experiments may have hundreds of individual
    array hybridizations
  • Core lab at an institution or multiple
    investigators using one machine - data archive
    and validate across experiments
  • Data-mining - look for similar patterns of gene
    expression across different experiments

55
Public Databases
  • Gene Expression data is an essential aspect of
    annotating the genome
  • Publication and data exchange for microarray
    experiments
  • Data mining/Meta-studies
  • Common data format - XML
  • MIAME (Minimal Information About a Microarray
    Experiment)

56
GEO at the NCBI
57
Array Express at EMBL
58
(No Transcript)
59
Gene ExpressionTechnologies
  • cDNA (EST) libraries
  • SAGE
  • Microarray
  • rt-PCR
  • RNA-seq

60
The Cancer Genome Anatomy Project
  • CGAP has collected a large amount of cDNA and
    related data online
  • http//cgap.nci.nih.gov/
  • cDNA libraries from various tissues
  • search for genes
  • compare expression levels

61
(No Transcript)
62
(No Transcript)
63
SAGE
  • Serial Analysis of Gene Expression is a
    technology that sequences very short fragments of
    mRNA (10 or 17 bp) that have been randomly
    ligated together
  • The short tags are assigned to genes and then
    relative counts for each gene are computed for
    cDNA libraries from various tissues

64
SAGE Genie
  • SAGE Anatomic Viewer
  • SAGE Digital Gene Expression Displayer
  • Digital Northern
  • SAGE Experiment Viewer

65
(No Transcript)
66
(No Transcript)
67
Microarray
  • GEO database at NCBI
  • Microarray experiments
  • Defined arrays
  • Published results
  • Also lots of inconclusive experiments
  • Tools to search for specific genes
  • Unreliable to search for tissue or disease in
    experiment description text

68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
RNA-seq
  • Next Generation DNA seqencing
  • NYU currently has one Illumina Genome Analyser
  • generates more than 1 million RNA sequences per
    sample
  • Currently seeking funding for a Roche/454
  • produces 100K reads of 250-400 bp

73
Count Transcripts
  • Techology exists to accurately count transcripts
    and compare samples
  • Digital Gene Expression
  • Can also identify alternate isoforms, splice
    variants, etc.
Write a Comment
User Comments (0)
About PowerShow.com