Title: S'Shriram,
1High Throughput Gene Expression Analysis
Techniques Opportunities in Bioinformatics
- S.Shriram,
- Reametrix India (P) Ltd.,
- Bangalore
- Feb 2005
2Expressed Genes mRNA
DNA (genes)
messenger RNA
Protein (effector molecules)
3Why analyze so many genes?
- Just because we sequenced a genome doesnt mean
we know everything about the genes. Thousands of
genes remain without an assigned function. - Patterns/clusters of expression (study complex
interplay of all genes simultaneously ) are more
predictive than looking at one or two prognostic
markers can figure out new pathways
4Genes do not reveal everything
C.Elegans 20,000 genes
H.Sapiens 30,000 genes
M.musculus 30,000 genes
5Techniques
- Microarrays
- Serial Analysis of Gene Expression (SAGE)
methodology - Microbead Technology
- Random Activation of Gene Expresssion (RAGE)
methodology
6DNA MICROARRAY TECHNOLOGY
7About Microarray
- Relatively young technology
- Mainly used in gene discovery
8Microarray of thousands of genes on a glass slide
9What are Microarrays?
- Microarrays are simply small glass or silicon
slides upon the surface of which are arrayed
thousands of genes (usually between 500-20,000) - Via a conventional DNA hybridization process, the
level of expression/activity of genes is measured - Data are read using laser-activated fluorescence
readers - The process is ultra-high throughput
10Types of Microarray
- DNA Microarrays are classified as -
- cDNA Microarrays
- Uses cDNA as probes ( 100 200 mers)
- Uses Pin technology for spotting
- Originated from Pat Browns lab at Stanford
- Ink-jet microarrays
- Short probes (25 60 mers)
- Originated from Agilent
- Oligonucleotide Microarrays -
- Oligonucleotides ( 25 mers)
- Uses Photolithography for spotting
- Originated from Affymetrix
11An Array Experiment
12The 6 steps of a DNA microarray experiment (1-3)
- Manufacturing of the microarray slide
- 2. Experimental design and choice of reference
what to compare to what? - 3. Target preparation (labeling) and hybridization
13The 6 steps of a microarray experiment (4-6)
- 4. Image acquisition (scanning) and
quantification (signal intensity to numbers) - 5. Database building, filtering and normalization
- 6. Statistical analysis and data mining
14Microarray simulation
- http//www.bio.davidson.edu/courses/genomics/chip/
chip.html
15excitation
scanning
cDNA clones (probes)
laser 1
laser 2
PCR product amplification purification
emission
printing
mRNA target
overlay images and normalise
0.1nl/spot
Hybridise target to microarray
microarray
analysis
16(No Transcript)
17Microarray Steps Quantification of RNA through
NanodropSpectrophotometer
18(No Transcript)
19(No Transcript)
20Chip Hybridisation
21Injecting the labeled cRNA into the Yeast S98
GeneChip
22Rotation 60 rpm Temp 45 C Time 16hrs
23(No Transcript)
24(No Transcript)
25GeneChip scanning
26(No Transcript)
27(No Transcript)
28Data acquisition and analysis
29(No Transcript)
30The Scanned Array220,000 probes6,400
genes24 um featuresSensitivity 1100,000 25
bp Oligos
31(No Transcript)
32What do we want to know?
- Genes involved in a specific biological process
(i.e. heat shock) - Guilt by association - assumption that genes
with same pattern of changes in expression are
involved the same pathway - Tumor classification - predict outcome /
prescribe appropriate treatment based on
clustering with known outcome tumors
33Steps in microarray data analysis
Bioinformatics opportunities
- IMAGE ANALYSIS assign the degree of expression
of genes based on intensity - STATISTICAL ANALYSIS identify the
differentially expressed genes (through
statistical methods and through other
bioinformatics methods) - PATHWAY ANALYSIS correlate the differentially
regulated genes to biological context based
pathways - SYSTEMS BIOLOGY - explain the observed phenotypic
(or macro-level) changes/effects at the organism
level based on overall changes in affected
pathways of various cells/tissues
34Steps in microarray data analysis
Bioinformatics opportunities
- IMAGE ANALYSIS assign the degree of expression
of genes based on intensity - STATISTICAL ANALYSIS identify the
differentially expressed genes (through
statistical methods and through other
bioinformatics methods) - PATHWAY ANALYSIS correlate the differentially
regulated genes to biological context based
pathways - SYSTEMS BIOLOGY - explain the observed phenotypic
(or macro-level) changes/effects at the organism
level based on overall changes in affected
pathways of various cells/tissues
35(No Transcript)
36Image Analysis Data Visualization
Cy5 Cy3
log2
Cy3
Cy5
8 4 2 fold 2 4 8
Underexpressed Overexpressed
Experiments
Genes
37Steps in microarray data analysis
Bioinformatics opportunities
- IMAGE ANALYSIS assign the degree of expression
of genes based on intensity - STATISTICAL ANALYSIS identify the
differentially expressed genes (through
statistical methods and through other
bioinformatics methods) - PATHWAY ANALYSIS correlate the differentially
regulated genes to biological context based
pathways - SYSTEMS BIOLOGY - explain the observed phenotypic
(or macro-level) changes/effects at the organism
level based on overall changes in affected
pathways of various cells/tissues
38Microarray data analysis
- Supervised versus unsupervised analysis
- Clustering organization of genes that are
similar to each other and samples that are
similar to each other using clustering algorithms - Statistical analysis how significant are the
results?
39Two dimensional hierarchical clustering (Eisen
et al, PNAS (1998) 95, p. 14863)
- Unsupervised no assumption on samples
- The algorithm successively joins gene expression
profiles to form a dendrogram based on their
pair-wise similarities. - Two-dimensional hierarchical clustering first
reorders genes and then reorders tumors based on
similarities of gene expression between samples.
40Two dimensional hierarchical (Eisen) Clustering
41Publicly Available SoftwaresCLUSTER and
TREEVIEW
- Hierarchical Clustering
- K means Clustering
- Self Organizing Maps
42Array Sequence Analysis
- Promoter motif extraction
- Cluster / classify genes with common response
pattern - Align upstream promoter regions (Gibbs sampler)
or count over-represented X-mers - Develop profile / motif from set search genome
for new candidates w/ motif - Return to array data, look for supporting
evidence for new members - Carry out experiment to support hypothesis
43Steps in microarray data analysis
Bioinformatics opportunities
- IMAGE ANALYSIS assign the degree of expression
of genes based on intensity - STATISTICAL ANALYSIS identify the
differentially expressed genes (through
statistical methods and through other
bioinformatics methods) - PATHWAY ANALYSIS correlate the differentially
regulated genes to biological context based
pathways - SYSTEMS BIOLOGY - explain the observed phenotypic
(or macro-level) changes/effects at the organism
level based on overall changes in affected
pathways of various cells/tissues
44Features of Pathway Knowledgebase
- Tagging of gene expression data (from Microarray,
SAGE, etc) onto the simple clickable pathway
maps. - In-silico manipulation of pathways ie predict
the alterations in expression levels in any given
tissue or disease conditions - Ease target prioritization
45Publicly Available Softwares
GenMAPP Visualize gene expression data on maps
representing biological pathways and groupings of
genes.
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52Steps in microarray data analysis
Bioinformatics opportunities
- IMAGE ANALYSIS assign the degree of expression
of genes based on intensity - STATISTICAL ANALYSIS identify the
differentially expressed genes (through
statistical methods and through other
bioinformatics methods) - PATHWAY ANALYSIS correlate the differentially
regulated genes to biological context based
pathways - SYSTEMS BIOLOGY - explain the observed phenotypic
(or macro-level) changes/effects at the organism
level based on overall changes in affected
pathways of various cells/tissues
53Building blocks of SYSTEM BIOLOGY
Organism / disease
Cell / tissue
Pathways
Gene/Protein/Target
Ligands/ Drugs
54Predictive biology
Computer Simulation
Predictive Biological Models
Novel Insights
Bioinformatics
55Systems Biology.
56Applications
Clinical
PreClinical
Leads
Discovery
- Screening
- Validation
- Optimization
- Target Discovery
- Target Validation
57Limitations of Arrays
- Do not necessarily reflect true levels of
proteins - protein levels are regulated by
translation initiation degradation as well - Generally, do not prove new biology - simply
suggest genes involved in a process, a hypothesis
that will require traditional experimental
verification - Expensive! 20-100K to make your own / buy
enough to get publishable data
58The DNA Microarray Industry
- The DNA microarray industry is comprised of
companies which supply - microarray slides
- microarrayers (eg. robotic spotters and
photolithographic equipment) - scanners
- software for designing and analysing microarrays
- pre-spotted slides
59Microarray Industry
- Slide Suppliers
- Corning
- Surmodics
- Suppliers of Microarrayers
- Affymetrix
- Amersham Pharmacia Biotech
- Biorobotics
- Cartesian Technologies
- ESI
- GeneMachines
- Genomic Solutions
- Nimblegen
- Packard Instruments
60Microarray Industry
- Scanners
- Affymetrix
- Agilent Technologies
- Applied Precision
- Axon Instruments
- GSI-Lumonics
- Virtek
- Suppliers of Pre-Spotted Arrays
- Affymetrix
- Agilent Technologies
- Clontech Laboratories
- GeneLogic
- Incyte Genomics
- Rosetta Inpharmatics
61Microarray Industry
- Microarray Bioinformatics
- Affymetrix
- Agilent Technologies
- Axon Instruments
- BioDiscovery
- Clontech Laboratories
- GeneLogic
- GeneMachines
- Rosetta Inpharmatics
- Silicon Genetics
- Spotfire
- Microarray Bioinformatics in Bangalore
- Strand Genomics
- Cytogenomics / Siri Technologies
- Jubilant Biosys
- Genotypic
- SysArris
- Kshema Technologies
- iSakthi Technologies
- Avasthagen (Gene-x platform)
- MWG The Genomic Company
- Array Solutions
62SERIAL ANALYSIS OF GENE EXPRESSION
63What is SAGE?
- Serial Analysis of Gene Expression
- Method to quantify gene expression levels in
samples of cells - Open system
- Can potentially reveal expression levels of all
genes unbiased and comprehensive - Microarrays are closed, since they only tell you
about the genes spotted on the array
Ref Velculescu et al., Science 1995 270484-487
64How does SAGE work?
3.(c) Discard loose fragments.
9. Sequence and record the tags and frequencies.
65SAGE Serial Analysis of Gene Expression
Simultaneous and quantitative comparison of
gene-specific sequence tags.
66(No Transcript)
67(No Transcript)
68From ditags to counts
- Locate the punctuation CATG
- Extract ditags of length 20-26 between the
punctuation - Discard duplicate ditags (including in reverse
direction) -- probably PCR artifacts - Take extreme 10 bases as the two tags, reversing
right-hand tag - Discard linker sequences
- Count occurrences of each tag
SAGE software available at http//www.sagenet.org
69What does the data look like?
70Genes in Normal Kidney
- GACTTCACGCC
- Mouse kidney androgen-regulated protein
- CTATTCCTCTCA
- Plasma glutathione peroxidase
- TGTAGCCTCAT
- Na/KATPase .chain
- GGCCTTACTTC
- Na/Pitransporter
- Am.J.Physiol.279 F383, 2000
71(No Transcript)
72From tags to genes
- Collect sequence records from GenBank that are
represented in UniGene - Assign sequence orientation (by finding poly-A
tail or poly-A signal or from annotations) - Extract 10-bases 3-adjacent to 3-most CATG
- Assign UniGene identifier to each sequence with a
SAGE tag - Record (for each tag-gene pair)
- sequences with this tag
- sequences in gene cluster with this tag
Maps available at http//www.ncbi.nlm.nih.gov/SAGE
73SAGE Data Analysis on NCBI
- Tag to Gene mapping
- Query page
- Results page
74From tags to genes
- Ideal situation
- one gene one tag
- True situation
- one gene many tags (alternative splicing
alternative polyadenylation) - one tag many genes (conserved 3 regions)
75Sequencing Errors
- Estimated sequencing error rate
- 0.7 per base (range 0.2 - 1)
- Affect
- ditags in a SAGE experiment
- can improve by using phred scores and discarding
ambiguous sequences - tag-gene mappings from GenBank
- RNA better than EST
76Reliable tag-gene assignments
77Statistical Methods
- Audic and Claverie, Genome Res 1997 7986-995
- Chen et al., J Exp Med 1998 91657-1668
- Kal et al., Mol Biol of Cell 1999 101859-1872
- Michiels et al., Physiol Genomics 1999 183-91
- Stollberg et al., Genome Res 2000 101241-1248
- Man et al., Bioinformatics 2000 16953-959
78Audic and Claverie
- Main goal confidence limits for differential
expression - Use Poisson approximation for number of times x
you see the same tag. - Put a uniform prior on the Poisson parameter get
posterior probability of see tag y times in new
experiment - p(y x) (x y)! / x! y! 2(x y 1)
- Generalizes to unequal sample sizes
79Chen et al.
- Assume
- equal sample sizes
- tag has concentration X, Y in two samples
- Look at W X/(XY)
- Use a symmetric Beta prior distribution with a
peak near 0.5 (since most genes dont change) - Use Bayes theorem to compute posterior
probability of threefold difference in expression
80Unequal sample sizes
- This analysis generalizes easily to the case of
unequal size SAGE libraries - Lal et al., Cancer Res 1999 595403-5407
- This method is used at the NCBI SAGEmap web site
for online differential expression queries - http//www.ncbi.nlm.nih.gov/SAGE
81Kal et al.
- Assume the proportion of times you see a tag has
binomial distribution - Replace with a normal approximation to compute
confidence limits - Used at http//www.cmbi.kun.nl/usage
- Equivalent to chi-square test on 2x2 table
82Michiels et al.
- First perform overall chi-square test to decide
if the two SAGE libraries being compared are
different. - Get significance by Monte Carlo simulation
- Perform gene-by-gene chi-square tests and use
them to rank genes in order of most likely to be
different
83Stollberg et al.
- Assume binomial distributions
- Model the binomial parameters as a sum of two
exponentials - fit to the Zhang step function data
- Simulate from this model, adding
- sequencing errors
- nonuniqueness of tags
- nonrandomness of DNA sequences
84Stollberg et al.
- Key finding
- Naively using observed data to fit model
parameters cannot recover the observed data by
simulation - Maximum likelihood estimate of parameters that
recover the observed data give very different
looking parameters
85Stollberg et al.
86Man et al.
- Compares specificity and sensitivity of different
tests for differential expression - Audic and Claverie
- Kal
- Fishers exact test
- Monte Carlo simulation of experiments
- Findings
- Similar power at high abundance
- Kal has highest power at low abundance
87Questions
- Sample size computations
- How many tags should we sequence if we want to
see tags of a given frequency? - How many tags should we sequence if we want to
see a given percentage of tags? - How many tags are expressed in a sample?
- Best method for identifying differential
expression?
88Additional SAGE references
- Review
- Madden et al., Drug Disc Today 2000 5415-425
- Online Tools
- Lash et al., Genome Res 2000 101051-1060
- van Kampen et al., Bioinformatics 2000
16899-905 - Comparison of SAGE and Affymetrix
- Ishii et al., Genomics 2000 68136-143
- Combine SAGE and custom microarrays
- Nacht et al., Cancer Res 1999 595464-5470
- Mapping SAGE data onto genome
- Caron et al., Science 2001 2911289-1292
- Data mining the public SAGE libraries
- Argani et al., Cancer Res 2001 614320-4324
89SAGE Summary
- Advantages
- Identify novel genes
- Comprehensive
- Quantitative
- Cost-effective
- Can be done in a small lab
90SAGE Summary
- Drawbacks
- Labor-intensive (a single run is for 90 days)
- No automation, no-high-throughput
- Difficult to replicate results
- Limited analytic tools - tough to annotate /
identify proteins from 10 base tag - Is a tag sequence unique for its gene?
91SAGE - companies
- Technique licensed to Genzyme (Molecular Oncology
division) by Johns Hopkins University - Silico Insights at Hyderabad working in Tag
annotation and pathway analysis - Mainly used in academic laboratories
92- THANK YOU.Any questions?
- shriram_at_reametrix.com
93BACKUP SLIDES
94Creating Targets
Reverse Transcriptase
PCR amplification of DNA
More
in vitro transcription
95RNA-DNA Hybridization
Targets (RNA)
probe sets on chip (DNA) (25 base
oligonucleotides of known sequence)
96Non-Hybridized Targets are Washed Away
Targets (fluorescently tagged)
probe sets (oligos)
Non-bound ones are washed away
97Experimental Design
- Choice of reference Common (non-biologically
relevant) reference, or paired samples? - Number of replicates How many are needed? (How
many are affordable?). - Are the replicate results going to be averaged or
treated independently? - Is this a fishing expedition or a
hypothesis-based experiment?
98The Process
Building the Chip
PCR PURIFICATION and PREPARATION
MASSIVE PCR
PREPARING SLIDES
PRINTING
Preparing RNA
Hybridising the Chip
CELL CULTURE AND HARVEST
POST PROCESSING
ARRAY HYBRIDIZATION
RNA ISOLATION
DATA ANALYSIS
PROBE LABELING
cDNA PRODUCTION
99Building the Chip
PCR PURIFICATION and PREPARATION
MASSIVE PCR
Full yeast genome 6,500 reactions
IPA precipitation EtOH washes 384-well format
PRINTING
The arrayer high precision spotting device
capable of printing 10,000 products in 14 hrs,
with a plate change every 25 mins
PREPARING SLIDES
Polylysine coating for adhering PCR products to
glass slides
POST PROCESSING
Chemically converting the positive polylysine
surface to prevent non-specific hybridization
100Preparing RNA
CELL CULTURE AND HARVEST
Designing experiments to profile
conditions/perturbations/ mutations and carefully
controlled growth conditions
RNA ISOLATION
RNA yield and purity are determined by system.
PolyA isolation is preferable but total RNA is
useable. Two RNA samples are hybridized/chip.
cDNA PRODUCTION
Single strand synthesis or amplification of RNA
can be performed. cDNA production includes
incorporation of Aminoallyl-dUTP.
101Hybridising the Chip
ARRAY HYBRIDIZATION
Cy3 and Cy5 RNA samples are simultaneously
hybridized to chip. Hybs are performed for 5-12
hours and then chips are washed.
DATA ANALYSIS
Ratio measurements are determined via
quantification of 532 nm and 635 nm emission
values. Data are uploaded to the appropriate
database where statistical and other analyses can
then be performed.
PROBE LABELING
Two RNA samples are labelled with Cy3 or Cy5
monofunctional dyes via a chemical coupling to
AA-dUTP. Samples are purified using a PCR
cleanup kit.
102Analysis of SAGE DataAn Introduction
- Kevin R. Coombes
- Section of Bioinformatics
103Outline
- Description of SAGE method
- Preliminary bioinformatics issues
- Description of analysis methods introduced in
early paper - Review of literature statistics and SAGE
104Genome Sequence Flood
- Typical results from initial analysis of a new
genome by the best computational methods - For 1/3 of the genes we have a good idea what
they are doing (high similarity to exp. studied
genes) - For 1/3 of the genes, we have a guess at what
they are doing (some similarity to previously
seen genes) - For 1/3 of genes, we have no idea what they are
doing (no similarity to studied genes)
105Gene expression
- Its essential to study complex interplay of all
genes simultaneously - This study requires high throughput and large
scale technologies - such as Microarray Technology
106DNA MICROARRAY TECHNOLOGY
- This technology is conceived to detect expression
of 1000s of genes simultaneously - Its potential applications include
- Identification of complex diseases
- Drug discovery and toxicology studies
- Mutation polymorphisms detection
- Pathogen detection
- Differential expression of genes over time
,between tissues and diseases states
107Cluster analysis of genes in G1 and G2
Chaudhry et. al., 2002
108Other Softwares
Extraction of information from DNA-chip with the
technology of promoter analysis Genomatix
Software GmbH
109SAGE and cancer
- Ten SAGE libraries, two each from
- normal colon
- colon tumors
- colon cancer cell lines
- pancreatic tumors
- pancreatic cell lines
- Pooled each pair
Ref Zhang et al., Science 1997 2761268-1272
110Variability in SAGE libraries
111Distribution of tags
- 303,706 total tags
- 48,471 distinct tags
- Distribution
- 85.9 seen up to 5 times (25 of mass)
- 12.7 between 5 and 50 times (30)
- 0.1 between 50 and 500 times (26)
- 0.1 more than 500 times (19)
Ref Zhang et al., Science 1997 2761268-1272
112How many tags were missed?
- They simulated to find 92 chance of detecting
tags at 3 copies/cell - Using binomial approximation
- Get 95 chance for 3 copies/cell
- Only get 63 chance for 1 copy/cell
- Most of what they saw occurred at 1-5 copies per
cell
113Differential Expression
- Found 289 tags differentially expressed between
normal colon and colon cancer (181 decreased 108
increased) - Method Monte Carlo simulation.
- 100000 sims per transcript for relative
likelihood of seeing observed difference - Used observed distribution of transcripts to
simulate 40 experiments.
Ref Zhang et al., Science 1997 2761268-1272
114Sensitivity
- Claim 95 chance of detecting 6-fold difference
- Method Monte Carlo
- 200 simulations, assuming abundance of 0.0001 in
first sample and 0.0006 in second sample
Ref Zhang et al., Science 1997 2761268-1272
115Weaknesses in Analysis
- Failed to account for intrinsic variability in
samples (which changes depending on abundance) in
assessing significance - Monte Carlo used observed distribution, which is
definitely not true distribution. - Sensitivity only measured at one abundance level.
116Why use Microarrays?
- What genes are expressed in a cell?
- What genes are Present/Absent in the experiment
vs. control? - Which genes have increased/decreased expression
in experiment vs. control? - Which genes have biological significance?