S'Shriram, - PowerPoint PPT Presentation

1 / 116
About This Presentation
Title:

S'Shriram,

Description:

S'Shriram, – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 117
Provided by: bicp
Category:
Tags: aerose | shriram

less

Transcript and Presenter's Notes

Title: S'Shriram,


1
High Throughput Gene Expression Analysis
Techniques Opportunities in Bioinformatics
  • S.Shriram,
  • Reametrix India (P) Ltd.,
  • Bangalore
  • Feb 2005

2
Expressed Genes mRNA
DNA (genes)
messenger RNA
Protein (effector molecules)
3
Why analyze so many genes?
  • Just because we sequenced a genome doesnt mean
    we know everything about the genes. Thousands of
    genes remain without an assigned function.
  • Patterns/clusters of expression (study complex
    interplay of all genes simultaneously ) are more
    predictive than looking at one or two prognostic
    markers can figure out new pathways

4
Genes do not reveal everything
C.Elegans 20,000 genes
H.Sapiens 30,000 genes
M.musculus 30,000 genes
5
Techniques
  • Microarrays
  • Serial Analysis of Gene Expression (SAGE)
    methodology
  • Microbead Technology
  • Random Activation of Gene Expresssion (RAGE)
    methodology

6
DNA MICROARRAY TECHNOLOGY
7
About Microarray
  • Relatively young technology
  • Widely adopted
  • Mainly used in gene discovery

8
Microarray of thousands of genes on a glass slide
9
What are Microarrays?
  • Microarrays are simply small glass or silicon
    slides upon the surface of which are arrayed
    thousands of genes (usually between 500-20,000)
  • Via a conventional DNA hybridization process, the
    level of expression/activity of genes is measured
  • Data are read using laser-activated fluorescence
    readers
  • The process is ultra-high throughput

10
Types of Microarray
  • DNA Microarrays are classified as -
  • cDNA Microarrays
  • Uses cDNA as probes ( 100 200 mers)
  • Uses Pin technology for spotting
  • Originated from Pat Browns lab at Stanford
  • Ink-jet microarrays
  • Short probes (25 60 mers)
  • Originated from Agilent
  • Oligonucleotide Microarrays -
  • Oligonucleotides ( 25 mers)
  • Uses Photolithography for spotting
  • Originated from Affymetrix

11
An Array Experiment
12
The 6 steps of a DNA microarray experiment (1-3)
  • Manufacturing of the microarray slide
  • 2. Experimental design and choice of reference
    what to compare to what?
  • 3. Target preparation (labeling) and hybridization

13
The 6 steps of a microarray experiment (4-6)
  • 4. Image acquisition (scanning) and
    quantification (signal intensity to numbers)
  • 5. Database building, filtering and normalization
  • 6. Statistical analysis and data mining

14
Microarray simulation
  • http//www.bio.davidson.edu/courses/genomics/chip/
    chip.html

15
excitation
scanning
cDNA clones (probes)
laser 1
laser 2
PCR product amplification purification
emission
printing
mRNA target
overlay images and normalise
0.1nl/spot
Hybridise target to microarray
microarray
analysis
16
(No Transcript)
17
Microarray Steps Quantification of RNA through
NanodropSpectrophotometer
18
(No Transcript)
19
(No Transcript)
20
Chip Hybridisation
21
Injecting the labeled cRNA into the Yeast S98
GeneChip
22
Rotation 60 rpm Temp 45 C Time 16hrs
23
(No Transcript)
24
(No Transcript)
25
GeneChip scanning
26
(No Transcript)
27
(No Transcript)
28
Data acquisition and analysis
29
(No Transcript)
30
The Scanned Array220,000 probes6,400
genes24 um featuresSensitivity 1100,000 25
bp Oligos
31
(No Transcript)
32
What do we want to know?
  • Genes involved in a specific biological process
    (i.e. heat shock)
  • Guilt by association - assumption that genes
    with same pattern of changes in expression are
    involved the same pathway
  • Tumor classification - predict outcome /
    prescribe appropriate treatment based on
    clustering with known outcome tumors

33
Steps in microarray data analysis
Bioinformatics opportunities
  • IMAGE ANALYSIS assign the degree of expression
    of genes based on intensity
  • STATISTICAL ANALYSIS identify the
    differentially expressed genes (through
    statistical methods and through other
    bioinformatics methods)
  • PATHWAY ANALYSIS correlate the differentially
    regulated genes to biological context based
    pathways
  • SYSTEMS BIOLOGY - explain the observed phenotypic
    (or macro-level) changes/effects at the organism
    level based on overall changes in affected
    pathways of various cells/tissues

34
Steps in microarray data analysis
Bioinformatics opportunities
  • IMAGE ANALYSIS assign the degree of expression
    of genes based on intensity
  • STATISTICAL ANALYSIS identify the
    differentially expressed genes (through
    statistical methods and through other
    bioinformatics methods)
  • PATHWAY ANALYSIS correlate the differentially
    regulated genes to biological context based
    pathways
  • SYSTEMS BIOLOGY - explain the observed phenotypic
    (or macro-level) changes/effects at the organism
    level based on overall changes in affected
    pathways of various cells/tissues

35
(No Transcript)
36
Image Analysis Data Visualization
Cy5 Cy3
log2
Cy3
Cy5
8 4 2 fold 2 4 8
Underexpressed Overexpressed
Experiments
Genes
37
Steps in microarray data analysis
Bioinformatics opportunities
  • IMAGE ANALYSIS assign the degree of expression
    of genes based on intensity
  • STATISTICAL ANALYSIS identify the
    differentially expressed genes (through
    statistical methods and through other
    bioinformatics methods)
  • PATHWAY ANALYSIS correlate the differentially
    regulated genes to biological context based
    pathways
  • SYSTEMS BIOLOGY - explain the observed phenotypic
    (or macro-level) changes/effects at the organism
    level based on overall changes in affected
    pathways of various cells/tissues

38
Microarray data analysis
  • Supervised versus unsupervised analysis
  • Clustering organization of genes that are
    similar to each other and samples that are
    similar to each other using clustering algorithms
  • Statistical analysis how significant are the
    results?

39
Two dimensional hierarchical clustering (Eisen
et al, PNAS (1998) 95, p. 14863)
  • Unsupervised no assumption on samples
  • The algorithm successively joins gene expression
    profiles to form a dendrogram based on their
    pair-wise similarities.
  • Two-dimensional hierarchical clustering first
    reorders genes and then reorders tumors based on
    similarities of gene expression between samples.

40
Two dimensional hierarchical (Eisen) Clustering
41
Publicly Available SoftwaresCLUSTER and
TREEVIEW
  • Hierarchical Clustering
  • K means Clustering
  • Self Organizing Maps

42
Array Sequence Analysis
  • Promoter motif extraction
  • Cluster / classify genes with common response
    pattern
  • Align upstream promoter regions (Gibbs sampler)
    or count over-represented X-mers
  • Develop profile / motif from set search genome
    for new candidates w/ motif
  • Return to array data, look for supporting
    evidence for new members
  • Carry out experiment to support hypothesis

43
Steps in microarray data analysis
Bioinformatics opportunities
  • IMAGE ANALYSIS assign the degree of expression
    of genes based on intensity
  • STATISTICAL ANALYSIS identify the
    differentially expressed genes (through
    statistical methods and through other
    bioinformatics methods)
  • PATHWAY ANALYSIS correlate the differentially
    regulated genes to biological context based
    pathways
  • SYSTEMS BIOLOGY - explain the observed phenotypic
    (or macro-level) changes/effects at the organism
    level based on overall changes in affected
    pathways of various cells/tissues

44
Features of Pathway Knowledgebase
  • Tagging of gene expression data (from Microarray,
    SAGE, etc) onto the simple clickable pathway
    maps.
  • In-silico manipulation of pathways ie predict
    the alterations in expression levels in any given
    tissue or disease conditions
  • Ease target prioritization

45
Publicly Available Softwares
GenMAPP Visualize gene expression data on maps
representing biological pathways and groupings of
genes.
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
Steps in microarray data analysis
Bioinformatics opportunities
  • IMAGE ANALYSIS assign the degree of expression
    of genes based on intensity
  • STATISTICAL ANALYSIS identify the
    differentially expressed genes (through
    statistical methods and through other
    bioinformatics methods)
  • PATHWAY ANALYSIS correlate the differentially
    regulated genes to biological context based
    pathways
  • SYSTEMS BIOLOGY - explain the observed phenotypic
    (or macro-level) changes/effects at the organism
    level based on overall changes in affected
    pathways of various cells/tissues

53
Building blocks of SYSTEM BIOLOGY
Organism / disease
Cell / tissue
Pathways
Gene/Protein/Target
Ligands/ Drugs
54
Predictive biology
Computer Simulation
Predictive Biological Models
Novel Insights
Bioinformatics
55
Systems Biology.

56
Applications
Clinical
PreClinical
Leads
  • Genotyping
  • ADE Screens

Discovery
  • Toxicology
  • Optimization
  • Screening
  • Validation
  • Optimization
  • Target Discovery
  • Target Validation

57
Limitations of Arrays
  • Do not necessarily reflect true levels of
    proteins - protein levels are regulated by
    translation initiation degradation as well
  • Generally, do not prove new biology - simply
    suggest genes involved in a process, a hypothesis
    that will require traditional experimental
    verification
  • Expensive! 20-100K to make your own / buy
    enough to get publishable data

58
The DNA Microarray Industry
  • The DNA microarray industry is comprised of
    companies which supply
  • microarray slides
  • microarrayers (eg. robotic spotters and
    photolithographic equipment)
  • scanners
  • software for designing and analysing microarrays
  • pre-spotted slides

59
Microarray Industry
  • Slide Suppliers 
  • Corning
  • Surmodics
  • Suppliers of Microarrayers
  • Affymetrix
  • Amersham Pharmacia Biotech
  • Biorobotics
  • Cartesian Technologies
  • ESI
  • GeneMachines
  • Genomic Solutions
  • Nimblegen
  • Packard Instruments

60
Microarray Industry
  • Scanners 
  • Affymetrix
  • Agilent Technologies
  • Applied Precision
  • Axon Instruments
  • GSI-Lumonics
  • Virtek
  • Suppliers of Pre-Spotted Arrays
  • Affymetrix
  • Agilent Technologies
  • Clontech Laboratories
  • GeneLogic
  • Incyte Genomics
  • Rosetta Inpharmatics

61
Microarray Industry
  • Microarray Bioinformatics 
  • Affymetrix
  • Agilent Technologies
  • Axon Instruments
  • BioDiscovery
  • Clontech Laboratories
  • GeneLogic
  • GeneMachines
  • Rosetta Inpharmatics
  • Silicon Genetics
  • Spotfire
  • Microarray Bioinformatics in Bangalore 
  • Strand Genomics
  • Cytogenomics / Siri Technologies
  • Jubilant Biosys
  • Genotypic
  • SysArris
  • Kshema Technologies
  • iSakthi Technologies
  • Avasthagen (Gene-x platform)
  • MWG The Genomic Company
  • Array Solutions

62
SERIAL ANALYSIS OF GENE EXPRESSION
63
What is SAGE?
  • Serial Analysis of Gene Expression
  • Method to quantify gene expression levels in
    samples of cells
  • Open system
  • Can potentially reveal expression levels of all
    genes unbiased and comprehensive
  • Microarrays are closed, since they only tell you
    about the genes spotted on the array

Ref Velculescu et al., Science 1995 270484-487
64
How does SAGE work?
3.(c) Discard loose fragments.
9. Sequence and record the tags and frequencies.
65
SAGE Serial Analysis of Gene Expression
Simultaneous and quantitative comparison of
gene-specific sequence tags.
66
(No Transcript)
67
(No Transcript)
68
From ditags to counts
  • Locate the punctuation CATG
  • Extract ditags of length 20-26 between the
    punctuation
  • Discard duplicate ditags (including in reverse
    direction) -- probably PCR artifacts
  • Take extreme 10 bases as the two tags, reversing
    right-hand tag
  • Discard linker sequences
  • Count occurrences of each tag

SAGE software available at http//www.sagenet.org
69
What does the data look like?
70
Genes in Normal Kidney
  • GACTTCACGCC
  • Mouse kidney androgen-regulated protein
  • CTATTCCTCTCA
  • Plasma glutathione peroxidase
  • TGTAGCCTCAT
  • Na/KATPase .chain
  • GGCCTTACTTC
  • Na/Pitransporter
  • Am.J.Physiol.279 F383, 2000

71
(No Transcript)
72
From tags to genes
  • Collect sequence records from GenBank that are
    represented in UniGene
  • Assign sequence orientation (by finding poly-A
    tail or poly-A signal or from annotations)
  • Extract 10-bases 3-adjacent to 3-most CATG
  • Assign UniGene identifier to each sequence with a
    SAGE tag
  • Record (for each tag-gene pair)
  • sequences with this tag
  • sequences in gene cluster with this tag

Maps available at http//www.ncbi.nlm.nih.gov/SAGE
73
SAGE Data Analysis on NCBI
  • Tag to Gene mapping
  • Query page
  • Results page

74
From tags to genes
  • Ideal situation
  • one gene one tag
  • True situation
  • one gene many tags (alternative splicing
    alternative polyadenylation)
  • one tag many genes (conserved 3 regions)

75
Sequencing Errors
  • Estimated sequencing error rate
  • 0.7 per base (range 0.2 - 1)
  • Affect
  • ditags in a SAGE experiment
  • can improve by using phred scores and discarding
    ambiguous sequences
  • tag-gene mappings from GenBank
  • RNA better than EST

76
Reliable tag-gene assignments
77
Statistical Methods
  • Audic and Claverie, Genome Res 1997 7986-995
  • Chen et al., J Exp Med 1998 91657-1668
  • Kal et al., Mol Biol of Cell 1999 101859-1872
  • Michiels et al., Physiol Genomics 1999 183-91
  • Stollberg et al., Genome Res 2000 101241-1248
  • Man et al., Bioinformatics 2000 16953-959

78
Audic and Claverie
  • Main goal confidence limits for differential
    expression
  • Use Poisson approximation for number of times x
    you see the same tag.
  • Put a uniform prior on the Poisson parameter get
    posterior probability of see tag y times in new
    experiment
  • p(y x) (x y)! / x! y! 2(x y 1)
  • Generalizes to unequal sample sizes

79
Chen et al.
  • Assume
  • equal sample sizes
  • tag has concentration X, Y in two samples
  • Look at W X/(XY)
  • Use a symmetric Beta prior distribution with a
    peak near 0.5 (since most genes dont change)
  • Use Bayes theorem to compute posterior
    probability of threefold difference in expression

80
Unequal sample sizes
  • This analysis generalizes easily to the case of
    unequal size SAGE libraries
  • Lal et al., Cancer Res 1999 595403-5407
  • This method is used at the NCBI SAGEmap web site
    for online differential expression queries
  • http//www.ncbi.nlm.nih.gov/SAGE

81
Kal et al.
  • Assume the proportion of times you see a tag has
    binomial distribution
  • Replace with a normal approximation to compute
    confidence limits
  • Used at http//www.cmbi.kun.nl/usage
  • Equivalent to chi-square test on 2x2 table

82
Michiels et al.
  • First perform overall chi-square test to decide
    if the two SAGE libraries being compared are
    different.
  • Get significance by Monte Carlo simulation
  • Perform gene-by-gene chi-square tests and use
    them to rank genes in order of most likely to be
    different

83
Stollberg et al.
  • Assume binomial distributions
  • Model the binomial parameters as a sum of two
    exponentials
  • fit to the Zhang step function data
  • Simulate from this model, adding
  • sequencing errors
  • nonuniqueness of tags
  • nonrandomness of DNA sequences

84
Stollberg et al.
  • Key finding
  • Naively using observed data to fit model
    parameters cannot recover the observed data by
    simulation
  • Maximum likelihood estimate of parameters that
    recover the observed data give very different
    looking parameters

85
Stollberg et al.
86
Man et al.
  • Compares specificity and sensitivity of different
    tests for differential expression
  • Audic and Claverie
  • Kal
  • Fishers exact test
  • Monte Carlo simulation of experiments
  • Findings
  • Similar power at high abundance
  • Kal has highest power at low abundance

87
Questions
  • Sample size computations
  • How many tags should we sequence if we want to
    see tags of a given frequency?
  • How many tags should we sequence if we want to
    see a given percentage of tags?
  • How many tags are expressed in a sample?
  • Best method for identifying differential
    expression?

88
Additional SAGE references
  • Review
  • Madden et al., Drug Disc Today 2000 5415-425
  • Online Tools
  • Lash et al., Genome Res 2000 101051-1060
  • van Kampen et al., Bioinformatics 2000
    16899-905
  • Comparison of SAGE and Affymetrix
  • Ishii et al., Genomics 2000 68136-143
  • Combine SAGE and custom microarrays
  • Nacht et al., Cancer Res 1999 595464-5470
  • Mapping SAGE data onto genome
  • Caron et al., Science 2001 2911289-1292
  • Data mining the public SAGE libraries
  • Argani et al., Cancer Res 2001 614320-4324

89
SAGE Summary
  • Advantages
  • Identify novel genes
  • Comprehensive
  • Quantitative
  • Cost-effective
  • Can be done in a small lab

90
SAGE Summary
  • Drawbacks
  • Labor-intensive (a single run is for 90 days)
  • No automation, no-high-throughput
  • Difficult to replicate results
  • Limited analytic tools - tough to annotate /
    identify proteins from 10 base tag
  • Is a tag sequence unique for its gene?

91
SAGE - companies
  • Technique licensed to Genzyme (Molecular Oncology
    division) by Johns Hopkins University
  • Silico Insights at Hyderabad working in Tag
    annotation and pathway analysis
  • Mainly used in academic laboratories

92
  • THANK YOU.Any questions?
  • shriram_at_reametrix.com

93
BACKUP SLIDES
94
Creating Targets
Reverse Transcriptase
PCR amplification of DNA
More
in vitro transcription
95
RNA-DNA Hybridization
Targets (RNA)
probe sets on chip (DNA) (25 base
oligonucleotides of known sequence)
96
Non-Hybridized Targets are Washed Away
Targets (fluorescently tagged)
probe sets (oligos)
Non-bound ones are washed away
97
Experimental Design
  • Choice of reference Common (non-biologically
    relevant) reference, or paired samples?
  • Number of replicates How many are needed? (How
    many are affordable?).
  • Are the replicate results going to be averaged or
    treated independently?
  • Is this a fishing expedition or a
    hypothesis-based experiment?

98
The Process
Building the Chip
PCR PURIFICATION and PREPARATION
MASSIVE PCR
PREPARING SLIDES
PRINTING
Preparing RNA
Hybridising the Chip
CELL CULTURE AND HARVEST
POST PROCESSING
ARRAY HYBRIDIZATION
RNA ISOLATION
DATA ANALYSIS
PROBE LABELING
cDNA PRODUCTION
99
Building the Chip
PCR PURIFICATION and PREPARATION
MASSIVE PCR
Full yeast genome 6,500 reactions
IPA precipitation EtOH washes 384-well format
PRINTING
The arrayer high precision spotting device
capable of printing 10,000 products in 14 hrs,
with a plate change every 25 mins
PREPARING SLIDES
Polylysine coating for adhering PCR products to
glass slides
POST PROCESSING
Chemically converting the positive polylysine
surface to prevent non-specific hybridization
100
Preparing RNA
CELL CULTURE AND HARVEST
Designing experiments to profile
conditions/perturbations/ mutations and carefully
controlled growth conditions
RNA ISOLATION
RNA yield and purity are determined by system.
PolyA isolation is preferable but total RNA is
useable. Two RNA samples are hybridized/chip.
cDNA PRODUCTION
Single strand synthesis or amplification of RNA
can be performed. cDNA production includes
incorporation of Aminoallyl-dUTP.
101
Hybridising the Chip
ARRAY HYBRIDIZATION
Cy3 and Cy5 RNA samples are simultaneously
hybridized to chip. Hybs are performed for 5-12
hours and then chips are washed.
DATA ANALYSIS
Ratio measurements are determined via
quantification of 532 nm and 635 nm emission
values. Data are uploaded to the appropriate
database where statistical and other analyses can
then be performed.
PROBE LABELING
Two RNA samples are labelled with Cy3 or Cy5
monofunctional dyes via a chemical coupling to
AA-dUTP. Samples are purified using a PCR
cleanup kit.
102
Analysis of SAGE DataAn Introduction
  • Kevin R. Coombes
  • Section of Bioinformatics

103
Outline
  • Description of SAGE method
  • Preliminary bioinformatics issues
  • Description of analysis methods introduced in
    early paper
  • Review of literature statistics and SAGE

104
Genome Sequence Flood
  • Typical results from initial analysis of a new
    genome by the best computational methods
  • For 1/3 of the genes we have a good idea what
    they are doing (high similarity to exp. studied
    genes)
  • For 1/3 of the genes, we have a guess at what
    they are doing (some similarity to previously
    seen genes)
  • For 1/3 of genes, we have no idea what they are
    doing (no similarity to studied genes)

105
Gene expression
  • Its essential to study complex interplay of all
    genes simultaneously
  • This study requires high throughput and large
    scale technologies
  • such as Microarray Technology

106
DNA MICROARRAY TECHNOLOGY
  • This technology is conceived to detect expression
    of 1000s of genes simultaneously
  • Its potential applications include
  • Identification of complex diseases
  • Drug discovery and toxicology studies
  • Mutation polymorphisms detection
  • Pathogen detection
  • Differential expression of genes over time
    ,between tissues and diseases states

107
Cluster analysis of genes in G1 and G2
Chaudhry et. al., 2002
108
Other Softwares
Extraction of information from DNA-chip with the
technology of promoter analysis Genomatix
Software GmbH
109
SAGE and cancer
  • Ten SAGE libraries, two each from
  • normal colon
  • colon tumors
  • colon cancer cell lines
  • pancreatic tumors
  • pancreatic cell lines
  • Pooled each pair

Ref Zhang et al., Science 1997 2761268-1272
110
Variability in SAGE libraries
111
Distribution of tags
  • 303,706 total tags
  • 48,471 distinct tags
  • Distribution
  • 85.9 seen up to 5 times (25 of mass)
  • 12.7 between 5 and 50 times (30)
  • 0.1 between 50 and 500 times (26)
  • 0.1 more than 500 times (19)

Ref Zhang et al., Science 1997 2761268-1272
112
How many tags were missed?
  • They simulated to find 92 chance of detecting
    tags at 3 copies/cell
  • Using binomial approximation
  • Get 95 chance for 3 copies/cell
  • Only get 63 chance for 1 copy/cell
  • Most of what they saw occurred at 1-5 copies per
    cell

113
Differential Expression
  • Found 289 tags differentially expressed between
    normal colon and colon cancer (181 decreased 108
    increased)
  • Method Monte Carlo simulation.
  • 100000 sims per transcript for relative
    likelihood of seeing observed difference
  • Used observed distribution of transcripts to
    simulate 40 experiments.

Ref Zhang et al., Science 1997 2761268-1272
114
Sensitivity
  • Claim 95 chance of detecting 6-fold difference
  • Method Monte Carlo
  • 200 simulations, assuming abundance of 0.0001 in
    first sample and 0.0006 in second sample

Ref Zhang et al., Science 1997 2761268-1272
115
Weaknesses in Analysis
  • Failed to account for intrinsic variability in
    samples (which changes depending on abundance) in
    assessing significance
  • Monte Carlo used observed distribution, which is
    definitely not true distribution.
  • Sensitivity only measured at one abundance level.

116
Why use Microarrays?
  • What genes are expressed in a cell?
  • What genes are Present/Absent in the experiment
    vs. control?
  • Which genes have increased/decreased expression
    in experiment vs. control?
  • Which genes have biological significance?
Write a Comment
User Comments (0)
About PowerShow.com