Title: Beyond the Human Genome: Transcriptomics
1Beyond the Human GenomeTranscriptomics
- Dr Jen Taylor
- Henry Wellcome Centre for Gene Function
- Bioinformatics
- Department of Statistics
- taylor_at_stats.ox.ac.uk
2Beyond the Human Genome 1995 Human Genome
sequencing begins in earnest Mapping the Book of
Life 1999 Human Genome 2000 - First
Draft Human Genome 2003 - Essential
Completion Human Genome
approx 140, 000 genes
30, 000 40,000 genes ??
24, 195 genes !!!???
Commemorative stained glass window for F.C.
Crick, designed by Maria McClafferty.(Photograph
Paul Forster) Gonville Caius College,
Cambridge, UK.
3Beyond the Human Genome
Gene Number ? Complexity
Gene
Commemorative stained glass window for F.C.
Crick, designed by Maria McClafferty.(Photograph
Paul Forster) Gonville Caius College,
Cambridge, UK.
4Introduction The scope of transcriptomics a
definition of the transcriptome Part I
Observing the transcriptome Experimental
methodology Data analysis Part II Using the
transcriptome The regulation of the
trancriptome The transcriptome and the genome The
transcriptome and the proteome Beyond the Human
Transcriptome
5Transcriptome transcriptome, the mRNAs
expressed by a genome at any given
time.. (Abbott, 1999)
6Central Dogma of Molecular Biology
- mRNA single stranded RNA molecule
- Complementary to DNA
- Processed (spliced and polyadenylated) RNA
transcript - Carries the sequence of a gene out of the nucleus
into the cytoplasm where it can be translated
into a protein structure
Image Access Excellence, National Institutes of
Heath
7Transcriptome An evolving definition
- (the population of) mRNAs expressed by a
genome at any given time - (Abbott, 1999)
- The complete collection of transcribed
elements of the genome. - (Affymetrix, 2004)
- mRNAs 35, 913 transcripts (including
alternative spliced variants) - Non-coding RNAs
- tRNAs (497 genes)
- rRNAs (243 genes)
- snmRNAs (small non-messenger RNAs)
- microRNAs and siRNAs (small interferring RNAs)
- snoRNAs (small nucleolar RNAs)
- snRNAs (small nuclear RNAs)
- Pseudogenes ( 2,000)
8The human transcriptome
Nucleotides
High density oligonucleotide arrays across 11
different cell lines 70 of transcripts
non-coding 79-88 have multiple
transcripts Kapranov et al., 2002 90 of
transcribed nucleotides outside annotated exons
The dimensions of the unique transcriptome?? gtgtgt
current 40,000 estimate
Kampa et al., Novel RNAs identified from an
in-depth analysis of the transcriptome of human
chromosomes 21 and 22. Genome Research. 2004
9Transcriptomics
- Scope
- the population of functional RNA transcripts.
- the mechanisms that regulate the production of
RNA transcripts - dynamics of the trancriptome (time, cell type,
genotype, external stimuli) - Definition
- The study of characteristics and regulation of
the functional RNA transcript population of a
cell/s or organism at a specific time.
10Introduction The scope of transcriptomics a
definition of the transcriptome Part I
Observing the transcriptome Experimental
methodology Data analysis Part II Using the
transcriptome The regulation of the
trancriptome The transcriptome and the genome The
transcriptome and the proteome Beyond the Human
Transcriptome
11Observing the transcriptome
High-throughput friendly
Genome
Predicts Biology
Regulatory network
Transcriptome
Context dependent and dynamic
Proteome
Li et al., 2004
12Publications Expression Profiling vs Proteomics
Data from PubMed
13Observing the transcriptome?
Classic Human Transcriptome Profiling
Studies Trancriptome reflects Biology
Golub et al., Molecular classification of cancer
class discovery and class prediction by gene
expression monitoring. Science 1999. ALL acute
lymphoblastic leukemia AML acute myeloid
leukemia Scherf et al., A gene expression
database for the molecular pharmacology of
cancer. Nature Genetics 2000 60 human cancer cell
lines
14Observing the transcriptome
- Focussed Experimental Approaches
- Northern Blotting Analysis
- Real time PCR (quantitative or semi-quantitative)
- Highthroughput Approaches
- Closed System Profiling
- Microarray expression profiling
- ? Open System Profiling
- Serial analysis of gene expression (SAGE)
- Massively Parallel Signature Sequencing (MPSS)
15Red increase of Cy5 sample transcripts Green
increase of Cy3 sample transcripts Yellow equal
abundance
Limit of Detection 1 in 30,000 transcripts
20 transcripts/cell
16Experimental overview
17Red increase of Cy5 sample transcripts Green
increase of Cy3 sample transcripts Yellow equal
abundance
Limit of Detection 1 in 30,000 transcripts
20 transcripts/cell
18Platforms and Formats
- Isotope
- Nylon cDNA (300-900 nt)
- Two-colour
- Glass
- cDNA or Oligo (80 nt)
- 500 11,000 elements
- Affymetrix
- Silicone oligo (20 nt)
- 22 ,000 elements
- Tissue Arrays
- Glass
- Tissue Discs (20-150)
19Affymetrix GeneChip
Limits 1 100,000 transcripts 5
transcripts/cell
Affymetrix GeneChip
20http//www.affymetrix.com
21Affymetrix
- Gene Expression Arrays Transcripts/Genes
- Arabidopsis Genome 24,000
- C. elegans Genome 22,500
- Drosophila Genome 18, 500
- E. coli Genome 20, 366
- Human Genome U133 Plus 47,000
- Mouse Genome 39, 000
- Yeast Genome 5, 841 (S. cerevisiae) 5, 031
(S. pombe) - Rat Genome 30, 000
- Zebrafish 14, 900
- Plasmodium/Anopheles 4,300 (P. falciparum)
14,900 (A. gambiae) - Barley (25,500), Soybean (37,500 23,300
pathogen), Grape (15,700) - Canine (21,700), Bovine (23,000)
- B.subtilis (5,000), S. aureus (3,300 ORFS),
Xenopus (14, 400)
22Microarray and GeneChip Approaches
- Advantages
- Rapid
- Method and data analysis well described and
supported - Robust
- Convenient for directed and focussed studies
- Disadvantages
- Closed system approach
- Difficult to correlate with absolute transcript
number - Sensitive to alternative splicing ambiguities
23Serial Analysis of Gene Expression (SAGE)
- The principles
- Velculescu et al., Science 1995
- A transcript (new or novel) can be recognised by
a small subset (e.g. 14) of its nucleotides
a tag - Linking tags allows for rapid sequencing.
- Open system for transcript profiling
- Modified SAGE methods
- LongSAGE (21 nt)
- SAGE-lite, micro-SAGE, mini-SAGE
- RASL/DASL methods (5 and 3 Tags)
14 nt
TAG
AAAAAAAAA 3
TAG
AAAAAAAAA 3
TAG
AAAAAAAAA 3
TAG
AAAAAAAAA 3
AGCTTGAACCGTGACATCATGGCCATTGGCCCCAATTGAGACAGTGAGTT
CAATGC
TAG
TAG
TAG
TAG
Sequence
24SAGE
- Advantages
- Potential open system method new transcripts
can be identified - Accuracy of unambiguous transcript observation
- Digital output of data
- Quantitative and qualitative information
- Disadvantages
- Characterising novel transcripts is often
computationally difficult from short tag
sequences - Tag specificity (recently increased length to 21
bp) - Length of tags can vary (RE enzyme activity
variable with temperature) - A subset of transcripts do not contain enzyme
recognition sequence - Sensitive to a subset of alternative splice
variants
25Introduction The scope of transcriptomics a
definition of the transcriptome Part I
Observing the transcriptome Experimental
methodology Data analysis Part II Using the
transcriptome The regulation of the
trancriptome The transcriptome and the genome The
transcriptome and the proteome Beyond the Human
Transcriptome
26Biological question
Sample Attributes
Experimental design Platform Choice
16-bit TIFF Files
Microarray experiment
(Rspot, Rbkg), (Gspot, Gbkg)
Image analysis
Normalization
Statistical Analysis
Clustering
Data Mining
Pattern Discovery
Classification
Biological verification and interpretation
27Analysis
188, 000
47,000 x 2 x 2 datapoints
Liver
47,000 x 2 x 2 datapoints
188, 000
Brain
47,000 x 2 x 2 datapoints
Lymphocyte
188, 000
28Analysis
- Essential problem
- Given a large dataset with technical and
biological noise - Find
- A) Transcripts patterns (common themes or
differences) - measures of robustness or some idea of
uncertainty - B) Sample similarities or differences between
samples on global/multi-gene level
29Analysis
Brain
Liver
Lymphocytes
Which transcripts are different?
What are the patterns?
30Biologists Nightmare Statisticians Playground
- Characteristics of the expression profiling data
- High dimensionality
- Sample number (n) low and observation number high
(p) - Non-independence of observations
- Complex patterns visualisation and extraction
- Incorporation of contextual information
- Standardisation and data sharing
- Integration of with other data types
31Analysis Methods
- Classical parametric non-parametric statistical
tests for hypothesis testing - Unsupervised clustering algorithms
- Hierarchical clustering
- Kmeans and Self-Organising Maps
- Classification
- e.g. Machine learning and Linear discriminant
analysis - Dimensionality Reduction or Principal Component
Analysis - e.g. Gene Shaving and Multi-dimensional Scaling
- Probabilistic Modelling
- Dynamic Bayesian Networks
- Markov Models
32Analysis Methods
- Classical Parametric Statistical Analysis
Tools T-test ANOVA Mann Whitney U Test
Fold Change
Liver
Brain
Lymphocyte
33Analysis Methods
- Classical Parametric Statistical Analysis
(P0.01) 20,000 transcripts 200 transcripts
- Difficulties
- Assumes that observations are normally
distributed and independent - Statistical significance does not equal
biological significance - Appropriate multiple testing corrections are
difficult
???
34Analysis Methods
Clustering Approaches Divides or groups
genes/samples into groups clusters, based on
similarities and differences Number of groups is
user defined
Algorithms Hierarchical clustering Kmeans
clustering Self organising maps
35Distance Metrics
Time
Distance between 2 expression vectors
Euclidean Pearson(r-1)
1.4
-0.90
4.2
-1.00
36Distance Metric
Transcription Factor Transcript
Target Transcript 1 Target Transcript 2
37Hierarchical Clustering
g1 is most like g8
g4 is most like g1, g8
38Hierarchical Tree
39Clustering Case Study
- Sorlie et al., 2001
- Breast tissue subtypes
- Hierarchical clustering
-
40K-means clustering
Partition or centroid algorithms
Step 1 User specifies K clusters
x
K 3
x
Expression Level
Brain
x
Liver Expression Level
41K-means clustering
Step 2 Using Euclidean distance nearest points
assigned to clusters (k)
Step 3 New centroids calculated
x
K 3
x
x
42K-means clustering
Step 4 Points re-assigned to nearest centroid
Step 5 New centroids calculated
K 3
43Classification
Transcript B
Transcript A
K-nearest neighbour methods (KNN) Linear
Discriminant Analysis (LDA) Machine Learning
Support Vector Machines Neural Network
Analysis
Adapted from Florian Markowetz
44Classification
Training Set 2/3 sample set
Test Set 1/3 sample set
Define Classification Rule
Linear Discriminant Analysis KNN
Gene B
Gene A
45Classification More complex classifiers
Gene B
Gene A
KNN Voting scheme (k3) Use three closest
points to classify
Adapted from Florian Markowetz
46Probabilistic Modelling
- Incorporate dependencies and prior knowledge
into the identification of patterns/clusters - - relationships in time between samples
- - relationships between genes
- Handle measures of uncertainty well
- Conceptually simple, consideration needed on
implementation
- Markov modelling
- Dynamic bayesian networks
47Analysis Methods
- Classical parametric non-parametric statistical
tests for hypothesis testing - Unsupervised clustering algorithms
- Hierarchical clustering
- Kmeans and Self-Organising Maps
- Classification
- Machine learning and Linear discriminant
Analysis - Dimensionality Reduction or Principal Component
Analysis - Gene Shaving and Multi-dimensional Scaling
- Probabilistic Modelling
- Dynamic Bayesian Networks and Pattern
recognition - Markov Models
48Introduction The scope of transcriptomics a
definition of the transcriptome Part I
Observing the transcriptome Experimental
methodology Data curation and analysis pipelines
Part II Using the transcriptome The regulation
of the trancriptome The transcriptome and the
genome The transcriptome and the proteome Beyond
the Human Transcriptome
49. to be continued.
50Introduction The scope of transcriptomics a
definition of the transcriptome Part I
Observing the transcriptome Experimental
methodology Data curation and analysis pipelines
Part II Using the transcriptome The regulation
of the trancriptome The transcriptome and the
genome The transcriptome and the proteome Beyond
the Human Transcriptome
51Regulation of Gene Expression
Abundance (transcript) Rate of
Transcription Rate of Decay
Decay
Transcription
- Protein/DNA interactions
- cis and trans regulatory sequence motifs
- chromatin structure
- Methylation
- Protein/RNA interactions
- cis-acting regulatory motifs
- secondary structure
52Regulation of Transcription
Wray et al., 2003
53Regulation of Decay
Stabilisation facilitates rapid increase in
potential protein production Destabilisation
facilitates precise time and dose control of
transcripts
Abundance
Stabile
Abundance
Decay
Time
Time
- Sequence-mediated mRNA decay AU rich elements
(AREs) - 3 UTR, 50 150 nucleotides
- usually multiple copies (e.g. AUUUA x 5)
- protein recruitment for destabilisation
- size and content variation (functionally
critical motif unknown) - gt30 of vertebrate homologous mRNAs have highly
conserved elements in the 3UTR - often
sequence position
54- The importance of the decay process
-
- BMP2 (bone morphogenetic protein 2)
developmentally critical, highly conserved
protein in vertebrates (Fritz et al., 2004) - 3 UTR conservation
- - 73 /100 nucleotides, 450 myr evolution
- - 95 within mammals
-
- Cancer related genes
- C-fos, C-myc, C-jun, MMP-13, Cyclooxygenase-2,
Cyclin D, Cyclin E, Cyclins A and B, Cdk
inhibitors, DNA methyltransferase 1. - (Review Audic and Hartley, 2004)
55Regulation of Transcription
Wray et al., 2003
56Regulation of Trancription
Diverse orientations, structure and functional
properties of regulatory modules
Wray et al., 2003
57Regulation of the transcriptome
- Finding regulatory elements using co-abundant
transcripts
Assumption shared abundance profile same
cluster shared regulatory machinery
Penacchio and Rubin, 2001
58Introduction The scope of transcriptomics a
definition of the transcriptome Part I
Observing the transcriptome Experimental
methodology Data analysis Part II Using the
transcriptome The regulation of the
trancriptome The transcriptome and the genome The
transcriptome and the proteome Beyond the Human
Transcriptome
59The transcriptome the genome
- Using the genome to infer/observe the
transcriptome - Construction of whole genome/transcriptome arrays
and SAGE tags - Using sequence features to predict gene
expression - Beer and Tavazoie. Predicting gene expression
from sequence. Cell 2004 - Using chromatin structure to predict regulation
of gene expression - Sabo et al. Genome-wide identification of
DNaseI hypersenstive sites. PNAS 2004 - Quantitative trait loci mapping
- Morley et al., Genetic analysis of genome-wide
variation in human gene expression. Nature 2004 - Schadt et al., Genetics of gene expression
surveyed in mouse, human and maize. Nature
2003
60Transcriptome Genome
- Beer and Tavazoie, Cell. 2004
Abundance profile
Predict potential gene expression patterns
Transcription factor binding site
61Transcriptome Genome
- Beer and Tavazoie, Cell. 2004
AND Logic, OR Logic
AND Logic
OR Logic, NOT Logic
Combinatorial patterns help identify groups of
transcripts predicted to show similar abundance
profiles
Solid Actual expression Dashed Predicted
62Introduction The scope of transcriptomics a
definition of the transcriptome Part I
Observing the transcriptome Experimental
methodology Data analysis Part II Using the
transcriptome The regulation of the
trancriptome The transcriptome and the genome The
transcriptome and the proteome Beyond the Human
Transcriptome
63The transcriptome the proteome
- Functional annotations of co-abundant genes
- Yang et al., 2003 Decay rates of human mRNAs
Correlation with functional characteristics and
sequence attributes. Genome Research. - Co-ordinated patterns of decay rates within
functional classes of transcripts - Transcription factor functional classes have
fast-decaying mRNAs (lt2 hr half lives). - Transcripts of multi-subunit proteins have
correlated decay patterns and rates
64The transcriptome the proteome
- Do they agree?
- Studies of direct correlation between mRNA
abundance and protein abundances - ( r 0.6) (Hegde et al., 2003)
- Biological Issues
- Post-translational modifications
- Protein stability and folding
- Alternative splicing products
- Technical Issues
- Inter-platform variability (microarray and RT
PCR r 0.8) - Protein abundance measures 2D gel
electrophoresis
65The transcriptome the proteome
- The integration of transcriptomics and proteomics
Hegde et al., 2003
Synergistic approaches to biological problems
using both transcriptomics and proteomics
66Beyond the Human Transcriptome
- Challenges for the Future (short and long term)
- Integration of different datatypes
- - sequence, exon structure, transcript
abundance, protein abundance and function - Dealing with alternative splice variants
- The regulatory processes behind any given RNA
abundance - Dealing with gene ontologies in a quantitative
manner
67Beyond the Human Transcriptome
- Future Directions
- Open systems for comprehensively cataloguing
the transcriptome - - between tissues/cells/developmental time
points - - between individuals
- Variation of transcriptome between individuals
- - coding variants, epigenetic variation and
inheritance - Clinical deployment of transcriptome profiling
approaches in diagnostics and pharmacogenetics - Human Regulatory Network Resources for Tissues
68Acknowledgements
Oxford Centre for Gene Function Jotun Hein Chris
Holmes Gerton Lunter Lizhong Hao Ben Holtom Karen
Lees http//www.stats.ox.ac.uk/taylor/Presentati
ons