Title: Inside the Genome
1Inside the Genome
22001 The Human Genome
The club resident JD Watson Back2back with DJ.
Venter and
International Human Genome Sequencing
Consortium, Nature, 409 860-921 (2001)
Venter et. al. , Science 2921304-1351 (2001)
3Prologue RNA word the dark matter of genomics
- How many coding genes in the human genome?
- The Bet of 2000
- Mean 61710
- Range 30,000 150,000
- By the end of the genome project the estimated
number of human protein-coding genes declined to
only 25,000 - What is the source for that discrepancy?
- ESTs based estimation Vs. Whole Genome annotation
4RNA revolution
- The majority of the transcriptional output comes
from non coding RNA - an average of 10 of the human genome (compared
with 1.5 exonic sequences) resulted in
transcripts Cheng et al. 2005 - Or even more...62 of the mouse genome is
transcribed FANTOM3 Science 2005
5Various RNAs A partial list
- messenger RNA (mRNA)
- Ribosomal RNA (rRNA)
- Transfer RNA (tRNA)
- Small nuclear RNA (snRNA)
- Small nucleolar RNA (snoRNA)
- Short interfering RNA (siRNA)
- Micro RNA (miRNA)
6RNAs are not merely the intermediary cousins of
proteins -The Central dogma of molecular biology
Revisited
Genome
miRNA
Regulation by proteins
Regulation by RNA
Transcriptome
Proteome
7Research in Biology is complex
- Deciphering Biological Systems
- The advantage (what makes this quest feasible)
and the hindrance (what makes this quest
inherently difficult) both explained by
evolution.
8The Hindrance Topological Entanglement of
functional interconnections
- The difficulties in our research fundamentally
owe their complexity to the designer natural
selection. - What is it - a Robot or a UFO ?
- The reason lies in the profound difference
between systems designed by natural selection
and those designed by intelligent engineers
Langton 1989 Artificial Life.
9- Bottom linewe investigate an outrageously
complex weave of interconnections - The textbook networks represent only the tip of
the iceberg. - miRNAs and Regolomics
- microRNAs - Expected to represent 1 of
predicted genes Lim et al., 2003 - Lewis et al., (2003) estimate average of five
targets per miRNA - Many targets are transcription factors - miRNAs
regulate the regulators
10The advantage universal homology, thus
enabling comparative biology.
- Bottom linethe research in biology advances
through a reductionist approach - using simple
model organisms to infer functionality of
homologous systems.
11Human genome statistics
2.91 billion base pairs 24,000 protein coding
genes (gt30,000 non-coding genes ???) 1.5 exons
(127 nucleotides) 24 introns (3,000
nucleotides) 75 intergenic (no genes) Repetitive
elements rule ( 45 dispersed repeat) Average
size of a gene is 27,894 bases Contains an
average of 8.8 exonsTitin contains 234
exons. Ave. of 4 diff. proteins per gene
(alternative splicing)
12Detecting genes in the human genome
- Gene finding methods
- Ab initio use general knowledge of gene
structure rules and statisticsThe challenge
small exons in a sea of introns - Homology-based The problem will not detect
novel genes
13Genscan (ab initio)
\\// (o o) -. .-.
.-oOOo(_)oOOo-. .-. .-. .-. .-. .-. .-.
.-. .-. .-. .-. .-. .-. .-. .-. .-. .-.
X\ /X\ /X\ /X\ /X\
/X\ /X\ /X\ /X\ /X\
/X\ / \X/ \X/ \X/
\X/ \X/ \X/ \X/ \X/
\X/ \X/ \ ' -' -' -' -'
-' -' -' -' -' -' -' -' -' -'
-' -' -' -' -' -' -'
- Based on a probabilistic model of a gene
structure - Takes into account- promoters - gene
composition exons/introns- GC content- splice
signals - Goes over all 6 reading frames
Burge and Karlin, 1997, Prediction of complete
gene structure in human genomic DNA, J. Mol.
Biol. 268
14Splicing
15Eukaryotic splice sites
Poly-pyrimidine tract
16CpG Islands another signal
- CpG islands are regions of the genome with a
higher frequency of CG dinucleotides (not
base-pairs!) than the rest of the genome - CpG islands often occur near the beginning of
genes ? maybe related to the binding of the TF
Sp1
17Gene Ontology
- GO describes proteins in terms of biological
process(e.g. induction of apoptosis by external
signals) cellular component(e.g. membrane
fraction)molecular function(e.g. protein
kinase)
18Comparative proteome analysis
Functional categories based on GO
19Comparative proteome analysis
- Humans have more proteins involved in
cytoskeleton, immune defense, and transcription
20Evolutionary conservation of human proteins
???
21Horizontal (lateral) gene transfer
- Lateral Gene Transfer (LGT) is any process in
which an organism transfers genetic material to
another organism that is not its offspring
22- Mechanisms
- Transformation
- Transduction (phages/viruses)
- Conjugation
23Bacteria to vertebrate LGT detection
- E-value of bacterial homolog X9 better than
eukaryal homolog
Human query Hit e-value Frog ..
4e-180 Mouse 1e-164 E.Coli .. 7e-124
Streptococcus .. 9e-71 Worm .0.1
24Bacteria to vertebrate LGT
Non-vertebrates
Bacteria
vertebrates
25(No Transcript)
26Bacteria to vertebrate LGT??
- Hundreds of sequenced bacterial genome vs.
handful of eukaryotes - Gene finding in bacteria is much easier than in
eukaryotes - On the practical side rigid mechanical barriers
to LGT in eukaryotes (nucleus, germ line)
27Repetitive Elements in the Human Genome
28Repeats statistics
- The human genome is 45 dispersed repeat
- 20 LINEs, (AT rich)
- 13 is SINES (11 Alu), (GC rich)
- 8 LTR (retrovirus like) and
- 2 DNA transposons
- Another 3 is tandem simple sequence repeats
(e.g. triplet) - And another 3-5 is segmentally duplicated at
high similarity (over 1kb over 90 id) - Identifying and screening these out is essential
to avoid fake matches
29LINEs and SINEs
- Highly successful elements in eukaryotes
- LINE - Long Interspersed Nuclear Element (gt5,000
bp) - SINE - Short Interspersed Nuclear Element (lt 500
bp) - SINEs are freeriders on the backs of LINEs
encode no proteins
30The C-value paradox
- Genome size does not correlate with organism
complexity
Amoeba Rice Human Yeast
670 billion 4.3 billion 3 billion 12 million Genome size
? 30,000 20-25,000 6,275 Number of genes
31Repetitive elements
- The C-value mystery was partially resolved when
it was found that large portions of genomes
contain repetitive elements
32Are Alus functional??
- SINEs are transcribed under stress
- SINE RNAs may bind a protein kinase ? promote
translation under stress - Need to be in regions which are highly
transcribed - Role in alternative splicing
33Segment duplications
- 1077 segmental duplications detected
- Several genes in the duplicated regions
associated with diseases (may be related to
homologous recombination) - Most are recent duplications (conservation of
entire segment, versus conservation of coding
sequences only)
34Genome-wide studies
35Sequenced genomes
36- 481 segments gt 200 bp absolutely conserved (100
identity) between human, rat and mouse
37Comparison with a neutral substitution rate
- Compare the substitution rate in a any 1Mb region
- Probability of 10-22 of obtaining 1
ultranconserved element (UE) by chance
38481 UEs
100 intronic
111 UE overlap a known mRNA exonic UEs
256 - no overlap (non-exonic)
156 inter-genic
114 - inconclusive
39Who are the genes?
Type 1 exonic Type 2 genes which are near
non-exonic UEs (???)
40Intergenic UEs
- Genes which flank intergenic UEs are enriched for
early developmental genes - Are UEs distal enhancers of these genes?
41Gene enhancer
- A short region of DNA, usually quite distant from
a gene (due to chromatin complex folding), which
binds an activator - An activator recruits transcription factors to
the gene
42Experimental studies of UEs
Tested 167 UEs (both mouse-human UEs and
fish-human UEs) for enhancer activity cloned
before a reporter gene to test their
activity 45 functioned as enhancers
43A bioinformatic success
- Ultraconservation can predict highly important
function!
44BUT
Ahituv PLoS Biol. 2007 Sep5(9)e234
Chose 4 UEs which are near specific genesgenes
which show a specific phenotype when
knocked-out Performed complete deletion of these
UEs the mice were viable and did not show any
different phenotype
45Conclusions
- Ultraconservation can be indicative of important
function -
- And sometimes not- gene redundancy- long-range
phenotypes- laboratories cannot mimic life