Title: Comparative genome analysis
1Proteome analysis in silico
Peer Bork EMBL MDC Heidelberg Berlin
bork_at_embl.de http//www.bork.embl-heidelberg.de/
2omes use and misuse
Original intention exemplified by the genome
ome entirety of biomolecular objects (ALL
genes etc)
omics research on an entirety of biomolecular
objects
Proteomics research on the entirety of proteins
(so far in an organism) coined beginning of the
90th
Common Praxis
omics - used to describe large-scale
approaches (whereby large is sometimes 1)
Proteomics - used for research on many
proteins (whereby many might mean 3)
3Originally two main directions
Protein profiling and interaction proteomics
Protein profiling establishment of protein
inventories under controlled conditions
(organelles, tissues, organisms).
Interaction proteomics identification of
temporally and spatially defined functional
modules formed by proteins
Bioinformatics analysis is essential in both areas
4Part I
Proteome analysis in silico
Protein detection and annotation by homology and
orthology (function in1D)
Part II
Protein interactions and protein networks
(function in 2D)
Temporal and spatial considerations (function in
3D4D)
5Bork et al. JMolBiol 1998
Genome annotation
Alternative Splicing
Domain analysis
Protein networks
Literature mining coupled to genomic data
670 prediction accuracy is great!
7Concepts in function prediction
Homology-based (intrinsic molecular features)
- Sequence and domain DBs (Blast, Pfam,Smart)
- Function transfer by orthology
Gene context (functional associations)
- Gene neighbourhood, fusion, co-occurrence
- Shared regulatory elements
Other (residue level, functional class )
- Correlated mutations
- Interaction threading
- Feature analysis
8I. Homology-based protein annotation
Homology detection and domain annotation
Homology detection and domain annotation
Metazoan genome annotation the dark side
Metazoan proteome analysis human vs chicken
Evolution of protein function
9Status of homology based function prediction
Many homologues, an increasing number of
predictable folds, but tough times for
automatic function prediction
10Molecular Functions have to be defined on a
domain basis i.e. separately for each
structurally independent unit within a sequence
Henikoff et al. 1997 Science 278, 609
11(No Transcript)
12History of signaling domain discovery
Systematic discovery by 1) searching in
between regions 2) starting with repeats
Doerks et al. 2002 Genome Res. Ponting et al.
2001 Genome Res.
13Domain discovery in disease genes
14SMART
Blast-like input
- Access to
- different
- databases
- Domain
- annotation
- architecture
Collaboration with Chris Ponting
www.smart.embl-heidelberg.de
15SMART
Digested output
-signal sequence, Coiled coil and TM
-Pfam integrated
-comparison of domain context
www.smart.embl-heidelberg.de
16A putative transport-associated
microtubule-binding domain
Unifying disorders associated to hereditary
spastic paraplegia?
Mutation
Spartin
Plant-related
MIT
- Spastin
- SKD1 protein
- VPS4p ATPase (Vacuolar protein sorting factor 4A
and 4B) - Tobacco mosaic virus helicase domain-binding
protein
Ciccarelli, F. D., et al. Genomics 81(03)437
Patel, H. et al. Nat Genet 31(02)347,
17I. Homology-based genome annotation
Homology detection and domain annotation
Homology detection and domain annotation
Metazoan genome annotation the dark side
Metazoan genome annotation the dark side
Metazoan proteome analysis human vs chicken
Evolution of protein function
18Number of human genes in time
120
HGS, Incyte and co
HGS
Textbooks, public opinion
100
80
52
others
Basis for Feb 01 publications
60
39
No human genes in thousands
Celera
40
HGP
38
20
32
27
24
22
21
0
Aug00
Apr01
Oct00
Dec00
Feb01
Feb00
Jan05
19Improvement of gene cluster predictions
Mouse chr494-94,6 Mb p450 (CYP2J) region 8
genes / 11 pseudogenic fragments
(comparison performed in 2004)
20BLAST2GENE finds independent gene copies
BLAST of cyp2j13 protein vs. Mouse chr494-94,6
Mb
150 Alignments
BLAST2GENE
5482
7674
2499
9960
9502
2772
355
733
816
294
600
644
775
248
383
362
986
294
355
986
2662
5482
2662
2161
4259
6354
2524
5704
4957
3955
1978
1262
6286
9024
2563
8844
5074
3089
7684
2403
4717
3443
2412
3180
8678
1863
5482
1988
3280
2111
3613
9547
7380
2960
1772
3522
1656
3839
1549
5141
9639
3289
1452
5270
3289
1452
5270
12983
22025
22025
10328
10646
18576
19633
12288
12983
25664
20546
19731
22780
19940
16451
14587
13029
23116
20352
15275
14703
13461
11826
11826
25664
20546
Hundrets often considerable differences to
current gene prediction pipelines!
211. Similarity search in intergenic regions
Annotation of pseudogenes changes gene numbers
Torrents, Suyama, Bork Genome Res. 13(2003)2550
22Annotation of pseudogenes changes gene numbers
2. Consistency check of gene predictions
Still gt3000 pseudogenes among the predicted human
genes mid 2004 (build 34)
Arrays, chips et al. 20off?
23genes
What do we count?
20-40k genes
gt100k transcripts
gt1000k proteins?
Protein diversity
24Rate of detectable alternative splicing depends
on EST coverage and library range
2.8
2.7
2.6
2.5
2.4
AS per mRNA (x)
2.3
2.2
2.1
2.0
Brett et al. Nature Genet. 30(2002)29
25Boue et al. Bioessays 03
26Homology-based predictions of exons and
alternative transcripts (www.smart.embl-heidelber
g.de)
SMART domain DB links to genomes
27Top 10 domains in human 30 diff.!
human
fly
worm
Species
Total no genes
13300
18200
26500(26500)
Immunoglobulin
765 (381)
140
64
C2H2zinc finger
706 (607)
357
151
Protein kinase
319
437
575 (501)
Rhod.-like GPCR
97
358
569 (616)
P-loop NTPase
198
183
433
Rev.transcriptase
10
50
350
RRM (RNA-binding)
157
96
300 (224)
WD40 (G-protein)
162
102
277 (136)
Ankyrin repeat
105
107
276 (145)
148
109
267 (160)
Homeobox
Only no of genes given, no of domains higher
note that only around 90 is sequenced
Nature 409 (01)860 Science 291(01)1304
28Metazoan genome annotation an ongoing process and
far from complete
- gt2000 pseudogenes in mammalian gene sets Only
now they are about to be included in prediction
pipelines - Ca 150 retro-related genes in mammalian gene sets
(gt1000 in 2004), but true human genes sometimes
suppressed - Annotation of gene clusters need considerable
improvements - Alternative splicing still a major unknown
- Considerable human factor in annotation
29I. Homology-based genome annotation
Homology detection and domain annotation
Metazoan genome annotation the dark side
Metazoan genome annotation the dark side
Metazoan proteome analysis human vs chicken
Metazoan proteome analysis human vs chicken
Evolution of protein function
30human
5
chimp
75
mouse
40
rat
310MY
chicken
450MY
fugu
600-1200MY?
C.eleg.
D.mena.
?
250MY
mosquito
Human Nature Feb 2001
Mosquito Science Oct 2002
Mouse Nature Dec 2002
chicken Nature Dec 2004
Rat Nature Apr 2004
31Chicken genome analysis
Hillier et al Nature 04
Zdobnov et al Science 02
15
45
32Chicken genome analysis orthology and cellular
processes
75.4 identity (median) between chicken and
human 11 orthologs
Immune response evolves fastest
33 Chicken genome analysis Innovation and
Expansion of domain families
34Orthology analysis reveals more subtle functional
changes
35Evolution by duplication Burst of an olfactory
receptor family
thought to recognize MHC diversity
chicken
221 copies in chicken
human
given a ca 300 ORs in chicken and 450 in human
36Chicken genome analysis Evolution of function by
domain accretion
Scavenger receptor cysteine-rich domain acquired
by a fibrinogen-domain containing protein
(identified and displayed by SMART)
37I. Homology-based genome annotation
Homology detection and domain annotation
Metazoan genome annotation the dark side
Metazoan proteome analysis human vs chicken
Metazoan proteome analysis human vs chicken
Evolution of protein function
Evolution of protein function
38Phylogenetic Distribution of orthologs
- Losses
39Gene loss in diptera
D A P Y W H M
40 Functional changes at evolutionary time scales
Orthologs mapped onto metazoan phylogeny
41Summary (homology-based function prediction)
Emphasis in homology based genome annotation
shifts from sensitivity (e.g. domain
identification) to selectivity issues (orthology
assignment for 11 function transfer)
Metazoan genome annotation is far from being
complete and caution is needed when using
incomplete and partially erroneous parts list
(e.g. when predicting networks)
Yet, with the incoming number of metazoan genomes
our understanding of functional diversification
at the protein level will increase dramatically
....although the proteome remains far from being
deciphered