Title: Comparative genomics and proteomics in Ensembl
1Comparative genomics and proteomics in Ensembl
November 2004
2Overview
- Rationale
- Species available
- Comparative proteomics
- Orthologues prediction
- Protein clustering into families
- Comparative genomics
- Genome-wide DNA alignments
- Conserved synteny blocks
- Future and perspectives
3Compara
- The Compara database is one single multispecies
database - Gene orthology/paralogy prediction
- Protein clustering
- Whole genome alignments
- Synteny regions
4Comparing different species
H. sapiens (Human) 3Gb NCBI 34
5
23
P. troglodytes (Common chimpanzee) CHIMP1
91
M. mulatta (Rhesus macaque)
92
M. musculus (House mouse) 2.6Gb NCBIm33
41
R. norvegicus (Norway rat) 2.6Gb RGSC3.1
C. familiaris (Domestic dog) BROAD1
45
310
74
F. catus (Domestic cat)
83
E. caballus (Horse)
65
S. scrofa (Domestic pig)
360
B. taurus (Cow) Btau 1.0
20
O. aries (Domestic sheep)
450
M. domestica (opossum)
G. gallus (Domestic fowl) 1.2Gb WASHUC1
X. laevis (African clawed frog) JGI3 3.1Gb
197
550
X. tropicalis (Tropical clawed frog) 1.7Gb
D. rerio (Zebrafish) 1.7Gb WTSI Zv4
135
70
O. latipes (Japanese medaka) 800Mb
T. nigroviridis Tetraodon7 400Mb
25
990
T. rubripes (Tiger pufferfish) 400Mb Fugu v2.0
?
C. savignyi (sea squirt) 180Mb
C. intestinalis (sea squirt) 180Mb
200?
A. aegypti (yellow fever mosquito)
A. gambiae (African malaria mosquito) 230Mb MOZ2
250
D. melanogaster (fruitfly) 125Mb BDGP3.1
300
A. mellifera (honeybee) 270Mb Amel1.1
C. elegans (nematode) 100Mb WS116
40
C. briggsae (nematode) 100Mb cb25.agp8
Million years
100
200
300
400
500
1000
Red whole genome assembly available Green
whole genome assembly due within the next 2 years
Currently in Ensembl
5Comparing different species
- From the Ensembl perspective joins species
through - orthologous/paralogous genes links
- chromosome synteny links
- protein family links
- From a broader perspective
- Where are syntenic regions located?
- How many genes are conserved?
- Where are orthologous/paralogous genes?
- Is gene order conserved?
- Where are potential regulatory regions?
- What is missing in one species, present only in
another?
6Orthologues prediction
- Use in model organism
- Gathering of information
- Identify potential species-specific
proteins/genes
7Identifying orthologous genes
time
Speciation
Orthologous
Duplication
Paralogous
Original function maintained
Original function maintained
Novel function
Paralogous
Gene 1
Gene 2
Gene 3
Functional Orthologous
8Orthologues prediction
- Find orthologous genes by comparing the protein
sets of two species (only the longest peptide
considered). - blastpsw all versus all (on a paired species
basis) - Best Reciprocal Hit as putative orthologues
(named BRH)
UBRH
MBRH subtype DUP1.3
MBRH subtype COMPLEX
9RHS, Orphans and Others
- Based on UBRH and MBRH-DUPs genomic coordinates
in both species compared and gene order
conservation, we identify additional orthologues
or RHS for Reciprocal Hit supported by Synteny.
MBRH COMPLEX
Human
Orphan
Mouse
For chimp, due to the special nature of the gene
build process, we also have DWGA (Derived from
Whole Genome Alignment)
10(No Transcript)
11For each orthologous gene pair
- We store
- identity, positivity, coverage, cigar lines,
description (UBRH, MRHS), subtype
(DUP1.2,SYN,COMPLEX), dN, dS - All the blastpsw results are provided
- Using the compara perl API
- Protein or cDNA protein-based alignment
- 4D, 2D sites can also be easily retrieved
- Future developments
- UBRHSYN or UBRHNON-SYN
- Consider all isoforms for each gene
- Build clusters of orthologues
- Bringing in paralogues?
- Multiple alignments and phylogenies
12Protein clustering into families
- Cluster proteins from different organisms that
may share the same function - Obtain some kind of description for novel
genes/proteins - Locate family members over the whole genome
- Identify possible orthologues and paralogues in
other species
13Dataset used and comparisons
- Half a million proteins clustered
- All Ensembl proteins from all species in Ensembl
- 233,000 predicted proteins
- All metazoan (animal) proteins in
SwissProt/SPTrEMBL - 40,000 UniProt/Swiss-Prot
- 230,000 UniProt/SPTrEMBL
- Blastp all versus all, then clustering with MCL
14Clustering with MCL
- MCL for Markov CLustering algorithm, based on
flow simulation in graphs (http//micans.org/mcl/)
- Keeps into the same graph/cluster only very well
inter-connected nodes/protein - Allows rapid and accurate detection of protein
families on large-scale. - Automatic description and clustalw multiple
alignment applied on each cluster
MCL
15For each cluster
- We store
- Description and score
- Multiple alignment
- Future extensions
- Improving descriptions
- Multiple alignment assessment
- t-coffee
- Protein domain information consistency
- Build phylogeny on each cluster
- Using the multiple alignment
- Using dS values (mainly inside mammals)
- Identify intra/inter-species orthologue/paralogues
16(No Transcript)
17Addition of protein domain information
- Introduction of protein domain
- Help for internal data QC by checking consistancy
between orthologues, protein clusters and domains
information. - Provide this kind of cross-check data to the user
18Aligning complete genomes
19Aligning genomes, why?
- Understand what evolution has done on the species
compared, after their speciation - Define syntenic regions, those long regions of
DNA sequences were order and orientation is
highly conserved - Finding conserved non coding regions
- Good guides to find and test putative regulatory
regions - What is missing in one species, present only in
another? - Differences between closely related species
(human/chimpanzee, human/macaque), may help
understanding speciation
20Basic ideas
Ancestor sequence
Speciation event
mutations
selection
alignment
Mutation Regulatory region Exon
21Basic ideas
- Functional sequences (coding exons, regulatory
regions) - are generally highly conserved
- Conserved sequences can be functionnaly important
- Conservation Function
- Comparing DNA sequences from different species
can help - to find biological functions
22Using a local aligner
- Local alignment
- Find all highly similar regions over 2 sequences
- Find the orthologous as well as all the
paralogous sequences - Separated by segments without alignment
- Can handle rearranged sequences
- Need post- filtering to limit too much
overlapping alignments
23Global vs. Local Alignments
Local
Global
inversion
duplication
1
2
1
2
(-)
24Aligning large genomic sequences
- Independent from protein/gene predictions
- Issues
- Heavy process
- Computes run only by few dedicated groups
- Scalability (more and more species available)
- Time constraint
- As the true alignment is not known, then
difficult to measure the alignment accuracy and
apply the right method
25Ensembl compute 0.25
26The rest of it
27Trying to avoid the all versus all comparison
- Phusion shotgun assembler-gapped BLASTN
combination - (Jim Mullikin and Zemin Ning, Sanger Institute)
The Phusion Assembler in Genome Res. (2003) 13
81-90
28Phusion - gapped BLASTN
Human 60Kb fragments
Mouse 60Kb fragments
Phusion clustering
16000 clusters containing no more than 50
fragments
gapped BLASTN
29The compute
The clusters
The farm
320x Compaq DS10 1Gb memory 60Gb local disk
10Tb
768 RLX blades 1Gb memory 80Gb local disk
8x Compaq ES40 32 CPUs
6x Compaq ES45 24 CPUs
30Phusion - gapped BLASTN
- Fast but speed comes at a cost
- Only 22 of human genome coverage
- Good enough for generating orthologous links
between - the 2 species aligned, so that can be used
either - - in the web site for moving from one species to
another - - calculate synteny regions
- Not good enough for serious genome-wide
post-analysis - because not comprehensive enough
31all versus all approach usingBLASTZ
(collaboration with UCSC)
- Can handle large sequences
- Used 2-weighted spaced seeding strategy
- Dynamic masking
- Makes distinction between repeat and
- non-repeat sequences (soft masking)
- Try aligning inside repeats
- One iterative step with lower threshold
- to expand alignments
32Blastz strategy
- 10Mb Human fragments (3000)
- 30Mb Mouse fragments (100)
- Lineage-specific repeats removed
- 48 hours on 1024 CPUs
- Generates 9Gb of output
- When filtered for Best hit on Human,
- reduced to 2.5Gb
- 10Mb Human fragments (3000)
- 30Mb Mouse fragments (100)
33Blastz human genome coverage
- 40 of the human genome is covered by an
- alignment of mouse sequences
- By rescoring the alignment over a tight matrix
- that is very stringent and look for high
conservation - (gt70 identity), the coverage goes down to 6
34Genome alignment summary
- cons track
- blastz human/mouse, human/rat,
human/chimpanzee, human/chicken,
mouse/chicken, mouse/rat, rat/chicken,
fugu/Tetraodon - phusion-blastn elegans/briggsae
- high cons track
- Obtained by rescoring the raw alignments over
- a tight matrix
- trans BLAT track
- translated BLAT human/fugu, human/zebrafish,
human/chicken, human/Tetradodon,
fly/anopheles, fly/bee,
elegans/briggsae, chicken/mouse, rat/zebrafish
mouse/Tetraodon, mouse/ zebrafish,
mouse/fly, mouse/fugu, rat/Tetraodon,
fugu/zebrafish, Tetraodon/chicken,
Tetraodon/zebrafish, Tetraodon/fugu,
Anopheles/bee
35(No Transcript)
36DNA/DNA matches web display
37DotterView
38Defining large syntenic regions
- genome alignments are refined into large syntenic
regions. - Alignments are clustered together when the
relative distance between them is less than 100kb
and order and orientation are consistent. - Any clusters less than 100kb are discarded.
39Synteny web display
- 347 syntenic regions
- Coverage
- 87.5 human
- 92.4 mouse
- Size range
- human
- 104.4Kb - 57.3Mb
- mouse
- 100.2Kb - 51.4Mb
40MultiContigView
41Synteny blocks in ContigView/CytoView
42Integrated multigenome browser
direct
Orthologous/paralogous genes
via families
Mouse genome browser
Human genome browser
Whole genome alignment Syntenic regions
43Species used in genome alignments
Human
Mouse
Rat
Vertebrata Compara
Chimp
Chimp
Chicken
Nematoda Compara
C. briggsae
C. elegans
44Species used in orthologues prediction
Fruit fly
Human
Mouse
Rat
Vertebrata Compara
Arthropoda Compara
Chimp
Chimp
Mosquito
Fugu
Zebrafish
Honeybee
Chicken
Tetraodon
Nematoda Compara
C. briggsae
C. elegans
45Species included in protein clustering
Fruit fly
Human
Mouse
Mouse
Rat
Vertebrata Compara
Arthropoda Compara
Chimp
Mosquito
Fugu
Zebrafish
Honeybee
Chicken
Tetraodon
Nematoda Compara
C. briggsae
C. elegans
46Outlook
-
- OrthoView
- Displaying alignments both from whole genome
alignments and on orthologues - Projected transcripts
47Acknowledgements
- Abel Ureta-Vidal
- Cara Woodwark
- Jessica Severin
- Javier Herrero
- Ensembl team
48AlignSlice concept
- Slice having alignment information attached to
it. - Being able to project a transcript from one
species to another through the alignment data
(pairwise or multiple) - Give gene context information across species
- Needed as a significant number of genomes are
going to be 2X/3X. No sensible gene building
possible. Cow will be used as test run.
49AlignSlice concept
- Getting an human AlignSlice
- my HumanAlignSlice
- AlignSliceAdaptor-gtfetch_by_Slice_method_link_spe
cies_set( - human_slice,
- method_link_species_set)
- Getting mouse genes projected on the human slice
coordinates as much as possible - my mouse_genes
- HumanAlignSlice-gtget_all_genes_by_species(Mus
musculus) - Changing the reference species
- my MouseAlignSlice
- HumanAlignSlice-gtchange_reference_species_to(Mus
musculus)