Title: Outline
1Outline
- Annotation status What weve done
- Preliminary analysis
- Data FTP
2Release 3a.50
- 16,587 Clones 2,779 Mb Sequence
- Within-BAC Assemblies
- Contig N50 33,145 bp
- Scaffold N50 54,953 bp
3Pre-annotation Mask Repeats
RepeatMasker with MIPS REcat Release
4.3 (Deficient in Helitron-encoded genes)
Find protein-coding genes
4Evidence-based Gene Build Pipeline
Pipeline use Ensembl framework www.ensembl.org
5Evidence Used
- 36,430 Ceres fl_cDNA
- 14,963 Arizona fl_cDNA
- 1,462,607 537,726 ESTs and another 18181 mRNAs
from maize - 1,217,859 ESTs and another 72,919 mRNA from rice
- 2,448,641 ESTs and another 14015 mRNAs from
other monocots - 359,942 Swiss-Prot proteins from all species
- 494,444 non-maize plant proteins in Trembl
- 94,734 GenBank proteins from plant species
- 52,177 rice proteins from rice gene annotations
- 36,338 proteins from sorghum gene annotations
6Captured 95-97 of Genes
7Working Gene Set
- Set of predicted genes subjected to 2
annotations - Compara gene trees/orthologue calling
- Xref to SwissProt
- InterPro (Pfam domains, TMHMM, etc)
- GO assignment
- 2 annotations will help refine gene set
8Working Set Source
- Gene-build on 16,587 BACs (2,778,853,373 bp)
- Evidence based (fl_cDNA, EST, mRNA, proteins)
- 90,828 loci
- Fgenesh (monocot) on repeat-masked DNA
- 22,843 loci not overlapping with gene-build
- Large and dubious set
- 113,671 genes 148,277 transcripts
- Expect
- False positives
- TEs
- Fragmented genes
- Redundancies due to overlapping BAC clones
9Defining Confidence Levels(to genes not
necessarily annotation)
- Homology-based classification
- BLASTP vs GenBank NRAA
- Cull out TE-encoded genes
- Find genes showing homology
- Has orthologue/paralogue
- Results from Compara gene trees
- Evidence types
- fl_cDNA, mRNA, protein, EST, FgenesH
- Gene Fragment detection
- Ratio of maize gene length to its best hit in
rice (TIGR5) and sorghum (Sbi1.3). - Complete CDS
- Has valid start and stop codons and no internal
stops - Could reflect problem in gene or gene model
- TE Pfams
- Have not looked at this yet
10Confidence Metrics
11Homology-Based Classification
- BLASTP vs NRAA
- Top 20 hits significant hits
- Compared to list of manually-curated TEs in
GenBank - Non-TEs with significant hit (E-value e-05)
homology
12Ensembl Compara Database
- Stores genes and their relationships with one
other. - Orthologue and paralogue assignments based on
reconciliation of gene trees with species tree. - Alignments made using MUSCLE and trees
constructed using TreeBest - Stores multiple alignments and trees..
- Easy access to the database through the
Application Programming Interface - Vilella et al. 2008 http//genome.cshlp.org/conte
nt/early/2008/11/24/gr.073585.107
BMC Bioinformatics (2004)195113
http//treesoft.sourceforge.net/njtree.shtml
Gramene Project Nucleic Acids Research, 34
D717-723(2006) Ensembl ComparaGenome Res. 2004
May 14(5)934-941
13Compara Overview
- Gene Trees for 13 Species (including dicots,
grasses, and non-plants) - Infers Orthologues and Paralogues by reconsiling
gene tree with species tree - Vilella A.J., et al. (2008). Genome Res.
Pre-print doi10.1101/gr.073585.107
14Compara Output
- Blue nodes are speciation events giving rise to
orthologues - Red nodes are duplication events giving rise to
paralogues - Node 1 speciation event at base of monocots and
dicots - Node 2 duplication event at base of grasses
15Compara Results Working-set Distribution in
Families
- 51,857 genes in 10,727 non-maize-specific
families (trees) - Remaining 61,813 (gray) in 11,067 maize-specific
families (trees) or no tree - No tree likely represents false positive
annotations - Maize specific trees with 2 genes
false-positives on overlapping BACs - Maize specific trees with many genes repetitive
sequences
16Orthologue Stats
- Annotation sets
- Maize 3a.50 Working gene set
- Rice TIGR5
- Sorghum Sbi1.4
- Orthologue relationships
- maize-rice 48,761
- maize-sorghum 57,215
- rice-sorghum 30,442
- Gene counts
17Under-representation of Maize Orthologues?
Genes having orthologues in
18Orthologue clusters
Duplication prior to monocot/dicot split
Orthologue missing
Orthologue present
protein kinase APK1B, chloroplast precursor
19Truncated Genes
Query-hit-length-ratio (QHLR) peptide length
of maize length of best significant hit
- Gene fragments enriched in non-syntenic genes
- Consistent with TE mediated movement and
truncation of parent gene.
Appears to be a real biological phenomenon as
opposed to an annotation artifact
20QHLR in Working Set
- GeneBuilder shows higher quality than Fgenesh
- In all, 48,331 genes have a QHLR gt 0.5
21Complete CDS
Valid start, stop, no internal stops
69,477 genes (excluding TE) have complete CDS
22Combining Scores
105,154 genes not yet classified as TE
39,050 genes have homology or orthology and pass
both CDS and QLHR
23Additional Considerations
- The 39,050 genes remaining after filtering retain
39,362 sorghum/rice orthologues, down from 43,345
starting set of unique orthologues. - Thus will want to add back 4000 genes using a
metric of best model if we go forward with this
approach - More thorough screen of TEs especially helitrons
- Account for duplicate gene calls due to
overlapping BACs
24FTP Site
- http//brie4.cshl.edu/maize/bac_paper/
- Includes fasta files, gff files, orthologue
files, and compiled evidence file
25Chromosome Distribution
26Acknowlegements
- Chengzhi Liang GeneBuild
- Will Spooner Compara data
- Shiran Pasternak Project leader and developer
- Other team members (Sharon Wei, Liya Ren)
- Doreen Ware
27Homology Associated w/ Good Models