Outline - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Outline

Description:

RepeatMasker with MIPS REcat Release 4.3 (Deficient in Helitron ... 39,050 genes have homology or orthology and pass both CDS and QLHR. 9/11/09. CSHL. 23 ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 28
Provided by: ste118
Category:

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Annotation status What weve done
  • Preliminary analysis
  • Data FTP

2
Release 3a.50
  • 16,587 Clones 2,779 Mb Sequence
  • Within-BAC Assemblies
  • Contig N50 33,145 bp
  • Scaffold N50 54,953 bp

3
Pre-annotation Mask Repeats
RepeatMasker with MIPS REcat Release
4.3 (Deficient in Helitron-encoded genes)
Find protein-coding genes
4
Evidence-based Gene Build Pipeline
Pipeline use Ensembl framework www.ensembl.org
5
Evidence Used
  • 36,430 Ceres fl_cDNA
  • 14,963 Arizona fl_cDNA
  • 1,462,607 537,726 ESTs and another 18181 mRNAs
    from maize
  • 1,217,859 ESTs and another 72,919 mRNA from rice
  • 2,448,641 ESTs and another 14015 mRNAs from
    other monocots
  • 359,942 Swiss-Prot proteins from all species
  • 494,444 non-maize plant proteins in Trembl
  • 94,734 GenBank proteins from plant species
  • 52,177 rice proteins from rice gene annotations
  • 36,338 proteins from sorghum gene annotations

6
Captured 95-97 of Genes
7
Working Gene Set
  • Set of predicted genes subjected to 2
    annotations
  • Compara gene trees/orthologue calling
  • Xref to SwissProt
  • InterPro (Pfam domains, TMHMM, etc)
  • GO assignment
  • 2 annotations will help refine gene set

8
Working Set Source
  • Gene-build on 16,587 BACs (2,778,853,373 bp)
  • Evidence based (fl_cDNA, EST, mRNA, proteins)
  • 90,828 loci
  • Fgenesh (monocot) on repeat-masked DNA
  • 22,843 loci not overlapping with gene-build
  • Large and dubious set
  • 113,671 genes 148,277 transcripts
  • Expect
  • False positives
  • TEs
  • Fragmented genes
  • Redundancies due to overlapping BAC clones

9
Defining Confidence Levels(to genes not
necessarily annotation)
  • Homology-based classification
  • BLASTP vs GenBank NRAA
  • Cull out TE-encoded genes
  • Find genes showing homology
  • Has orthologue/paralogue
  • Results from Compara gene trees
  • Evidence types
  • fl_cDNA, mRNA, protein, EST, FgenesH
  • Gene Fragment detection
  • Ratio of maize gene length to its best hit in
    rice (TIGR5) and sorghum (Sbi1.3).
  • Complete CDS
  • Has valid start and stop codons and no internal
    stops
  • Could reflect problem in gene or gene model
  • TE Pfams
  • Have not looked at this yet

10
Confidence Metrics
11
Homology-Based Classification
  • BLASTP vs NRAA
  • Top 20 hits significant hits
  • Compared to list of manually-curated TEs in
    GenBank
  • Non-TEs with significant hit (E-value e-05)
    homology

12
Ensembl Compara Database
  • Stores genes and their relationships with one
    other.
  • Orthologue and paralogue assignments based on
    reconciliation of gene trees with species tree.
  • Alignments made using MUSCLE and trees
    constructed using TreeBest
  • Stores multiple alignments and trees..
  • Easy access to the database through the
    Application Programming Interface
  • Vilella et al. 2008 http//genome.cshlp.org/conte
    nt/early/2008/11/24/gr.073585.107

BMC Bioinformatics (2004)195113

http//treesoft.sourceforge.net/njtree.shtml
Gramene Project Nucleic Acids Research, 34
D717-723(2006) Ensembl ComparaGenome Res. 2004
May 14(5)934-941
13
Compara Overview
  • Gene Trees for 13 Species (including dicots,
    grasses, and non-plants)
  • Infers Orthologues and Paralogues by reconsiling
    gene tree with species tree
  • Vilella A.J., et al. (2008). Genome Res.
    Pre-print doi10.1101/gr.073585.107

14
Compara Output
  • Blue nodes are speciation events giving rise to
    orthologues
  • Red nodes are duplication events giving rise to
    paralogues
  • Node 1 speciation event at base of monocots and
    dicots
  • Node 2 duplication event at base of grasses

15
Compara Results Working-set Distribution in
Families
  • 51,857 genes in 10,727 non-maize-specific
    families (trees)
  • Remaining 61,813 (gray) in 11,067 maize-specific
    families (trees) or no tree
  • No tree likely represents false positive
    annotations
  • Maize specific trees with 2 genes
    false-positives on overlapping BACs
  • Maize specific trees with many genes repetitive
    sequences

16
Orthologue Stats
  • Annotation sets
  • Maize 3a.50 Working gene set
  • Rice TIGR5
  • Sorghum Sbi1.4
  • Orthologue relationships
  • maize-rice 48,761
  • maize-sorghum 57,215
  • rice-sorghum 30,442
  • Gene counts

17
Under-representation of Maize Orthologues?
Genes having orthologues in
18
Orthologue clusters
Duplication prior to monocot/dicot split
Orthologue missing
Orthologue present
protein kinase APK1B, chloroplast precursor
19
Truncated Genes
Query-hit-length-ratio (QHLR) peptide length
of maize length of best significant hit
  • Gene fragments enriched in non-syntenic genes
  • Consistent with TE mediated movement and
    truncation of parent gene.

Appears to be a real biological phenomenon as
opposed to an annotation artifact
20
QHLR in Working Set
  • GeneBuilder shows higher quality than Fgenesh
  • In all, 48,331 genes have a QHLR gt 0.5

21
Complete CDS
Valid start, stop, no internal stops
69,477 genes (excluding TE) have complete CDS
22
Combining Scores
105,154 genes not yet classified as TE
39,050 genes have homology or orthology and pass
both CDS and QLHR
23
Additional Considerations
  • The 39,050 genes remaining after filtering retain
    39,362 sorghum/rice orthologues, down from 43,345
    starting set of unique orthologues.
  • Thus will want to add back 4000 genes using a
    metric of best model if we go forward with this
    approach
  • More thorough screen of TEs especially helitrons
  • Account for duplicate gene calls due to
    overlapping BACs

24
FTP Site
  • http//brie4.cshl.edu/maize/bac_paper/
  • Includes fasta files, gff files, orthologue
    files, and compiled evidence file

25
Chromosome Distribution
26
Acknowlegements
  • Chengzhi Liang GeneBuild
  • Will Spooner Compara data
  • Shiran Pasternak Project leader and developer
  • Other team members (Sharon Wei, Liya Ren)
  • Doreen Ware

27
Homology Associated w/ Good Models
Write a Comment
User Comments (0)
About PowerShow.com