Has the Yoyo Stopped - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Has the Yoyo Stopped

Description:

Model eukaryotes (yeast/worm/fly) will show a significant post-genomic rise in gene number ... Massive functional genomics focus on yeast, worm and fly. 10 ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 35
Provided by: OGS
Category:
Tags: stopped | worm | yoyo

less

Transcript and Presenter's Notes

Title: Has the Yoyo Stopped


1
Has the Yo-yo Stopped? A Human Gene Number
Update Dr Christopher Southan, Proteome
Discovery Oxford Glycosciences Presentation for
the HGMP Resource Centre, Hinxton, April 2003
2
Presentation Outline
  • The importance of gene number
  • Gene definition and detection
  • Genome inflation arguments
  • Post-completion changes in model eukaryotes
  • Ensembl and NCBI gene pipeline numbers
  • Completed chromosome gene numbers
  • Man, mouse and fish
  • Post-genomic transcript and protein increases
  • Novel gene skimming
  • Proteomics for gene detection
  • Conclusions

3
So Who Cares About Gene Number?
  • Central to evolutionary questions of gene
    expansion vs. protein diversity from alternative
    splicing and post-translational modifications
  • Announcement of genome closure sets expectations
    for gene closure
  • Gene delineation essential for genetics and
    clinical genomics
  • Defines limits for the number of potential drug
    targets and therapeutic proteins
  • Sets the baseline for Human Proteome Organisation
    and other academic large-scale proteomics
    initiatives, e.g. Sanger Atlas of Gene Expression
  • Sets the baseline for commercial initiatives such
    as OGS/Confirmant (www.confirmant.com) Protein
    Atlas of the Human GenomeTM A database of
    mass-spec data on human proteins mapped onto
    genome data

4
Definitions
  • The Guidelines for Human Gene Nomenclature define
    a gene as "a DNA segment that contributes to
    phenotype/function. In the absence of
    demonstrated function a gene may be characterised
    by sequence, transcription or homology"
  • This presentation is concerned with the
    protein-coding gene number - defined as
    transcriptional units that translate to one or
    more proteins that share overlapping sequence
    identity and are products of the same unique
    genomic locus and strand orientation

5
Spread of Estimates in the Literature
6
Evidence for Identifying Genes
  • Bioinformatic
  • Detection of protein identity in genomic DNA
  • Gene prediction with protein similarity support
  • Matches with ESTs that include ORFs and/or splice
    sites
  • Cross-species comparisons for orthologous exon
    detection
  • Presence of gene anatomy features e.g. CpG
    islands, promoters, transcription start sites,
    polyadenylation signals and the absence of repeat
    elements
  • Experimental
  • Cloning of predicted genes
  • Detection of active transcription by Northern
    blot, RT-PCR or microarray hybridisation
  • Loss-of-function approaches
  • High-throughput transcript sampling by EST or
    SAGE tagging
  • In-vitro expression
  • Direct verification of protein sequence by Edman
    sequencing, mass-mapping and/or MS/MS sequencing

7
Arguments for high numbers (I)
  • Model eukaryotes (yeast/worm/fly) will show a
    significant post-genomic rise in gene number
  • Human genome assembly is not complete
  • Gene prediction programs have a significant
    false-negative rate
  • The Ensembl pipeline is conservative
  • Mammalian protein and transcript coverage is
    incomplete
  • Selective skimming experiments have revealed new
    genes
  • Extensive human/mouse genomic sequence
    conservation

8
Arguments for high numbers (II)
  • There exists a substantial subset of
    cryptic proteins (5,000 to 10,000) that have
    the following characteristics
  • Low specificity of detection by gene prediction
    (predominantly single exon)
  • Not sampled in any mammals by mRNAs or ESTs (rare
    or restricted transcripts)
  • Diverged in sequence from all other proteins in
    current databases (rapidly evolving and
    clade-specific)
  • Predominantly short proteins (smORFs)

9
Model Eukaryotes no Significant Post-Completion
Gene Increases
  • S.cerevisiae 1.5 increase since 1997
  • C.elegans 2 increase since 1998
  • D.melanogaster 2 decrease since 2001
  • Ciona intestalis, protochordate with many
    vertebrate proteins but 4,000 less genes than
    C.elegans
  • S.pombe only 4,824 genes
  • Massive functional genomics focus on yeast, worm
    and fly

10
Model Eukaryotes Close to Gene Number Closure ?
  • S.cervisiae remaining uncertainties in ORF
    totals
  • 6128 (Snyder Gerstein 2003)
  • 6202 from EBI,
  • 6356 from SGD-Stanford
  • 6449 from CYDG-MIPS
  • Mass-spectrometric identification of tryptic
    peptides
  • S.cervisiae 23 ORF confirmation and 60(?)
    novels, P.falciparium 24 ORF confirmation and
    100 orphan peptides
  • Themes from latest Drosophila re-annotation
  • 45 of genes changed
  • 800 new genes balanced by reduced gene
    fragmentation
  • increased transcript length and exon count

11
Mammals vs. Eukaryotes Average Protein Length
and Exon Number
  • No evidence of constitutive under-prediction
    in model eukaryotes
  • Suggestion of a skew towards smaller proteins
    in mammals

12
Post-genomic Coverage of Protein and Transcript
Data
  • The cornerstones of genome annotation, locating
    transcript identities and detecting protein
    homology, are directly related to coverage in
    non-genomic data
  • Since the first draft human genome in early 2001
    there has been a massive increase in the human
    mRNA and protein data
  • EST data feeds several international
    high-throughput mRNA projects
  • EST data is increasingly used as supporting
    evidence for predicted genes
  • Data from other mammals can be used for homology
    detection of human genes
  • These data increase would be expected to result
    in an increased gene number

13
Mammalian Transcript Coverage in UniGene, March
2003
  • 107,000 human mRNAs cluster to 28,000
    gene products
  • Average human mRNA coverage 3.8-fold
  • 95 of unique human mRNAs are covered by at
    least 1 EST

14
Human Transcripts Post-genomic mRNA Growth in
UniGene
  • Rapid growth in redundant mRNA
  • But slow growth in clustered set 9,000 over 2
    years
  • This will include some splice variants

15
Human Protein Number Changes in the International
Protein Index and SPTr
  • Growth in redundancy-reduced SPTr only 5000
    over 18 months
  • IPI inflated by NCBI XP proteins from
    unsupported predictions
  • IPI Includes increasing numbers of alternative
    splice forms

16
Mammalian Post-Genomic Protein GrowthGrist to
the Genome Annotation Mill
  • Growth in SPTr despite 100 redundancy removal
  • Mouse biggest increase of 7.4-fold
  • Predominantly re-sampling the same mammalian gene
    set?

17
Ensembl Gene Number Essentially Flat
  • Massive increase in human protein and transcript
    coverage over 2 years
  • But 24,847 genes, only 801 more than the first
    release
  • Knowns lt from 90 in Nov-01 to over 95 Mar-03
  • Novel genes gt 12,398 Nov-01 to 5,421 (21)
    Mar-03
  • Exons-per-gene lt 6.5 Jan-02 to 10.0 Mar-03
  • Alternative splicing lt from 3,669 Nov-01 to
    12,500 Mar-02

18
Ensembl and NCBI GP31 Comparison
  • Gene numbers (24,847 26,846) approximately
    congruent except higher NCBI totals on 14 and 22
  • Ensembl novels approximately 20 for each
    chromosome with maximum of 43 for Y and minimum
    of 12 for 17

19
NCBI Gene Number Still Yo-yoing
  • NCBI genomic pipeline includes varying
    proportions of EST-only supported and
    unsupported ab inito predictions as RefSeq XP
    protein models
  • New category locp in GP32 statistics page
    infers 23,270 protein-coding genes

20
Humans, rodents and fish
  • Rodents lower numbers despite 500 more ODR
    genes
  • Lower exon counts from reduced transcript
    coverage?
  • Lower alternative splicing from reduced
    transcript coverage?
  • Teleost fish known to have lineage-specific
    duplications

21
Addressing the smORF Question Protein Size
Distributions in Human SPTr
Pre Oct-01 6.3 gt 100aa
Post Oct-01 5.5 gt 100aa
Novel in title 3.4 gt 100aa
22
Addressing the smORF Question (II)
  • No database evidence for substantially increased
    smORF discovery in eukaryotes or mammals
  • The observation that only 1 of mouse genes
    have no detectable human homology contradicts the
    idea of large order-specific gene expansion in
    mammals
  • Although small proteins are less conserved i.e.
    evolve more rapidly, those much shorter than 100
    residues will fall below the threshold necessary
    to fold into the domain structures necessary for
    biological function

23
Pseudogenes Potential false-positives in Genome
Annotation
  • Human upper estimate of 20,000 (Harrison et al.
    2002)
  • Mouse estimate of 14,000 with 12 in gene set
  • Completed chromosome teams annotate an average of
    14 pseudoreal genes
  • But this varies between 101 for chrom 7 and
    2.31 for 22
  • Conservative estimate for Ensembl would reduce
    human gene total to 22,500
  • NCBI LocusLink 2172 with 80 (3.5) expressed as
    mRNA
  • Difficult to prove null-translation for
    pseudogenes with minor disablements such as
    premature stops

24
Experimental Transcript Skimming as Evidence for
High Gene Numbers
  • Exon arrays (Dunham et al. 1999)
  • Gene arrays (Penn et al. 2000)
  • RT-PCR (Das et al. 2001)
  • SAGE-tags (Saha et al. 2002, Chen et al. 2002)
  • Oligo tiling (Kapranov et al. 2002)
  • No novel proteins were submitted to the primary
    databases
  • There is increasing evidence for significant
    amounts of antisence and other non-ORF
    transcription in human and mouse
  • It now becomes necessary to clone a full length
    ORF with the necessary features of gene anatomy,
    and submission to the public databases, before
    the discovery of novel genes can be claimed

25
Human Proteome Sampling with MS/MS Peptide
Identification
  • 615 from the human heart mitochondria (Taylor et
    al. 2003)
  • 500 from breast cancer cell membranes (Adams et
    al. 2003)
  • 491 from microsomal fractions (Han et al. 2001)
  • 490 from blood serum (Adkins et al. 2003)
  • 311 from the splicesome (Rappsilber et al. 2002)
  • Total approaches 10 of human genes
  • No reported data on protein prediction
    confirmation
  • Technical caveats on search space for novel gene
    detection by correlative algorithms
  • One novel gene reported from a genome-only
    peptide match by Kuster et al in 2001 but this
    appeared from a high-throughput project later in
    the same year (Tr Q96DA0)
  • Proteomics will have a key impact on
    characterising the proteome but there is no
    evidence so far for significant novel gene
    discovery

26
Gene Numbers for Individual Completed Chromosomes
Five finished chromosomes now published and
annotated for gene content by large independent
teams
27
Vertebrate Genome Annotation (VEGA) for Human
14, 20 and 22
  • Novel CDSs where an ORF can be determined
  • Novel transcripts ORFs not frame-fixed by
    homology
  • Putative transcripts where spliced ESTs define
    intron/exon boundaries but not an ORF

28
Gene Numbers for Individual Completed Chromosomes
  • Averaging the completed chromosomes exceeds
    Ensembl GP31 genes by 12
  • This extrapolates to 28,000 genes
  • The five chromosomes still only cover 13 of
    genome
  • The chromosome reports were made at different
    times using different assemblies and different
    grades of gene definition and evidence support
  • Difficult to explicitly cross-map VEGA vs.
    Ensembl chromosome gene numbers
  • Future status of partial genes unclear

29
An Example of Disappearing Novelty
  • Cysteine and tyrosine-rich 1 (CYYR1), a novel
    unpredicted gene on
  • human chromosome 21 (21q21.2), encodes a cysteine
    and tyrosine-
  • rich protein and defines a new family of highly
    conserved
  • vertebrate- specific genes (Vitale et al. Gene
    2002 May 15290141-51)
  • Nineteen additional unpredicted transcripts from
    human
  • chromosome 21 (Reymond et al.Genomics 2002
    Jun79(6)824-32)
  • Human accessions
  • BF676689 834 bp EST 21-DEC-2000
  • AY061853 2320 bp 154 aa 11-JUN-2002 Reymond
    et al., Geneva
  • AF401639 2686 bp 154aa 13-JUN-2002 Vitale
    et al., Bologna
  • AAM56646 154 aa 20-JUN-2002
    US Patent 6368794
  • AL833200 3048 bp (no CDS)12-JUL-2002 German
    Genome Project
  • AK054581 1678 bp 262aa 01-AUG-2002 NEDO
    human project
  • BC036761 2000 bp 154aa 26-AUG-2002 NIH-MGC
    Project
  • ENSG00000166265 154aa Sept 2002
  • Mouse

30
Disappearing Novelty (II)
  • EMBL hum cds 2003
  • 1491
  • Plus novel 159
  • Plus PubMed 2003
  • 120
  • Novel in title 11
  • Previous cds 8
  • Novel genes 2

31
Current Numbers From Major Public Gene Sets
  • Lower numbers explicitly non-redundant and
    exclude pseudogenes
  • Higher numbers have increasing splice variant
    content

32
So What Would Constitute Gene Closure ?
  • The human genome was closed on April 14th but
    yeast gene number still not closed after six
    years
  • Comparative genomics will contribute to resolving
    the mammalian gene sets e.g. three-way
    human/mouse/rat
  • Closure-by-clone from VEGA
  • Proteomic closure by confirming at least one
    protein splice form from all plausible genes
    (expression in vitro and detection in vivo?)
  • Likely to be remaining grey areas e.g.
    transcribed pseudogenes producing truncated
    proteins and apparently intact genes may have
    undetectable impairments that render them
    functionally superfluous and translationally
    silent
  • Grey areas may not be numerically large

33
Conclusions
  • The model eukaryotes have not shown post-genomic
    rises in gene number
  • The Ensembl gene number has been essentially flat
  • The pseudogene-adjusted Ensembl gene total on a
    largely complete GP is 22,000
  • The five curated complete chromosomes extrapolate
    to 28,000 but leave many unclosed annotations
  • The massive increase in post-genomic transcript
    coverage is extending exons but predominantly
    re-sampling known genes
  • Database submissions of novel human genes have
    slowed to a trickle
  • Initial mouse rat have lower gene numbers than
    human
  • No evidence for large numbers of cryptic smORFs
  • Widespread occurrence of non-protein transcripts
    could explain previous high gene estimates from
    transcript skimming
  • Gene number closure likely to be well below
    30,000

34
Acknowledgments
  • Paul Kersey for IPI figures
  • Lucas Wagner of the NCBI for the retrospective
    UniGene data
  • Arek Kasprzyk of the EBI for historical and
    preview Ensembl release statistics
  • Numerous other people at NCBI, EBI, and Sanger
    Centre who have graciously answered queries on
    their data collections
  • The OGS Proteome Discovery Team for useful
    discussions
Write a Comment
User Comments (0)
About PowerShow.com