Title: Has the Yoyo Stopped
1Has the Yo-yo Stopped? A Human Gene Number
Update Dr Christopher Southan, Proteome
Discovery Oxford Glycosciences Presentation for
the HGMP Resource Centre, Hinxton, April 2003
2Presentation Outline
- The importance of gene number
- Gene definition and detection
- Genome inflation arguments
- Post-completion changes in model eukaryotes
- Ensembl and NCBI gene pipeline numbers
- Completed chromosome gene numbers
- Man, mouse and fish
- Post-genomic transcript and protein increases
- Novel gene skimming
- Proteomics for gene detection
- Conclusions
3So Who Cares About Gene Number?
- Central to evolutionary questions of gene
expansion vs. protein diversity from alternative
splicing and post-translational modifications - Announcement of genome closure sets expectations
for gene closure - Gene delineation essential for genetics and
clinical genomics - Defines limits for the number of potential drug
targets and therapeutic proteins - Sets the baseline for Human Proteome Organisation
and other academic large-scale proteomics
initiatives, e.g. Sanger Atlas of Gene Expression - Sets the baseline for commercial initiatives such
as OGS/Confirmant (www.confirmant.com) Protein
Atlas of the Human GenomeTM A database of
mass-spec data on human proteins mapped onto
genome data
4Definitions
- The Guidelines for Human Gene Nomenclature define
a gene as "a DNA segment that contributes to
phenotype/function. In the absence of
demonstrated function a gene may be characterised
by sequence, transcription or homology" - This presentation is concerned with the
protein-coding gene number - defined as
transcriptional units that translate to one or
more proteins that share overlapping sequence
identity and are products of the same unique
genomic locus and strand orientation
5Spread of Estimates in the Literature
6Evidence for Identifying Genes
- Bioinformatic
- Detection of protein identity in genomic DNA
- Gene prediction with protein similarity support
- Matches with ESTs that include ORFs and/or splice
sites - Cross-species comparisons for orthologous exon
detection - Presence of gene anatomy features e.g. CpG
islands, promoters, transcription start sites,
polyadenylation signals and the absence of repeat
elements - Experimental
- Cloning of predicted genes
- Detection of active transcription by Northern
blot, RT-PCR or microarray hybridisation - Loss-of-function approaches
- High-throughput transcript sampling by EST or
SAGE tagging - In-vitro expression
- Direct verification of protein sequence by Edman
sequencing, mass-mapping and/or MS/MS sequencing
7Arguments for high numbers (I)
- Model eukaryotes (yeast/worm/fly) will show a
significant post-genomic rise in gene number - Human genome assembly is not complete
- Gene prediction programs have a significant
false-negative rate - The Ensembl pipeline is conservative
- Mammalian protein and transcript coverage is
incomplete - Selective skimming experiments have revealed new
genes - Extensive human/mouse genomic sequence
conservation
8Arguments for high numbers (II)
- There exists a substantial subset of
cryptic proteins (5,000 to 10,000) that have
the following characteristics -
- Low specificity of detection by gene prediction
(predominantly single exon) - Not sampled in any mammals by mRNAs or ESTs (rare
or restricted transcripts) - Diverged in sequence from all other proteins in
current databases (rapidly evolving and
clade-specific) - Predominantly short proteins (smORFs)
9Model Eukaryotes no Significant Post-Completion
Gene Increases
- S.cerevisiae 1.5 increase since 1997
- C.elegans 2 increase since 1998
- D.melanogaster 2 decrease since 2001
- Ciona intestalis, protochordate with many
vertebrate proteins but 4,000 less genes than
C.elegans - S.pombe only 4,824 genes
- Massive functional genomics focus on yeast, worm
and fly
10Model Eukaryotes Close to Gene Number Closure ?
- S.cervisiae remaining uncertainties in ORF
totals - 6128 (Snyder Gerstein 2003)
- 6202 from EBI,
- 6356 from SGD-Stanford
- 6449 from CYDG-MIPS
- Mass-spectrometric identification of tryptic
peptides - S.cervisiae 23 ORF confirmation and 60(?)
novels, P.falciparium 24 ORF confirmation and
100 orphan peptides - Themes from latest Drosophila re-annotation
- 45 of genes changed
- 800 new genes balanced by reduced gene
fragmentation - increased transcript length and exon count
11Mammals vs. Eukaryotes Average Protein Length
and Exon Number
- No evidence of constitutive under-prediction
in model eukaryotes - Suggestion of a skew towards smaller proteins
in mammals
12Post-genomic Coverage of Protein and Transcript
Data
- The cornerstones of genome annotation, locating
transcript identities and detecting protein
homology, are directly related to coverage in
non-genomic data - Since the first draft human genome in early 2001
there has been a massive increase in the human
mRNA and protein data - EST data feeds several international
high-throughput mRNA projects - EST data is increasingly used as supporting
evidence for predicted genes - Data from other mammals can be used for homology
detection of human genes - These data increase would be expected to result
in an increased gene number
13Mammalian Transcript Coverage in UniGene, March
2003
- 107,000 human mRNAs cluster to 28,000
gene products - Average human mRNA coverage 3.8-fold
- 95 of unique human mRNAs are covered by at
least 1 EST
14Human Transcripts Post-genomic mRNA Growth in
UniGene
- Rapid growth in redundant mRNA
- But slow growth in clustered set 9,000 over 2
years - This will include some splice variants
15Human Protein Number Changes in the International
Protein Index and SPTr
- Growth in redundancy-reduced SPTr only 5000
over 18 months - IPI inflated by NCBI XP proteins from
unsupported predictions - IPI Includes increasing numbers of alternative
splice forms
16Mammalian Post-Genomic Protein GrowthGrist to
the Genome Annotation Mill
- Growth in SPTr despite 100 redundancy removal
- Mouse biggest increase of 7.4-fold
- Predominantly re-sampling the same mammalian gene
set?
17Ensembl Gene Number Essentially Flat
- Massive increase in human protein and transcript
coverage over 2 years - But 24,847 genes, only 801 more than the first
release - Knowns lt from 90 in Nov-01 to over 95 Mar-03
- Novel genes gt 12,398 Nov-01 to 5,421 (21)
Mar-03 - Exons-per-gene lt 6.5 Jan-02 to 10.0 Mar-03
- Alternative splicing lt from 3,669 Nov-01 to
12,500 Mar-02
18Ensembl and NCBI GP31 Comparison
- Gene numbers (24,847 26,846) approximately
congruent except higher NCBI totals on 14 and 22 - Ensembl novels approximately 20 for each
chromosome with maximum of 43 for Y and minimum
of 12 for 17
19NCBI Gene Number Still Yo-yoing
- NCBI genomic pipeline includes varying
proportions of EST-only supported and
unsupported ab inito predictions as RefSeq XP
protein models - New category locp in GP32 statistics page
infers 23,270 protein-coding genes
20Humans, rodents and fish
- Rodents lower numbers despite 500 more ODR
genes - Lower exon counts from reduced transcript
coverage? - Lower alternative splicing from reduced
transcript coverage? - Teleost fish known to have lineage-specific
duplications
21Addressing the smORF Question Protein Size
Distributions in Human SPTr
Pre Oct-01 6.3 gt 100aa
Post Oct-01 5.5 gt 100aa
Novel in title 3.4 gt 100aa
22Addressing the smORF Question (II)
- No database evidence for substantially increased
smORF discovery in eukaryotes or mammals - The observation that only 1 of mouse genes
have no detectable human homology contradicts the
idea of large order-specific gene expansion in
mammals - Although small proteins are less conserved i.e.
evolve more rapidly, those much shorter than 100
residues will fall below the threshold necessary
to fold into the domain structures necessary for
biological function
23Pseudogenes Potential false-positives in Genome
Annotation
- Human upper estimate of 20,000 (Harrison et al.
2002) - Mouse estimate of 14,000 with 12 in gene set
- Completed chromosome teams annotate an average of
14 pseudoreal genes - But this varies between 101 for chrom 7 and
2.31 for 22 - Conservative estimate for Ensembl would reduce
human gene total to 22,500 - NCBI LocusLink 2172 with 80 (3.5) expressed as
mRNA - Difficult to prove null-translation for
pseudogenes with minor disablements such as
premature stops
24Experimental Transcript Skimming as Evidence for
High Gene Numbers
- Exon arrays (Dunham et al. 1999)
- Gene arrays (Penn et al. 2000)
- RT-PCR (Das et al. 2001)
- SAGE-tags (Saha et al. 2002, Chen et al. 2002)
- Oligo tiling (Kapranov et al. 2002)
- No novel proteins were submitted to the primary
databases - There is increasing evidence for significant
amounts of antisence and other non-ORF
transcription in human and mouse - It now becomes necessary to clone a full length
ORF with the necessary features of gene anatomy,
and submission to the public databases, before
the discovery of novel genes can be claimed
25Human Proteome Sampling with MS/MS Peptide
Identification
- 615 from the human heart mitochondria (Taylor et
al. 2003) - 500 from breast cancer cell membranes (Adams et
al. 2003) - 491 from microsomal fractions (Han et al. 2001)
- 490 from blood serum (Adkins et al. 2003)
- 311 from the splicesome (Rappsilber et al. 2002)
- Total approaches 10 of human genes
- No reported data on protein prediction
confirmation - Technical caveats on search space for novel gene
detection by correlative algorithms - One novel gene reported from a genome-only
peptide match by Kuster et al in 2001 but this
appeared from a high-throughput project later in
the same year (Tr Q96DA0) - Proteomics will have a key impact on
characterising the proteome but there is no
evidence so far for significant novel gene
discovery
26Gene Numbers for Individual Completed Chromosomes
Five finished chromosomes now published and
annotated for gene content by large independent
teams
27Vertebrate Genome Annotation (VEGA) for Human
14, 20 and 22
- Novel CDSs where an ORF can be determined
- Novel transcripts ORFs not frame-fixed by
homology - Putative transcripts where spliced ESTs define
intron/exon boundaries but not an ORF
28Gene Numbers for Individual Completed Chromosomes
- Averaging the completed chromosomes exceeds
Ensembl GP31 genes by 12 - This extrapolates to 28,000 genes
- The five chromosomes still only cover 13 of
genome - The chromosome reports were made at different
times using different assemblies and different
grades of gene definition and evidence support - Difficult to explicitly cross-map VEGA vs.
Ensembl chromosome gene numbers - Future status of partial genes unclear
29An Example of Disappearing Novelty
- Cysteine and tyrosine-rich 1 (CYYR1), a novel
unpredicted gene on - human chromosome 21 (21q21.2), encodes a cysteine
and tyrosine- - rich protein and defines a new family of highly
conserved - vertebrate- specific genes (Vitale et al. Gene
2002 May 15290141-51) - Nineteen additional unpredicted transcripts from
human - chromosome 21 (Reymond et al.Genomics 2002
Jun79(6)824-32) - Human accessions
- BF676689 834 bp EST 21-DEC-2000
- AY061853 2320 bp 154 aa 11-JUN-2002 Reymond
et al., Geneva - AF401639 2686 bp 154aa 13-JUN-2002 Vitale
et al., Bologna - AAM56646 154 aa 20-JUN-2002
US Patent 6368794 - AL833200 3048 bp (no CDS)12-JUL-2002 German
Genome Project - AK054581 1678 bp 262aa 01-AUG-2002 NEDO
human project - BC036761 2000 bp 154aa 26-AUG-2002 NIH-MGC
Project - ENSG00000166265 154aa Sept 2002
- Mouse
30Disappearing Novelty (II)
- EMBL hum cds 2003
- 1491
- Plus novel 159
- Plus PubMed 2003
- 120
- Novel in title 11
- Previous cds 8
- Novel genes 2
31Current Numbers From Major Public Gene Sets
- Lower numbers explicitly non-redundant and
exclude pseudogenes - Higher numbers have increasing splice variant
content
32So What Would Constitute Gene Closure ?
- The human genome was closed on April 14th but
yeast gene number still not closed after six
years - Comparative genomics will contribute to resolving
the mammalian gene sets e.g. three-way
human/mouse/rat - Closure-by-clone from VEGA
- Proteomic closure by confirming at least one
protein splice form from all plausible genes
(expression in vitro and detection in vivo?) - Likely to be remaining grey areas e.g.
transcribed pseudogenes producing truncated
proteins and apparently intact genes may have
undetectable impairments that render them
functionally superfluous and translationally
silent - Grey areas may not be numerically large
33Conclusions
- The model eukaryotes have not shown post-genomic
rises in gene number - The Ensembl gene number has been essentially flat
- The pseudogene-adjusted Ensembl gene total on a
largely complete GP is 22,000 - The five curated complete chromosomes extrapolate
to 28,000 but leave many unclosed annotations - The massive increase in post-genomic transcript
coverage is extending exons but predominantly
re-sampling known genes - Database submissions of novel human genes have
slowed to a trickle - Initial mouse rat have lower gene numbers than
human - No evidence for large numbers of cryptic smORFs
- Widespread occurrence of non-protein transcripts
could explain previous high gene estimates from
transcript skimming - Gene number closure likely to be well below
30,000
34Acknowledgments
- Paul Kersey for IPI figures
- Lucas Wagner of the NCBI for the retrospective
UniGene data - Arek Kasprzyk of the EBI for historical and
preview Ensembl release statistics - Numerous other people at NCBI, EBI, and Sanger
Centre who have graciously answered queries on
their data collections - The OGS Proteome Discovery Team for useful
discussions