Title: The Basal Human Protein Number and Database Choices: Neglected Parameters in MSbased Proteomics
1The Basal Human Protein Number and Database
Choices Neglected Parameters in MS-based
Proteomics?
- Christopher Southan
- Priniciple Scientist, Target Development, In
vitro Biosciences, AstraZeneca RD, Mölndal,
Sweden - Special Professor of Proteomics, School of
Biosciences, University of Nottingham
2Outline
- Human protein number
- Available databases
- The SwissProt Human Protein Initiative
- The EBI International Protein Index
- Splice search space
- nsSNP search space
- Non-cannonical search space
- Summary and conclusions
3The Human Protein-Coding Gene Number
Post-Genomic Total Consistently Below 25K
- Ensembl 26.35.1 4-Nov, 22,221 proteins plus
1974 pseudogenes - This represents a decrease of 1,825 from first
release in 2001 - Novel genes 12,398 gt 2,645 Entries
- Exons-per-gene 6.5 lt 10.1
- Alternative splicing 3,669 lt to 11,718
- Southan, Proteomics, June 2004, reviewed estimate
of 25K to 30K proteins - HGSC, Nature, Oct 2004, finished genome count
of 20K to 25K proteins
4The Consequences of a Low Protein-coding Gene
Number for MS-based Proteomics
- Shifts the identification risk from false
negatives (missed novel proteins) towards false
positives (inflated hit-lists) - Encourages experimental progression beyond simple
gene stamping towards exon corrections, detecting
splice forms, PTMs, SAPs and quantitation - 10 of proteins predicted from genomic data
still have no experimentaly confirmed mRNA - These could be detected by MS-based proteomics so
long as the sequences are included in the search
space - The basal (unspliced) protein number is likely to
be simillar for most mammals - The goal of being able to detect the majority of
proteins seems more feasable
5From Minimal to Maximal Sequence Collections
Many databases to choose from
6 Release Statistics
7Using as a Minimal
Search Space
- Pros
- Gold Standard SwissProt annotation
- Explicit redundancy reduction
- Extensive cross referencing and SRS indexing
- Explicit treatment of some verified (RESID) and
many potential PTMs - Can generate splice forms and variants with
VARSPLIC - Probably contains all proteins of moderate
abundancy - No false positives
- By submitting your own MS data you can promote
any TrEMBL entry or predicted protein to
SwissProt and have links to your publications - Cons
- With only half the ORF-ome presents big risk of
false negatives - Slow updating
- Cannot yet do all the transformations to produce
complete tryptome - TrEMBL is redundant w.r.t. SwissProt
8 Release History Tracking the
International Protein Index Yo-Yo
9 Composition of the Nov 2004 Release
Reading Between the Numbers
(new or dodgy proteins)
(NCBI-only XP predictions)
(Ens-false negatives)
(Ens-only predictions)
(NCBI-false negatives)
(NBCI/Ens prediction concensus)
(3-way concensus)
10Using the as a Maximal Search Space
- Pros
- The three way merge and redundancy reduction
between evidence-supported genome predictions and
experimental mRNA derived ORFS is a good
aproximation to maximal search space - Explicitly captures the majority of splicing in
SP, TrEMBL and Ensembl - Regular updates and detailed statistics
- Good parsing and linking to source annotations
- EBI team receptive to improving MS search
requirements e.g. virtual tryptome - Common schema and similar coverage for human,
mouse and rat - Cons
- Includes potential false-positives from
artifactual ORFs in Uniprot and dubious XP gene
predictions - No explicit handling of nsSNPs or PTMs
- Automated reciprocal redundancy reduction not
perfect - Updates not coordinated with source databases
- Update frequency and churn rate may oblige you to
re-assign old data sets - No patent data
- Probably still a small number of false-negatives
and missing exons
11Selecting a Maximum Search Space for AA-changing
SNPs (nsSNPs or SAPs)
- dbSNP records 50K human polymorphic AA postions
i.e. 2 per protein - These will give rise to substituted tryptic
peptides and loss or gain from Arg/Lys ? Xaa
changes and PTM shifts - Biological interest for verifying nsSNPs in
populations, tissues or disease states and
establishing that translated allelle ratio may be
different to genotypic ratio - Challenges
- Most nsSNPs will have low validation rate and
population frequency - Difficult to detect in pooled samples
- Need large numbers of individual samples for
population sampling - Also need high tryptic coverage to have any
chance of detection - Would need to identify both alleles by MS/MS to
be convincing - Quantifying ratio would be difficult
- Leu/Ile exchanges undetectable
- Need virtual tryptome approach
- Data processing effort may not translate into
results payoff
12Identifying Alternative Splice Forms by the
Detection of Splice-Specific Tryptic Peptides
(SSTPs)
- Increasing interest in the tissue specificity and
biological consequences of alternative splicing - Estimated upper boundary of splicing increasing
towards 60 of all proteins - In vivo ratios of normalspliced proteins may be
very different to mRNA ratios - Few other techniques can verify protein splice
forms - Challenges
- By convention SwissProt chooses the longest form
as normal. Therefore alternative splicing is
characterized predominantly by peptide loss. - In the Sept 2003 HPI the proportion of SSTPs
from annotated splice variants was only 5.4 of
all tryptics with a lossgain ratio of 2.41 - Detection of SSTPs needs high tryptic coverage,
breadth and depth of tissue sampling - Complex choices for splice space ranging from
SwissProt, IPI, AltSplicedb, TIGR assemblies, raw
EST data and/or sub-threshold gene predictions - Only 70 of mRNA splice forms would potentially
translate and not all would be folded or
trafficked as functional proteins
13Non-cannonical Sequence Space Pushing the Outer
Limits
- A few (10 lt 100?) small mammalian proteins may
have neither been predicted by Ens/NCBI nor have
cDNAs been translated in UniProt - Rare splice forms not represented in any
transcript data - Inteins
- Annotated pseudogenes translated to truncated
proteins, or AAgt stop SNPs - nsSNPs not in dbSNP
- Micro-indel AA changes in dbSNP
- Somatic rearangments and mutations in cancers
e.g. gi2738915 from Hela cells - Unidentified selenocysteine containing ORFs
- Cellular protein garbage e.g. 3 run through
- Mixed species assignments e.g. bacteria in human
sputum
These would need de-novo sequencing,
sub-threshold gene predictions, virtual tryptomes
and other advanced approaches
14Finding the Balance
Maximal Sequence Space
Minimal Sequence Space
- Pros
- Maximum use of MS data
- High sensitivity
- Exploits the unique advantages of MS proteomics
to correct predicted proteins, detect in vivo
splice forms, nsSNPs, PTMs ect - Experimental verification may yeild new
biological insights - Cons
- Maximal gene/splice/nsSNP hooking space needs
virtual tryptome approach - Need to calibrate false positive rate
- Thin to zero annotation on unknown proteins
- Limited parsability
- Unpublishable data sets
- Post-processing data merging and mining essential
to filter potential biologically interesting
matches for cross-check and follow-up
- Pros
- Robust and reproducable results
- Well annotated database entries
- Clear sequence names
- Good parsing and link-outs e.g. InterPro, GO ect
- Submittable to public databases
- Cons
- Unknown false negative rate
- Low sensitivity
- Small hit-lists of usual suspects
- Majority of MS data unassigned
- Limited new biological insights
- Microarray/PCR/antibody array could produce more
comprehensive and quatitative data for less
resource
15Conclusions
- The availability of largely complete basal ORF
sets for humans and increasing numbers of other
mammals will be a boon for proteomics - For proteomics experiments it is important to
choose and understand exactly what is, or is not,
represented in the sequence database you use for
assignments and how this determines not only your
specificity callibrations but also what your can
extract from your results - Although there is no correct solution to the
automated stringency compromise it can be
adjusted to what your customers or collaborators
are prepared to exploit - Collaborating with your local bioinformaticians
can expedite database choice, identification
strategies, formatting and data mining - There is arguably more potential biological
pay-off by using lower stringencies and carefully
selected but maximised sequence space to push the
boundaries beyond the identification of basal
ORFs - Submitting your results to improve the quality of
public protein data is a virtuous cycle
16References and Acknowledgments
- Southan C. Has the Yo-yo stopped? a human gene
number update (2004) Proteomics (6)1712-26
(review) PMID 15174140 - McGowan, S et al. (84 authors) (2004) Annotation
of the Human Genome by High-Throughput Sequence
Analysis of Naturally Occurring Proteins,
Current Proteomics, 1, 41-48 - Furnham N, Ruffle S., Southan C. (2004) Splice
variants a homology modelling approach.
Proteins Structure, Function, Genetics,
Proteins. 2004 54(3)596-608. PMID 14748006
17Finding the Balance
Minimising false positives
Minimising false negatives
- Filter to high-quality spectra
- Multiple independent observations
- Match to only well curated cDNA reference protein
sets - Calibrate a high, comfort-zone scoring threshold
- Use convergence between multiple algorithms
- Accept double-hit proteins only
- Use orthogonal data
- Use all data
- Include maximal predicted ORF space and patent
sequences - Include maximal EST splicing and nsSNP coverage
- Include non-cannonical sequence space and PTM
predictions - Use database triage
- Use twighlight-zone scoring thresholds
- Accept single peptide observations
18Setting Stringencies for Automated Protein
Identification by MS/MS
- Experimental replication to convergence
- Composite scoring schemes calibrated for
specificity by control experiments, checks
against randomised databases and selected manual
confirmation - Use of orthogonal filters e.g. protein Mw,
protein pI, peptide pI or hydrophobicity - Use of MS data libraries for seen-before spectral
matching - Application of pragmatic rules e.g. assignments
only from double peptide hits
These are all dependant on the composition of the
sequence databases used to hook out the MS matches