The Basal Human Protein Number and Database Choices: Neglected Parameters in MSbased Proteomics - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

The Basal Human Protein Number and Database Choices: Neglected Parameters in MSbased Proteomics

Description:

Priniciple Scientist, Target Development, In vitro Biosciences, AstraZeneca R&D, ... of orthogonal filters e.g. protein Mw, protein pI, peptide pI or hydrophobicity ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 19

Provided by: chriss67

Category:

more less

Transcript and Presenter's Notes

Title: The Basal Human Protein Number and Database Choices: Neglected Parameters in MSbased Proteomics

1
The Basal Human Protein Number and Database
Choices Neglected Parameters in MS-based
Proteomics?

Christopher Southan
Priniciple Scientist, Target Development, In
vitro Biosciences, AstraZeneca RD, Mölndal,
Sweden
Special Professor of Proteomics, School of
Biosciences, University of Nottingham

2
Outline

Human protein number
Available databases
The SwissProt Human Protein Initiative
The EBI International Protein Index
Splice search space
nsSNP search space
Non-cannonical search space
Summary and conclusions

3
The Human Protein-Coding Gene Number
Post-Genomic Total Consistently Below 25K

Ensembl 26.35.1 4-Nov, 22,221 proteins plus
1974 pseudogenes
This represents a decrease of 1,825 from first
release in 2001
Novel genes 12,398 gt 2,645 Entries
Exons-per-gene 6.5 lt 10.1
Alternative splicing 3,669 lt to 11,718
Southan, Proteomics, June 2004, reviewed estimate
of 25K to 30K proteins
HGSC, Nature, Oct 2004, finished genome count
of 20K to 25K proteins

4
The Consequences of a Low Protein-coding Gene
Number for MS-based Proteomics

Shifts the identification risk from false
negatives (missed novel proteins) towards false
positives (inflated hit-lists)
Encourages experimental progression beyond simple
gene stamping towards exon corrections, detecting
splice forms, PTMs, SAPs and quantitation
10 of proteins predicted from genomic data
still have no experimentaly confirmed mRNA
These could be detected by MS-based proteomics so
long as the sequences are included in the search
space
The basal (unspliced) protein number is likely to
be simillar for most mammals
The goal of being able to detect the majority of
proteins seems more feasable

5
From Minimal to Maximal Sequence Collections
Many databases to choose from
6
Release Statistics
7
Using as a Minimal
Search Space

Pros
Gold Standard SwissProt annotation
Explicit redundancy reduction
Extensive cross referencing and SRS indexing
Explicit treatment of some verified (RESID) and
many potential PTMs
Can generate splice forms and variants with
VARSPLIC
Probably contains all proteins of moderate
abundancy
No false positives
By submitting your own MS data you can promote
any TrEMBL entry or predicted protein to
SwissProt and have links to your publications
Cons
With only half the ORF-ome presents big risk of
false negatives
Slow updating
Cannot yet do all the transformations to produce
complete tryptome
TrEMBL is redundant w.r.t. SwissProt

8
Release History Tracking the
International Protein Index Yo-Yo
9
Composition of the Nov 2004 Release
Reading Between the Numbers
(new or dodgy proteins)
(NCBI-only XP predictions)
(Ens-false negatives)
(Ens-only predictions)
(NCBI-false negatives)
(NBCI/Ens prediction concensus)
(3-way concensus)
10
Using the as a Maximal Search Space

Pros
The three way merge and redundancy reduction
between evidence-supported genome predictions and
experimental mRNA derived ORFS is a good
aproximation to maximal search space
Explicitly captures the majority of splicing in
SP, TrEMBL and Ensembl
Regular updates and detailed statistics
Good parsing and linking to source annotations
EBI team receptive to improving MS search
requirements e.g. virtual tryptome
Common schema and similar coverage for human,
mouse and rat
Cons
Includes potential false-positives from
artifactual ORFs in Uniprot and dubious XP gene
predictions
No explicit handling of nsSNPs or PTMs
Automated reciprocal redundancy reduction not
perfect
Updates not coordinated with source databases
Update frequency and churn rate may oblige you to
re-assign old data sets
No patent data
Probably still a small number of false-negatives
and missing exons

11
Selecting a Maximum Search Space for AA-changing
SNPs (nsSNPs or SAPs)

dbSNP records 50K human polymorphic AA postions
i.e. 2 per protein
These will give rise to substituted tryptic
peptides and loss or gain from Arg/Lys ? Xaa
changes and PTM shifts
Biological interest for verifying nsSNPs in
populations, tissues or disease states and
establishing that translated allelle ratio may be
different to genotypic ratio
Challenges
Most nsSNPs will have low validation rate and
population frequency
Difficult to detect in pooled samples
Need large numbers of individual samples for
population sampling
Also need high tryptic coverage to have any
chance of detection
Would need to identify both alleles by MS/MS to
be convincing
Quantifying ratio would be difficult
Leu/Ile exchanges undetectable
Need virtual tryptome approach
Data processing effort may not translate into
results payoff

12
Identifying Alternative Splice Forms by the
Detection of Splice-Specific Tryptic Peptides
(SSTPs)

Increasing interest in the tissue specificity and
biological consequences of alternative splicing
Estimated upper boundary of splicing increasing
towards 60 of all proteins
In vivo ratios of normalspliced proteins may be
very different to mRNA ratios
Few other techniques can verify protein splice
forms
Challenges
By convention SwissProt chooses the longest form
as normal. Therefore alternative splicing is
characterized predominantly by peptide loss.
In the Sept 2003 HPI the proportion of SSTPs
from annotated splice variants was only 5.4 of
all tryptics with a lossgain ratio of 2.41
Detection of SSTPs needs high tryptic coverage,
breadth and depth of tissue sampling
Complex choices for splice space ranging from
SwissProt, IPI, AltSplicedb, TIGR assemblies, raw
EST data and/or sub-threshold gene predictions
Only 70 of mRNA splice forms would potentially
translate and not all would be folded or
trafficked as functional proteins

13
Non-cannonical Sequence Space Pushing the Outer
Limits

A few (10 lt 100?) small mammalian proteins may
have neither been predicted by Ens/NCBI nor have
cDNAs been translated in UniProt
Rare splice forms not represented in any
transcript data
Inteins
Annotated pseudogenes translated to truncated
proteins, or AAgt stop SNPs
nsSNPs not in dbSNP
Micro-indel AA changes in dbSNP
Somatic rearangments and mutations in cancers
e.g. gi2738915 from Hela cells
Unidentified selenocysteine containing ORFs
Cellular protein garbage e.g. 3 run through
Mixed species assignments e.g. bacteria in human
sputum

These would need de-novo sequencing,
sub-threshold gene predictions, virtual tryptomes
and other advanced approaches
14
Finding the Balance
Maximal Sequence Space
Minimal Sequence Space

Pros
Maximum use of MS data
High sensitivity
Exploits the unique advantages of MS proteomics
to correct predicted proteins, detect in vivo
splice forms, nsSNPs, PTMs ect
Experimental verification may yeild new
biological insights
Cons
Maximal gene/splice/nsSNP hooking space needs
virtual tryptome approach
Need to calibrate false positive rate
Thin to zero annotation on unknown proteins
Limited parsability
Unpublishable data sets
Post-processing data merging and mining essential
to filter potential biologically interesting
matches for cross-check and follow-up

Pros
Robust and reproducable results
Well annotated database entries
Clear sequence names
Good parsing and link-outs e.g. InterPro, GO ect
Submittable to public databases
Cons
Unknown false negative rate
Low sensitivity
Small hit-lists of usual suspects
Majority of MS data unassigned
Limited new biological insights
Microarray/PCR/antibody array could produce more
comprehensive and quatitative data for less
resource

15
Conclusions

The availability of largely complete basal ORF
sets for humans and increasing numbers of other
mammals will be a boon for proteomics
For proteomics experiments it is important to
choose and understand exactly what is, or is not,
represented in the sequence database you use for
assignments and how this determines not only your
specificity callibrations but also what your can
extract from your results
Although there is no correct solution to the
automated stringency compromise it can be
adjusted to what your customers or collaborators
are prepared to exploit
Collaborating with your local bioinformaticians
can expedite database choice, identification
strategies, formatting and data mining
There is arguably more potential biological
pay-off by using lower stringencies and carefully
selected but maximised sequence space to push the
boundaries beyond the identification of basal
ORFs
Submitting your results to improve the quality of
public protein data is a virtuous cycle

16
References and Acknowledgments

Southan C. Has the Yo-yo stopped? a human gene
number update (2004) Proteomics (6)1712-26
(review) PMID 15174140
McGowan, S et al. (84 authors) (2004) Annotation
of the Human Genome by High-Throughput Sequence
Analysis of Naturally Occurring Proteins,
Current Proteomics, 1, 41-48
Furnham N, Ruffle S., Southan C. (2004) Splice
variants a homology modelling approach.
Proteins Structure, Function, Genetics,
Proteins. 2004 54(3)596-608. PMID 14748006

17
Finding the Balance
Minimising false positives
Minimising false negatives

Filter to high-quality spectra
Multiple independent observations
Match to only well curated cDNA reference protein
sets
Calibrate a high, comfort-zone scoring threshold
Use convergence between multiple algorithms
Accept double-hit proteins only
Use orthogonal data

Use all data
Include maximal predicted ORF space and patent
sequences
Include maximal EST splicing and nsSNP coverage
Include non-cannonical sequence space and PTM
predictions
Use database triage
Use twighlight-zone scoring thresholds
Accept single peptide observations

18
Setting Stringencies for Automated Protein
Identification by MS/MS

Experimental replication to convergence
Composite scoring schemes calibrated for
specificity by control experiments, checks
against randomised databases and selected manual
confirmation
Use of orthogonal filters e.g. protein Mw,
protein pI, peptide pI or hydrophobicity
Use of MS data libraries for seen-before spectral
matching
Application of pragmatic rules e.g. assignments
only from double peptide hits

These are all dependant on the composition of the
sequence databases used to hook out the MS matches

Write a Comment

User Comments (0)