The Basal Human Protein Number and Database Choices: Neglected Parameters in MSbased Proteomics - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

The Basal Human Protein Number and Database Choices: Neglected Parameters in MSbased Proteomics

Description:

Priniciple Scientist, Target Development, In vitro Biosciences, AstraZeneca R&D, ... of orthogonal filters e.g. protein Mw, protein pI, peptide pI or hydrophobicity ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 19
Provided by: chriss67
Category:

less

Transcript and Presenter's Notes

Title: The Basal Human Protein Number and Database Choices: Neglected Parameters in MSbased Proteomics


1
The Basal Human Protein Number and Database
Choices Neglected Parameters in MS-based
Proteomics?
  • Christopher Southan
  • Priniciple Scientist, Target Development, In
    vitro Biosciences, AstraZeneca RD, Mölndal,
    Sweden
  • Special Professor of Proteomics, School of
    Biosciences, University of Nottingham

2
Outline
  • Human protein number
  • Available databases
  • The SwissProt Human Protein Initiative
  • The EBI International Protein Index
  • Splice search space
  • nsSNP search space
  • Non-cannonical search space
  • Summary and conclusions

3
The Human Protein-Coding Gene Number
Post-Genomic Total Consistently Below 25K
  • Ensembl 26.35.1 4-Nov, 22,221 proteins plus
    1974 pseudogenes
  • This represents a decrease of 1,825 from first
    release in 2001
  • Novel genes 12,398 gt 2,645 Entries
  • Exons-per-gene 6.5 lt 10.1
  • Alternative splicing 3,669 lt to 11,718
  • Southan, Proteomics, June 2004, reviewed estimate
    of 25K to 30K proteins
  • HGSC, Nature, Oct 2004, finished genome count
    of 20K to 25K proteins

4
The Consequences of a Low Protein-coding Gene
Number for MS-based Proteomics
  • Shifts the identification risk from false
    negatives (missed novel proteins) towards false
    positives (inflated hit-lists)
  • Encourages experimental progression beyond simple
    gene stamping towards exon corrections, detecting
    splice forms, PTMs, SAPs and quantitation
  • 10 of proteins predicted from genomic data
    still have no experimentaly confirmed mRNA
  • These could be detected by MS-based proteomics so
    long as the sequences are included in the search
    space
  • The basal (unspliced) protein number is likely to
    be simillar for most mammals
  • The goal of being able to detect the majority of
    proteins seems more feasable

5
From Minimal to Maximal Sequence Collections
Many databases to choose from
6
Release Statistics
7
Using as a Minimal
Search Space
  • Pros
  • Gold Standard SwissProt annotation
  • Explicit redundancy reduction
  • Extensive cross referencing and SRS indexing
  • Explicit treatment of some verified (RESID) and
    many potential PTMs
  • Can generate splice forms and variants with
    VARSPLIC
  • Probably contains all proteins of moderate
    abundancy
  • No false positives
  • By submitting your own MS data you can promote
    any TrEMBL entry or predicted protein to
    SwissProt and have links to your publications
  • Cons
  • With only half the ORF-ome presents big risk of
    false negatives
  • Slow updating
  • Cannot yet do all the transformations to produce
    complete tryptome
  • TrEMBL is redundant w.r.t. SwissProt

8
Release History Tracking the
International Protein Index Yo-Yo
9
Composition of the Nov 2004 Release
Reading Between the Numbers
(new or dodgy proteins)
(NCBI-only XP predictions)
(Ens-false negatives)
(Ens-only predictions)
(NCBI-false negatives)
(NBCI/Ens prediction concensus)
(3-way concensus)
10
Using the as a Maximal Search Space
  • Pros
  • The three way merge and redundancy reduction
    between evidence-supported genome predictions and
    experimental mRNA derived ORFS is a good
    aproximation to maximal search space
  • Explicitly captures the majority of splicing in
    SP, TrEMBL and Ensembl
  • Regular updates and detailed statistics
  • Good parsing and linking to source annotations
  • EBI team receptive to improving MS search
    requirements e.g. virtual tryptome
  • Common schema and similar coverage for human,
    mouse and rat
  • Cons
  • Includes potential false-positives from
    artifactual ORFs in Uniprot and dubious XP gene
    predictions
  • No explicit handling of nsSNPs or PTMs
  • Automated reciprocal redundancy reduction not
    perfect
  • Updates not coordinated with source databases
  • Update frequency and churn rate may oblige you to
    re-assign old data sets
  • No patent data
  • Probably still a small number of false-negatives
    and missing exons

11
Selecting a Maximum Search Space for AA-changing
SNPs (nsSNPs or SAPs)
  • dbSNP records 50K human polymorphic AA postions
    i.e. 2 per protein
  • These will give rise to substituted tryptic
    peptides and loss or gain from Arg/Lys ? Xaa
    changes and PTM shifts
  • Biological interest for verifying nsSNPs in
    populations, tissues or disease states and
    establishing that translated allelle ratio may be
    different to genotypic ratio
  • Challenges
  • Most nsSNPs will have low validation rate and
    population frequency
  • Difficult to detect in pooled samples
  • Need large numbers of individual samples for
    population sampling
  • Also need high tryptic coverage to have any
    chance of detection
  • Would need to identify both alleles by MS/MS to
    be convincing
  • Quantifying ratio would be difficult
  • Leu/Ile exchanges undetectable
  • Need virtual tryptome approach
  • Data processing effort may not translate into
    results payoff

12
Identifying Alternative Splice Forms by the
Detection of Splice-Specific Tryptic Peptides
(SSTPs)
  • Increasing interest in the tissue specificity and
    biological consequences of alternative splicing
  • Estimated upper boundary of splicing increasing
    towards 60 of all proteins
  • In vivo ratios of normalspliced proteins may be
    very different to mRNA ratios
  • Few other techniques can verify protein splice
    forms
  • Challenges
  • By convention SwissProt chooses the longest form
    as normal. Therefore alternative splicing is
    characterized predominantly by peptide loss.
  • In the Sept 2003 HPI the proportion of SSTPs
    from annotated splice variants was only 5.4 of
    all tryptics with a lossgain ratio of 2.41
  • Detection of SSTPs needs high tryptic coverage,
    breadth and depth of tissue sampling
  • Complex choices for splice space ranging from
    SwissProt, IPI, AltSplicedb, TIGR assemblies, raw
    EST data and/or sub-threshold gene predictions
  • Only 70 of mRNA splice forms would potentially
    translate and not all would be folded or
    trafficked as functional proteins

13
Non-cannonical Sequence Space Pushing the Outer
Limits
  • A few (10 lt 100?) small mammalian proteins may
    have neither been predicted by Ens/NCBI nor have
    cDNAs been translated in UniProt
  • Rare splice forms not represented in any
    transcript data
  • Inteins
  • Annotated pseudogenes translated to truncated
    proteins, or AAgt stop SNPs
  • nsSNPs not in dbSNP
  • Micro-indel AA changes in dbSNP
  • Somatic rearangments and mutations in cancers
    e.g. gi2738915 from Hela cells
  • Unidentified selenocysteine containing ORFs
  • Cellular protein garbage e.g. 3 run through
  • Mixed species assignments e.g. bacteria in human
    sputum

These would need de-novo sequencing,
sub-threshold gene predictions, virtual tryptomes
and other advanced approaches
14
Finding the Balance
Maximal Sequence Space
Minimal Sequence Space
  • Pros
  • Maximum use of MS data
  • High sensitivity
  • Exploits the unique advantages of MS proteomics
    to correct predicted proteins, detect in vivo
    splice forms, nsSNPs, PTMs ect
  • Experimental verification may yeild new
    biological insights
  • Cons
  • Maximal gene/splice/nsSNP hooking space needs
    virtual tryptome approach
  • Need to calibrate false positive rate
  • Thin to zero annotation on unknown proteins
  • Limited parsability
  • Unpublishable data sets
  • Post-processing data merging and mining essential
    to filter potential biologically interesting
    matches for cross-check and follow-up
  • Pros
  • Robust and reproducable results
  • Well annotated database entries
  • Clear sequence names
  • Good parsing and link-outs e.g. InterPro, GO ect
  • Submittable to public databases
  • Cons
  • Unknown false negative rate
  • Low sensitivity
  • Small hit-lists of usual suspects
  • Majority of MS data unassigned
  • Limited new biological insights
  • Microarray/PCR/antibody array could produce more
    comprehensive and quatitative data for less
    resource

15
Conclusions
  • The availability of largely complete basal ORF
    sets for humans and increasing numbers of other
    mammals will be a boon for proteomics
  • For proteomics experiments it is important to
    choose and understand exactly what is, or is not,
    represented in the sequence database you use for
    assignments and how this determines not only your
    specificity callibrations but also what your can
    extract from your results
  • Although there is no correct solution to the
    automated stringency compromise it can be
    adjusted to what your customers or collaborators
    are prepared to exploit
  • Collaborating with your local bioinformaticians
    can expedite database choice, identification
    strategies, formatting and data mining
  • There is arguably more potential biological
    pay-off by using lower stringencies and carefully
    selected but maximised sequence space to push the
    boundaries beyond the identification of basal
    ORFs
  • Submitting your results to improve the quality of
    public protein data is a virtuous cycle

16
References and Acknowledgments
  • Southan C. Has the Yo-yo stopped? a human gene
    number update (2004) Proteomics (6)1712-26
    (review) PMID 15174140
  • McGowan, S et al. (84 authors) (2004) Annotation
    of the Human Genome by High-Throughput Sequence
    Analysis of Naturally Occurring Proteins,
    Current Proteomics, 1, 41-48
  • Furnham N, Ruffle S., Southan C. (2004) Splice
    variants a homology modelling approach.
    Proteins Structure, Function, Genetics,
    Proteins. 2004 54(3)596-608. PMID 14748006

17
Finding the Balance
Minimising false positives
Minimising false negatives
  • Filter to high-quality spectra
  • Multiple independent observations
  • Match to only well curated cDNA reference protein
    sets
  • Calibrate a high, comfort-zone scoring threshold
  • Use convergence between multiple algorithms
  • Accept double-hit proteins only
  • Use orthogonal data
  • Use all data
  • Include maximal predicted ORF space and patent
    sequences
  • Include maximal EST splicing and nsSNP coverage
  • Include non-cannonical sequence space and PTM
    predictions
  • Use database triage
  • Use twighlight-zone scoring thresholds
  • Accept single peptide observations

18
Setting Stringencies for Automated Protein
Identification by MS/MS
  • Experimental replication to convergence
  • Composite scoring schemes calibrated for
    specificity by control experiments, checks
    against randomised databases and selected manual
    confirmation
  • Use of orthogonal filters e.g. protein Mw,
    protein pI, peptide pI or hydrophobicity
  • Use of MS data libraries for seen-before spectral
    matching
  • Application of pragmatic rules e.g. assignments
    only from double peptide hits

These are all dependant on the composition of the
sequence databases used to hook out the MS matches
Write a Comment
User Comments (0)
About PowerShow.com