Title: Proteomic Characterization of Alternative Splicing and Coding Polymorphism
 1Proteomic Characterization of Alternative 
Splicing and Coding Polymorphism
- Nathan Edwards 
- Center for Bioinformatics and Computational 
 Biology
- University of Maryland, College Park
2Proteomics
- Proteins are the machines that drive much of 
 biology
- Genes are merely the recipe 
- The direct characterization of a samples 
 proteins en masse.
- What proteins are present? 
- How much of each protein is present?
3Systems Biology
- Establish relationships by 
- Choosing related samples, 
- Global characterization, and 
- Comparison.
4Samples
- Healthy / Diseased 
- Cancerous / Benign 
- Drug resistant / Drug susceptible 
- Progression or Prognosis 
- Bound / Unbound 
- Tissue specific 
- Cellular location specific 
- Mitochondria, Membrane
52D Gel-Electrophoresis
- Protein separation 
- Molecular weight (MW) 
- Isoelectric point (pI) 
- Staining 
- Birds-eye view of protein abundance 
62D Gel-Electrophoresis
Bécamel et al., Biol. Proced. Online 2002494-104
. 
 7Paradigm Shift
- Traditional protein chemistry assay methods 
 struggle to establish identity.
- Identity requires 
- Specificity of measurement (Precision) 
- A reference for comparison 
8Mass Spectrometry for Proteomics
- Measure mass of many (bio)molecules 
 simultaneously
- High bandwidth 
- Mass is an intrinsic property of all 
 (bio)molecules
- No prior knowledge required
9Mass Spectrometer
- Time-Of-Flight (TOF) 
- Quadrapole 
- Ion-Trap
- MALDI 
- Electro-SprayIonization (ESI)
10High Bandwidth 
 11Mass is fundamental! 
 12Mass Spectrometry for Proteomics
- Measure mass of many molecules simultaneously 
- ...but not too many, abundance bias 
- Mass is an intrinsic property of all 
 (bio)molecules
- ...but need a reference to compare to 
13Mass Spectrometry for Proteomics
- Mass spectrometry has been around since the turn 
 of the century...
- ...why is MS based Proteomics so new? 
- Ionization methods 
- MALDI, Electrospray 
- Protein chemistry  automation 
- Chromatography, Gels, Computers 
- Protein sequence databases 
- A reference for comparison
14Sample Preparation for Peptide Identification 
 15Single Stage MS
MS
m/z 
 16Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z 
 17Tandem Mass Spectrometry(MS/MS)
Precursor selection  collision induced 
dissociation (CID)
m/z
MS/MS
m/z 
 18Peptide Identification
- For each (likely) peptide sequence 
- 1. Compute fragment masses 
- 2. Compare with spectrum 
- 3. Retain those that match well 
- Peptide sequences from protein sequence databases 
- Swiss-Prot, IPI, NCBIs nr, ... 
- Automated, high-throughput peptide identification 
 in complex mixtures
19Why dont we see more novel peptides?
- Tandem mass spectrometry doesnt discriminate 
 against novel peptides......but protein
 sequence databases do!
- Searching traditional protein sequence databases 
 biases the results towards well-understood
 protein isoforms!
20What goes missing?
- Known coding SNPs 
- Novel coding mutations 
- Alternative splicing isoforms 
- Alternative translation start-sites 
- Microexons 
- Alternative translation frames
21Why should we care?
- Alternative splicing is the norm! 
- Only 20-25K human genes 
- Each gene makes many proteins 
- Proteins have clinical implications 
- Biomarker discovery 
- Evidence for SNPs and alternative splicing stops 
 with transcription
- Genomic assays, ESTs, mRNA sequence. 
- Little hard evidence for translation start site
22Novel Splice Isoform
- Human Jurkat leukemia cell-line 
- Lipid-raft extraction protocol, targeting T cells 
- von Haller, et al. MCP 2003. 
- LIME1 gene 
- LCK interacting transmembrane adaptor 1 
- LCK gene 
- Leukocyte-specific protein tyrosine kinase 
- Proto-oncogene 
- Chromosomal aberration involving LCK in 
 leukemias.
- Multiple significant peptide identifications
23Novel Splice Isoform 
 24Novel Splice Isoform 
 25Novel Mutation
- HUPO Plasma Proteome Project 
- Pooled samples from 10 male  10 female healthy 
 Chinese subjects
- Plasma/EDTA sample protocol 
- Li, et al. Proteomics 2005. (Lab 29) 
- TTR gene 
- Transthyretin (pre-albumin) 
- Defects in TTR are a cause of amyloidosis. 
- Familial amyloidotic polyneuropathy 
- late-onset, dominant inheritance
26Novel Mutation
Ala2?Pro associated with familial amyloid 
polyneuropathy 
 27Novel Mutation 
 28Expressed Sequence Tags (ESTs)
- Cheap, fast, coding 
- Single sequencing reads of mRNA 
- Sequence from 5 or 3 end 
- No assembly 
http//www.ncbi.nlm.nih.gov/About/primer/est.html 
 29Searching ESTs
- Proposed long ago 
- Yates, Eng, and McCormack Anal Chem, 95. 
- Now 
- Protein sequences are sufficient for protein 
 identification
- Computationally expensive/infeasible 
- Difficult to interpret 
- Make EST searching feasible for routine searching 
 to discover novel peptides.
30Searching Expressed Sequence Tags (ESTs)
- Pros 
- No introns! 
- Primary splicing evidence for annotation 
 pipelines
- Evidence for dbSNP 
- Often derived from clinical cancer samples
- Cons 
- No frame 
- Large (8Gb) 
- Untrusted by annotation pipelines 
- Highly redundant 
- Nucleotide error rate  1 
31Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene 
- Six-frame translation 
- Eliminate ORFs lt 30 amino-acids 
- Eliminate amino-acid 30-mers observed once 
- Compress to C2 FASTA database 
- Complete, Correct for amino-acid 30-mers 
32Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene 
- Six-frame translation 
- Eliminate ORFs lt 30 amino-acids 
- Eliminate amino-acid 30-mers observed once 
- Compress to C2 FASTA database 
- Complete, Correct for amino-acid 30-mers 
33Compressed EST Database
- Gene centric compressed EST peptide sequence 
 database
- 20,774 sequence entries 
- 8Gb vs 223 Mb 
- 35 fold compression 
- 22 hours becomes 15 minutes 
- E-values improve by similar factor! 
- Makes routine EST searching feasible 
- Search ESTs instead of IPI?
34Back to the lab...
- Current LC/MS/MS workflows identify a few 
 peptides per protein
- ...not sufficient for protein isoforms 
- Need to raise the sequence coverage to (say) 80 
- ...protein separation prior to LC/MS/MS analysis 
- Potential for database of splice sites of 
 (functional) proteins!
35Microorganism Identification by MALDI Mass 
Spectrometry
- Direct observation of microorganism biomarkers in 
 the field.
- Peaks represent masses of abundant proteins. 
- Statistical models assess identification 
 significance.
B.anthracisspores
MALDI Mass Spectrometry 
 36Key Principles
- Protein mass from protein sequence 
- No introns, few PTMs 
- Specificity of single mass is very weak 
- Statistical significance from many peaks 
- Not all proteins are equally likely to be 
 observed
- Ribosomal proteins, SASPs 
37Rapid Microorganism Identification Database 
(www.RMIDb.org)
- Protein Sequences 
- 8.1M (2.9M) 
- Species 
-  18K 
- Genbank, 
- Microbial, Virus, Plasmid 
- RefSeq 
- CMR, 
- Swiss-Prot 
- TrEMBL 
38Rapid Microorganism Identification Database 
(www.RMIDb.org) 
 39Informatics Issues
- Need good species / strain annotation 
- B.anthracis vs B.thuringiensis  
- Need correct protein sequence 
- B.anthracis Sterne a/ß SASP 
- RefSeq/Gb MVMARN... (7442 Da) 
- CMR MARN... (7211 Da) 
- Need chemistry based protein classification
40Spectral Matching 
- Detection vs. identification 
- Increased sensitivity 
- No novel peptides 
- NIST GC/MS Spectral Library 
- Identifies small molecules, 
- 100,000s of (consensus) spectra 
- Bundled/Sold with many instruments 
- Dot-product spectral comparison 
- Current project Peptide MS/MS 
41Peptide DLATVYVDVLK 
 42Peptide DLATVYVDVLK 
 43Hidden Markov Models for Spectral Matching
- Capture statistical variation and consensus in 
 peak intensity
- Capture semantics of peaks 
- Extrapolate model to other peptides 
- Good specificity with superior sensitivity for 
 peptide detection
44Conclusions
- Molecular biology  bioinformatics provide a 
 reference for biotechnologies
- Foundation of systems biology 
- Peptides identify more than just proteins 
- Untapped source of disease biomarkers 
- Compressed peptide sequence databases make 
 routine EST searching feasible
45Future Research Directions
- Identification of protein isoforms 
- Optimize proteomics workflow for isoform 
 detection
- Identify splice variants in cancer cell-lines 
 (MCF-7) and clinical brain tumor samples
- dbPep for genomic annotation
46Future Research Directions
- Proteomics for Microorganism Identification 
- Specificity of tandem mass spectra 
- Revamp RMIDb prototype 
- Incorporate spectral matching 
47Acknowledgements
- Catherine Fenselau, Steve Swatkoski 
- UMCP Biochemistry 
- Chau-Wen Tseng, Xue Wu 
- UMCP Computer Science 
- Cheng Lee 
- Calibrant Biosystems 
- PeptideAtlas, HUPO PPP, X!Tandem 
- Funding NIH/NCI, USDA/ARS