Title: Proteomic Characterization of Alternative Splicing and Coding Polymorphism
1Proteomic Characterization of Alternative
Splicing and Coding Polymorphism
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park
2Proteomics
- Proteins are the machines that drive much of
biology - Genes are merely the recipe
- The direct characterization of a samples
proteins en masse. - What proteins are present?
- How much of each protein is present?
3Systems Biology
- Establish relationships by
- Choosing related samples,
- Global characterization, and
- Comparison.
Gene / Transcript / Protein Gene / Transcript / Protein
Measurement Predetermined Unknown
Discrete (DNA) Genotyping Sequencing
Continuous Gene Expression Proteomics
4Samples
- Healthy / Diseased
- Cancerous / Benign
- Drug resistant / Drug susceptible
- Progression or Prognosis
- Bound / Unbound
- Tissue specific
- Cellular location specific
- Mitochondria, Membrane
52D Gel-Electrophoresis
- Protein separation
- Molecular weight (MW)
- Isoelectric point (pI)
- Staining
- Birds-eye view of protein abundance
62D Gel-Electrophoresis
Bécamel et al., Biol. Proced. Online 2002494-104
.
7Paradigm Shift
- Traditional protein chemistry assay methods
struggle to establish identity. - Identity requires
- Specificity of measurement (Precision)
- A reference for comparison
8Mass Spectrometry for Proteomics
- Measure mass of many (bio)molecules
simultaneously - High bandwidth
- Mass is an intrinsic property of all
(bio)molecules - No prior knowledge required
9Mass Spectrometer
- Time-Of-Flight (TOF)
- Quadrapole
- Ion-Trap
- MALDI
- Electro-SprayIonization (ESI)
10High Bandwidth
11Mass is fundamental!
12Mass Spectrometry for Proteomics
- Measure mass of many molecules simultaneously
- ...but not too many, abundance bias
- Mass is an intrinsic property of all
(bio)molecules - ...but need a reference to compare to
13Mass Spectrometry for Proteomics
- Mass spectrometry has been around since the turn
of the century... - ...why is MS based Proteomics so new?
- Ionization methods
- MALDI, Electrospray
- Protein chemistry automation
- Chromatography, Gels, Computers
- Protein sequence databases
- A reference for comparison
14Sample Preparation for Peptide Identification
15Single Stage MS
MS
m/z
16Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
17Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
18Peptide Identification
- For each (likely) peptide sequence
- 1. Compute fragment masses
- 2. Compare with spectrum
- 3. Retain those that match well
- Peptide sequences from protein sequence databases
- Swiss-Prot, IPI, NCBIs nr, ...
- Automated, high-throughput peptide identification
in complex mixtures
19Why dont we see more novel peptides?
- Tandem mass spectrometry doesnt discriminate
against novel peptides......but protein
sequence databases do! - Searching traditional protein sequence databases
biases the results towards well-understood
protein isoforms!
20What goes missing?
- Known coding SNPs
- Novel coding mutations
- Alternative splicing isoforms
- Alternative translation start-sites
- Microexons
- Alternative translation frames
21Why should we care?
- Alternative splicing is the norm!
- Only 20-25K human genes
- Each gene makes many proteins
- Proteins have clinical implications
- Biomarker discovery
- Evidence for SNPs and alternative splicing stops
with transcription - Genomic assays, ESTs, mRNA sequence.
- Little hard evidence for translation start site
22Novel Splice Isoform
- Human Jurkat leukemia cell-line
- Lipid-raft extraction protocol, targeting T cells
- von Haller, et al. MCP 2003.
- LIME1 gene
- LCK interacting transmembrane adaptor 1
- LCK gene
- Leukocyte-specific protein tyrosine kinase
- Proto-oncogene
- Chromosomal aberration involving LCK in
leukemias. - Multiple significant peptide identifications
23Novel Splice Isoform
24Novel Splice Isoform
25Novel Mutation
- HUPO Plasma Proteome Project
- Pooled samples from 10 male 10 female healthy
Chinese subjects - Plasma/EDTA sample protocol
- Li, et al. Proteomics 2005. (Lab 29)
- TTR gene
- Transthyretin (pre-albumin)
- Defects in TTR are a cause of amyloidosis.
- Familial amyloidotic polyneuropathy
- late-onset, dominant inheritance
26Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
27Novel Mutation
28Expressed Sequence Tags (ESTs)
- Cheap, fast, coding
- Single sequencing reads of mRNA
- Sequence from 5 or 3 end
- No assembly
http//www.ncbi.nlm.nih.gov/About/primer/est.html
29Searching ESTs
- Proposed long ago
- Yates, Eng, and McCormack Anal Chem, 95.
- Now
- Protein sequences are sufficient for protein
identification - Computationally expensive/infeasible
- Difficult to interpret
- Make EST searching feasible for routine searching
to discover novel peptides.
30Searching Expressed Sequence Tags (ESTs)
- Pros
- No introns!
- Primary splicing evidence for annotation
pipelines - Evidence for dbSNP
- Often derived from clinical cancer samples
- Cons
- No frame
- Large (8Gb)
- Untrusted by annotation pipelines
- Highly redundant
- Nucleotide error rate 1
31Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
32Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
33Compressed EST Database
- Gene centric compressed EST peptide sequence
database - 20,774 sequence entries
- 8Gb vs 223 Mb
- 35 fold compression
- 22 hours becomes 15 minutes
- E-values improve by similar factor!
- Makes routine EST searching feasible
- Search ESTs instead of IPI?
34Back to the lab...
- Current LC/MS/MS workflows identify a few
peptides per protein - ...not sufficient for protein isoforms
- Need to raise the sequence coverage to (say) 80
- ...protein separation prior to LC/MS/MS analysis
- Potential for database of splice sites of
(functional) proteins!
35Microorganism Identification by MALDI Mass
Spectrometry
- Direct observation of microorganism biomarkers in
the field. - Peaks represent masses of abundant proteins.
- Statistical models assess identification
significance.
B.anthracisspores
MALDI Mass Spectrometry
36Key Principles
- Protein mass from protein sequence
- No introns, few PTMs
- Specificity of single mass is very weak
- Statistical significance from many peaks
- Not all proteins are equally likely to be
observed - Ribosomal proteins, SASPs
37Rapid Microorganism Identification Database
(www.RMIDb.org)
- Protein Sequences
- 8.1M (2.9M)
- Species
- 18K
- Genbank,
- Microbial, Virus, Plasmid
- RefSeq
- CMR,
- Swiss-Prot
- TrEMBL
38Rapid Microorganism Identification Database
(www.RMIDb.org)
39Informatics Issues
- Need good species / strain annotation
- B.anthracis vs B.thuringiensis
- Need correct protein sequence
- B.anthracis Sterne a/ß SASP
- RefSeq/Gb MVMARN... (7442 Da)
- CMR MARN... (7211 Da)
- Need chemistry based protein classification
40Spectral Matching
- Detection vs. identification
- Increased sensitivity
- No novel peptides
- NIST GC/MS Spectral Library
- Identifies small molecules,
- 100,000s of (consensus) spectra
- Bundled/Sold with many instruments
- Dot-product spectral comparison
- Current project Peptide MS/MS
41Peptide DLATVYVDVLK
42Peptide DLATVYVDVLK
43Hidden Markov Models for Spectral Matching
- Capture statistical variation and consensus in
peak intensity - Capture semantics of peaks
- Extrapolate model to other peptides
- Good specificity with superior sensitivity for
peptide detection
44Conclusions
- Molecular biology bioinformatics provide a
reference for biotechnologies - Foundation of systems biology
- Peptides identify more than just proteins
- Untapped source of disease biomarkers
- Compressed peptide sequence databases make
routine EST searching feasible
45Future Research Directions
- Identification of protein isoforms
- Optimize proteomics workflow for isoform
detection - Identify splice variants in cancer cell-lines
(MCF-7) and clinical brain tumor samples - dbPep for genomic annotation
46Future Research Directions
- Proteomics for Microorganism Identification
- Specificity of tandem mass spectra
- Revamp RMIDb prototype
- Incorporate spectral matching
47Acknowledgements
- Catherine Fenselau, Steve Swatkoski
- UMCP Biochemistry
- Chau-Wen Tseng, Xue Wu
- UMCP Computer Science
- Cheng Lee
- Calibrant Biosystems
- PeptideAtlas, HUPO PPP, X!Tandem
- Funding NIH/NCI, USDA/ARS