Title: Computational Methods and Bioinformatics in Proteomic Studies
1Computational Methods and Bioinformatics in
Proteomic Studies
Bioinformatics Building Bridges April 14,
2005 Tim Griffin Dept. Biochemistry, Molecular
Biology and Biophysics tgriffin_at_umn.edu
2Interdisciplinary biology in the 21st century
3Genome-era biology system-wide studies
The yeast genome on a chip
DeRisi et al, 1997, Science 278680
4The simple view one gene, one protein
5The reality biological systems are complex
Protein interaction network in Drosophila Science
(2003) 302, p. 1727
6Why analyze at the protein level? control of
eukaryotic gene expression
Inactive mRNA
Nucleus
Cytosol
Translational control
Primary RNA transcript
DNA
mRNA
mRNA
Trancriptional control
RNA processing control
RNA transport control
Translational control
Protein
Protein activity control
Inactive protein
Active protein
7What is proteomics?
- Proteomics includes not only the identification
and quantification of proteins, but also the
determination of their localization,
modifications, interactions, activities, and,
ultimately, their function. - -Stan Fields in Science, 2001.
-
- Alternatively proteomics fast biochemistry
8Proteomics a complement to genomics
What proteomic analysis has to offer
- measurement of protein response, which is not
always - indicated by mRNA response
- post-translational modifications
- macromolecular interactions
- sub-cellular location
- high-resolution structural and molecular
characterization
9Genomics, Proteomics, and Systems Biology
10Proteomics technologies and methods
- Two-dimensional gel electrophoresis
- mass spectrometry
- protein chips
- yeast 2-hybrid
- phage display
- antibody engineering
- high-throughput protein expression
- high-throughput X-ray crystallography
11The 1990s revolution mass spectrometry Developm
ent of physical methods to mass analyze large
biomolecules
separation by m/z
- quadrupole
- ion trap
- time-of-flight
- MALDI
- Electrospray
- liquid chromatography
- nanospray
- mass analysis of proteins, peptides, DNA
12Electrospray ionization (ESI)
200 ?m
- protein and peptide analysis, multiply charged
ions - quadrupole and TOF detection
- tandem mass spectrometry
- solution phase ionization enables online
coupling with liquid chromatography (LC)
13Separations of complex mixtures crowd control
- Enables the processing of the many components
in big protein mixtures
turnstile
1 2 3....
14Identification of protein mixtures by tandem mass
spectrometry
3. CID
4. detect fragments
2. select specific peptide
ESI
Ar
Ar
µLC
Ar
Ar
1. MS survey scan
peptides
trypsin
Protein mixture
15Peptide sequence determination from MS/MS spectra
Collision-induced dissociation (CID) creates two
prominent ion series
y13
y12
y11
y10
y9
y8
y7
y6
y5
y4
y3
y2
y1
y14
y-series
H2N-N--S--G--D--I--V--N--L--G--S--I--A--G--R-COOH
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
b12
b13
b14
b1
b-series
16Identification of protein mixtures by mass
spectrometry
- De novo (i.e. manually)
- Database searching
peptide identification
theoretical (DNA or protein database)
observed
protein identification
17Peptide sequence identifies the protein
GDIVNLGSIAGR
DIVNLGSIAGR
IVNLGSIAGR
VNLGSIAGR
NLGSIAGR
LGSIAGR
GSIAGR
H2N-NSGDIVNLGSIAGR-COOH
Relative Abundance
SIAGR
IAGR
AGR
GR
R
200
400
600
800
1000
1200
m/z
YMR134W, yeast protein involved in iron metabolism
18High-throughput protein identification by
LC-MS/MS and automated sequence database searching
Raw MS/MS spectrum
Direct identification of 1000 proteins from
complex mixtures
Protein sequence and/or DNA sequence database
search
Peptide sequence match
Protein identification
19Case Study Proteomic Analysis of Oral Cancer
Progression
- Mouth cancer, tongue cancer, throat cancer
- In USA, 30,000 people are newly diagnosed with
oral cancer each year, a person dies from oral
cancer every hour of every day - 350,000 to 400,000 new cases annually worldwide
- Less than half will be alive in 5 yrs 20x
higher risk of producing second, primary tumors - However, 80 to 90 cure rate when found early.
Unfortunately, at this time, the majority are
found as latter stage cancers
20Progression of oral cancer
Malignancy transformation rate 5-17
carcinoma
dysplasia
insult or injury
normal
?? Can we find molecular markers that predict
this transition?
(adapted from Dr. Nelson Rhodus, U of M Dental
School)
21Saliva as a diagnostic fluid in oral cancer
progression
- Readily available, non-invasive collection
- Heterogeneous human fluid with large dynamic
range of protein abundances requires
fractionation - Many post-translational modified proteins
- Currently only 100-150 proteins have been
identified in whole saliva (LC-MS/MS)
First step obtain a comprehensive profile of the
protein components from a normal individual
saliva sample
22Multidimensional separations followed by mass
spectrometry
Whole saliva protein mixture
FFE fractionation (70 fractions)
RP-capLC
ESI-MS/MS (500,000 spectra)
Protein identification
Protein sequence and/or DNA sequence database
search
23 Raw data processing Automated database
searching Computational algorithms for searching
MS/MS spectra against protein sequence databases,
mRNA sequences, DNA sequences
- ProFound
- Mascot
- PepSea
- MS-Fit
- MOWSE
- Peptident
- Multident
- Sequest
- PepFrag
- MS-Tag
Protein identification
24Choosing a sequence database
- National Center for Biotechnology Information
(NCBI) - Swiss-Prot/TrEMBL
- Protein Information Resource (PIR)
- European Biotechnology Institute (EBI)
Considerations organism-specificity, redundancy,
annotation
25Analysis of processed data quality control of
protein matches
filtering
Unfiltered 105 matches (lots of noise and
junk)
Filtered thousands of true matches
26Probability of sequence match via statistical
modeling
Keller et al (2002) Analytical Chemistry 74, 5383
Sequence matches automatically assigned a P score
between 0 and 1
27Collating and interpreting the data Interact
software tool
http//www.systemsbiology.org/Default.aspx?pagenam
eproteomicssoftware
28Result Processed and Filtered Data
Saliva example 433 unique proteins identified
29 Interpreting the data annotated protein
databases
National Center for Biotechnology Information
(NCBI) ExPASy/Uniprot European Bioinformatics
Institute (EBI) Organism/biology
specific Saccharomyces Genome Database
(SGD) Human Mitochondrial Protein
Database Human Proteome Organization (HUPO)
30Mining databases for data interpretation Example
1
31Mining databases for data interpretation Example
1
32Mining databases for data interpretation Example
2
33Mining databases for data interpretation Example
2
34Classification of interpreted data subcellular
localization
35Classification of interpreted data functional
characterization
36What about quantitative measurements?
Malignancy transformation rate 5-17
carcinoma
dysplasia
insult or injury
normal
?? Can we find molecular markers that predict
this transition?
(adapted from Dr. Nelson Rhodus, U of M Dental
School)
37Stable-isotope labeling of proteins for
quantitative profiling
20 vs. 37
-L and H labels are chemically identical, but
isotopically different due to incorporation of
stable isotopes (i.e. 2H, 15N, 13C)
Chemically identical but isotopically different
peptides ionize with same efficiency, act as
mutual internal standards
38Quantitative analysis of mRNA data
DeRisi et al, 1997, Science 278680
39Automated Quantitative Proteomics
100
light
heavy
quantify
mixture 1 (light)
550
560
570
580
m/z
mass analysis
multi-dimensional separation
combine and proteolyze
mixture 2 (heavy)
100
NH2-EACDPLR-COOH
Identify (MS/MS)
0
200
400
600
800
m/z
40Quantitative analysis
Sample 2
Relative intensity relative protein abundance
Sample 1
41Disease proteomics androgen-induced effects in
prostate cancer
42Dealing with the data
Data acquisition
Raw data processing (Database searching)
Analysis of processed data (Statistical
filtering, quantitative analysis)
Data organization and interpretation
Archiving and databasing
Modeling (Computational Biology)
43Need for better data archives and respositories
http//proteomics.jhu.edu/dl/pathidb.php
44Archiving challenges different data formats
http//sashimi.sourceforge.net/software_glossolali
a.html
45Computational Biology Integrating proteomics
and genomics data
control of eukaryotic gene expression
Inactive mRNA
Nucleus
Cytosol
Translational control
Primary RNA transcript
DNA
mRNA
mRNA
Trancriptional control
RNA processing control
RNA transport control
Translational control
Protein
Protein activity control
Inactive protein
Active protein
46Integrating proteomics and genomics
data Elucidating gene expression regulatory
networks
Griffin TJ et al (2002) Mol Cell Proteomics 1 323
47Post-transcriptionally regulated proteins?
48Computational biology integrating information to
assign function
Cytoscape http//www.cytoscape.org/
49Modeling cellular circuitry based on genomic and
proteomic data
50Is the virtual human on the horizon???
51Acknowledgements
Griffin Laboratory Mikel Roe Sri
Bandhakavi Hongwei Xie Clive Nyauncho
U of M Dental School Dr. Nelson Rhodus
MSI Patton Fast
University of Minnesota
Funding Minnesota Medical Foundation NIH