Title: NCI Proteome Informatics
1NCI Proteome Informatics
- David J. States, M.D., Ph.D.
- February 8, 2005
2Genomics vs. Proteomics
- Genome sequence
- One copy of each gene per genome
- Duplicated genes
- Diploid or higher ploidy
- No tissue variation
- Somatic cell recombination
- Modification not an issue
- Methylation
- Few sample handling issues
- DNA is DNA is DNA
- Very stable
- Proteome
- Wide range of expression levels
- 10-12 orders of magnitude
- Quantification is a goal
- Tissue specific variation
- Plasma proteome
- Post translational processing
- Many known modifications
- Potentially novel chemistry
- Sample processing matters
- Serum vs. plasma
- Many protocols
3Errors and UncertaintyLessons from Genomics
- GenBank before there were errors
- My lab would never submit an erroneous
sequence - Krawetz 1987 survey 1 in 300 nt later revised
or retracted - Developing a framework
- Identifying error processes
- Computational representation
- Quantitative error analysis
- Setting standards
- Data use gt accuracy requirements
- Cost vs. accuracy analysis
- Validating lab performance
4Error Processes in Sequencing
- Genomic DNA
- ?
- Subclone
- ?
- Sangersequencing
- ?
- Sequenceassembly
- Polymorphism
- Degradation
- Sheering, cross linking, oxidation, depurination
- Polymerase fidelity
- cDNA
- Clonal instablity
- Repeats, poison sequences, rearrangements
- Lane tracking
- pre-capillary
- Base calling
- Assembly errors
5Setting Standards
- Applications drive accuracy requirements
- Homolog identification
- Very error tolerant (evolution as an error
process) - Translation of nt to aa sequence
- Frameshift and nonsense errors rare in an ORF
- 1 kb reading frame gt 1/10kb error rate
- Polymorphism identification
- Error rate substantially below the polymorphism
rate - Bermuda Quality
- Standard
- No more than one substitution per 10 kb
- No unclosed gaps
- No more than 1 of the sequence in gaps
- Arrived at by community consensus conference
- Does not meet all needs
6Cost vs. Accuracy in DNA Sequencing
7Quality Assurance Exercises
- CRADA Funding Mechanism
- Contract with deliverables
- Supervision by NIH staff
- Blind resequencing of test samples
- Until you have done a megabase of sequence, you
can not really estimate error rates at 1/10kb - 8 Labs gt 3 Centers
- Major technology choices locked in
8Proteomics Accuracy Requirements
- Identification
- Reproducibility
- 100 lab to lab reproducibility is not necessary
- Needs to be achievable with reasonable resources
- Accuracy
- Paralog and splice isoform issues
- Biomarkers
- Biological chemistry
- Characterization
- Post translational modifications
- Nature
- location
- Quantification
- Application specific, Issues familiar
9Abundance vs. Detection
10Significant But Irreproducible Data
- Identifications may be highly significant even if
not reproducible - Sampling issues in the instrumentation
- Biochemical handling
- Proposal Reproducibly within the original lab
- Poisson statistics
- Nobs2 implies ??2
- Another lab reproducing your experiment has a
high probability of confirming your observation
if they are willing to put in comparable effort - P gt 97 at twice your effort
11Identification Five Levels
- Member of a gene family
- A peptide matching only in this gene family
- Often useful biologically information
- Gene product (transcription unit)
- Multiple peptide matches
- Complete genome sequence available
- Peptide matches that are diagnostic between
paralogs - Post translational modification
- Modifications defined by number, type and
location - Varying levels of precision (e.g. residue vs.
peptide level assignment) - Transcriptional/splice variant
- Multiple peptide matches including matches
defining the N and C termni - Peptide matches that are diagnostic between
paralogs - Peptide matches or molecular weight data
diagnostic fo splice isoform - Complete covalent structure
- Covering set of peptide matches
- Covering set of MS/MS data on all peptides
12Protein Integration Workflow
.
13Bacterial Proteins in Normal Blood
Number RefSeq ID Description of
peptides 6 NP_417798.1 Elongation factor EF-Tu
E. coli... 5 NP_415477.1 Outer membrane protein
3a E. coli K12 4 NP_416010.1 Glutamate
decarboxylase isozyme E. coli K12 2 NP_751975.1
Chaperone protein dnaK E. coli
CFT073 2 NP_415333.1 DNA protection during
starvation E. coli K12 2 NP_756310.1 Lipid
A-core, surface polymer ligase E.
coli... 2 NP_418165.1 Low affinity tryptophan
permease E. coli K12 2 NP_416192.1 Murein
lipoprotein E. coli K12 2 NP_418169.1 Hypotheti
cal protein b3713 E. coli K12 2 NP_288226.1 Put
ative AraC-type regulatory protein E. coli...
14Issues in Project Coordination
- Multiple formats permitted for submissions
- XML
- Pedro data definitions
- Only 2 labs used XML Local informatics
resources - Web based submission of Excel templates
- Multiple versions as the project evolved
- Email submission of Excel templates
- Initial suggestion
- Technical concerns, but worked OK
- Some revised submissions Core had to clarify
whether new data or revised data.
15Informatics Goals and Mission
- Guide and Coordinate Activities of the Eastern
Consortium - Sample tracking between labs
- Data interchange between consortium labs
- Record of project deliverables
- Support beyond the active project period is an
issue - Support for project-wide, unanticipated and post
hoc analysis - Data interchange and dissemination
- Between the Eastern and Western Consortia
- With the scientific community
- Archival storage???
16Division of Responsibilities
- Laboratory
- Detailed laboratory records
- Assignment of internal lab identifiers
- Maintenance of association between lab and
project wide identifiers
- Informatics Core
- Details appropriate for publication
- Assignment of identifiers for
- Samples generating deliverable data
- Samples exchanged between laboratories
17Informatics Overview
18Categories of Data
- Three levels of output data need to be provided
to the Informatics Core for each assay - Raw data files
- Valuable for reanalysis
- Likely to be instrument vendor specific
- Extracted data files
- Machine and vendor-independent format files
- Describing the data produced by an assay
- tandem mass spectroscopy gt peak lists in .dta or
.mgf format and search output files - gene expression analysis gt complete lists of
gene expression levels - Analysis results
- The high level conclusions derived from an assay
- mass spectroscopy gt protein identifications with
the supporting peptide sequence lists and
associated explicit search criteria and filters - gene expression gt tables of genes that exhibit
significant changes in expression with the fold
change and confidence ranges
19Choice of caLIMS
- Data definitions based on the NCI caLIMS
Laboratory Information System (LIMS) - Many LIMS both commercial and academic
- caLIMS is designed for molecular biology
- caLIMS is an NCI supported project, implemented
by SAIC and integrated with caBIG - caLIMS Technical Overview http//calims.nci.nih.go
v/developers/ - Data definitions provides a common vocabulary
- process vs. protocol vs. project
20Adapting caLIMS to the MMC
- caLIMS is generic
- No explicit support for proteomics or genetics
- Our focus is on the data definitions and entity
relationships - Choice of whether to implement a local LIMS and
whether to use caLIMS forms and interfaces within
the lab is entirely up to the lab - MMC vs. PPP
- Similar in focus on analysis of blood proteins
- MMC will have many more samples and time
dependent experimental series
21All Consortium Samples
- Unique identifier
- assigned by the Informatics Core
- Mouse model
- Sample type
- Parent
- Sample or animal from which this specimen was
derived - Pooled samples are a special case, details off
line - Protocol used to obtain the sample
- Date
- Optional
- see additional fields in the caLIMS SAMPLE table
below
22All Consortium Protocols
- Unique identifier
- assigned by the Informatics Core when submission
received - Descriptive title
- Text description
- Optional
- see additional fields in the caLIMS PROTOCOL
table below
23Animal Models
- Specific model (of several in each lab/each tumor
type) - Unique identifier
- Tracking and relationship of samples from the
same animal - Background strain
- Genetic manipulations
- Age
- Treatment protocols
- Xenograft information
- Species/cell line
- Site of implant
- Time since implant
- Drugs and exogenous agents
- Agent
- Method and site of administration
- Dosage
- Time since treatment
24Tissue and Cell Line Samples
- Identifier for the animal of origin
- Treat cell lines as animal lines for informatics
purposes - Anatomic origin
- Using GO and Jackson Labs nomenclature
- Protocol used to obtain the sample
- specific protocol identifier and an associated
text description of the protocol - Any pooling or aliquoting performed
25Antibodies
- Reagent identifier
- Species
- Monoclonal vs. Polyclonal
- Immunogen
- If the immunogen was derived from a project
specimen, the specimen identifier - Purification or fractionation
- Conjugation or derivatization
- Characterization information
- Specific ligand affinities if known
- Qualitative assessment of suitability for western
blotting, immunohistochemistry,
immunoprecipitation - Cross reactivity if known
26Proteomics Sample Processing
- Depletion of abundant proteins
- Class
- Agilent column or spin cartridge
- Albumin and IgG only
- IgY antibodies
- Dye or other affinity agents
- Protocol identifier
- Protein fractionation
- Labeling (fluorescent dyes isobaric tags)
- Class
- Electrophoresis vs. chromatography
- Charge vs. size vs. hydrophobicity
- Information about the fraction
- Approximate molecular weight/pI
- Protocol identifier
27Proteomics Sample Processing
- Cleavage reaction (digestion)
- Enzyme or reaction used
- Specific cleavage protocol identifier
- Pattern describing the cleavage site
- Protocol identifier
- Peptide fractionation
- Class
- Electrophoresis vs. chromatography
- Charge vs. size vs. hydrophobicity
- Information about the fraction
- Approximate molecular weight/pI
- Protocol identifier
- Derivatization
- Chemical reactions performed to prepare the
sample for analysis - Blocking groups
- Conversion of lysine to homoarginine
- Sulfydryl modifications
- Other.
- Protocol identifier
28Mass Spectroscopy
- Sample(s) being analyzed
- Type of experiment (MALDI/MS vs. ESI-MS/MS, etc.)
- Type of instrument used/specific model and any
modifications - Protocol identifier
- Data sets
- Raw data files
- Extracted data
- Peaklists
- Search output files (Mascot, Sequest, etc.) with
explicit criteria/filters - Analysis
- Protein identifications
- Database(s) used in the search
- Analysis protocol identifier
- Scores for the overall protein identification
- Peptides used to make the identification and
individual peptide scores - Quantification
- Analysis protocol identifier
- Relative or absolute units
- Probability estimates for correct and erroneous
IDs
29Multi-Dimensional Fractionation
- PF2D, FFE, IPAS, etc.
- Sample(s) analyzed and labeling/derivatization
- Protocol identifier, including pooling of
fractions - Data sets
- Raw data file
- Extracted data
- Virtual 2-D gel
- 1-D chromatograms
- Analysis
- Identification
- See MS (above)
- Tracking information needed to related protein
identifications back to PF2D fractions - Quantification
- Analysis protocol identifier
- Relative or absolute units
- Error estimates
30Protein Arrays
- Sample analyzed and labeling/derivatization
- Protocol identifier
- Data sets
- Raw data files
- Array images
- Extracted data
- Spot locations and peak intensities
- Spot assignments
- what was spotted where
- replicate information
- Analysis
- Quantification, including normalization
- Analysis protocol identifier
- Relative or absolute units
- Error estimates
31Assay Representation
- Project
- Consortium wide
- May involve one or several laboratories
- Protocol
- Description of a procedure
- May be lab specific or used in several labs
- May be used in one or several projects
- Sample
- Specific instance of an experimental sample
- Sample_type table describes classes of samples
- Consortium wide identification
- One sample may be used in one or several labs and
projects - Assay
- Specific instance of an experiment
32Summary
- Proteomics is still in the early days
- Need to better define error processes and
accuracy requirements - Informatics support in the labs limited
- Dangers in too rigid an approach to QC
- Project coordination
- Division of labor between labs and the consortium
data center - Define deliverables early and inclusively
- Archive of data at multiple levels
- raw, processed, analyzed