NCI Proteome Informatics - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

NCI Proteome Informatics

Description:

'My lab would never submit an erroneous sequence...' Krawetz 1987 survey 1 in 300 nt ... Charge vs. size vs. hydrophobicity. Information about the fraction ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 33
Provided by: medi173
Category:

less

Transcript and Presenter's Notes

Title: NCI Proteome Informatics


1
NCI Proteome Informatics
  • David J. States, M.D., Ph.D.
  • February 8, 2005

2
Genomics vs. Proteomics
  • Genome sequence
  • One copy of each gene per genome
  • Duplicated genes
  • Diploid or higher ploidy
  • No tissue variation
  • Somatic cell recombination
  • Modification not an issue
  • Methylation
  • Few sample handling issues
  • DNA is DNA is DNA
  • Very stable
  • Proteome
  • Wide range of expression levels
  • 10-12 orders of magnitude
  • Quantification is a goal
  • Tissue specific variation
  • Plasma proteome
  • Post translational processing
  • Many known modifications
  • Potentially novel chemistry
  • Sample processing matters
  • Serum vs. plasma
  • Many protocols

3
Errors and UncertaintyLessons from Genomics
  • GenBank before there were errors
  • My lab would never submit an erroneous
    sequence
  • Krawetz 1987 survey 1 in 300 nt later revised
    or retracted
  • Developing a framework
  • Identifying error processes
  • Computational representation
  • Quantitative error analysis
  • Setting standards
  • Data use gt accuracy requirements
  • Cost vs. accuracy analysis
  • Validating lab performance

4
Error Processes in Sequencing
  • Genomic DNA
  • ?
  • Subclone
  • ?
  • Sangersequencing
  • ?
  • Sequenceassembly
  • Polymorphism
  • Degradation
  • Sheering, cross linking, oxidation, depurination
  • Polymerase fidelity
  • cDNA
  • Clonal instablity
  • Repeats, poison sequences, rearrangements
  • Lane tracking
  • pre-capillary
  • Base calling
  • Assembly errors

5
Setting Standards
  • Applications drive accuracy requirements
  • Homolog identification
  • Very error tolerant (evolution as an error
    process)
  • Translation of nt to aa sequence
  • Frameshift and nonsense errors rare in an ORF
  • 1 kb reading frame gt 1/10kb error rate
  • Polymorphism identification
  • Error rate substantially below the polymorphism
    rate
  • Bermuda Quality
  • Standard
  • No more than one substitution per 10 kb
  • No unclosed gaps
  • No more than 1 of the sequence in gaps
  • Arrived at by community consensus conference
  • Does not meet all needs

6
Cost vs. Accuracy in DNA Sequencing
7
Quality Assurance Exercises
  • CRADA Funding Mechanism
  • Contract with deliverables
  • Supervision by NIH staff
  • Blind resequencing of test samples
  • Until you have done a megabase of sequence, you
    can not really estimate error rates at 1/10kb
  • 8 Labs gt 3 Centers
  • Major technology choices locked in

8
Proteomics Accuracy Requirements
  • Identification
  • Reproducibility
  • 100 lab to lab reproducibility is not necessary
  • Needs to be achievable with reasonable resources
  • Accuracy
  • Paralog and splice isoform issues
  • Biomarkers
  • Biological chemistry
  • Characterization
  • Post translational modifications
  • Nature
  • location
  • Quantification
  • Application specific, Issues familiar

9
Abundance vs. Detection
10
Significant But Irreproducible Data
  • Identifications may be highly significant even if
    not reproducible
  • Sampling issues in the instrumentation
  • Biochemical handling
  • Proposal Reproducibly within the original lab
  • Poisson statistics
  • Nobs2 implies ??2
  • Another lab reproducing your experiment has a
    high probability of confirming your observation
    if they are willing to put in comparable effort
  • P gt 97 at twice your effort

11
Identification Five Levels
  • Member of a gene family
  • A peptide matching only in this gene family
  • Often useful biologically information
  • Gene product (transcription unit)
  • Multiple peptide matches
  • Complete genome sequence available
  • Peptide matches that are diagnostic between
    paralogs
  • Post translational modification
  • Modifications defined by number, type and
    location
  • Varying levels of precision (e.g. residue vs.
    peptide level assignment)
  • Transcriptional/splice variant
  • Multiple peptide matches including matches
    defining the N and C termni
  • Peptide matches that are diagnostic between
    paralogs
  • Peptide matches or molecular weight data
    diagnostic fo splice isoform
  • Complete covalent structure
  • Covering set of peptide matches
  • Covering set of MS/MS data on all peptides

12
Protein Integration Workflow
.
13
Bacterial Proteins in Normal Blood
Number RefSeq ID Description of
peptides 6 NP_417798.1 Elongation factor EF-Tu
E. coli... 5 NP_415477.1 Outer membrane protein
3a E. coli K12 4 NP_416010.1 Glutamate
decarboxylase isozyme E. coli K12 2 NP_751975.1
Chaperone protein dnaK E. coli
CFT073 2 NP_415333.1 DNA protection during
starvation E. coli K12 2 NP_756310.1 Lipid
A-core, surface polymer ligase E.
coli... 2 NP_418165.1 Low affinity tryptophan
permease E. coli K12 2 NP_416192.1 Murein
lipoprotein E. coli K12 2 NP_418169.1 Hypotheti
cal protein b3713 E. coli K12 2 NP_288226.1 Put
ative AraC-type regulatory protein E. coli...
14
Issues in Project Coordination
  • Multiple formats permitted for submissions
  • XML
  • Pedro data definitions
  • Only 2 labs used XML Local informatics
    resources
  • Web based submission of Excel templates
  • Multiple versions as the project evolved
  • Email submission of Excel templates
  • Initial suggestion
  • Technical concerns, but worked OK
  • Some revised submissions Core had to clarify
    whether new data or revised data.

15
Informatics Goals and Mission
  • Guide and Coordinate Activities of the Eastern
    Consortium
  • Sample tracking between labs
  • Data interchange between consortium labs
  • Record of project deliverables
  • Support beyond the active project period is an
    issue
  • Support for project-wide, unanticipated and post
    hoc analysis
  • Data interchange and dissemination
  • Between the Eastern and Western Consortia
  • With the scientific community
  • Archival storage???

16
Division of Responsibilities
  • Laboratory
  • Detailed laboratory records
  • Assignment of internal lab identifiers
  • Maintenance of association between lab and
    project wide identifiers
  • Informatics Core
  • Details appropriate for publication
  • Assignment of identifiers for
  • Samples generating deliverable data
  • Samples exchanged between laboratories

17
Informatics Overview
18
Categories of Data
  • Three levels of output data need to be provided
    to the Informatics Core for each assay
  • Raw data files
  • Valuable for reanalysis
  • Likely to be instrument vendor specific
  • Extracted data files
  • Machine and vendor-independent format files
  • Describing the data produced by an assay
  • tandem mass spectroscopy gt peak lists in .dta or
    .mgf format and search output files
  • gene expression analysis gt complete lists of
    gene expression levels
  • Analysis results
  • The high level conclusions derived from an assay
  • mass spectroscopy gt protein identifications with
    the supporting peptide sequence lists and
    associated explicit search criteria and filters
  • gene expression gt tables of genes that exhibit
    significant changes in expression with the fold
    change and confidence ranges

19
Choice of caLIMS
  • Data definitions based on the NCI caLIMS
    Laboratory Information System (LIMS)
  • Many LIMS both commercial and academic
  • caLIMS is designed for molecular biology
  • caLIMS is an NCI supported project, implemented
    by SAIC and integrated with caBIG
  • caLIMS Technical Overview http//calims.nci.nih.go
    v/developers/
  • Data definitions provides a common vocabulary
  • process vs. protocol vs. project

20
Adapting caLIMS to the MMC
  • caLIMS is generic
  • No explicit support for proteomics or genetics
  • Our focus is on the data definitions and entity
    relationships
  • Choice of whether to implement a local LIMS and
    whether to use caLIMS forms and interfaces within
    the lab is entirely up to the lab
  • MMC vs. PPP
  • Similar in focus on analysis of blood proteins
  • MMC will have many more samples and time
    dependent experimental series

21
All Consortium Samples
  • Unique identifier
  • assigned by the Informatics Core
  • Mouse model
  • Sample type
  • Parent
  • Sample or animal from which this specimen was
    derived
  • Pooled samples are a special case, details off
    line
  • Protocol used to obtain the sample
  • Date
  • Optional
  • see additional fields in the caLIMS SAMPLE table
    below

22
All Consortium Protocols
  • Unique identifier
  • assigned by the Informatics Core when submission
    received
  • Descriptive title
  • Text description
  • Optional
  • see additional fields in the caLIMS PROTOCOL
    table below

23
Animal Models
  • Specific model (of several in each lab/each tumor
    type)
  • Unique identifier
  • Tracking and relationship of samples from the
    same animal
  • Background strain
  • Genetic manipulations
  • Age
  • Treatment protocols
  • Xenograft information
  • Species/cell line
  • Site of implant
  • Time since implant
  • Drugs and exogenous agents
  • Agent
  • Method and site of administration
  • Dosage
  • Time since treatment

24
Tissue and Cell Line Samples
  • Identifier for the animal of origin
  • Treat cell lines as animal lines for informatics
    purposes
  • Anatomic origin
  • Using GO and Jackson Labs nomenclature
  • Protocol used to obtain the sample
  • specific protocol identifier and an associated
    text description of the protocol
  • Any pooling or aliquoting performed

25
Antibodies
  • Reagent identifier
  • Species
  • Monoclonal vs. Polyclonal
  • Immunogen
  • If the immunogen was derived from a project
    specimen, the specimen identifier
  • Purification or fractionation
  • Conjugation or derivatization
  • Characterization information
  • Specific ligand affinities if known
  • Qualitative assessment of suitability for western
    blotting, immunohistochemistry,
    immunoprecipitation
  • Cross reactivity if known

26
Proteomics Sample Processing
  • Depletion of abundant proteins
  • Class
  • Agilent column or spin cartridge
  • Albumin and IgG only
  • IgY antibodies
  • Dye or other affinity agents
  • Protocol identifier
  • Protein fractionation
  • Labeling (fluorescent dyes isobaric tags)
  • Class
  • Electrophoresis vs. chromatography
  • Charge vs. size vs. hydrophobicity
  • Information about the fraction
  • Approximate molecular weight/pI
  • Protocol identifier

27
Proteomics Sample Processing
  • Cleavage reaction (digestion)
  • Enzyme or reaction used
  • Specific cleavage protocol identifier
  • Pattern describing the cleavage site
  • Protocol identifier
  • Peptide fractionation
  • Class
  • Electrophoresis vs. chromatography
  • Charge vs. size vs. hydrophobicity
  • Information about the fraction
  • Approximate molecular weight/pI
  • Protocol identifier
  • Derivatization
  • Chemical reactions performed to prepare the
    sample for analysis
  • Blocking groups
  • Conversion of lysine to homoarginine
  • Sulfydryl modifications
  • Other.
  • Protocol identifier

28
Mass Spectroscopy
  • Sample(s) being analyzed
  • Type of experiment (MALDI/MS vs. ESI-MS/MS, etc.)
  • Type of instrument used/specific model and any
    modifications
  • Protocol identifier
  • Data sets
  • Raw data files
  • Extracted data
  • Peaklists
  • Search output files (Mascot, Sequest, etc.) with
    explicit criteria/filters
  • Analysis
  • Protein identifications
  • Database(s) used in the search
  • Analysis protocol identifier
  • Scores for the overall protein identification
  • Peptides used to make the identification and
    individual peptide scores
  • Quantification
  • Analysis protocol identifier
  • Relative or absolute units
  • Probability estimates for correct and erroneous
    IDs

29
Multi-Dimensional Fractionation
  • PF2D, FFE, IPAS, etc.
  • Sample(s) analyzed and labeling/derivatization
  • Protocol identifier, including pooling of
    fractions
  • Data sets
  • Raw data file
  • Extracted data
  • Virtual 2-D gel
  • 1-D chromatograms
  • Analysis
  • Identification
  • See MS (above)
  • Tracking information needed to related protein
    identifications back to PF2D fractions
  • Quantification
  • Analysis protocol identifier
  • Relative or absolute units
  • Error estimates

30
Protein Arrays
  • Sample analyzed and labeling/derivatization
  • Protocol identifier
  • Data sets
  • Raw data files
  • Array images
  • Extracted data
  • Spot locations and peak intensities
  • Spot assignments
  • what was spotted where
  • replicate information
  • Analysis
  • Quantification, including normalization
  • Analysis protocol identifier
  • Relative or absolute units
  • Error estimates

31
Assay Representation
  • Project
  • Consortium wide
  • May involve one or several laboratories
  • Protocol
  • Description of a procedure
  • May be lab specific or used in several labs
  • May be used in one or several projects
  • Sample
  • Specific instance of an experimental sample
  • Sample_type table describes classes of samples
  • Consortium wide identification
  • One sample may be used in one or several labs and
    projects
  • Assay
  • Specific instance of an experiment

32
Summary
  • Proteomics is still in the early days
  • Need to better define error processes and
    accuracy requirements
  • Informatics support in the labs limited
  • Dangers in too rigid an approach to QC
  • Project coordination
  • Division of labor between labs and the consortium
    data center
  • Define deliverables early and inclusively
  • Archive of data at multiple levels
  • raw, processed, analyzed
Write a Comment
User Comments (0)
About PowerShow.com