Title: Bioinformatics and Proteomics
1Bioinformatics and Proteomics
- March 28, 2003
- NIH Proteomics Workshop
- Bethesda, MD
- Anastasia Nikolskaya, Ph.D.
- Research Assistant Professor
- Protein Information Resource
- Department of Biochemistry and Molecular
Biology - Georgetown University Medical Center
2Overview
- Role of Bioinformatics/Computational Biology in
Proteomics Research - Genomics
- Functional Annotation of Proteins
- Classification of Proteins
- Bioinformatics Databases and Analytical Tools
(Dr. Yeh and Dr. Hu) -
- Sequence function
3Functional Genomics/Proteomics
- Proteomics studies biological systems based
on global knowledge of genomes, transcriptomes,
proteomes, metabolomes. Functional genomics
studies biological functions of proteins,
complexes, pathways based on the analysis of
genome sequences. Includes functional
assignments for protein sequences. - Genome All the Genetic Material in the
Chromosomes - Transcriptome Entire Set of Gene Transcripts
- Proteome Entire Set of Proteins
- Metabolome Entire Set of Metabolites
Genome
Transcriptome
Proteome
Metabolome
4Proteomics
- Data Gene Expression Profiling
- - Genome-Wide Analyses of Gene Expression
- Data Structural Genomics
- - Determine 3D Structures of All Protein
Families - Data Genome Projects (Sequencing)
- - Functional genomics
- - Knowing complete genome sequences of a
number of organisms is the basis of the
proteomics research
5DNA Sequence Gene Protein
Sequence Function
6Bioinformatics and Genomics/Proteomics
7Most new proteins come from genome sequencing
projects
- Mycoplasma genitalium - 484 proteins
- Escherichia coli - 4,288 proteins
- S. cerevisiae (yeast) - 5,932 proteins
- C. elegans (worm) 19,000 proteins
- Homo sapiens 40,000 proteins
... and have unknown functions
8Advantages of knowing the complete genome
sequence
- All encoded proteins can be predicted and
identified - The missing functions can be identified and
analyzed - Peculiarities and novelties in each organism can
be studied - Predictions can be made and verified
9The changing face of protein science
- 20th century
- Few well-studied proteins
- Mostly globular with enzymatic activity
- Biased protein set
- 21st century
- Many hypotheti- cal proteins
- Various, often with no enzymatic activity
- Natural protein set
10Properties of the natural protein set
- Unexpected diversity of even common enzymes
(analogous, paralogous, xenologous, etc.
enzymes ) - Conservation of the reaction chemistry, but not
the substrate specificity - Functional diversity in closely related proteins
- Abundance of new structures
11Objectives of functional analysis for different
groups of proteins
- Experimentally characterized
- Best annotated protein database SwissProt
- Knowns Characterized by similarity (closely
related to experimentally characterized) - Make sure the assignment is plausible
- Function can be predicted
- Extract maximum possible information
- Avoid errors and overpredictions
- Fill the gaps in metabolic pathways
- Unknowns (conserved or unique)
- Rank by importance
12 E. coli M. jannaschii S. cerevisiae
H. sapiens Characterized
experimentally 2046 97
3307 10189
Characterized by similarity 1083
1025 1055 10901
Unknown, conserved 285
211 1007
2723 Unknown, no similarity
874 411 966
7965
from Koonin and Galperin,
2003, with modifications
13Problems in functional assignments for knowns
- Previous low quality annotations
- misinterpreted experimental results
(e.g. suppressors, cofactors) -
biologically senseless annotations
Deinococcus head morphogenesis protein
Arabidopsis separation anxiety protein-like
Helicobacter brute force protein
Methanococcus centromere-binding protein
Plasmodium frameshift - propagated
mistakes of sequence comparison
14Problems in functional assignments for knowns
- Multi-domain organization of proteins
Histidine kinase
His kinase domain
Periplasmic sensor domain
Periplasmic sensor domain
Uncharacterized domain
15Problems in functional assignments for knowns
- Low sequence complexity (coiled-coil,
non-globular regions)
- Non-orthologous gene displacement
- Enzyme evolution (divergence in sequence and
function)
16Enzyme recruitment Minor mutational changes
convert a glycerol kinase into gluconate kinase
Differences between gluconate and
glycerol/xylulose kinases
Differences between gluconate and
glycerol/xylulose kinases
Differences between gluconate and
glycerol/xylulose kinases
Leads to non-orthologous gene displacement
17Objectives of functional analysis for different
groups of proteins
- Experimentally characterized
- Knowns Characterized by similarity (closely
related to experimentally characterized) - Make sure the assignment is plausible
- Function can be predicted
- Extract maximum possible information
- Avoid errors and overpredictions
- Fill the gaps in metabolic pathways
- Unknowns (conserved or unique)
- Rank by importance
18Functional PredictionDealing with
hypothetical proteins
- Computational analysis
- Sequence analysis of the new ORFs
- Mutational analysis
- Functional analysis
- Expression profiling
- Tracking of cellular localization
- Structural analysis
- Determination of the 3D structure
19Structural Genomics
- Protein Structure Initiative Determine 3D
Structures of All Proteins - Family Classification
- Organize Protein Sequences into Families, collect
families without known structures - Target Selection
- Select Family Representatives as Targets
- Structure Determination
- X-Ray Crystallography or NMR Spectroscopy
- Homology Modeling
- Build Models for Other Proteins by Homology
- Functional prediction based on structure
20Structural Genomics Structure-Based Functional
Assignments
Methanococcus jannaschii MJ0577 (Hypothetical
Protein) Contains bound ATP gt ATPase or
ATP-Mediated Molecular Switch Confirmed by
biochemical experiments
21Crystal structure is not a function!
22Improving functional assignments for unknowns
(Functional Prediction)
- Detailed manual analysis of sequence
similarities - Cluster analysis of protein families (family
databases) - Use of sophisticated database searches
(PSI-BLAST, HMM)
23Using comparative genomics for protein analysis
- Those amino acids that are conserved in
divergent proteins (archaeal and bacterial,
hyperthermophilic and mesophilic) are likely
to be important for catalytic activity. - Comparative analysis allows us to find
subtle sequence similarities in proteins that
would not have been noticed otherwise - Prediction of the 3D fold and general function
is much easier than prediction of exact
biological (or biochemical) function.
24Using comparative genomics for protein analysis
- For some reason, the reaction chemistry often
remains conserved even when sequence diverges
almost beyond recognition - Sequence database searches that use exotic or
highly divergent query sequences often reveal
more subtle relationships than those using
queries from humans or standard model organisms
(E. coli, yeast, worm, fly). - Sequence analysis complements structural
comparisons and can greatly benefit from them
25Poorly characterized protein families
- Enzyme activity can be predicted, the substrate
remains unknown (ATPases, GTPases,
oxidoreductases, methyltransferases,
acetyltransferases) - Helix-turn-helix motif proteins (predicted
transcriptional regulators) - Membrane transporters
26Improving functional assignments for unknowns
- Phylogenetic distribution
- Wide - most likely essential
- Narrow - probably clade-specific
- Patchy - most intriguing, niche-specific
- Domain association Rosetta Stone for
multidomain proteins - Gene neighborhood (operon organization)
27Using genome context for functional prediction
28Problems in functional assignments/predictions
- Identification of protein-coding regions
- Delineation of potential function(s) for
distant paralogs - Identification of domains in the absense of
close homologs - Analysis of proteins with low sequence
complexity
29Unknown unknowns
- Phylogenetic distribution
- Wide - most likely essential
- Narrow - probably clade-specific
- Patchy - most intriguing, niche-specific
30To deal with the ocean of new sequences, need
natural protein classification
Discovery of New Knowledge by Using Information
Embedded within Families of Homologous Sequences
and Their Structures
- Protein families are real and reflect
evolutionary relationships - Protein classification systems can be used to
- Improve sensitivity of protein identification
- Provide new protein sequence annotation,
simplifying the search for non-obvious
relationships - Detect and correct genome annotation errors
systematically - Drive other annotations (actve site etc)
- Provide basis for evolution, genomics and
proteomics research
31The ideal system would be
- Comprehensive, with each sequence classified
either as a member of a family or as an orphan
sequence, a family of one - Hierarchical, with families united into
superfamilies on the basis of distant homology - Allow for simultaneous use of the whole protein
and domain information (domains mapped onto
proteins) - Allow for automatic classification/annotation of
new sequences when these sequences are
classifiable into the existing families - Expertly curated (family name, function, evidence
attribution (experimental vs predicted),
background etc). This is the only way to avoid
annotation errors and prevent error propagation
32The ideal system has yet to be created, but there
are several very useful systems
33Levels of Protein Classification
Class ?/? Composition of structural elements No relationships
Fold TIM-Barrel Topology of folded backbone Possible monophyly above and below
Superfamily Aldolase Recognizable sequence similarity (motifs) basic biochemistry Monophyletic origin
Family Class I Aldolase High sequence similarity (alignments) biochemical properties Evolution by ancient duplications
COG 2-keto-3-deoxy-6-phosphogluconate aldolase Orthology for a given set of species biochemical activity biological function Origin traceable to a single gene in LCA
LSE PA3131 and PA3181 Paralogy within a lineage Evolution by recent duplication and loss
34Protein Evolution
- Tree of Life Evolution of Protein Families
(Dayhoff, 1978) - Can build a tree representing evolution of a
protein family, based on sequences - Othologus Gene Family Organismal and Sequence
Trees Match Well
35Protein Evolution
- Homolog
- Common Ancestors
- Common 3D Structure
- Common Active Sites or Binding Domains
- Ortholog
- Derived from Speciation
- Paralog
- Derived from Duplication
36Orthologs and Paralogs
Myo (Hagfish)
Hb (Hagfish)
HbA (Frog)
HbB (Frog)
Myo (Frog)
HbA (Cod)
HbB (Cod)
Myo (Cod)
HbA (Rat)
Myo (Rat)
HbB (Rat)
Amphibia
Mammalia
Teleostomi
Myxinidae
Tetrapoda
Vertebrata
Craniata
37Orthologs and Paralogs
Myo (Hagfish)
Hb (Hagfish)
HbA (Frog)
HbB (Frog)
Myo (Frog)
HbA (Cod)
HbB (Cod)
Myo (Cod)
HbA (Rat)
Myo (Rat)
HbB (Rat)
COG myoglobins
COG hemoglobins
38Orthologs and Paralogs
Myo (Hagfish)
Myo (Cod)
Orthologs (COG Myo)
Myo (Frog)
Myo (Rat)
Out-paralogs (globin family)
Hb (Hagfish)
HbA (Cod)
SubCOG
HbA (Frog)
Orthologs (COG Hb)
In-paralogs (LSE in Vertebrata)
HbA (Rat)
HbB (Cod)
SubCOG
HbB (Frog)
HbB (Rat)
39Orthologs and Paralogs
Myo (Hagfish)
Hb (Hagfish)
HbA (Frog)
HbB (Frog)
Myo (Frog)
HbA (Cod)
HbB (Cod)
Myo (Cod)
HbA (Rat)
Myo (Rat)
HbB (Rat)
COG myoglobins
COG hemoglobins
COG hemoglobins A
40Orthologs and Paralogs
Myo (Cod)
Orthologs (COG Myo)
Myo (Frog)
Myo (Rat)
HbA (Cod)
Out-paralogs (globin family)
Orthologs (COG HbA)
HbA (Frog)
HbA (Rat)
HbB (Cod)
Orthologs (COG HbB)
HbB (Frog)
HbB (Rat)
41Levels of Protein Classification
Class ?/? Composition of structural elements No relationships
Fold TIM-Barrel Topology of folded backbone Possible monophyly above and below
Superfamily Aldolase Recognizable sequence similarity (motifs) basic biochemistry Monophyletic origin
Family Class I Aldolase High sequence similarity (alignments) biochemical properties Evolution by ancient duplications
COG 2-keto-3-deoxy-6-phosphogluconate aldolase Orthology for a given set of species biochemical activity biological function Origin traceable to a single gene in LCA
LSE PA3131 and PA3181 Paralogy within a lineage Evolution by recent duplication and loss
42Protein Family-Domain-Motif
- Domain Evolutionary/Functional/Structural Unit
- Domain structurally compact, independently
folding unit that forms a stable
three-dimentional structure and shows a certain
level of evolutionary conservation. Usually,
corresponds to an evolutionary unit. - A protein can consist of a single domain or
multiple domains. Proteins have modular
structure. - Motif Conserved Functional/Structural Site
43Protein EvolutionSequence Change vs. Domain
Shuffling
44Recent Domain Shuffling
SF006786
CM (AroQ type)
PDH
SF001501
CM (AroQ type)
SF001499
PDH
SF005547
ACT
PDH
SF001424
PDT
ACT
SF001500
PDT
ACT
CM (AroQ type)
45Protein classification proteins and domains
- Option 1 classify domains
- - take individual domain sequences, consider
them as independently evolving units and build a
classification system - allows to go all the way to the deepest possible
level, the last point of traceable homology and
common origin (fold) - domain databases (Pfam, SMART, CDD)
- allow to map domains onto a query sequence
46Protein classification proteins and domains
- Option 2 classify full-length proteins
- In cases of multidomain proteins, does not allow
to go deep along the evolutionary tree - All proteins in a family will often have a common
biological function, which is very convenient for
annotation - Domains will be mapped onto protein families
47Practical Classification of ProteinsSetting
Realistic Goals
We strive to reconstruct the natural
classification of proteins to the fullest
possible extent
48Clasification current status
- PIR Superfamilies
- Proteins in PIRPSD 283,289
- Proteins classified 187,871
- 2/3 of the PIR proteins
- COGs
- 70 of each microbial genome
- 50 of each Eukaryotic genome in 3-clade COG
- 20 ? of each Eukaryotic genome in LSEs
49PIR Web Site (http//pir.georgetown.edu)
50PIR Superfamily Concept
- Whole (Full-Length) Proteins
- Homeomorphic (Common Domain Architecture)
- Monophyletic (Common Evolutionary Origin)
- Hierarchical structure (Family and Superfamily)
- Non-Overlapping placement within each level
-
- PIR Superfamily vs. Other Concepts
- Evolution Superfamily hierarchy reflects
orthology and paralogy - Structure PIR superfamily generally corresponds
to SCOP family - Domain Domains are mapped onto the Superfamily
- Motif Motifs (functional/structural sites) are
mapped onto the Superfamily - Function a Superfamily may contain divergent
functions
51PIR Superfamilies
- Created by automated clustering by identity
with coverage-by-length requirements. Creation
of new Superfamilies is an ongoing process. - Automated classification rules are refined by
expert curation - - Evolution rates are very different in
different branches of the protein universe, so
need very different score cutoffs - Verify/add members
- Annotation (at level of orthology) Superfamily
Name, Description, Bibliography - In some cases, more than one orthologous group
will be included into a single Superfamily these
Superfamilies will often be very large and
diverse - Depth of hierarchy will be different for
single-domain and multidomain proteins - This is work in progress and will become
available through PIR (iProClass) and InterPro
52CM-Related Superfamilies
- Chorismate Mutase (CM), AroQ class
- SF001501 CM (Prokaryotic type) PF01817
- SF001499 tyrA bifunctional enzyme (Prok)
PF01817-PF02153 - SF001500 pheA bifunctional enzyme (Prok)
PF01817-PF00800 - SF017318 CM (Eukaryotic type) Regulatory
Dom-PF01817 - Chorismate Mutase, AroH class
- SF005965 CM PF01817
53iProClass Superfamily Report (I)
54iProClass Superfamily Report (II)
55InterPro
- InterPro is an integrated resource for protein
families, domains and sites. - - InterPro combines a number of databases that
use different methodologies. By uniting the
member databases, InterPro capitalizes on their
individual strengths, producing a powerful
integrated diagnostic tool. - Member databases PROSITE, PRINTS, Pfam, SMART,
ProDom, and TIGRFAMs - PIR to be added soon
- SWISSPROT and TrEMBL matches used as examples
56InterPro Entry
InterPro Entry Type defines the entry as a
Family, Domain, Repeat, or Post-translational
modification site (other sites to be added
binding site, active site). Family protein
family. PIR SFs will generally belong to this
type. Contains field lists domains within this
protein Found in for domain entries, lists
families which contain this domain
57PIR Superfamilies are being integrated into
InterPro
InterPro Entry Type Family SF001500 Bifunctional
chorismate mutase / prephenate dehydratase
(P-protein)
58complete genomes- reciprocal best hits- no
score cutoffs Comparative genomics - a branch
of computational biology that uses complete
genome sequences
COGs Clusters of Orthologous Groups
59Construction of COGs
Genome 2
Genome 1
60Construction of COGs
Yeast YLR377c
Bidirectional best hit
Triangle - the simplest COG
Bidirectional best hit
E. coli fbp
Bidirectional best hit
Synechocystis slr0952
61Construction of COGs
Merge triangles
62Construction of COGsAdd all homologs
New protein
?
Yeast YLR377c
E. coli fbp
Synechocystis slr0952
63(No Transcript)
64(No Transcript)
65(No Transcript)
66In COGs, the dilemma between the depth of
analysis and protein integrity is approached by
keeping proteins intact whenever possible, and
dividing into modules (single- or multidomain)
when necessary
67Case Study 1 Prediction verified GGDEF domain
- Proteins containing this domain Caulobacter
crescentus PleD controls swarmer cell - stalk
cell transition (Hecht and Newton, 1995). In
Rhizobium leguminosarum, Acetobacter xylinum,
required for cellulose biosynthesis (regulation) - Predicted to be involved in signal transduction
because it is found in fusions with other
signaling domains (receiver, etc) - In Acetobacter xylinum, cyclic di-GMP is a
specific nucleotide regulator of cellulose
synthase (signalling molecule). Multidomain
protein with GGDEF domain was shown to have
diguanylate cyclase activity (Tal et al., 1998) - Detailed sequence analysis tentatively predicts
GGDEF to be a diguanylate cyclase domain (Pei and
Grishin, 2001) - Complementation experiments prove diguanylate
cyclase activity of GGDEF (Ausmees et al., 2001)
68Case study 2 Defining a novel domain family
Prokaryotic Response Regulatiors (RRs)
Variable - DNA-binding - Enzymatic
CheY-like receiver
Output
What if domain is not described yet?
CheY receiver
PSY-BLAST with C-terminal portion alone
69Two Groups of Unusual RRs Receiver-X
SF006198, COG3279
- 1. AlgR-related
- Pseudomonas aeruginosa (AlgR) alginate
biosynthesis - Klebsiella pneumoniae (MrkE) formation of
adhesive fimbriae - Clostridium perfringens (VirR) virulence factors
- 2. Regulators of autoinduced
peptide-controlled regulons - Staphylococcus aureus (AgrA) virulence factors
- Lactobacillus plantarum (PlnC, PlnD) bacteriocin
production - Streptococcus pneumoniae (ComE) competence
-
- Properties of the CheY- LytTR transcriptional
regulators - Regulate secreted and extracellular factors
- Often regulate their own expression
- Bind to imperfect direct repeat sites in -80 to -
40 area (or in UAS) - Can be phosphorylated by His kinases, but form
operons with HisK-type sensor ATPases - Contain a conserved LytTR-type DNA-binding domain
70LytTR - a new DNA-binding domain not similar to
HTH, winged helix, or ribbon-helix-helix
DNA-binding domains
71Domain organization of LytTR proteinsother than
CheY-LytTR
- Stand-alone LytTR Streptococcus pneumoniae
BlpS Pseudomonas phage D3 Orf50 - 40aa - LytTR Lactococcus lactis
L121252 Listeria monocytogenes
Lmo0984 Staphylococcus aureus
SA2153 Streptococcus pneumoniae SP0161 - ABC - LytTR Bacillus halodurans BH3894
- MHYT - LytTR Oligotropha carboxydovora CoxC,
CoxH - 3TM - LytTR Xanthomonas campestris
RpfD Caulobacter crescentus CC1610 Mesorhizo
bium loti mll0891 - 3TM - LytTR Caulobacter crescentus CC0295
- 4TM - LytTR Caulobacter crescentus CC0330,
CC3036 - 8TM - LytTR Caulobacter crescentus CC0551
- PAS - LytTR Burkholderia cepacia Geobacter
sulfurreducens
72Consensus binding site for the LytTR domains
73Predicted LytTR-regulated genes
- Expected
- Bacillus subtilis natAB (Na-ATPase)
- Oligotropha carboxidovorans comC, comH (CO
growth) - Staphylococcus aureus lrgAB (autolysis)
- Streptococcus pneumoniae hld (hemolysin delta)
- Unexpected
- Bacillus subtilis alr, dinB, rapI, veg,
- ybaJ, ybbI, yceA, ydbS, ydjL, yebB,
yfiV, ykuA - Staphylococcus aureus capO, coa, hsdR, SA0096,
SA0257, SA0285, SA0302, SA0357, SA0358,
SA0513
74Impact of genomics
- Single protein level
- Discovery of new enzymes and superfamilies
- Prediction of active sites and 3D structures
- Pathway level
- Identification of missing enzymes
- Prediction of alternative enzyme forms
- Identification of potential drug targets
- Cellular metabolism level
- Multisubunit protein systems
- Membrane energy transducers
- Cellular signaling systems
75Examples for analysis
- 1. Retrieve one of the following protein
sequences - PIR C69086 D64376 GenBank
GI15679635. Using analysis tools available on
the web, check if the functional annotation is
correct, and provide correct annotation without
looking at internal PIR or COG annotations (Run
BLAST with CDsearch and SMART to start with).
When you are done, look at the PIR curated SF
annotation (still at internal site only) - http//pir.georgetown.edu/test-cgi/sf/pirclassif.p
l?idSF006549 - http//pir.georgetown.edu/cgi-bin/ipcSF1?idSF0065
49 (compare with original automatic SF
annotation at the public site), and at COG
annotations. What caused the wrong annotations?
In BLAST outputs for these sequences, do you see
other wrongly annotated proteins? - Next, analyze the C-terminal domain of these
proteins by PSI-BLAST (and alignment analysis)
and suggest any speculations as to its function
(homework). -
76Examples for analysis
- 2.
- Retrieve the following sequence GI7019521
- Take a look at the associated publication
(reference). - Analyze the sequence to see if any additional
information can be obtained (run PSI-BLAST, and
(as a homework) construct multiple alignment). - Take a look at taxonomy report what does it tell
you? - Find experimental paper associated with one of
the sequences found by PSI-BLAST. What
annotation is appropriate for this sequence and
for the entire family?
77Examples for analysis
- 3.
- Predict the function of the following proteins
- GenBank GI 27716853
- E. coli YjeE protein
- Verify and/or correct the following functional
annotations. Can you explain why the erroneous
annotations were made? - PIR H87387
- GenBank GI15606003 GI15807219
- PIR F70338
78Examples for analysis
- 4. Homework an exercise in transitive
relationshipsStart withgtgi20093648refNP_6134
95.1 Uncharacterized membrane protein, conserved
in Archaea Methanopyrus kandleri AV19(this is
a short membrane protein) run PSI-BLAST, make
sure you have filtering, complexity and CD-search
off. There are no good hits but a bunch of
sub-threshold ones. Collect "suspect" relations,
use them as queries and expand the net. You will
be able to come up with two proteinsgtgi21227474
refNP_633396.1 hypothetical protein
Methanosarcina mazei Goe1 andgtgi14324537dbjB
AB59464.1 hypothetical protein Thermoplasma
volcaniumWhen used as a PSI-BLAST query, the
first will tie the Methanopyrus protein into a
group, while the second will tie this group to
the Sec61 subunit of preprotein
translocase.Then, of course, you can obtain the
same result with CD-search in a single step ?.