Title: 26S Proteasome and Protein Stability
1 Algorithms and databases for sequence and
structural analysis
Biology thru homology or analogy
2In biomolecular sequences (DNA, RNA, or amino
acid sequences), high sequence similarity usually
implies significant functional or structural
similarity.
However Evolutionary and functionally
related molecular strings can differ
significantly throughout much of the string and
yet preserve the same three-dimensional
structure(s), or the same two dimensional
substructure(s) (motifs, domains), or the same
active sites, or the same or related dispersed
residues (DNA or amino acid). Dan Gusfield.
Algorithms on Strings, Trees, and Sequences.
1997. University of Cambridge Press. p.334
3Objectives
- What is the function of this gene?
- Do other genes have this functional motif?
- Can I predict the higher order structure of this
protein? - Is this gene a member of a known gene family?
- Do other organisms have this gene?
4Intuition
- Similar sequences should have (long) regions of
similar/identical residues. - Why?
- Evolution descent from a common ancestral
sequence - Functional/structural convergence
5General Database Search Issues
- Search using amino acid sequence if possible
- Why? Protein evolution is slower than DNA
sequence evolution - Statistical theory is based on unrealistic
assumptions consider results as predictions.
6Sequence Alignment
- Sequence alignment is simply the optimal
assignment of substitution and indel events to a
pair of sequences. - Global alignment align entire sequences
- Local alignment find best matching regions of
sequences
7Measuring Alignment Quality
- Good alignments should have
- many exact matches
- few mismatches
- many of the mismatches should be similar
residues - few gaps
8Measuring Alignment Quality
Begin with...
Longest Exact Match
QTRPQNVLNPP STRQNVINPWAAQ
S 3a
Salignment score amatch score
9Measuring Alignment Quality
allow some mismatches
QTRPQNVLNPP STRQNVINPWAAQ
Salignment score amatch score bmismatch penalty
S 5a - 1b
10Measuring Alignment Quality
and finally, introduce some gaps
QTRPQNVLNPP STR-QNVINPWAAQ
Salignment score amatch score bmismatch
penalty cgap penalty
S 7a - 1b -1c
11Scoring Issues
- Relative costs of matches, mismatches, and gaps
should depend on their probabilities (rare events
receive higher penalties) - In practice, the appropriate costs are rarely
known. - A variety of scoring matrices are available.
12BLAST (www.ncbi.nlm.nih.gov/BLAST)Basic Local
Alignment Search Tool
BLAST is based on a systematic search of
conserved words. The query sequence is decomposed
into words of length W (W3 for amino acids 11
for nucleotides), a list of these words and
similar words from entries in the relational
database are compared. Sequences scoring below
a threshold are deleted from the list.
13Scoring Matrices
- Scoring matrix specifies a score, sij, for
aligning sequence I with sequence II. - Choice of matrix depends on the divergence level
of desired/expected hits. - Examples PAM, BLOSUM
- Both can be modified for different divergence
levels (eg, BLOSUM40, BLOSUM62) - Advice try several matrices when possible.
14(No Transcript)
15(No Transcript)
16PSI-BLASTPosition Specific Iterated BLAST
1. BLAST with query 2. Keep hits w/ E lt E
(adjustable constant) 3. Multiple alignment of
HSPs from step (2) 4. Build profile 5. BLAST with
profile 6. Iterate (1)-(5) until no new hits are
found
17PSI-BLASTPosition Specific Iterated BLAST
Use with great caution!!! Once an unrelated
sequence is mistakenly incorporated into the
profile, subsequent iterations will incorporate
homologues of the unrelated sequence
(catastrophic transitivity). Human intervention
is essential.
18The COG database new developments in
phylogenetic classification of proteins from
complete genomes. Tatusov RL, Natale DA,
Garkavtsev IV, Tatusova TA, Shankavaram UT,
Rao BS, Kiryutin B, Galperin MY, Fedorova ND,
Koonin EV.
All vs. all blastp of genome sequence (primarily
microbial) database. Each COG consists of
individual orthologous genes or orthologous
groups of paralogs from three or more
phylogenetic lineages. In other words, any two
proteins from different lineages that belong to
the same COG are orthologs. Each COG is assumed
to have evolved from an individual ancestral
gene through a series of speciation and
duplication events.
19Domains and insight into protein function
- Proteins are modular, exhibiting discrete folding
units known as domains - Switching and swapping domains is a mechanism for
functional diversity in proteins - Domains can exhibit intrinsic function
20Examples
- SH2 binds phosphorylated tyrosine residues in
protein partners - PDZ mediates protein-protein interactions
between enzymes - HTH binds DNA in site-specific manner
- Once a domain acquires selectable functionality,
it can be distributed to other gene products and
providing a mechanism for evolution
21Hidden Markov Models are sensitive tools for
domain detection
www.pfam.wustl.edu www.tigr.org/TIGRFAMs/ www.s
mart.embl-heidelberg.de/ These tools use
profiles generated from multiple sequence
alignments.
22Rossman fold - Profile HMM and PROSITE
- GLGFFGV
- GVGYFGV
- GLGFFGL
- GLGFFGL
- GQGVLGL
23Transition to structural classifications
- Several useful databases link sequence analysis
and protein structure information - CATH and SCOP are two of these, each containing
950-1400 protein superfamilies - Since structure is more highly conserved than
sequence during evolution, structural alignment
algorithms and classifications enable more
distant evolutionary relatives to be identified.
24CATH
- Contains 200,000 sequence domains, assigned to
1200 CATH homologous superfamilies - Classification Scheme Class, Architecture,
Topology and Homology - Class secondary structure composition and
packing - Architecture orientation of secondary
structures in 3D, regardless of connectivity - Topology both orientation and connectivity of
secondary structure is accounted for - Homologous superfamily grouped based on whether
an evolutionary relationship exists (clustered at
different levels of sequence ID)
25SCOP database
- Classification scheme Class, Fold, Superfamily,
and Family, - Class Type and organization of secondary
structure - Fold Share common core structure, same
secondary structure elements in the same
arrangement with the same topological connections - Superfamily share very common structure and
function - Family protein domains share a clear common
evolutionary origin as evidenced by sequence
identity or similar structure/function
26HMMs also useful at SCOP
- For instance, SCOP (http//scop.mrc-lmb.cam.ac.uk/
scop/) HMMs are derived from the PDB databank at
www.rcsb.org - Identify sequence signatures for specific domains
27Structural Alignments
- Various algorithms allow structure vs. structure
comparisons - VAST, DALI
- CATH (http//www.biochem.ucl.ac.uk/bsm/cath/)
also has SSAP and GRATH (one computationally
intensive, one not) - Sequence similarity to structural families for
modeling often extracted using PSI-BLAST
(Gene3D)
28Comparison of sequence and structure alignments
1 Taylor WR, Orengo CA, 1989, Protein structure
alignment. J Mol Biol 2081-224 Mueller L,
2003, Protein structure alignment. Paper
presentations 27.51630h
29Multiple structural alignments
- CORA from CATH (where?)
- MultiProt - http//bioinfo3d.cs.tau.ac.il/MultiPro
t/ - DMAPS (pre-calculated) http//dmaps.sdsc.edu/
- CE-MC - http//bioinformatics.albany.edu/cemc/
- Others?