Title: Sequence, Classification, Alignment and Structural Libraries
1Sequence, Classification, Alignment and
Structural Libraries
BIOINFORMATICS I Protein and DNA Sequence
Analysis Jaime E. Ramirez-Vick, PhD
2Sequence, Classification, Alignment and
Structural Libraries
Purpose
- Understand data resources commonly used in
sequence analysis - Be able to identify the major sequence data
libraries, and recognize their file formats - Be able to identify the major classification
libraries - Understand the difference between sequence data
libraries and classification libraries and how
these resources are used
Predictions using individual sequence
3Sequence Libraries
- Libraries (Databases) are compilations of known
sequences - Libraries are usually searched to find sequences
homologous with a query sequence - Libraries contain vast amounts of annotation
about the sequence
4Strategies to infer structure and function
Related sequences
Unannotated sequences
Classify well known motifs and structures
Determine common motifs
Literature
Multiple sequence alignment
Find related features using database searches
pairwise alignments
Alignment and Classification Libraries
Probe for distant relatives
Predictions using individual sequence
Integrate conserved patterns with structure
Sequence Libraries
Self Comparison
Annotated sequences
Categorized family or function
5Sequence Libraries May Contain
gtP1PSRSAW phospholipase A2 (EC 3.1.1.4) -
western diamondback rattlesnake S L V Q F E T L
I M K I A G R S G L L W Y S A Y G C Y C G W G G H
G L P Q D A T D R C C F V H D C C Y G K A T D C N
P K T V S Y T Y S E E N G E I I C G G D D P C G
T Q I C E C D K A A A I C F R D N I P S Y D N K Y
W L F P P K D C R E E P E P C NAlternate
names phosphatidylcholine 2-acylhydrolase CSpeci
es Crotalus atrox (western diamondback
rattlesnake) CDate 17-Mar-1987
sequence_revision 17-Mar-1987 text_change
14-Feb-1997 CAccession A00764 RRandolph, A.
Heinrikson, R.L. J. Biol. Chem. 257, 2155-2161,
1982 ATitle Crotalus atrox phospholipase A-2.
Amino acid sequence and studies on the function
of the NH-2-terminal region. AReference number
A92362 MUID82142299 AAccession
A00764 AMolecule type protein AResidues 1-122
ltRANgt RBrunie, S. Sigler, P.B. submitted to the
Brookhaven Protein Data Bank, March
1986 AReference number A50318
PDB1PP2 AContents annotation X-ray
crystallography, 2.5 angstroms residues
1-122 RBrunie, S. Bolin, J. Gewirth, D.
Sigler, P.B. J. Biol. Chem. 260, 9742-9749,
1985 ATitle The refined crystal structure of
dimeric phospholipse A-2 at 2.5 angstrom. Access
to a shielded catalytic center. AReference
number A92550 MUID85261386 AContents
annotation X-ray crystallography, 2.5 angstroms
active sites binding sites RKeith, C. Feldman,
D.S. Deganello, S. Glick, J. Ward, K.B.
Jones, E.O. Sigler, P.B. J. Biol. Chem. 256,
8602-8607, 1981 ATitle The 2.5-angstrom crystal
structure of a dimeric phospholipase A-2 from the
venom of Crotalus atrox. AReference number
A92336 MUID81264275 AContents annotation
X-ray crystallography, 2.5 angstroms CComplex
homodimer CFunction ADescription catalyzes
hydrolysis of 1,2-diacyl-sn-glycero-3-phosphocholi
ne to 1-acyl-sn-3-glycero-phosphocholine and
fatty acid ANote the reaction is strongly
enhanced when the phospholipid is condensed into
a micellar aggregate CSuperfamily phospholipase
A2 CKeywords calcium carboxylic ester
hydrolase homodimer lipid degradation
metalloprotein toxin venom F4,64/Binding site
micellar substrate (Gln, Tyr) status
experimental F26-115,28-44,43-95,49-122,50-88,57-
81,75-86/Disulfide bonds status
experimental F27,29,31,48/Binding site calcium
(Tyr, Gly, Gly, Asp) status predicted F47,89/Act
ive site His, Asp status experimental
- Amino acids/bases
- Text identifying the sequence species
- Journal references and citations
- Sequence Features (Repeats, structures,
functions, etc.) - Cross references to other data sources
6Major (Primary) Sequence Data Libraries
- Nucleic Acids
- Because of formal data sharing between the
databases, only one library (GenBank) needs to be
searched - GenBank http//www.ncbi.nlm.nih.gov/
- EMBL http//www.ebi.ac.uk
- Protein
- Only informal data sharing between the databases
- NBRF-PIR http//nbrfa.georgetown.edu/
- Swiss-Prot http//www.ebi.ac.uk/
- NRL_3d - Sequence library derived from PDB
(Search to find if a 3d structures is known for a
family/domain) - GenPept - Identified protein sequences contained
in GenBank
7Sequence and Structure Libraries
- Non-redundant protein collections (NRPC)
- Useful because Protein Data Libraries only share
data informally NRPC attempt to merge libraries
by removing Redundant sequences - PIR-NREF - PIR plus non redundant sequences taken
from a variety of databases - PATCHX - a non redundant collection of public
domain sequence databases excluding protein
sequences in PIR Database. - TREMBL - a non redundant collection of identified
protein sequences contained in EMBL excluding
sequences already in SWISS-PROT - OWL - composite, non-redundant database assembled
from a number of primary sources including
translations of nucleic acid sequences. - Redundant usually does not mean without duplicates
8Examples of Sequences
9Classification and Alignment Libraries
- Classification libraries are typically built from
a set of (related) aligned sequences thus
classification libraries can be thought of as
collections of abstract representations of
multiple sequence alignments - A variety of abstract representations are used in
classification libraries. - Some libraries contain detailed information
describing the family, structure or function. - Classification Libraries are a good resource to
use to quickly identify an unknown sequence, or
to find additional related sequences.
10Classification and Alignment Libraries
- Classification libraries are generally used to
generate a hypothesis as to the
structure/function or family of the query
sequence. - Hypothesis is NOT always correct. (Be especially
suspicious of short, highly ambiguous patterns)
Query
Result
This sequence might contain a Leucine Rich Repeat
Classification Library
yetdfqrltklrmlqltdnqihti
11Classification and Alignment Libraries
- Strategy Use several libraries and weight the
hypotheses generated accordingly. - Protein classification libraries
- PROSITE - A dictionary of sites and patterns
- http//www.expasy.ch/prosite.html
- BLOCKS - multiply aligned ungapped segments
corresponding to the most highly conserved
regions of proteins. - http//www.blocks.fhcrc.org/
- PRINTS - Conserved group of motifs used to
characterize a protein family. - http//www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/
PRINTS.html - Pfam - a collection of alignments and hidden
Markov models covering most common protein
domains (Complete domain) - http//pfam.wustl.edu/
12Classification and Alignment Libraries
- Nucleic Acid classification libraries
- Transcription Factor Database
- http//transfac.gbf.de/TRANSFAC/
- Restriction Enzyme Database
- http//rebase.neb.com/
13Classification and Alignment Libraries
- Structural classification and alignment libraries
contain detailed 3d-structural information for
certain macromolecules organized in a variety of
categories including - Family
- Superfamily
- Fold
- Secondary Structure (e.g. Helix, Sheet, Turns)
- Structural classification libraries may be useful
to infer what the underlying structure might be
However structural predictions are only reliable
when there is high degree of similarity.
14Classification and Alignment Libraries
- Structural classification and alignment
- DSSP - Database of secondary structure
assignments - http//www.sander.embl-heidelberg.de/dssp/
- FSSP - Fold classification based on
structure-structure alignment of proteins - http//www2.embl-ebi.ac.uk/dali/fssp
- HSSP - Homology derived secondary structure of
proteins - http//www.sander.embl-heidelberg.de/hssp
- 3d_ALI - a database of aligned protein structures
and related sequences - http//www.embl-heidelberg.de/argos/ali/ali.html
- SCOP Structural Classification of Proteins.
- http//scop.mrc-lmb.cam.ac.uk/scop
15Structure Libraries
- Structure libraries contain the actual three
dimensional coordinates of a macromolecule. - Cambridge Structural Database
- Small Molecules (100 atoms)
- For more information see
- http//www.psc.edu/general/software/packages/cambr
idge/cambridge.html - http//www.ccdc.cam.ac.uk/
- Protein Data Bank (PDB)
- Large Molecules (1000 atoms)
- For more information see
- http//www.psc.edu/general/software/packages/pdb/p
db.html - http//www.rcsb.org/pdb/
16Sequence and Structure Libraries
- PSdb - Protein Structure Database
- Secondary and tertiary structures classified on a
per-residue basis - Helix, sheets and turns are defined according to
the literature definitions, not the authors
preferences. - Tertiary structures defined by both Eisenberg
classifications and a unique structure
classification method. - For more information see
- http//www.psc.edu/biomed/pages/research/PSdb/PSdb
Paper/PSdbPaper.html - http//www.psc.edu/deerfiel/PSdb/