Sequence, Classification, Alignment and Structural Libraries - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Sequence, Classification, Alignment and Structural Libraries

Description:

Nucleic Acids ... of primary sources including translations of nucleic acid sequences. ... Nucleic Acid classification libraries: Transcription Factor Database ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 17
Provided by: Ramire
Category:

less

Transcript and Presenter's Notes

Title: Sequence, Classification, Alignment and Structural Libraries


1
Sequence, Classification, Alignment and
Structural Libraries
BIOINFORMATICS I Protein and DNA Sequence
Analysis Jaime E. Ramirez-Vick, PhD
2
Sequence, Classification, Alignment and
Structural Libraries
Purpose
  • Understand data resources commonly used in
    sequence analysis
  • Be able to identify the major sequence data
    libraries, and recognize their file formats
  • Be able to identify the major classification
    libraries
  • Understand the difference between sequence data
    libraries and classification libraries and how
    these resources are used

Predictions using individual sequence
3
Sequence Libraries
  • Libraries (Databases) are compilations of known
    sequences
  • Libraries are usually searched to find sequences
    homologous with a query sequence
  • Libraries contain vast amounts of annotation
    about the sequence

4
Strategies to infer structure and function
Related sequences
Unannotated sequences
Classify well known motifs and structures
Determine common motifs
Literature
Multiple sequence alignment
Find related features using database searches
pairwise alignments
Alignment and Classification Libraries
Probe for distant relatives
Predictions using individual sequence
Integrate conserved patterns with structure
Sequence Libraries
Self Comparison
Annotated sequences
Categorized family or function
5
Sequence Libraries May Contain
gtP1PSRSAW phospholipase A2 (EC 3.1.1.4) -
western diamondback rattlesnake S L V Q F E T L
I M K I A G R S G L L W Y S A Y G C Y C G W G G H
G L P Q D A T D R C C F V H D C C Y G K A T D C N
P K T V S Y T Y S E E N G E I I C G G D D P C G
T Q I C E C D K A A A I C F R D N I P S Y D N K Y
W L F P P K D C R E E P E P C NAlternate
names phosphatidylcholine 2-acylhydrolase CSpeci
es Crotalus atrox (western diamondback
rattlesnake) CDate 17-Mar-1987
sequence_revision 17-Mar-1987 text_change
14-Feb-1997 CAccession A00764 RRandolph, A.
Heinrikson, R.L. J. Biol. Chem. 257, 2155-2161,
1982 ATitle Crotalus atrox phospholipase A-2.
Amino acid sequence and studies on the function
of the NH-2-terminal region. AReference number
A92362 MUID82142299 AAccession
A00764 AMolecule type protein AResidues 1-122
ltRANgt RBrunie, S. Sigler, P.B. submitted to the
Brookhaven Protein Data Bank, March
1986 AReference number A50318
PDB1PP2 AContents annotation X-ray
crystallography, 2.5 angstroms residues
1-122 RBrunie, S. Bolin, J. Gewirth, D.
Sigler, P.B. J. Biol. Chem. 260, 9742-9749,
1985 ATitle The refined crystal structure of
dimeric phospholipse A-2 at 2.5 angstrom. Access
to a shielded catalytic center. AReference
number A92550 MUID85261386 AContents
annotation X-ray crystallography, 2.5 angstroms
active sites binding sites RKeith, C. Feldman,
D.S. Deganello, S. Glick, J. Ward, K.B.
Jones, E.O. Sigler, P.B. J. Biol. Chem. 256,
8602-8607, 1981 ATitle The 2.5-angstrom crystal
structure of a dimeric phospholipase A-2 from the
venom of Crotalus atrox. AReference number
A92336 MUID81264275 AContents annotation
X-ray crystallography, 2.5 angstroms CComplex
homodimer CFunction ADescription catalyzes
hydrolysis of 1,2-diacyl-sn-glycero-3-phosphocholi
ne to 1-acyl-sn-3-glycero-phosphocholine and
fatty acid ANote the reaction is strongly
enhanced when the phospholipid is condensed into
a micellar aggregate CSuperfamily phospholipase
A2 CKeywords calcium carboxylic ester
hydrolase homodimer lipid degradation
metalloprotein toxin venom F4,64/Binding site
micellar substrate (Gln, Tyr) status
experimental F26-115,28-44,43-95,49-122,50-88,57-
81,75-86/Disulfide bonds status
experimental F27,29,31,48/Binding site calcium
(Tyr, Gly, Gly, Asp) status predicted F47,89/Act
ive site His, Asp status experimental
  • Amino acids/bases
  • Text identifying the sequence species
  • Journal references and citations
  • Sequence Features (Repeats, structures,
    functions, etc.)
  • Cross references to other data sources

6
Major (Primary) Sequence Data Libraries
  • Nucleic Acids
  • Because of formal data sharing between the
    databases, only one library (GenBank) needs to be
    searched
  • GenBank http//www.ncbi.nlm.nih.gov/
  • EMBL http//www.ebi.ac.uk
  • Protein
  • Only informal data sharing between the databases
  • NBRF-PIR http//nbrfa.georgetown.edu/
  • Swiss-Prot http//www.ebi.ac.uk/
  • NRL_3d - Sequence library derived from PDB
    (Search to find if a 3d structures is known for a
    family/domain)
  • GenPept - Identified protein sequences contained
    in GenBank

7
Sequence and Structure Libraries
  • Non-redundant protein collections (NRPC)
  • Useful because Protein Data Libraries only share
    data informally NRPC attempt to merge libraries
    by removing Redundant sequences
  • PIR-NREF - PIR plus non redundant sequences taken
    from a variety of databases
  • PATCHX - a non redundant collection of public
    domain sequence databases excluding protein
    sequences in PIR Database.
  • TREMBL - a non redundant collection of identified
    protein sequences contained in EMBL excluding
    sequences already in SWISS-PROT
  • OWL - composite, non-redundant database assembled
    from a number of primary sources including
    translations of nucleic acid sequences.
  • Redundant usually does not mean without duplicates

8
Examples of Sequences
9
Classification and Alignment Libraries
  • Classification libraries are typically built from
    a set of (related) aligned sequences thus
    classification libraries can be thought of as
    collections of abstract representations of
    multiple sequence alignments
  • A variety of abstract representations are used in
    classification libraries.
  • Some libraries contain detailed information
    describing the family, structure or function.
  • Classification Libraries are a good resource to
    use to quickly identify an unknown sequence, or
    to find additional related sequences.

10
Classification and Alignment Libraries
  • Classification libraries are generally used to
    generate a hypothesis as to the
    structure/function or family of the query
    sequence.
  • Hypothesis is NOT always correct. (Be especially
    suspicious of short, highly ambiguous patterns)

Query
Result
This sequence might contain a Leucine Rich Repeat
Classification Library
yetdfqrltklrmlqltdnqihti
11
Classification and Alignment Libraries
  • Strategy Use several libraries and weight the
    hypotheses generated accordingly.
  • Protein classification libraries
  • PROSITE - A dictionary of sites and patterns
  • http//www.expasy.ch/prosite.html
  • BLOCKS - multiply aligned ungapped segments
    corresponding to the most highly conserved
    regions of proteins.
  • http//www.blocks.fhcrc.org/
  • PRINTS - Conserved group of motifs used to
    characterize a protein family.
  • http//www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/
    PRINTS.html
  • Pfam - a collection of alignments and hidden
    Markov models covering most common protein
    domains (Complete domain)
  • http//pfam.wustl.edu/

12
Classification and Alignment Libraries
  • Nucleic Acid classification libraries
  • Transcription Factor Database
  • http//transfac.gbf.de/TRANSFAC/
  • Restriction Enzyme Database
  • http//rebase.neb.com/

13
Classification and Alignment Libraries
  • Structural classification and alignment libraries
    contain detailed 3d-structural information for
    certain macromolecules organized in a variety of
    categories including
  • Family
  • Superfamily
  • Fold
  • Secondary Structure (e.g. Helix, Sheet, Turns)
  • Structural classification libraries may be useful
    to infer what the underlying structure might be
    However structural predictions are only reliable
    when there is high degree of similarity.

14
Classification and Alignment Libraries
  • Structural classification and alignment
  • DSSP - Database of secondary structure
    assignments
  • http//www.sander.embl-heidelberg.de/dssp/
  • FSSP - Fold classification based on
    structure-structure alignment of proteins
  • http//www2.embl-ebi.ac.uk/dali/fssp
  • HSSP - Homology derived secondary structure of
    proteins
  • http//www.sander.embl-heidelberg.de/hssp
  • 3d_ALI - a database of aligned protein structures
    and related sequences
  • http//www.embl-heidelberg.de/argos/ali/ali.html
  • SCOP Structural Classification of Proteins.
  • http//scop.mrc-lmb.cam.ac.uk/scop

15
Structure Libraries
  • Structure libraries contain the actual three
    dimensional coordinates of a macromolecule.
  • Cambridge Structural Database
  • Small Molecules (100 atoms)
  • For more information see
  • http//www.psc.edu/general/software/packages/cambr
    idge/cambridge.html
  • http//www.ccdc.cam.ac.uk/
  • Protein Data Bank (PDB)
  • Large Molecules (1000 atoms)
  • For more information see
  • http//www.psc.edu/general/software/packages/pdb/p
    db.html
  • http//www.rcsb.org/pdb/

16
Sequence and Structure Libraries
  • PSdb - Protein Structure Database
  • Secondary and tertiary structures classified on a
    per-residue basis
  • Helix, sheets and turns are defined according to
    the literature definitions, not the authors
    preferences.
  • Tertiary structures defined by both Eisenberg
    classifications and a unique structure
    classification method.
  • For more information see
  • http//www.psc.edu/biomed/pages/research/PSdb/PSdb
    Paper/PSdbPaper.html
  • http//www.psc.edu/deerfiel/PSdb/
Write a Comment
User Comments (0)
About PowerShow.com