Title: Module 2 Sequence DBs and Similarity Searches
1Module 2Sequence DBs and Similarity Searches
- Learning objectives
- Understand how information is stored in GenBank.
- Learn how to read a Genbank flat file.
- Learn how to search Genbank for information.
- Understand difference between header, features
and sequence. - Learn the difference between a primary database
and secondary database. - Principle of similarity searches using the BLAST
program
2What is GenBank?
- Gene sequence database
- Annotated records that represent single
contiguous stretches of DNA or RNA-may have more
than one coding region (limit 350 kb) - Generated from direct submissions to the DNA
sequence databases from the authors. - Part of the International Nucleotide Sequence
Database Collaboration.
3Exchange of information on a daily basis
GenBank (NCBI)
EMBL (EBI) United Kingdom
International Nucleotide Sequence Database
Collaboration
DDBJ Japan
4History of GenBank
- Began with Atlas of Protein Sequences and
Structures (Dayhoff et al., 1965) - In 1986 it collaborated with EMBL and in 1987 it
collaborated with DDBJ. - It is a primary database-(i.e., experimental data
is placed into it) - Examples of secondary databases derived from
GenBank/EMBL/DDBJ Swiss-Prot, PRI. - GenBank Flat File is a human readable form of the
records.
5General Comments on GBFF
- Three sections
- 1) Header-information about the whole record
- 2) Features-description of annotations-each
represented by a key. - 3) Nucleotide sequence-each ends with // on last
line of record. - DNA-centered
- Translated sequence is only a feature
6Feature Keys
- Purpose
- 1) Indicates biological nature of sequence
- 2) Supplies information about changes to
sequences - Feature Key Description
- conflict Separate deters of the same seq.
differ - rep_origin Origin of replication
- protein_bind Protein binding site on DNA
- CDS Protein coding sequence
7Feature Keys-Terminology
- Feature Key Location/Qualifiers
- CDS 23..400
- /productalcohol dehydro.
- /geneadhI
- Interpretation-The feature CDS is a coding
sequence beginning at base 23 and ending at base
400, has a product called alcohol dehydrogenase
and corresponds to the gene called adhI.
8Feature Keys-Terminology (Cont.)
- Feat. Key Location/Qualifiers
- CDS join (544..589,688..1032)
- /productT-cell recep. B-ch.
- /partial
- Interpretation-The feature CDS is a partial
coding sequence formed by joining the indicated
elements to form one contiguous sequence encoding
a product called T-cell receptor beta-chain.
9Record from GenBank
GenBank division (plant, fungal and algal)
Modification date
LOCUS SCU49845 5028 bp DNA
PLN 21-JUN-1999 DEFINITION Saccharomyces
cerevisiae TCP1-beta gene, partial cds, and
Axl2p (AXL2) and Rev7p (REV7) genes,
complete cds. ACCESSION U49845 VERSION
U49845.1 GI1293613 KEYWORDS . SOURCE
baker's yeast. ORGANISM Saccharomyces
cerevisiae Eukaryota Fungi
Ascomycota Hemiascomycetes Saccharomycetales
Saccharomycetaceae Saccharomyces.
Unique identifier (never changes)
Coding region
GeneInfo identifier (changes whenever there is a
change)
Nucleotide sequence identifier (changes when
there is a change in sequence (accession.version))
Word or phrase describing the sequence (not based
on controlled vocabulary). Not used in newer
records.
Common name for organism
Formal scientific name for the source organism
and its lineage based on NCBI Taxonomy Database
10Record from GenBank (cont.1)
- REFERENCE 1 (bases 1 to 5028)
- AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J.
and Lawrence,C.W. - TITLE Cloning and sequence of REV7, a gene
whose function is required - for DNA damage-induced mutagenesis in
Saccharomyces cerevisiae - JOURNAL Yeast 10 (11), 1503-1509 (1994)
- MEDLINE 95176709
- REFERENCE 2 (bases 1 to 5028)
- AUTHORS Roemer,T., Madden,K., Chang,J. and
Snyder,M. - TITLE Selection of axial growth sites in
yeast requires Axl2p, a - novel plasma membrane glycoprotein
- JOURNAL Genes Dev. 10 (7), 777-793 (1996)
- MEDLINE 96194260
Oldest reference first
Medline UID
REFERENCE 3 (bases 1 to 5028) AUTHORS
Roemer,T. TITLE Direct Submission JOURNAL
Submitted (22-FEB-1996) Terry Roemer, Biology,
Yale University, New Haven, CT, USA
Submitter of sequence (always the last reference)
11Record from GenBank (cont.2)
There are three parts to the feature key a
keyword (indicates functional group), a location
(instruction for finding the feature), and a
qualifier (auxiliary information about a feature)
- FEATURES Location/Qualifiers
- source 1..5028
- /organism"Saccharomyces
cerevisiae" - /db_xref"taxon4932"
- /chromosome"IX"
- /map"9"
- CDS lt1..206
- /codon_start3
- /product"TCP1-beta"
- /protein_id"AAA98665.1"
- /db_xref"GI1293614"
- /translation"SSIYNGISTSGLDLN
NGTIADMRQLGIVESYKLKRAVVSSASEA - AEVLLRVDNIIRARPRTANRQHM"
Location
Keys
Qualifiers
Partial sequence on the 5 end. The 3 end is
complete.
Start of open reading frame
Descriptive free text must be quotations
Database cross-refs
Protein sequence ID
Values
Note only a partial sequence
12Record from GenBank (cont.3)
New location
- gene 687..3158
- /gene"AXL2"
- CDS 687..3158
- /gene"AXL2"
- /note"plasma membrane
glycoprotein" - /codon_start1
- /function"required for
axial budding pattern of S. - cerevisiae"
- /product"Axl2p"
- /protein_id"AAA98666.1"
- /db_xref"GI1293615"
/translation"MTQLQISLLLTATISLLHLVVATP
YEAYPIGKQYPPVARVN. . . - gene complement(3300..4037)
- /gene"REV7"
- CDS complement(3300..4037)
- /gene"REV7"
- /codon_start1
- /product"Rev7p"
- /protein_id"AAA98667.1"
- /db_xref"GI1293616"
/translation"MNRWVEKWLRVYLKCYINLILFYRNV
YPPQSFDYTTYQSFNLPQ . . .
Cutoff
New location
Cutoff
13Record from GenBank (cont.4)
BASE COUNT 1510 a 1074 c 835 g 1609
t ORIGIN 1 gatcctccat atacaacggt
atctccacct caggtttaga tctcaacaac ggaaccattg
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct
aaaacgagca gtagtcagct . . .//
14Primary databases contain experimental biological
information
- GenBank/EMBL/DDBJ
- Alu-alu repeats in human DNA
- dbEST-expressed sequence tags-single pass cDNA
sequences (high error freq.) - It is non-redundant
- HTGS-high-throughput genomic sequence database
(errors!) - PDB-Three-dimensional structure coordinates of
biological molecules - PROSITE-database of protein domain/function
relationships.
15Types of secondary databases that contain
biological information
- dbSTS-Non-redundant db of sequence-tagged sites
(useful for physical mapping) - Genome databases-(there are over 20 genome
databases that can be searched - EPDeukaryotic promoter database
- NR-non-redundant GenBankEMBLDDBJPDB. Entries
with 100 sequence identity are merged as one. - Vector A subset of GenBank containing vector DNA
- ProDom
- PRINTS
- BLOCKS
16Workshop 2 A-Look up a Genbank record. Use the
annotations to determine the the first
open reading frame.
17Similarity Searching
It is easy to score if an amino acid is identical
to another (the score is 1 if identical and 0 if
not). However, it is not easy to give a score
for amino acids that are somewhat similar.
CO2-
CO2-
NH3
NH3
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1
(identical) or Something in between?
18Purpose of finding differences and similarities
of amino acids.
- Infer structural information
- Infer functional information
- Infer evolutionary relationships
19Evolutionary Basis of Sequence Alignment
1. Similarity Quantity that relates to how
alike two sequences are. 2. Identity Quantity
that describes how alike two sequences are in the
strictest terms. 3. Homology a conclusion drawn
from data suggesting that two genes share a
common evolutionary history.
20Evolutionary Basis of Sequence Alignment (Cont. 1)
1. Example Shown on the next page is a pairwise
alignment of two proteins. One is mouse trypsin
and the other is crayfish trypsin. They are
homologous proteins. The sequences share 41
identity.
2. Underlined residues are identical. Asterisks
and diamond represent those residues that
participate in catalysis. Five gaps are placed
to optimize the alignment.
21(No Transcript)
22Evolutionary Basis of Sequence Alignment (Cont. 2)
Why are there regions of identity? 1) Conserved
function-residues participate in reaction. 2)
Structural-residues participate in maintaining
structure of protein. (For example, conserved
cysteine residues that form a disulfide
linkage) 3) Historical-Residues that are
conserved solely due to a common ancestor gene.
23Evolutionary Basis of Sequence Alignment (Cont. 3)
Note It is possible that two proteins share a
high degree of similarity but have two different
functions. For example, human gamma-crystallin
is a lens protein that has no known enzymatic
activity. It shares a high percentage of
identity with E. coli quinone oxidoreductase.
These proteins likely had a common ancestor but
their functions diverged.
Analogous to railroad car and diner function.
24(No Transcript)
25Modular nature of proteins
- The previous alignment was global. However, many
proteins do not display global patterns of
similarity. Instead, they possess local regions
of similarity. - Proteins can be thought of as assemblies of
modular domains. It is thought that this may, in
some cases, be due to a process known as exon
shuffling.
26Modular nature of proteins (cont. 1)
Exon 1a
Exon 2a
Gene A
Duplication
Exon 1a
Gene B
Exon 2a
Exon 2a
Exchange
Exon 3 (Ex. 2b from Gene B)
Exon 2a
Exon 1a
Gene A
Gene B
Exon 1b
Exon 3 (Ex. 2a from Gene A)
Exon 2b
27Dot Plots
Window 1 Note that 25 of the table will
be filled due to random chance. 1 in 4 chance at
each position
28Dot Plots with window 2
A T G C C T A G
Window 2 The larger the window the more noise
can be filtered What is the percent chance
that you will receive a match randomly? 1/16
100 6.25
A T G C C T A G
29Identity Matrix
1
A
1
0
C
1
0
0
I
1
0
0
0
L
L
I
C
A
Simplest type of scoring matrix
30Similarity
It is easy to score if an amino acid is identical
to another (the score is 1 if identical and 0 if
not). However, it is not easy to give a score
for amino acids that are somewhat similar.
CO2-
CO2-
NH3
NH3
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1
(identical) or Something in between?
31Scoring Matrices
- Importance of scoring matrices
- Scoring matrices appear in all analyses involving
sequence comparisons. - The choice of matrix can strongly influence the
outcome of the analysis. - Scoring matrices implicitly represent a
particular theory of sequence alignment. - Understanding theories underlying a given scoring
matrix can aid in making the proper choice when
performing sequence alignments.
32Scoring Matrices
- When we consider scoring matrices, we encounter
the convention that matrices have numeric indices
corresponding to the rows and columns of the
matrix. For example, M11 refers to the entry at
the first row and the first column. In general,
Mij refers to the entry at the ith row and the
jth column. To use this for sequence alignment,
we simply associate a numeric value to each
letter in the alphabet of the sequence. For
example, if the matrix is - A,C,T,G then A 1,1 C 1,2, etc.
33Steps to building the first PAM(Point Accepted
Mutation)
- Dayhoff aligned sequences that were at least 85
identical. - Reconstructed phylogenetic trees and inferred
ancestral sequences. 71 trees containing 1,572 aa
exchanges were used. - Tallied aa replacements "accepted" by natural
selection, in all pair-wise comparisons.
34Steps to building PAM (cont. 1)
- 4. Computed amino acid mutability, mj (the
propensity of a given amino acid, j, to be
replaced) - 5. Combined data from 3 4 to produce a
Mutation Probability Matrix for one PAM of
evolutionary distance, according to the following
formula
Replacements
Mjj 1 - mj
MPM of aaj for aaj
35Steps to building PAM (cont. 2)
6. Took the log odds ratio to obtain each
score Sij log (Mij/fi) (Note this is what
you see in the matrix) Where fi is the normalized
frequency of aai in the sequences used. 7. Note
must multiply the Mij/fi by factors of 10 prior
to avoid fractions.
36Assumptions in the PAM model
1. Replacement at any site depends only on the
amino acid at that site and the probability given
by the table (Markov model). 2. Sequences that
are being compared have average amino acid
composition.
37The bottom line on PAM
Frequencies of alignment
Frequencies of occurrence
The probability that two amino acids, i and j
are aligned by evolutionary descent divided by
the probability that they are aligned by chance
38Sources of error in PAM model
1. Many sequences depart from average aa
composition. 2. Rare replacements were observed
too infrequently to resolve relative
probabilities accurately (for 36 aa pairs (out of
appoxi-mately 400 aa pairs) no replacements were
observed!). 3. Errors in 1PAM are magnified in
the extrapolation to 250 PAM. (Mijk k PAM) 4.
This process (Markov) is an imperfect
representation of evolution distantly related
sequences usually have islands (blocks) of
conserved residues. This implies that replacement
is not equally probable over entire sequence.
39(No Transcript)
40BLOSUM Matrices
- BLOSUM is built from distantly related sequences
whereas PAM is built from closely related
sequences - BLOSUM is built from conserved blocks of aligned
protein segment found in the BLOCKS database
(remember the BLOCKS database is a secondary
database that depends on the PROSITE Family)
41Gap Penalties
- Takes into account insertions and deletions.
- Cant have too many that may make the alignment
meaningless - Typically, there is a fixed deduction for
introducing a gap plus additional deduction for
the length of the gap.
Gap penalty G Ln where G gap opening
penalty, L gap extension penalty and n gap
length. G 2 to 12, L 2
42Global Alignment vs. Local Alignment
- Global alignment is used when the overall gene
sequence is similar to another sequence-often
used in multiple sequence alignment. - Clustal W algorithm (Needleman-Wunsch)
- Local alignment is used when only a small portion
of one gene is similar to a small portion of
another gene. - BLAST
- FASTA
- Smith-Waterman algorithm
43Two proteins that are similar in certain regions
Tissue plasminogen activator (PLAT) Coagulation
factor 12 (F12).
44The Dotter Program
- Program consists of three components
- Sliding window
- A scoring matrix that gives a score for each
amino acid - A graph that converts the score to a dot of
certain pixel density
45(No Transcript)
46Region of similarity
47BLAST
- Basic Local Alignment Search Tool
- Speed is achieved by
- Pre-indexing the database before the search
- Parallel processing
- Uses a hash table that contains neighborhood
words rather than just identical words.
48Neighborhood words
- The program declares a hit if the word taken from
the query sequence has a score gt T when a
substitution matrix is used. - This allows the word size (W (this is similar to
ktup value)) to be kept high (for speed) without
sacrificing sensitivity. - If T is increased by the user the number of
background hits is reduced and the program will
run faster
49Workshop for module 2 Use the Dotter program to
determine the optimal alignment between two
sequences. Perform a Blast search on a protein
sequence.