Module 2 Sequence DBs and Similarity Searches - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Module 2 Sequence DBs and Similarity Searches

Description:

Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file. – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 50
Provided by: jmomand
Category:

less

Transcript and Presenter's Notes

Title: Module 2 Sequence DBs and Similarity Searches


1
Module 2Sequence DBs and Similarity Searches
  • Learning objectives
  • Understand how information is stored in GenBank.
  • Learn how to read a Genbank flat file.
  • Learn how to search Genbank for information.
  • Understand difference between header, features
    and sequence.
  • Learn the difference between a primary database
    and secondary database.
  • Principle of similarity searches using the BLAST
    program

2
What is GenBank?
  • Gene sequence database
  • Annotated records that represent single
    contiguous stretches of DNA or RNA-may have more
    than one coding region (limit 350 kb)
  • Generated from direct submissions to the DNA
    sequence databases from the authors.
  • Part of the International Nucleotide Sequence
    Database Collaboration.

3
Exchange of information on a daily basis
GenBank (NCBI)
EMBL (EBI) United Kingdom
International Nucleotide Sequence Database
Collaboration
DDBJ Japan
4
History of GenBank
  • Began with Atlas of Protein Sequences and
    Structures (Dayhoff et al., 1965)
  • In 1986 it collaborated with EMBL and in 1987 it
    collaborated with DDBJ.
  • It is a primary database-(i.e., experimental data
    is placed into it)
  • Examples of secondary databases derived from
    GenBank/EMBL/DDBJ Swiss-Prot, PRI.
  • GenBank Flat File is a human readable form of the
    records.

5
General Comments on GBFF
  • Three sections
  • 1) Header-information about the whole record
  • 2) Features-description of annotations-each
    represented by a key.
  • 3) Nucleotide sequence-each ends with // on last
    line of record.
  • DNA-centered
  • Translated sequence is only a feature

6
Feature Keys
  • Purpose
  • 1) Indicates biological nature of sequence
  • 2) Supplies information about changes to
    sequences
  • Feature Key Description
  • conflict Separate deters of the same seq.
    differ
  • rep_origin Origin of replication
  • protein_bind Protein binding site on DNA
  • CDS Protein coding sequence

7
Feature Keys-Terminology
  • Feature Key Location/Qualifiers
  • CDS 23..400
  • /productalcohol dehydro.
  • /geneadhI
  • Interpretation-The feature CDS is a coding
    sequence beginning at base 23 and ending at base
    400, has a product called alcohol dehydrogenase
    and corresponds to the gene called adhI.

8
Feature Keys-Terminology (Cont.)
  • Feat. Key Location/Qualifiers
  • CDS join (544..589,688..1032)
  • /productT-cell recep. B-ch.
  • /partial
  • Interpretation-The feature CDS is a partial
    coding sequence formed by joining the indicated
    elements to form one contiguous sequence encoding
    a product called T-cell receptor beta-chain.

9
Record from GenBank
GenBank division (plant, fungal and algal)
Modification date
LOCUS SCU49845 5028 bp DNA
PLN 21-JUN-1999 DEFINITION Saccharomyces
cerevisiae TCP1-beta gene, partial cds, and
Axl2p (AXL2) and Rev7p (REV7) genes,
complete cds. ACCESSION U49845 VERSION
U49845.1 GI1293613 KEYWORDS . SOURCE
baker's yeast. ORGANISM Saccharomyces
cerevisiae Eukaryota Fungi
Ascomycota Hemiascomycetes Saccharomycetales
Saccharomycetaceae Saccharomyces.
Unique identifier (never changes)
Coding region
GeneInfo identifier (changes whenever there is a
change)
Nucleotide sequence identifier (changes when
there is a change in sequence (accession.version))
Word or phrase describing the sequence (not based
on controlled vocabulary). Not used in newer
records.
Common name for organism
Formal scientific name for the source organism
and its lineage based on NCBI Taxonomy Database
10
Record from GenBank (cont.1)
  • REFERENCE 1 (bases 1 to 5028)
  • AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J.
    and Lawrence,C.W.
  • TITLE Cloning and sequence of REV7, a gene
    whose function is required
  • for DNA damage-induced mutagenesis in
    Saccharomyces cerevisiae
  • JOURNAL Yeast 10 (11), 1503-1509 (1994)
  • MEDLINE 95176709
  • REFERENCE 2 (bases 1 to 5028)
  • AUTHORS Roemer,T., Madden,K., Chang,J. and
    Snyder,M.
  • TITLE Selection of axial growth sites in
    yeast requires Axl2p, a
  • novel plasma membrane glycoprotein
  • JOURNAL Genes Dev. 10 (7), 777-793 (1996)
  • MEDLINE 96194260

Oldest reference first
Medline UID
REFERENCE 3 (bases 1 to 5028) AUTHORS
Roemer,T. TITLE Direct Submission JOURNAL
Submitted (22-FEB-1996) Terry Roemer, Biology,
Yale University, New Haven, CT, USA
Submitter of sequence (always the last reference)
11
Record from GenBank (cont.2)
There are three parts to the feature key a
keyword (indicates functional group), a location
(instruction for finding the feature), and a
qualifier (auxiliary information about a feature)
  • FEATURES Location/Qualifiers
  • source 1..5028
  • /organism"Saccharomyces
    cerevisiae"
  • /db_xref"taxon4932"
  • /chromosome"IX"
  • /map"9"
  • CDS lt1..206
  • /codon_start3
  • /product"TCP1-beta"
  • /protein_id"AAA98665.1"
  • /db_xref"GI1293614"
  • /translation"SSIYNGISTSGLDLN
    NGTIADMRQLGIVESYKLKRAVVSSASEA
  • AEVLLRVDNIIRARPRTANRQHM"

Location
Keys
Qualifiers
Partial sequence on the 5 end. The 3 end is
complete.
Start of open reading frame
Descriptive free text must be quotations
Database cross-refs
Protein sequence ID
Values
Note only a partial sequence
12
Record from GenBank (cont.3)
New location
  • gene 687..3158
  • /gene"AXL2"
  • CDS 687..3158
  • /gene"AXL2"
  • /note"plasma membrane
    glycoprotein"
  • /codon_start1
  • /function"required for
    axial budding pattern of S.
  • cerevisiae"
  • /product"Axl2p"
  • /protein_id"AAA98666.1"
  • /db_xref"GI1293615"
    /translation"MTQLQISLLLTATISLLHLVVATP
    YEAYPIGKQYPPVARVN. . .
  • gene complement(3300..4037)
  • /gene"REV7"
  • CDS complement(3300..4037)
  • /gene"REV7"
  • /codon_start1
  • /product"Rev7p"
  • /protein_id"AAA98667.1"
  • /db_xref"GI1293616"
    /translation"MNRWVEKWLRVYLKCYINLILFYRNV
    YPPQSFDYTTYQSFNLPQ . . .

Cutoff
New location
Cutoff
13
Record from GenBank (cont.4)

BASE COUNT 1510 a 1074 c 835 g 1609
t ORIGIN 1 gatcctccat atacaacggt
atctccacct caggtttaga tctcaacaac ggaaccattg
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct
aaaacgagca gtagtcagct . . .//
14
Primary databases contain experimental biological
information
  • GenBank/EMBL/DDBJ
  • Alu-alu repeats in human DNA
  • dbEST-expressed sequence tags-single pass cDNA
    sequences (high error freq.)
  • It is non-redundant
  • HTGS-high-throughput genomic sequence database
    (errors!)
  • PDB-Three-dimensional structure coordinates of
    biological molecules
  • PROSITE-database of protein domain/function
    relationships.

15
Types of secondary databases that contain
biological information
  • dbSTS-Non-redundant db of sequence-tagged sites
    (useful for physical mapping)
  • Genome databases-(there are over 20 genome
    databases that can be searched
  • EPDeukaryotic promoter database
  • NR-non-redundant GenBankEMBLDDBJPDB. Entries
    with 100 sequence identity are merged as one.
  • Vector A subset of GenBank containing vector DNA
  • ProDom
  • PRINTS
  • BLOCKS

16
Workshop 2 A-Look up a Genbank record. Use the
annotations to determine the the first
open reading frame.
17
Similarity Searching
It is easy to score if an amino acid is identical
to another (the score is 1 if identical and 0 if
not). However, it is not easy to give a score
for amino acids that are somewhat similar.
CO2-
CO2-
NH3
NH3
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1
(identical) or Something in between?
18
Purpose of finding differences and similarities
of amino acids.
  • Infer structural information
  • Infer functional information
  • Infer evolutionary relationships

19
Evolutionary Basis of Sequence Alignment
1. Similarity Quantity that relates to how
alike two sequences are. 2. Identity Quantity
that describes how alike two sequences are in the
strictest terms. 3. Homology a conclusion drawn
from data suggesting that two genes share a
common evolutionary history.
20
Evolutionary Basis of Sequence Alignment (Cont. 1)
1. Example Shown on the next page is a pairwise
alignment of two proteins. One is mouse trypsin
and the other is crayfish trypsin. They are
homologous proteins. The sequences share 41
identity.
2. Underlined residues are identical. Asterisks
and diamond represent those residues that
participate in catalysis. Five gaps are placed
to optimize the alignment.
21
(No Transcript)
22
Evolutionary Basis of Sequence Alignment (Cont. 2)
Why are there regions of identity? 1) Conserved
function-residues participate in reaction. 2)
Structural-residues participate in maintaining
structure of protein. (For example, conserved
cysteine residues that form a disulfide
linkage) 3) Historical-Residues that are
conserved solely due to a common ancestor gene.
23
Evolutionary Basis of Sequence Alignment (Cont. 3)
Note It is possible that two proteins share a
high degree of similarity but have two different
functions. For example, human gamma-crystallin
is a lens protein that has no known enzymatic
activity. It shares a high percentage of
identity with E. coli quinone oxidoreductase.
These proteins likely had a common ancestor but
their functions diverged.
Analogous to railroad car and diner function.
24
(No Transcript)
25
Modular nature of proteins
  • The previous alignment was global. However, many
    proteins do not display global patterns of
    similarity. Instead, they possess local regions
    of similarity.
  • Proteins can be thought of as assemblies of
    modular domains. It is thought that this may, in
    some cases, be due to a process known as exon
    shuffling.

26
Modular nature of proteins (cont. 1)
Exon 1a
Exon 2a
Gene A
Duplication
Exon 1a
Gene B
Exon 2a
Exon 2a
Exchange
Exon 3 (Ex. 2b from Gene B)
Exon 2a
Exon 1a
Gene A
Gene B
Exon 1b
Exon 3 (Ex. 2a from Gene A)
Exon 2b
27
Dot Plots
Window 1 Note that 25 of the table will
be filled due to random chance. 1 in 4 chance at
each position
28
Dot Plots with window 2
A T G C C T A G
Window 2 The larger the window the more noise
can be filtered What is the percent chance
that you will receive a match randomly? 1/16
100 6.25

A T G C C T A G













29
Identity Matrix
1
A
1
0
C
1
0
0
I
1
0
0
0
L
L
I
C
A
Simplest type of scoring matrix
30
Similarity
It is easy to score if an amino acid is identical
to another (the score is 1 if identical and 0 if
not). However, it is not easy to give a score
for amino acids that are somewhat similar.
CO2-
CO2-
NH3
NH3
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1
(identical) or Something in between?
31
Scoring Matrices
  • Importance of scoring matrices
  • Scoring matrices appear in all analyses involving
    sequence comparisons.
  • The choice of matrix can strongly influence the
    outcome of the analysis.
  • Scoring matrices implicitly represent a
    particular theory of sequence alignment.
  • Understanding theories underlying a given scoring
    matrix can aid in making the proper choice when
    performing sequence alignments.

32
Scoring Matrices
  • When we consider scoring matrices, we encounter
    the convention that matrices have numeric indices
    corresponding to the rows and columns of the
    matrix. For example, M11 refers to the entry at
    the first row and the first column. In general,
    Mij refers to the entry at the ith row and the
    jth column. To use this for sequence alignment,
    we simply associate a numeric value to each
    letter in the alphabet of the sequence. For
    example, if the matrix is
  • A,C,T,G then A 1,1 C 1,2, etc.

33
Steps to building the first PAM(Point Accepted
Mutation)
  1. Dayhoff aligned sequences that were at least 85
    identical.
  2. Reconstructed phylogenetic trees and inferred
    ancestral sequences. 71 trees containing 1,572 aa
    exchanges were used.
  3. Tallied aa replacements "accepted" by natural
    selection, in all pair-wise comparisons.

34
Steps to building PAM (cont. 1)
  • 4. Computed amino acid mutability, mj (the
    propensity of a given amino acid, j, to be
    replaced)
  • 5. Combined data from 3 4 to produce a
    Mutation Probability Matrix for one PAM of
    evolutionary distance, according to the following
    formula

Replacements
Mjj 1 - mj
MPM of aaj for aaj
35
Steps to building PAM (cont. 2)
6. Took the log odds ratio to obtain each
score Sij log (Mij/fi) (Note this is what
you see in the matrix) Where fi is the normalized
frequency of aai in the sequences used. 7. Note
must multiply the Mij/fi by factors of 10 prior
to avoid fractions.
36
Assumptions in the PAM model
1. Replacement at any site depends only on the
amino acid at that site and the probability given
by the table (Markov model). 2. Sequences that
are being compared have average amino acid
composition.
37
The bottom line on PAM
Frequencies of alignment
Frequencies of occurrence
The probability that two amino acids, i and j
are aligned by evolutionary descent divided by
the probability that they are aligned by chance
38
Sources of error in PAM model
1. Many sequences depart from average aa
composition. 2. Rare replacements were observed
too infrequently to resolve relative
probabilities accurately (for 36 aa pairs (out of
appoxi-mately 400 aa pairs) no replacements were
observed!). 3. Errors in 1PAM are magnified in
the extrapolation to 250 PAM. (Mijk k PAM) 4.
This process (Markov) is an imperfect
representation of evolution distantly related
sequences usually have islands (blocks) of
conserved residues. This implies that replacement
is not equally probable over entire sequence.
39
(No Transcript)
40
BLOSUM Matrices
  • BLOSUM is built from distantly related sequences
    whereas PAM is built from closely related
    sequences
  • BLOSUM is built from conserved blocks of aligned
    protein segment found in the BLOCKS database
    (remember the BLOCKS database is a secondary
    database that depends on the PROSITE Family)

41
Gap Penalties
  • Takes into account insertions and deletions.
  • Cant have too many that may make the alignment
    meaningless
  • Typically, there is a fixed deduction for
    introducing a gap plus additional deduction for
    the length of the gap.

Gap penalty G Ln where G gap opening
penalty, L gap extension penalty and n gap
length. G 2 to 12, L 2
42
Global Alignment vs. Local Alignment
  • Global alignment is used when the overall gene
    sequence is similar to another sequence-often
    used in multiple sequence alignment.
  • Clustal W algorithm (Needleman-Wunsch)
  • Local alignment is used when only a small portion
    of one gene is similar to a small portion of
    another gene.
  • BLAST
  • FASTA
  • Smith-Waterman algorithm

43
Two proteins that are similar in certain regions
Tissue plasminogen activator (PLAT) Coagulation
factor 12 (F12).
44
The Dotter Program
  • Program consists of three components
  • Sliding window
  • A scoring matrix that gives a score for each
    amino acid
  • A graph that converts the score to a dot of
    certain pixel density

45
(No Transcript)
46
Region of similarity
47
BLAST
  • Basic Local Alignment Search Tool
  • Speed is achieved by
  • Pre-indexing the database before the search
  • Parallel processing
  • Uses a hash table that contains neighborhood
    words rather than just identical words.

48
Neighborhood words
  • The program declares a hit if the word taken from
    the query sequence has a score gt T when a
    substitution matrix is used.
  • This allows the word size (W (this is similar to
    ktup value)) to be kept high (for speed) without
    sacrificing sensitivity.
  • If T is increased by the user the number of
    background hits is reduced and the program will
    run faster

49
Workshop for module 2 Use the Dotter program to
determine the optimal alignment between two
sequences. Perform a Blast search on a protein
sequence.
Write a Comment
User Comments (0)
About PowerShow.com