Presentaci - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Presentaci

Description:

Pairwise local/global alignment: differences Introduction to Sequence Analysis We are going to use the EMBOSS-Align tool (http://www.ebi.ac.uk/Tools/emboss/align/). – PowerPoint PPT presentation

Number of Views:9
Avg rating:3.0/5.0
Slides: 42
Provided by: O7111
Category:

less

Transcript and Presenter's Notes

Title: Presentaci


1
Introduction to Sequence Analysis
Protein Sequence Analysis Part II Osvaldo
Graña ograna_at_cnio.es CNIO Bioinformatics
Unit web page here 22 Feb. 2012
2
Introduction to Sequence Analysis
Introduction
  • Determination of protein/peptide sequences is a
    basic requirement for biomedical research,
    including cancer research. It is absolutely
    essential for characterising and identifying
    proteins or peptides.
  • The UniProt Knowledgebase is a central database
    of protein sequence and function. It consists of
    two parts, a section containing fully
    manually-annotated and non-redundant records
    resulting from information extracted from
    literature and curator-evaluated computational
    analyses UniProtKB/Swiss-Prot, and a section with
    computationally-analysed records awaiting full
    manual annotation UniProtKB/TrEMBL.
  • Check this web page with information
    about UniProtKB http//www.uniprot.org/help/unipr
    otkb

3
Introduction to Sequence Analysis
Searching against a protein sequence database
with NCBI-BLAST2
We are going to search against a protein database
using a nucleotide query (sequence 2) with NCBI
BLAST2 (http//ubio.bioinfo.cnio.es/people/ograna/
public_html/introductionToSequenceAnalysis/checkin
gForVectorContamination/) and look for
peptides/protein sequences that are similar in
UniProtKB/Swiss-Prot. This peptide/protein
sequence is a real entry in this database, so we
will expect to find a sequence that is a perfect
match to our test sequence. Also we expect to
find similar peptide/protein sequences, perhaps
from closely related animals, or from sequences
of closely related proteins. We are going to
select the BLASTX option and the
UniProtKB/Swiss-Prot database to search against a
protein database using a nucleotide
query http//www.ebi.ac.uk/Tools/blastall/
4
Introduction to Sequence Analysis
Searching against a protein sequence database
with NCBI-BLAST2
Database Choose here the databases you wish to
run your protein sequence against.
5
Introduction to Sequence Analysis
Searching against a protein sequence database
with NCBI-BLAST2
Database Choose here the databases you wish to
run your protein sequence against.
6
Introduction to Sequence Analysis
Selecting Blast parameters to search
7
Introduction to Sequence Analysis
Searching against a protein sequence database
with NCBI-BLAST2
Matrix You may choose from a complete list of
matrices which should cover various evolutionary
constraints. This is because substitutions will
occur in your sequences due to genetic diversity
during evolution. Each matrix is tailored to a
particular evolutionary distance. The default
matrix for BLAST is blosum62 (Blocks Substitution
Matrix 62 identity), which is the best of the
available matrices for detecting weak protein
similarities. PAM (Point Accepted Mutation)
matrices are also traditionally used for amino
acid sequences. Choosing a matrix with a larger
PAM value will allow alignments of sequences with
larger evolutionary distances, and choosing a
blosum matrix with a larger value will allow a
larger percentage identity. The default value
is blosum 62. Expected threshold The expected
threshold establishes a statistical significance
threshold for reporting database sequence
matches. The default value is 10, meaning that 10
matches are expected to be found merely by
chance. Lower expected thresholds are more
stringent, leading to fewer chance matches being
reported. Increasing the expected threshold shows
less stringent matches and is recommended when
you are performing searches with short sequences
as a short query is more likely to occur by
chance in the database than a longer one, so even
a perfect match (no gaps) can have low
statistical significance and may not be reported.
Increasing the Expected threshold allows you to
look farther down in the hit list and see matches
that would normally be discarded because of low
statistical significance. Generally a value of up
to 1000 is enough to see results. The default
value is 10.
0
10-5
10-2
Risky
Reliable
Very Reliable
Homology
8
Introduction to Sequence Analysis
Searching against a protein sequence database
with NCBI-BLAST2
Filter The filter option, if set to true, will
allow you to mask out various segments of the
query sequence for regions which are non-specific
for sequence similarity searches. Filtering can
eliminate statistically significant but
biologically uninteresting reports from the
output, for example hits against common acidic-,
basic- or proline-rich regions, leaving the more
biologically interesting regions of the query
sequence available for specific matching against
database sequences. Filtering is only applied to
the query sequence, not to database sequences.
The program used for this, with nucleotide query
sequences is known as DUST written by Tatusov, R.
L., and Lipman, D.J. The SEG program is used for
filtering low complexity regions in amino acid
sequences from your protein query sequence and
was written by Wootton, J.C., and Federhen,
S. The default is true. Default Filters (When
Filter set to true) BLASTp SEG
BLASTx SEG BLASTn DUST N.B. "If you have
UniProt Clusters 100 (SEG filtered)" selected ,
you will not be able to set a filter as a filter
is already applied. Drop off This is the amount
a score must drop before extension of word hits
is halted.
9
Introduction to Sequence Analysis
Searching against a protein sequence database
with NCBI-BLAST2
Open gap The gap open penalty is the score taken
away for the initiation of the gap in sequence or
in structure. To make the match more significant
you can try to make the gap penalty larger. It
will decrease the number of gaps and if you have
good alignment without many gaps, its Z-score
will be higher. The default is 11. Extend
gap The gap extension penalty is added to the
standard gap penalty for each base or residue in
the gap. This is how long gaps are penalised. If
you don't like long gaps, just increase the
extension gap penalty. Usually you will expect a
few long gaps rather than many short gaps, so the
gap extension penalty should be lower than the
gap penalty. An exception is where one or both
sequences are single reads with possible
sequencing errors in which case you would expect
many single base gaps. You can get this result by
setting the gap open penalty to zero (or very
low) and using the gap extension penalty to
control gap scoring. The default is 1.
10
Introduction to Sequence Analysis
Searching against a protein sequence database
with NCBI-BLAST2
Gap align This is a true/false answer that
tells the program to perform optimised alignments
within regions involving gaps. If set to true,
the program will perform an alignment using gaps.
Otherwise, if it is set to false, it will report
only individual HSP where two sequences match
each other, and thus will not produce alignments
with gaps. The default is true. (N.B. HSP
means High-Scoring Segment Pair. Local alignments
with no gaps that achieve one of the top
alignment scores in a given search)
11
Introduction to Sequence Analysis
Substitution matrices
Alignment of protein sequences can take account
of the diferential rates at which amino acids
substitute for each other. It can be measured
through two types of matrices PAM and
BLOSUM. PAM (Percent Accepted Mutations) on the
basis of comparisons among many pairs of very
similar protein sequences (at least 85
identical, ie., homologous sequences), Margaret
Dayhoff constructed a mutation probability matrix
comparing many pairs of protein sequences to
determine the empirical frequencies with which
one amino acid is replaced by others during
evolution. Examples are PAM1, PAM10, PAM25,
PAM50, PAM100, PAM125, PAM250. PAM10
PAM110 PAM250 PAM1250 The PAM1
matrix could be multiplied by itself N times to
give transition matrices for comparing sequences
with lower and lower levels of similarity due to
separation over longer periods of evolutionary
history. Thus, the commonly used PAM250 matrix
represents a level of 250 of change expected in
2500 million years. Although this amount of
change seems very large, sequences at this level
of divergence still have about 20 of similarity
(Bioinformatics, D. W Mount, page 96).
12
Introduction to Sequence Analysis
Substitution matrices
The empirical frecuency with
which aminoacid type i is replaced by type j (or
viceversa) is writen as Mi,j in the matrix the
probability of aligning two Ys in an alignment
YY/YY is 101020, a very significant score,
whereas that of YY/TP is 0-5-5
13
Introduction to Sequence Analysis
Substitution matrices
Recommendations Which PAM matrix should I
use? One cannot know previously what the
percentage similarity or difference between two
sequences actually is until an alignment is done,
thus a trial alignment must be first done. Once
the initial similarity score has been obtained
with these matrices, a more representative score
can be obtained by using another PAM matrix
designed specifically for sequences at that level
of similarity.
14
Introduction to Sequence Analysis
Substitution matrices
BLOSUM (Blocks Substitution Matrix) the PAM
matrices introduced by Dayhoff are constructed
from the amino acid replacements inferred from
alignments of protein sequences that are at least
85 identical. Henikoff Henikoff (1992)
considered blocks, or highly conserved regions,
in aligned protein sequences. The BLOSUM matrix
scores for amino acid pairs are based on the
frequency of amino acid substitutions in aligned
sequence motifs (blocks) from a related familiy
of proteins, regardless of the overall degree of
similarity between the protein sequences. The
BLOSUM62 substitution matrix is widely used for
scoring protein sequence alignments. The matrix
values are based on the observed aminoacid
substitutions in a large set of approximately
2000 conserved amino acid blocks representing
more than 500 families of related
proteins. BLOSUM62 -gt based on blocks that are
62 identical BLOSUM80 -gt based on blocks that
are 80 alike BLOSUM62 example
http//www.uky.edu/Classes/BIO/520/BIO520WWW/blosu
m62.htm
15
Introduction to Sequence Analysis
Substitution matrices
PAM vs BLOSUM The PAM matrices are based on
scoring all amino acid positions in related
sequences, whereas the BLOSUM matrices are based
on substitutions and conserved positions in
blocks, which represents the most-alike common
regions in related sequences. The PAM model is
thus designed to track the evolutionary origins
of proteins, whereas the BLOSUM model is designed
to find their conserved domains. The choice of
which matrix to use depends on the goals of the
investigator. Still there are some equivalences
between PAM and BLOSUM matrices
16
Introduction to Sequence Analysis
Substitution matrices
We have to consider also insertions and
deletions, this implies to open gaps in the
alignment and so we have to recalculate the
scores penalizing for a) Opening a gap in the
alignment b) Extending the gap in the
alignment Values vary depending on the program
we are using, but a general rule is that opening
a new gap is much more penalized than extending
an existing one It is more frequent to find long
gaps than bunches of 1 base gaps
Example1 bunch of gaps ATCG_ATCG_ATCG_ATCG ATCG
TATCGTATCGTATCG
Example 2 long gap ATCG_ _ _ ATCG ATCGT CG ATCG
17
Introduction to Sequence Analysis
Substitution matrices
Example of scoring a sequence alignment with a
gap penalty and under BLOSUM62. BLOSUM62 matrix
http//www.uky.edu/Classes/BIO/520/BIO520WWW/blosu
m62.htm Sequence 1 V D S - C Y Sequence
2 V E S L C Y Score 4 2 4 -11 9
7 Total score (? amino acid pair scores
) minus (single gap penalty) 15
18
Introduction to Sequence Analysis
Searching against a protein sequence database
with NCBI-BLAST2
We are going to search a protein database using a
nucleotide query (sequence 2) with NCBI
BLAST2 (http//ubio.bioinfo.cnio.es/people/ograna/
public_html/introductionToSequenceAnalysis/checkin
gForVectorContamination/) and look for
peptides/protein sequences that are similar in
UniProt. This peptide/protein sequence is a real
entry in this database, so we will expect to find
a sequence that is a perfect match to our test
sequence. Also we expect to find similar
peptide/protein sequences, perhaps from closely
related animals, or from sequences of closely
related proteins. We are going to select the
BLASTX option and the Swiss-Prot database to
search a protein database using a nucleotide
query http//www.ebi.ac.uk/Tools/blastall/
19
Introduction to Sequence Analysis
Searching against a protein sequence database
with NCBI-BLAST2 Results Summary
NOTE by clicking 'show alignments' we will find
that the hits are catched with the frame 2 (see
Show Alignments). This tell us that at least
the second frame is a coding frame.
20
Introduction to Sequence Analysis
Showing the alignments
NOTE all the hits are catched with the frame 2
(see Show Alignments). This tell us that at
least the second frame is a coding frame.
21
Introduction to Sequence Analysis
Visual output (results)
See that the part of the mouse fosB mRNA that
we are able to align with the FosB protein
sequence is the one that belongs to the CDS, from
the first methyonine (translation start site)
until the stop codon (translation stop site).
22
Introduction to Sequence Analysis
Functional predictions (results)
23
Introduction to Sequence Analysis
Description of Uniprot entry
24
Introduction to Sequence Analysis
Description of Uniprot entry
25
Introduction to Sequence Analysis
Pairwise local/global alignment differences
Global alignment we try to align the whole
sequence. It is only useful for homologous
proteins with a high percentage of
identity. Local alignment we try to align
locally as much of the sequence as we can. This
is useful when dealing with domains. Ar
e these proteins homologues? Globally no, they
are very different, the score would be very
low. Locally there is a homologous domain, the
grey one.
26
Introduction to Sequence Analysis
Pairwise local/global alignment Running an
EMBOSS-Align alignment
  • We are going to use the EMBOSS-Align tool
    (http//www.ebi.ac.uk/Tools/emboss/align/).
  • 2 jobs to execute, one with the EMBOSS global
    alignment program (needle), and one with the
    local alignment program (water).
  • As we are comparing 2 protein sequences, the
    molecule type was left on protein.
  • The default blosum62 matrix is used, and the
    default gap open of "10" and gap extend of "0.5"
    is also used.

27
Introduction to Sequence Analysis
Pairwise local/global alignment differences
Lets align these two sequences
http//pfam.sanger.ac.uk/family?accPF00071 http
//ubio.bioinfo.cnio.es/people/ograna/public_html/i
ntroductionToSequenceAnalysis/protEMBOSSalign/sequ
ence10.txt http//ubio.bioinfo.cnio.es/peop
le/ograna/public_html/introductionToSequenceAnalys
is/protEMBOSSalign/sequence11.txt
28
Introduction to Sequence Analysis
Pairwise local/global alignment needle GLOBAL
result
29
Introduction to Sequence Analysis
Pairwise local/global alignment water LOCAL
result
The Smith-Waterman
algorithm is more suitable for identifying
related proteins of limited sequence similarity
than FASTA and BLAST in a database search
(Bioinformatics, D. W. Mount, page 259).
30
Introduction to Sequence Analysis
Pairwise local/global alignment Results of
EMBOSS-Align alignments
Note that identical amino acids are connected
with a "" symbol. Unrelated pairs of amino acids
(mismatches) would be connected with a space. A
gap would be represented with a "-" symbol.
Similar pairs (e.g. leucine vs methionine) are
connected via a "" symbol. Less similar ones are
indicated with "." The id is the percentage of
identical matches between the two sequences over
the reported aligned region. The similarity is
the percentage of matches between the two
sequences over the reported aligned region where
the scoring matrix value is greater or equal to
0.0. The Overall id and Overall similarity are
calculated in a similar manner for the number of
matches over the length of the longest of the two
sequences.
31
Introduction to Sequence Analysis
ClustalW can build multiple sequence alignments
(MSA)
ClustalW (http//www.ebi.ac.uk/Tools/clustalw2/)
is a general purpose global multiple sequence
alignment program for DNA or proteins. It
produces biologically meaningful multiple
sequence alignments of divergent sequences. It
calculates the best match for the selected
sequences, and lines them up so that the
identities, similarities and differences can be
seen. Evolutionary relationships can be seen
through Cladograms or Phylograms. Multiple
alignments of protein sequences are important
tools in studying sequences. The basic
information they provide is identification of
conserved sequence regions. This is very useful
in designing experiments to test and modify the
function of specific proteins, in predicting the
function and structure of proteins, and in
identifying new members of protein
families. Sequences can be aligned across their
entire length (global alignment) or only in
certain regions (local alignment). This is true
for pairwise and multiple alignments. Global
alignments need to use gaps (representing
insertions/deletions) while local alignments can
avoid them, aligning regions between gaps.
ClustalW is a fully automatic program for global
multiple alignment of DNA and protein sequences.
The alignment is progressive and considers the
sequence redundancy. Trees can also be calculated
from multiple alignments. The program has some
adjustable parameters with reasonable defaults.
  • ClustalW (Higgins et al. 1996)
  • It is designed to provide an adequate alignment
    of a large number of more close related sequences
    and a reliable indication of the domain structure
    of those sequences.
  • The steps used by ClustalW include
  • Perform pair-wise alignments of all the sequences
  • Use the aligment scores to produce a phylogenetic
    tree
  • Progressive multiple sequence alignment it
    reduces the construction of the MSA to a series
    of pair-wise alignments. Initially, a dynamic
    programming alignment is made between the two
    most alike sequences, and the resulting alignment
    is then extended to include other, less alike
    sequences.

32
Introduction to Sequence Analysis
Building a MSA 1) get protein homologs with Blast
We select all the hits obtained from the previous
search results, and the click download fasta
33
Introduction to Sequence Analysis
Building a MSA 2) Copy all the downloaded
sequences
We then copy all the downloaded sequences to the
ClustalW2 tool
34
Introduction to Sequence Analysis
Building a MSA 3) ClustalW MSA results
35
Introduction to Sequence Analysis
Analyzing ClustalW results
We can now see how the first 3 sequences are very
similar while from the fourth to the last the MSA
introduces differences
The branch lengths on the phylogram are
proportional to the evolutionary distance between
species, however the branches are normalized in
the cladogram and therefore do not represent the
distance between species.
36
Introduction to Sequence Analysis
Interpretation of ClustalW results
Consensus symbols An alignment will display by
default the following symbols denoting the degree
of conservation observed in each column ""
means that the residues or nucleotides in that
column are identical in all sequences in the
alignment. "" means that conserved substitutions
(similar) have been observed, according to the
COLOUR table below. "." means that semi-conserved
substitutions (less similar) are observed.
Colour This option only works when you have
chosen ALN or GCG the output format. The
colouring of residues takes place according to
the following physiochemical criteria
37
Introduction to Sequence Analysis
Other examples of MSA programs
T-Coffee combines information from global and
local alignments to produce a global MSA
(http//www.ebi.ac.uk/Tools/t-coffee/index.html)
Muscle builds global MSA (http//www.ebi.ac.uk/
Tools/muscle/) Mafft generates global MSA
(http//www.ebi.ac.uk/Tools/mafft/index.html) DiA
lign produces global and local MSA
(http//bibiserv.techfak.uni-bielefeld.de/dialign/
) Hmmer generates local MSA (http//hmmer.janeli
a.org/) Meme builds local MSA
(http//meme.sdsc.edu/meme4_1/cgi-bin/meme.cgi)
38
Introduction to Sequence Analysis
Searching protein families with InterPro
  • What is InterPro? http//www.ebi.ac.uk/interpro/us
    er_manual.html
  • InterPro is an integrated documentation resource
    for protein families, domains and sites. InterPro
    is a consortium of member databases (PROSITE,
    Pfam, Prints, ProDom, SMART and TIGRFAMs). Each
    member database devises methods that can be
    applied computationally to assign a score for a
    protein according to how well it matches a given
    signature. For some types of methods, the
    classification is binary (i.e. hit or miss), in
    other cases a numerical value is produced and a
    cut off point chosen to separate hits from
    misses. Different member databases create
    methods/signatures in different ways some groups
    build them from alignments studied manually,
    others use automatic processes with some human
    input and correction, ProDom uses an entirely
    automatic method.
  • Signatures describing the same protein family or
    domain are grouped into unique InterPro entries.
    Each combined InterPro entry has a unique
    accession number, an abstract describing the
    features of proteins associated with the entry
    and literature references and has links to the
    relevant member database(s). All UniProtKB
    protein sequences that have matches to a
    particular InterPro entry are listed in the Match
    Table associated with that entry. There are also
    links to the InterPro graphical views. The
    graphical views, which can be sorted by UniProtKB
    accession number, structure or taxonomy, show the
    position of the signatures on the protein,
    mousing over the signature brings up a pop-box,
    giving the accession, name and position.
  • InterPro graphically represents the location of a
    protein domain and information pertaining to the
    origin of that domain and the proteins that
    contain it. Families are also defined and may
    contain several InterPro domains which are often,
    but not always, in the same order. Through the
    InterPro Domain Architecture view, the
    composition and order of the different domains
    within a family are clearly displayed for easy
    comparison, as well as for simple navigation
    between the entries for individual domains.
  • InterPro entries are linked to one another
    through PARENT/CHILD and CONTAINS/FOUND IN
    relationships. PARENT/CHILD relationships
    indicate superfamily/family/subfamily
    relationships, as well as domain hierarchies,
    where sequences can be subdivided into more
    specific sub-sets. CONTAINS/FOUND IN
    relationships apply to domains, repeats and sites
    within families, and are used to describe the
    composition of protein sequences.

39
Introduction to Sequence Analysis
Searching protein families with InterPro
  • Going back to our sequence 2
  • http//srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-eswis
    sprot-idFOSB_MOUSEswissprot-accFOSB_MOUSE-n
    oSession
  • We move down through the page to the section
    Database cross-references until we find the
    following link
  • There are 3 InterPro entries in this case
    usually InterPro defines one entry for each
    member database that contains a definition for
    this domain. The first entry is a PFAM annotation
    with type Domain. The second InterPro entry is
    an annotation from PRINTS, in this case the type
    is called Family. The third entry comes from
    SMART and PROSITE and the type is again Domain.
  • InterPro entries can have associations like
    parent/child (different levels defined by
    InterPro methods) or contain/found in.
  • This particular domain is named differently in
    the three entries. The reason is only the
    preferences of each one of the database members
    for the names.

40
Introduction to Sequence Analysis
Searching protein families with InterPro
41
Introduction to Sequence Analysis
Thanks for your attention !
I would like to thank also the effort done by the
2Can initiative at the EBI. Some of the slides
shown in this tutorial were selected from the
2Can Support Portal.
Write a Comment
User Comments (0)
About PowerShow.com