The biological meaning of pairwise alignments

About This Presentation

Title:

The biological meaning of pairwise alignments

Description:

We can compare a sequence to an entire database of sequences one ... Different algorithms Needleman-Wunsch, Smith-Waterman, FastA, BLAST. AG-ICB-USP ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 47

Provided by: coccidi

Category:

more less

Transcript and Presenter's Notes

Title: The biological meaning of pairwise alignments

1
The biological meaning of pairwise alignments

Arthur Gruber

Instituto de Ciências Biomédicas Universidade de
São Paulo
AG-ICB-USP
2
What is a pairwise alignment?

Comparison of 2 sequences nucleotide or protein
sequences
We can compare a sequence to an entire database
of sequences one pairwise alignment at a time
Different types of alignments global and local
alignment
Different algorithms Needleman-Wunsch,
Smith-Waterman, FastA, BLAST

AG-ICB-USP
3
Pairwise alignment

Output alignment of similar blocks or whole
sequences

gi3323386gbU85705.1IFU85705 Isospora felis
28S large subunit ribosomal RNA gene, complete
sequence Length 3227 Score 218 bits (110),
Expect 2e-54 Identities 146/158 (92) Strand
Plus / Minus Query 3 cacttttaactctctttccaaa
gtccttttcatctttccttcacagtacttgttcactat 62

Sbjct 386 cacttttaactctctttccaaag
aacttttcatctttccctcacggtacttgtttgctat 327
Query 63 cggtctcacgccaatatttagctttacgtgaaacttatca
cacattttgcgctcaaatcc 122

Sbjct 326 cggtctcgcgccaatatttagctttatg
tgaaacttatcacacattttgcgctcaaatcc 267 Query 123
caatgaacgcgactcaataaaagcgcaccgtacgtgga 160

Sbjct 266 cgatgaacgcgactctataaaggcgtaccgtacgtgga
229
AG-ICB-USP
4
Some applications of pairwise alignments

Annotation description of the characteristics
of a sequence
Function ascribing similar sequences MAY share
similar functions
Identification of structural domains similar
sequences MAY share similar structures
Identification of protein domains defines
protein architecture
Phylogenetic inference identification of
similar sequences that MAY have a common ancestry

AG-ICB-USP
5
Some applications of pairwise alignments

Identification of contaminant sequences in a
sequencing project query sequence x databases
(bacterial, ribosomal, mitochondrial, etc.)
Identification of vector sequences in sequencing
reads alignment and masking

AG-ICB-USP
6
Identity, similarity, homology

Identity refers to nucleotide or amino acid
residues that are identical
Similarity - measurable quantity percentage of
identities between two sequences, percentage of
similar amino acid residues (conserved along the
evolution).
Homology based on a evolutionary conclusion
that implies that two sequences has a common
ancestral sequence. They are said to share the
same evolutionary history. Homology is not
quantitative. Two sequences can be or not to be
homologous.

AG-ICB-USP
7
Identity, similarity, homology

A high degree of similarity between two sequences
MAY suggest that they share a common
evolutionary history. Other analyses and
experimental work should be done to validate such
hypothesis

AG-ICB-USP
8
Contaminant removal
Libraries can be contaminated by different sources
Genomic libraries

Other organisms and/or cells co-purification
Bacterial DNA - E. coli used as the host cell
Human contamination during manipulation
Other genomes being manipulated in the lab
cross-contamination

AG-ICB-USP
9
Contaminant removal
Libraries can be contaminated by different sources
EST libraries

All sources already mentioned
Ribosomal RNA co-purification with the polyA
fraction
Organelle transcripts mitochondrion, plastid

AG-ICB-USP
10
Vector masking
A typical read contains sequence stretches that
are not originally part of the insert
insert
Sequencing reaction
Vector sequence
Vector sequence
AG-ICB-USP
11
Vector masking
Masking consists in a substitution of bases that
are not part of the insert by Xs
insert
Vector sequence
Vector sequence
insert
xxxxxxxxx
xxxxxxxxxxxxxxxx
Vector sequence
Vector sequence

X bases will not be taken into account by
assembly/clustering programs

AG-ICB-USP
12
Aligning Two Sequences

Human Hemoglobin (HH)
VLSPADKTNVKAAWGKVGAHAGYEG
Sperm Whale Myoglobin (SWM)
VLSEGEWQLVLHVWAKVEADVAGHG

AG-ICB-USP
13
Aligning Two Sequences

(HH) VLSPADKTNVKAAWGKVGAHAGYEG
(SWM) VLSEGEWQLVLHVWAKVEADVAGHG
Gap Weight 12
Length Weight 4
Gaps 0
Percent Similarity 40.000
Percent Identity 36.000
Matrix blosum62

AG-ICB-USP
14
Gap Insertion/Deletion

(HH) VLSPADKTNVKAAWGKVGAH-AGYEG
??? ? ? ?? ? ?? ?
(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
- gap insertion/deletion
Gap Weight 4
Length Weight 1
Gaps 2
Percent Similarity 54.167
Percent Identity 45.833
BLOSUM62

AG-ICB-USP
15
Scoring

(HH) VLSPADKTNVKAAWGKVGAH-AGYEG
(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
The score of the alignment is
Matrix value at (V,V) (L,L) (S,S) (P,E)
(penalty for gap insertion/deletion)gaps
(penalty for gap extension)(total
length of all gaps)

AG-ICB-USP
16
Scoring System

Identity An objective and quite well defined
measure Count the number of identical matches,
divide by length of aligned region
Similarity A less well defined measure
Category Amino acid
Acids and Amides Asp (D) Glu(E) Asn (N) Gln (Q)
Basic His (H) Lys (K) Arg (R)
Aromatic Phe (F) Tyr (Y) Trp (W)
Hydrophilic Ala (A) Cys (C) Gly (G) Pro (P) Ser
(S) Thr (T)
Hydrophobic Ile (I) Leu (L) Met (M) Val (V)

AG-ICB-USP
17
Scoring system

Rates of amino acid substitution are not uniform
Some amino acids are more conserved than others
(e.g. C, H, W compared to A, L, I)
Some substitutions are more common than others
(e.g. A I, A L compared to D L)
Conclusion there are evolutionary pressures that
probably reflect structural and functional
constraints
Scoring matrices matrices that are used for
scoring amino acid substitutions in pairwise
alignments
They reflect substitution rates that are
originated by evolutionary events

AG-ICB-USP
18
Amino acids - chemical relationships
Tiny
Alphatic
A
G
P
Hydrophobic
OH
L
I
S
C
V
Polar
T
Y
F
M
Hydrophilic
W
K
D
N
H
NH2
R
E
K
Aromatic
Charged
Positive
Negative
AG-ICB-USP
19
PAM

Stands for Point Accepted Mutation
Dayhoff Matrix, 1978
A series of matrices describing the extent to
which two amino acids have been interchanged in
evolution
Very similar sequences were aligned, phylogenetic
trees were built, and ancestral sequences were
reconstructed
Out of these alignments, the frequency of
substitution between each pair of amino acids was
calculated. Using this information, PAM matrices
were built (PAM1 i.e. one accepted point mutation
per 100 amino acids).

AG-ICB-USP
20
PAM250 - amino acid substitution matrix
GAP_CREATE 12 GAP_EXTEND 4
A B C D E F G H I K
L M N P Q R S T V
W A 2 0 -2 0 0 -4 1 -1
-1 -1 -2 -1 0 1 0 -2 1 1
0 -6 B 0 2 -4 3 2 -5 0
1 -2 1 -3 -2 2 -1 1 -1
0 0 -2 -5 C -2 -4 12 -5 -5
-4 -3 -3 -2 -5 -6 -5 -4 -3 -5
-4 0 -2 -2 -8 D 0 3 -5 4
3 -6 1 1 -2 0 -4 -3 2 -1
2 -1 0 0 -2 -7 E 0 2 -5
3 4 -5 0 1 -2 0 -3 -2
1 -1 2 -1 0 0 -2 -7 F -4
-5 -4 -6 -5 9 -5 -2 1 -5 2
0 -4 -5 -5 -4 -3 -3 -1 0 G
1 0 -3 1 0 -5 5 -2 -3 -2
-4 -3 0 -1 -1 -3 1 0 -1
-7 H -1 1 -3 1 1 -2 -2 6
-2 0 -2 -2 2 0 3 2 -1 -1
-2 -3 I -1 -2 -2 -2 -2 1 -3
-2 5 -2 2 2 -2 -2 -2 -2
-1 0 4 -5 K -1 1 -5 0 0
-5 -2 0 -2 5 -3 0 1 -1 1
3 0 0 -2 -3 L -2 -3 -6 -4
-3 2 -4 -2 2 -3 6 4 -3
-3 -2 -3 -3 -2 2 -2 M -1 -2
-5 -3 -2 0 -3 -2 2 0 4 6
-2 -2 -1 0 -2 -1 2 -4 N 0
2 -4 2 1 -4 0 2 -2 1
-3 -2 2 -1 1 0 1 0 -2
-4 P 1 -1 -3 -1 -1 -5 -1 0
-2 -1 -3 -2 -1 6 0 0 1 0
-1 -6 Q 0 1 -5 2 2 -5 -1
3 -2 1 -2 -1 1 0 4 1
-1 -1 -2 -5 R -2 -1 -4 -1 -1
-4 -3 2 -2 3 -3 0 0 0 1
6 0 -1 -2 2 S 1 0 0 0
0 -3 1 -1 -1 0 -3 -2 1 1
-1 0 2 1 -1 -2 T 1 0 -2
0 0 -3 0 -1 0 0 -2 -1
0 0 -1 -1 1 3 0 -5 V 0
-2 -2 -2 -2 -1 -1 -2 4 -2 2
2 -2 -1 -2 -2 -1 0 4 -6 W
-6 -5 -8 -7 -7 0 -7 -3 -5 -3
-2 -4 -4 -6 -5 2 -2 -5 -6
17
AG-ICB-USP
21
BLOSUM

Stands for Blocks Substitution Matrices
Henikoff and Henikoff, 1992
A series of matrices describing the extent to
which two amino acids are interchangeable in
conserved structures
Built by extracting replacement information from
the alignments in the BLOCKS database.

AG-ICB-USP
22
BLOSUM

The number in the series (BLOSUM62) represents
the threshold percent similarity between
sequences, for considering them in the
calculation.
For example, BLOSUM62 is derived from an
alignment of sequences that share 62 similarity,
BLOSUM45 is based on 45 sequence similarity in
aligned sequences

AG-ICB-USP
23
BLOSUM62 - amino acid substitution matrix
Reference Henikoff, S. and Henikoff, J. G.
(1992). Amino acid substitution matrices from
protein blocks. Proc. Natl. Acad. Sci. USA 89
10915-10919. A R N D C Q E G H I L
K M F P S T W Y V B Z X A 4 -1 -2
-2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2
0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2
2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0
6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2
-3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3
-4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0
-3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1
-2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2
0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E
-1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0
-1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2
6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2
-1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3
-3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3
-1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2
0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3
1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0
1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3
-2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1
-3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3
-1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1
1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2
-2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1
-1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3
-3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2
11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3
2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2
0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1
0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1
0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1
-1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1
-4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4
-4 -4 -4 -4 -4 -4 -4 -4 -4 1
AG-ICB-USP
24
Guidelines

Lower PAMs and higher Blosums find short local
alignment of highly similar sequences
Higher PAMs and lower Blosums find longer weaker
local alignment
No single matrix answers all questions

AG-ICB-USP
25
BLAST Basic Local Alignment Search Tool

Algorithm first described in 1990
Altschul, S.F., Gish, W., Miller, W., Myers, E.W.
Lipman, D.J. (1990) "Basic local alignment
search tool." J. Mol. Biol. 215403-410.
And improved in 1997
Altschul, S.F., Madden, T.L., Schäffer, A.A.,
Zhang, J., Zhang, Z., Miller, W. Lipman, D.J.
(1997). Gapped BLAST and PSI-BLAST a new
generation of protein database search programs.
Nucleic Acids Res. 25 3389-3402.

AG-ICB-USP
26
Blast search four components

Search purpose/goal
Program
Query sequence
Database

AG-ICB-USP
27
BLAST search purpose/goal

What is the biological question? Examples
Which proteins of the database are similar to my
protein sequence?
Which proteins of the database are similar to
the conceptual translation of my DNA sequence?
Which nucleotide sequences in the database are
similar to my nucleotide sequence?
Which proteins coded by the conceptual
translation of the database sequences are similar
to my protein sequence?
Which proteins coded by the conceptual
translation of the database sequences are similar
to the conceptual translation of my DNA sequence?

AG-ICB-USP
28
BLAST search purpose/goal

Which proteins of the database are similar to my
protein sequence?
I have sequenced a gene and derived the protein
sequence by concetpual translation.
Alternatively, I obtained the protein sequence
directly. I am now interested to find out its
possible fnction.
Using a similarity search, I can find protein
sequences in databases that are similar to mine
orthologs and paralogs.
BLASTP protein query x protein database

AG-ICB-USP
29
BLAST - search purpose/goal

Which proteins of the database are similar to the
conceptual translation of my DNA sequence?
I have sequenced an EST (expressed sequence tag)
that contains a protein coding region.
I am interested to find out which proteins of
the database are similar to the conceptual
translation of my nucleic acid sequence.
BLASTX nucleotide (translated) query x protein
database

AG-ICB-USP
30
BLAST search purpose/goal

Which nucleotide sequences of the database are
similar to my DNA sequence?
I have sequenced a DNA fragment.
I am interested to find out which DNA sequences
of the database are similar to my nucleic acid
sequence.
BLASTN nucleotide query x nucleotide database

AG-ICB-USP
31
BLAST - search purpose/goal

Which proteins translated from a nucleic acid
database are similar to the conceptual
translation of my DNA sequence?
I have sequenced an EST (expressed sequence tag)
that contains a protein coding region.
I am interested to find out which ESTs of other
organisms may be coding for homologous proteins.
TBLASTX nucleotide (translated) query x
nucleotide (translated) database

AG-ICB-USP
32
BLAST search purpose/goal

Which proteins coded by the conceptual
translation of the database sequences are similar
to my protein sequence?
I have a protein sequence on hands and am
interested to find out which genes of other
organisms may be coding for homologous proteins.
TBLASTN protein query x nucleotide
(translated) database

AG-ICB-USP
33
BLAST - programs

BLASTP protein query x protein database
BLASTN nucleotide query x nucleotide database
BLASTX nucleotide (translated) query x protein
database
TBLASTN protein query x nucleotide (translated)
database
TBLASTX nucleotide query (translated) x
nucleotide (translated) database

AG-ICB-USP
34
FastA format
BLAST query sequence

The first line begins with the symbol 'gt'
followed by the name of the sequence
The sequence is on the remaining lines.
The sequence must not contain blanks.
The sequence could be in upper or lower case.
Below is an example sequence in FASTA format\
gtDNA sequence
GCCCCCGGCCCCGCCCCGGCCCCGCCCCCGGCCCCGCCCCGCAAGGGTC
ACAGGTCACGGGGCGGGGCCGAGGCGGAAGCGCCCGCAGCCCGGTACCG
GCTCCTCCTGGGCTCCCTCTAGCGCCTTCCCCCCGGCCCGACTCCGCTG
GTCAGCGCCAAGTGACTTACGCCCCCGACCTCTGAGCCCGGACCGCTAG

AG-ICB-USP
35
BLAST database

Nucleotide databases
nr, refseq, est_human, est_mouse, est_others,
wgs, etc.
Protein databases nr, Swiss-Prot, refseq, etc.

AG-ICB-USP
36
AG-ICB-USP
37
AG-ICB-USP
38
AG-ICB-USP
39
AG-ICB-USP
40
AG-ICB-USP
41
AG-ICB-USP
42
AG-ICB-USP
43
AG-ICB-USP
44
AG-ICB-USP
45
AG-ICB-USP
46
Blast programs

PSI-BLAST Position-Specific Iterated BLAST
program - performs an iterative search in which
sequences found in one round of searching are
used to build a score model for the next round of
searching. In PSI-BLAST the algorithm is not tied
to a specific score matrix.
PHI-BLAST Pattern-Hit Initiated BLAST - a
search program that combines matching of regular
expressions with local alignments surrounding the
match.
MEGABLAST uses the greedy algorithm for
nucleotide sequence alignment search - it can be
up to 10 times faster than more common sequence
similarity programs and handles much longer DNA
sequences than the blastn program

AG-ICB-USP

Write a Comment

User Comments (0)