Title: The biological meaning of pairwise alignments
1The biological meaning of pairwise alignments
Instituto de Ciências Biomédicas Universidade de
São Paulo
AG-ICB-USP
2What is a pairwise alignment?
- Comparison of 2 sequences nucleotide or protein
sequences - We can compare a sequence to an entire database
of sequences one pairwise alignment at a time - Different types of alignments global and local
alignment - Different algorithms Needleman-Wunsch,
Smith-Waterman, FastA, BLAST
AG-ICB-USP
3Pairwise alignment
- Output alignment of similar blocks or whole
sequences
gi3323386gbU85705.1IFU85705 Isospora felis
28S large subunit ribosomal RNA gene, complete
sequence Length 3227 Score 218 bits (110),
Expect 2e-54 Identities 146/158 (92) Strand
Plus / Minus Query 3 cacttttaactctctttccaaa
gtccttttcatctttccttcacagtacttgttcactat 62
Sbjct 386 cacttttaactctctttccaaag
aacttttcatctttccctcacggtacttgtttgctat 327
Query 63 cggtctcacgccaatatttagctttacgtgaaacttatca
cacattttgcgctcaaatcc 122
Sbjct 326 cggtctcgcgccaatatttagctttatg
tgaaacttatcacacattttgcgctcaaatcc 267 Query 123
caatgaacgcgactcaataaaagcgcaccgtacgtgga 160
Sbjct 266 cgatgaacgcgactctataaaggcgtaccgtacgtgga
229
AG-ICB-USP
4Some applications of pairwise alignments
- Annotation description of the characteristics
of a sequence - Function ascribing similar sequences MAY share
similar functions - Identification of structural domains similar
sequences MAY share similar structures - Identification of protein domains defines
protein architecture - Phylogenetic inference identification of
similar sequences that MAY have a common ancestry
AG-ICB-USP
5Some applications of pairwise alignments
- Identification of contaminant sequences in a
sequencing project query sequence x databases
(bacterial, ribosomal, mitochondrial, etc.) - Identification of vector sequences in sequencing
reads alignment and masking
AG-ICB-USP
6Identity, similarity, homology
- Identity refers to nucleotide or amino acid
residues that are identical - Similarity - measurable quantity percentage of
identities between two sequences, percentage of
similar amino acid residues (conserved along the
evolution). - Homology based on a evolutionary conclusion
that implies that two sequences has a common
ancestral sequence. They are said to share the
same evolutionary history. Homology is not
quantitative. Two sequences can be or not to be
homologous.
AG-ICB-USP
7Identity, similarity, homology
- A high degree of similarity between two sequences
MAY suggest that they share a common
evolutionary history. Other analyses and
experimental work should be done to validate such
hypothesis
AG-ICB-USP
8Contaminant removal
Libraries can be contaminated by different sources
Genomic libraries
- Other organisms and/or cells co-purification
- Bacterial DNA - E. coli used as the host cell
- Human contamination during manipulation
- Other genomes being manipulated in the lab
cross-contamination
AG-ICB-USP
9Contaminant removal
Libraries can be contaminated by different sources
EST libraries
- All sources already mentioned
- Ribosomal RNA co-purification with the polyA
fraction - Organelle transcripts mitochondrion, plastid
AG-ICB-USP
10Vector masking
A typical read contains sequence stretches that
are not originally part of the insert
insert
Sequencing reaction
Vector sequence
Vector sequence
AG-ICB-USP
11Vector masking
Masking consists in a substitution of bases that
are not part of the insert by Xs
insert
Vector sequence
Vector sequence
insert
xxxxxxxxx
xxxxxxxxxxxxxxxx
Vector sequence
Vector sequence
- X bases will not be taken into account by
assembly/clustering programs
AG-ICB-USP
12Aligning Two Sequences
- Human Hemoglobin (HH)
- VLSPADKTNVKAAWGKVGAHAGYEG
- Sperm Whale Myoglobin (SWM)
- VLSEGEWQLVLHVWAKVEADVAGHG
AG-ICB-USP
13Aligning Two Sequences
- (HH) VLSPADKTNVKAAWGKVGAHAGYEG
-
- (SWM) VLSEGEWQLVLHVWAKVEADVAGHG
- Gap Weight 12
- Length Weight 4
- Gaps 0
- Percent Similarity 40.000
- Percent Identity 36.000
- Matrix blosum62
AG-ICB-USP
14Gap Insertion/Deletion
- (HH) VLSPADKTNVKAAWGKVGAH-AGYEG
- ??? ? ? ?? ? ?? ?
- (SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
- - gap insertion/deletion
- Gap Weight 4
- Length Weight 1
- Gaps 2
- Percent Similarity 54.167
- Percent Identity 45.833
- BLOSUM62
AG-ICB-USP
15Scoring
- (HH) VLSPADKTNVKAAWGKVGAH-AGYEG
-
- (SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
- The score of the alignment is
- Matrix value at (V,V) (L,L) (S,S) (P,E)
(penalty for gap insertion/deletion)gaps
(penalty for gap extension)(total
length of all gaps)
AG-ICB-USP
16Scoring System
- Identity An objective and quite well defined
measure Count the number of identical matches,
divide by length of aligned region - Similarity A less well defined measure
- Category Amino acid
- Acids and Amides Asp (D) Glu(E) Asn (N) Gln (Q)
- Basic His (H) Lys (K) Arg (R)
- Aromatic Phe (F) Tyr (Y) Trp (W)
- Hydrophilic Ala (A) Cys (C) Gly (G) Pro (P) Ser
(S) Thr (T) - Hydrophobic Ile (I) Leu (L) Met (M) Val (V)
AG-ICB-USP
17Scoring system
- Rates of amino acid substitution are not uniform
- Some amino acids are more conserved than others
(e.g. C, H, W compared to A, L, I) - Some substitutions are more common than others
- (e.g. A I, A L compared to D L)
- Conclusion there are evolutionary pressures that
probably reflect structural and functional
constraints - Scoring matrices matrices that are used for
scoring amino acid substitutions in pairwise
alignments - They reflect substitution rates that are
originated by evolutionary events
AG-ICB-USP
18Amino acids - chemical relationships
Tiny
Alphatic
A
G
P
Hydrophobic
OH
L
I
S
C
V
Polar
T
Y
F
M
Hydrophilic
W
K
D
N
H
NH2
R
E
K
Aromatic
Charged
Positive
Negative
AG-ICB-USP
19PAM
- Stands for Point Accepted Mutation
- Dayhoff Matrix, 1978
- A series of matrices describing the extent to
which two amino acids have been interchanged in
evolution - Very similar sequences were aligned, phylogenetic
trees were built, and ancestral sequences were
reconstructed - Out of these alignments, the frequency of
substitution between each pair of amino acids was
calculated. Using this information, PAM matrices
were built (PAM1 i.e. one accepted point mutation
per 100 amino acids).
AG-ICB-USP
20PAM250 - amino acid substitution matrix
GAP_CREATE 12 GAP_EXTEND 4
A B C D E F G H I K
L M N P Q R S T V
W A 2 0 -2 0 0 -4 1 -1
-1 -1 -2 -1 0 1 0 -2 1 1
0 -6 B 0 2 -4 3 2 -5 0
1 -2 1 -3 -2 2 -1 1 -1
0 0 -2 -5 C -2 -4 12 -5 -5
-4 -3 -3 -2 -5 -6 -5 -4 -3 -5
-4 0 -2 -2 -8 D 0 3 -5 4
3 -6 1 1 -2 0 -4 -3 2 -1
2 -1 0 0 -2 -7 E 0 2 -5
3 4 -5 0 1 -2 0 -3 -2
1 -1 2 -1 0 0 -2 -7 F -4
-5 -4 -6 -5 9 -5 -2 1 -5 2
0 -4 -5 -5 -4 -3 -3 -1 0 G
1 0 -3 1 0 -5 5 -2 -3 -2
-4 -3 0 -1 -1 -3 1 0 -1
-7 H -1 1 -3 1 1 -2 -2 6
-2 0 -2 -2 2 0 3 2 -1 -1
-2 -3 I -1 -2 -2 -2 -2 1 -3
-2 5 -2 2 2 -2 -2 -2 -2
-1 0 4 -5 K -1 1 -5 0 0
-5 -2 0 -2 5 -3 0 1 -1 1
3 0 0 -2 -3 L -2 -3 -6 -4
-3 2 -4 -2 2 -3 6 4 -3
-3 -2 -3 -3 -2 2 -2 M -1 -2
-5 -3 -2 0 -3 -2 2 0 4 6
-2 -2 -1 0 -2 -1 2 -4 N 0
2 -4 2 1 -4 0 2 -2 1
-3 -2 2 -1 1 0 1 0 -2
-4 P 1 -1 -3 -1 -1 -5 -1 0
-2 -1 -3 -2 -1 6 0 0 1 0
-1 -6 Q 0 1 -5 2 2 -5 -1
3 -2 1 -2 -1 1 0 4 1
-1 -1 -2 -5 R -2 -1 -4 -1 -1
-4 -3 2 -2 3 -3 0 0 0 1
6 0 -1 -2 2 S 1 0 0 0
0 -3 1 -1 -1 0 -3 -2 1 1
-1 0 2 1 -1 -2 T 1 0 -2
0 0 -3 0 -1 0 0 -2 -1
0 0 -1 -1 1 3 0 -5 V 0
-2 -2 -2 -2 -1 -1 -2 4 -2 2
2 -2 -1 -2 -2 -1 0 4 -6 W
-6 -5 -8 -7 -7 0 -7 -3 -5 -3
-2 -4 -4 -6 -5 2 -2 -5 -6
17
AG-ICB-USP
21BLOSUM
- Stands for Blocks Substitution Matrices
- Henikoff and Henikoff, 1992
- A series of matrices describing the extent to
which two amino acids are interchangeable in
conserved structures - Built by extracting replacement information from
the alignments in the BLOCKS database.
AG-ICB-USP
22BLOSUM
- The number in the series (BLOSUM62) represents
the threshold percent similarity between
sequences, for considering them in the
calculation. - For example, BLOSUM62 is derived from an
alignment of sequences that share 62 similarity,
BLOSUM45 is based on 45 sequence similarity in
aligned sequences
AG-ICB-USP
23BLOSUM62 - amino acid substitution matrix
Reference Henikoff, S. and Henikoff, J. G.
(1992). Amino acid substitution matrices from
protein blocks. Proc. Natl. Acad. Sci. USA 89
10915-10919. A R N D C Q E G H I L
K M F P S T W Y V B Z X A 4 -1 -2
-2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2
0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2
2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0
6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2
-3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3
-4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0
-3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1
-2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2
0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E
-1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0
-1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2
6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2
-1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3
-3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3
-1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2
0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3
1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0
1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3
-2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1
-3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3
-1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1
1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2
-2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1
-1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3
-3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2
11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3
2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2
0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1
0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1
0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1
-1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1
-4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4
-4 -4 -4 -4 -4 -4 -4 -4 -4 1
AG-ICB-USP
24Guidelines
- Lower PAMs and higher Blosums find short local
alignment of highly similar sequences - Higher PAMs and lower Blosums find longer weaker
local alignment - No single matrix answers all questions
AG-ICB-USP
25BLAST Basic Local Alignment Search Tool
- Algorithm first described in 1990
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W.
Lipman, D.J. (1990) "Basic local alignment
search tool." J. Mol. Biol. 215403-410. - And improved in 1997
- Altschul, S.F., Madden, T.L., Schäffer, A.A.,
Zhang, J., Zhang, Z., Miller, W. Lipman, D.J.
(1997). Gapped BLAST and PSI-BLAST a new
generation of protein database search programs.
Nucleic Acids Res. 25 3389-3402.
AG-ICB-USP
26Blast search four components
- Search purpose/goal
- Program
- Query sequence
- Database
AG-ICB-USP
27BLAST search purpose/goal
- What is the biological question? Examples
- Which proteins of the database are similar to my
protein sequence? - Which proteins of the database are similar to
the conceptual translation of my DNA sequence? - Which nucleotide sequences in the database are
similar to my nucleotide sequence? - Which proteins coded by the conceptual
translation of the database sequences are similar
to my protein sequence? - Which proteins coded by the conceptual
translation of the database sequences are similar
to the conceptual translation of my DNA sequence?
AG-ICB-USP
28BLAST search purpose/goal
- Which proteins of the database are similar to my
protein sequence? - I have sequenced a gene and derived the protein
sequence by concetpual translation.
Alternatively, I obtained the protein sequence
directly. I am now interested to find out its
possible fnction. - Using a similarity search, I can find protein
sequences in databases that are similar to mine
orthologs and paralogs. - BLASTP protein query x protein database
AG-ICB-USP
29BLAST - search purpose/goal
- Which proteins of the database are similar to the
conceptual translation of my DNA sequence? - I have sequenced an EST (expressed sequence tag)
that contains a protein coding region. - I am interested to find out which proteins of
the database are similar to the conceptual
translation of my nucleic acid sequence. - BLASTX nucleotide (translated) query x protein
database
AG-ICB-USP
30BLAST search purpose/goal
- Which nucleotide sequences of the database are
similar to my DNA sequence? - I have sequenced a DNA fragment.
- I am interested to find out which DNA sequences
of the database are similar to my nucleic acid
sequence. - BLASTN nucleotide query x nucleotide database
AG-ICB-USP
31BLAST - search purpose/goal
- Which proteins translated from a nucleic acid
database are similar to the conceptual
translation of my DNA sequence? - I have sequenced an EST (expressed sequence tag)
that contains a protein coding region. - I am interested to find out which ESTs of other
organisms may be coding for homologous proteins. - TBLASTX nucleotide (translated) query x
nucleotide (translated) database
AG-ICB-USP
32BLAST search purpose/goal
- Which proteins coded by the conceptual
translation of the database sequences are similar
to my protein sequence? - I have a protein sequence on hands and am
interested to find out which genes of other
organisms may be coding for homologous proteins. - TBLASTN protein query x nucleotide
(translated) database
AG-ICB-USP
33BLAST - programs
- BLASTP protein query x protein database
- BLASTN nucleotide query x nucleotide database
- BLASTX nucleotide (translated) query x protein
database - TBLASTN protein query x nucleotide (translated)
database - TBLASTX nucleotide query (translated) x
nucleotide (translated) database
AG-ICB-USP
34FastA format
BLAST query sequence
- The first line begins with the symbol 'gt'
followed by the name of the sequence - The sequence is on the remaining lines.
- The sequence must not contain blanks.
- The sequence could be in upper or lower case.
- Below is an example sequence in FASTA format\
- gtDNA sequence
- GCCCCCGGCCCCGCCCCGGCCCCGCCCCCGGCCCCGCCCCGCAAGGGTC
- ACAGGTCACGGGGCGGGGCCGAGGCGGAAGCGCCCGCAGCCCGGTACCG
- GCTCCTCCTGGGCTCCCTCTAGCGCCTTCCCCCCGGCCCGACTCCGCTG
- GTCAGCGCCAAGTGACTTACGCCCCCGACCTCTGAGCCCGGACCGCTAG
AG-ICB-USP
35BLAST database
- Nucleotide databases
- nr, refseq, est_human, est_mouse, est_others,
wgs, etc. - Protein databases nr, Swiss-Prot, refseq, etc.
AG-ICB-USP
36AG-ICB-USP
37AG-ICB-USP
38AG-ICB-USP
39AG-ICB-USP
40AG-ICB-USP
41AG-ICB-USP
42AG-ICB-USP
43AG-ICB-USP
44AG-ICB-USP
45AG-ICB-USP
46Blast programs
- PSI-BLAST Position-Specific Iterated BLAST
program - performs an iterative search in which
sequences found in one round of searching are
used to build a score model for the next round of
searching. In PSI-BLAST the algorithm is not tied
to a specific score matrix. - PHI-BLAST Pattern-Hit Initiated BLAST - a
search program that combines matching of regular
expressions with local alignments surrounding the
match. - MEGABLAST uses the greedy algorithm for
nucleotide sequence alignment search - it can be
up to 10 times faster than more common sequence
similarity programs and handles much longer DNA
sequences than the blastn program
AG-ICB-USP