Title: Sequence%20Alignment
1Sequence Alignment
Motivation
- Storing, retrieving and comparing DNA sequences
in Databases. - Comparing two or more sequences for similarities.
- Searching databases for related sequences and
subsequences. - Exploring frequently occurring patterns of
nucleotides. - Finding informative elements in protein and DNA
sequences. - Various experimental applications (reconstruction
of DNA, etc.)
2Seq.Align. Protein Function
Gene1
Gene2
More than 25 sequence identity
?
Similar function
3Alignment - inexact matching
- Substitution - replacing a sequence base by
another. - Insertion - an insertion of a base (letter)
or several bases to the sequence. - Deletion - deleting a base (or more) from
the sequence. - (Insertion and deletion are the reverse of one
another)
4Seq. Align. Score
Commonly used matrices PAM250, BLOSUM64
5Global Alignment
Global Alignment INPUT Two sequences S and T of
roughly the same length. QUESTION What is the
maximum similarity between them?
Find one of the best alignments.
6The IDEA
s1n
t1m
To align s1...i with t1j we have three
choices align s1i-1 with t1j-1 and match
si with tj align s1i with t1j-1 and
match a space with tj align s1i-1 with
t1j and match si with a space
s1i-1 i
t1j-1 j
s1 i -
t1j-1 j
s1i-1 i
t1 j -
7Global Alignment
Needleman-Wunsch 1970
Alignment Score V(n,m)
8Local Alignment
Local Alignment INPUT Two sequences S and T
. QUESTION What is the maximum similarity
between a subsequence of S and a
subsequence of T ? Find most
similar subsequences.
9Local Alignment
Smith-Waterman 1981
Penalties should be negative
10match 2 mismatch -1
Local alignment
11Sequence Alignment
Complexity Time O(nm) Space O(nm) (exist
algorithm with O(min(n,m)) )
12Ends free alignment
Ends free alignment
INPUT Two equences
S and T (possibly of different length).
QUESTION Find one of the best alignments between
subsequences of S and T when at least one of
these subsequences is a prefix of the original
sequence and one (not necessarily the other) is a
suffix.
or
13Ends free alignment
m
n
14Gap Alignment
Definition A gap is the maximal contiguous run
of spaces in a single sequence within a given
alignment.The length of a gap is the number of
indel operations on it. A gap penalty function is
a function that measure the cost of a gap as a
(nonlinear) function of its length.
Gap penalty
INPUT Two sequences S and T (possibly of
different length). QUESTION Find one of the
best alignments between the two
sequences using the gap penalty function.
Affine Gap
Wtotal Wg qWs Wg weight to open the gap Ws
weight to extend the gap
15What Kind of Alignment to Use?
- The same protein from the different organisms.
- Two different proteins sharing the same function.
- Protein domain against a database of complete
proteins. - ...
16M. Dayhoff Scoring Matrices Point Accepted
Mutations or PAM matrices
Proteins with 85 identity were used -gt the
function is not significantly changed -gt the
mutations are accepted PAM units the
measure of the amount of evolutionary distance
between two amino acid sequences. One PAM unit
S1 has converted (mutated) to S2 with an average
of one accepted point-mutation event per 100
amino acids.
17PAM250
18- Probability matrix
- Scoring matrix
pa (Na/N) probability of occurrence of amino
acid a over a large, sufficiently
varied, data set. ?a pa 1 fab the number of
times the mutation a lt-gt b was observed to
occur. fa ?b?a fab - - - the total number
of mutations in which a was involved f ?a
fa - the total number of amino acid occurrences
involved in mutations.
19M - 20x20 probability matrix Mab - the
probability of amino acid a changing into
b ma fa / f 1/(100 pa ) - relative
mutability of amino acid a. It is the
probability that the given amino acid will
change in the evolutionary period of interest.
Assumptions (a) 1 in 100 amino acids on
average is changed. (b) mutations are position
independent. (c) mutations are
independent on its past.
20Maa 1- ma - the probablity of a to remain
unchanged. Mab Pr(a -gt b) Pr(a -gt b a
changed) Pr(a changed) (fab/fa )ma
Easy to see ?b Mab 1 Maa ? b?a (fab/fa
)ma 1- ma ma /fa ? b?a fab 1 ?a pa Maa
0.99 -gt in average 1 mutation every 100
positions.
21What is the probability that a mutates into b
in two PAM units of evolution? a-gtc-gtb or
a-gtd-gt ?c Mac Mcb M2ab -gt M2 , M3 , M4
M250 k-gt? Mk converges to a matrix with
identical rows. Mkac pc - no matter what
amino acid you start with, after a long period
of evolution the resulting amino acid will be c
with probability pc .
22PAM-k matrix
PAM-kab Mkab / pb - probability that a pair
ab is a mutation as opposed to being a
random occurrence (likelihood or odds
ratio). Mab / pb (fab/fa )ma / pb (fab/fa )
fa / (f 100 pa pb) fab/ (f 100 pa
pb) Mba / pa The total alignment score is
the product of Pam-kab . To avoid accuracy
problems Pam-kab 10 log Mkab / pb -gt The total
alignment score is the sum of Pam-kab .
23BioPerl
Bioperl is a collection of perl modules that
facilitate the development of perl scripts for
bio-informatics applications.
Bioperl is open source software that is still
under active development. www.bioperl.org Tutori
al Documentation
24BioPerl
- Accessing sequence data from local and remote
databases - Transforming formats of database/ file records
- Manipulating individual sequences
- Searching for "similar" sequences
- Creating and manipulating sequence alignments
- Searching for genes and other structures on
genomic DNA - Developing machine readable sequence annotations
25BioPerl library at TAU
BioPerl is NOT yet installed globally on CS
network. In each script you should add the
following line use lib "/home/silly6/mol/lib/bio
perl"
26Sequence Object
Seq stores sequence, identification labels (id,
accession number, molecule type DNA, RNA,
Protein, ), multiple annotations and associated
sequence features.
27Seq Objects
PrimarySeq light-weight Seq. Stores only
sequence and id. labels. LocatableSeq Seq
object with start and stop positions. Used by
SimpleAlign object. Created by pSW, Clustalw,
Tcoffee or bl2seq. LargeSeq special type of Seq
(more than 100 MB). LiveSeq handles changing
over time locations on a sequence. SeqI
interface for Seq objects.
28Sequence Object
seq BioSeq-gtnew('-seq'gt'actgtggcgtcaact',
'-desc'gt'Sample
BioSeq object',
'-display_id' gt 'something',
'-accession_number' gt
'accnum',
'-moltype' gt 'dna' )
Usually Seq is not created this way.
29Sequence Object
seqobj-gtdisplay_id() the human read-able id
of the sequence seqobj-gtseq() string of
sequence seqobj-gtsubseq(5,10) part of the
sequence as a string seqobj-gtaccession_number()
when there, the accession number
seqobj-gtmoltype() one of 'dna','rna','protein
' seqobj-gtprimary_id() a unique id for this
sequence irregardless of its display_id
or accession number
30Accessing Data Base
Databases genbank, genpept, swissprot and gdb.
gb new BioDBGenBank() seq1
gb-gtget_Seq_by_id('MUSIGHBA1') seq2
gb-gtget_Seq_by_acc('AF303112')) seqio
gb-gtget_Stream_by_batch( qw(J00522 AF303112
2981014)))
31Seq module
use BioDBGenBank gb new
BioDBGenBank() seq1
gb-gtget_Seq_by_acc('AF303112') seq2seq1-gttrun
c(1,90) print
seq2-gtseq(), "\n" seq3seq2-gttranslate print
seq3-gtseq(), \n
ATGGAGCCCAAGCAAGGATACCTTCTTGTAAAATTGATAGAAGCTCGCAA
GCTAGCATCTAAGGATGTGGGCGGAGGGTCAGATCCATAC MEPKQGYLL
VKLIEARKLASKDVGGGSDPY
32SeqIO object
seq gb-gtget_Seq_by_acc('AF303112')) out
BioSeqIO-gtnew('-file' gt "gtf.fasta",
'-format' gt
'Fasta') out-gtwrite_seq(seq)
SeqIO can read/write/transform data in the
following formats Fasta, EMBL. GenBank,
Swissprot, PIR, GCG, SCF, ACE, BSML
33Transforming Sequence Files
in BioSeqIO-gtnew('-file' gt f.fasta",
'-format' gt 'Fasta') out
BioSeqIO-gtnew('-file' gt "gtf.embl",
'-format' gt
EMBL') out-gtwrite_seq(in-gtnext_seq()) for
several sequences
while ( my seq in-gtnext_seq() )
out-gtwrite_seq(seq)
betterin BioSeqIO-gtnew('-file' gt
f.fasta",
'-format' gt 'Fasta') out BioSeqIO-gtnew('-fi
le' gt "gtf.embl", '-format' gt
EMBL') Fastalt-gtEMBL format converterprint
out _ while ltingt
34BioPerl Pairwise Sequence Alignment
Smith-Waterman Algorithm
use BioToolspSW factory new
BioToolspSW( '-matrix' gt 'blosum62.bla',
'-gap' gt 12,
'-ext' gt 2, ) factory-gtalign_an
d_show(seq1, seq2, STDOUT)
Currently works only on protein sequences.
35Alignment Objects
SimpleAlign handles multiple alignments of
sequences. pSW module aln factory-gtpairwise_a
lignment(seq1, seq2) foreach seq (
aln-gteachSeq() )
print seq-gtseq(), "\n"
alnout BioAlignIO-gtnew(-format gt 'fasta',
-fh gt
\STDOUT) alnout-gtwrite_aln(aln)
36Homework
- Write a cgi script (using Perl) that performs
pairwise Local/Global Alignment for DNA
sequences. All I/O is via HTML only. - Input
- Choice for Local/Global alignment.
- Two sequences text boxes or two accession
numbers. - Values for match, mismatch, ins/dels.
- Number of iterations for computing random scores.
- Output
- Alignment score.
- z-score value (z (score-average)/standard
deviation.)
Remarks 1) you are allowed to use only linear
space. 2)To compute z-score perform random
shuffling srand(time ) init,
-proc.id int(rand(i)) returns rand. number
between 0,i. 3)Shuffling is done in windows
(non-overlapping) of 10 bases length. Number of
shuffling for each window is random 0,10.
37BioPerl Multiple Sequence Alignment
_at_params ('ktuple' gt 2, 'matrix' gt
'BLOSUM') factory BioToolsRunAlignment
Clustalw-gtnew(_at_params) aln factory-gtalign(\_at_s
eq_array) foreach seq ( aln-gteachSeq() )
print
seq-gtseq(), "\n"