Sequence%20Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Sequence%20Alignment

Description:

Sequence Alignment – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 36
Provided by: maxsh
Category:

less

Transcript and Presenter's Notes

Title: Sequence%20Alignment


1
Sequence Alignment
Motivation
  • Storing, retrieving and comparing DNA sequences
    in Databases.
  • Comparing two or more sequences for similarities.
  • Searching databases for related sequences and
    subsequences.
  • Exploring frequently occurring patterns of
    nucleotides.
  • Finding informative elements in protein and DNA
    sequences.
  • Various experimental applications (reconstruction
    of DNA, etc.)

2
Seq.Align. Protein Function
Gene1
Gene2
More than 25 sequence identity
?
Similar function
3
Alignment - inexact matching
  • Substitution - replacing a sequence base by
    another.
  • Insertion - an insertion of a base (letter)
    or several bases to the sequence.
  • Deletion - deleting a base (or more) from
    the sequence.
  • (Insertion and deletion are the reverse of one
    another)

4
Seq. Align. Score
Commonly used matrices PAM250, BLOSUM64
5
Global Alignment
Global Alignment INPUT Two sequences S and T of
roughly the same length. QUESTION What is the
maximum similarity between them?
Find one of the best alignments.
6
The IDEA
s1n
t1m
To align s1...i with t1j we have three
choices align s1i-1 with t1j-1 and match
si with tj align s1i with t1j-1 and
match a space with tj align s1i-1 with
t1j and match si with a space
s1i-1 i
t1j-1 j
s1 i -
t1j-1 j
s1i-1 i
t1 j -
7
Global Alignment
Needleman-Wunsch 1970
Alignment Score V(n,m)
8
Local Alignment
Local Alignment INPUT Two sequences S and T
. QUESTION What is the maximum similarity
between a subsequence of S and a
subsequence of T ? Find most
similar subsequences.
9
Local Alignment
Smith-Waterman 1981
Penalties should be negative
10
match 2 mismatch -1
Local alignment
11
Sequence Alignment
Complexity Time O(nm) Space O(nm) (exist
algorithm with O(min(n,m)) )
12
Ends free alignment
Ends free alignment
INPUT Two equences
S and T (possibly of different length).
QUESTION Find one of the best alignments between
subsequences of S and T when at least one of
these subsequences is a prefix of the original
sequence and one (not necessarily the other) is a
suffix.
or
13
Ends free alignment
m
n
14
Gap Alignment
Definition A gap is the maximal contiguous run
of spaces in a single sequence within a given
alignment.The length of a gap is the number of
indel operations on it. A gap penalty function is
a function that measure the cost of a gap as a
(nonlinear) function of its length.
Gap penalty

INPUT Two sequences S and T (possibly of
different length). QUESTION Find one of the
best alignments between the two
sequences using the gap penalty function.
Affine Gap
Wtotal Wg qWs Wg weight to open the gap Ws
weight to extend the gap
15
What Kind of Alignment to Use?
  • The same protein from the different organisms.
  • Two different proteins sharing the same function.
  • Protein domain against a database of complete
    proteins.
  • ...

16
M. Dayhoff Scoring Matrices Point Accepted
Mutations or PAM matrices
Proteins with 85 identity were used -gt the
function is not significantly changed -gt the
mutations are accepted PAM units the
measure of the amount of evolutionary distance
between two amino acid sequences. One PAM unit
S1 has converted (mutated) to S2 with an average
of one accepted point-mutation event per 100
amino acids.
17
PAM250
18
  1. Probability matrix
  2. Scoring matrix

pa (Na/N) probability of occurrence of amino
acid a over a large, sufficiently
varied, data set. ?a pa 1 fab the number of
times the mutation a lt-gt b was observed to
occur. fa ?b?a fab - - - the total number
of mutations in which a was involved f ?a
fa - the total number of amino acid occurrences
involved in mutations.
19
M - 20x20 probability matrix Mab - the
probability of amino acid a changing into
b ma fa / f 1/(100 pa ) - relative
mutability of amino acid a. It is the
probability that the given amino acid will
change in the evolutionary period of interest.
Assumptions (a) 1 in 100 amino acids on
average is changed. (b) mutations are position
independent. (c) mutations are
independent on its past.
20
Maa 1- ma - the probablity of a to remain
unchanged. Mab Pr(a -gt b) Pr(a -gt b a
changed) Pr(a changed) (fab/fa )ma
Easy to see ?b Mab 1 Maa ? b?a (fab/fa
)ma 1- ma ma /fa ? b?a fab 1 ?a pa Maa
0.99 -gt in average 1 mutation every 100
positions.
21
What is the probability that a mutates into b
in two PAM units of evolution? a-gtc-gtb or
a-gtd-gt ?c Mac Mcb M2ab -gt M2 , M3 , M4
M250 k-gt? Mk converges to a matrix with
identical rows. Mkac pc - no matter what
amino acid you start with, after a long period
of evolution the resulting amino acid will be c
with probability pc .
22
PAM-k matrix
PAM-kab Mkab / pb - probability that a pair
ab is a mutation as opposed to being a
random occurrence (likelihood or odds
ratio). Mab / pb (fab/fa )ma / pb (fab/fa )
fa / (f 100 pa pb) fab/ (f 100 pa
pb) Mba / pa The total alignment score is
the product of Pam-kab . To avoid accuracy
problems Pam-kab 10 log Mkab / pb -gt The total
alignment score is the sum of Pam-kab .
23
BioPerl
Bioperl is a collection of perl modules that
facilitate the development of perl scripts for
bio-informatics applications.
Bioperl is open source software that is still
under active development. www.bioperl.org Tutori
al Documentation
24
BioPerl
  • Accessing sequence data from local and remote
    databases
  • Transforming formats of database/ file records
  • Manipulating individual sequences
  • Searching for "similar" sequences
  • Creating and manipulating sequence alignments
  • Searching for genes and other structures on
    genomic DNA
  • Developing machine readable sequence annotations

25
BioPerl library at TAU
BioPerl is NOT yet installed globally on CS
network. In each script you should add the
following line use lib "/home/silly6/mol/lib/bio
perl"
26
Sequence Object
Seq stores sequence, identification labels (id,
accession number, molecule type DNA, RNA,
Protein, ), multiple annotations and associated
sequence features.
27
Seq Objects
PrimarySeq light-weight Seq. Stores only
sequence and id. labels. LocatableSeq Seq
object with start and stop positions. Used by
SimpleAlign object. Created by pSW, Clustalw,
Tcoffee or bl2seq. LargeSeq special type of Seq
(more than 100 MB). LiveSeq handles changing
over time locations on a sequence. SeqI
interface for Seq objects.
28
Sequence Object
seq BioSeq-gtnew('-seq'gt'actgtggcgtcaact',
'-desc'gt'Sample
BioSeq object',
'-display_id' gt 'something',
'-accession_number' gt
'accnum',
'-moltype' gt 'dna' )
Usually Seq is not created this way.
29
Sequence Object
seqobj-gtdisplay_id() the human read-able id
of the sequence seqobj-gtseq() string of
sequence seqobj-gtsubseq(5,10) part of the
sequence as a string seqobj-gtaccession_number()
when there, the accession number
seqobj-gtmoltype() one of 'dna','rna','protein
' seqobj-gtprimary_id() a unique id for this
sequence irregardless of its display_id
or accession number
30
Accessing Data Base
Databases genbank, genpept, swissprot and gdb.
gb new BioDBGenBank() seq1
gb-gtget_Seq_by_id('MUSIGHBA1') seq2
gb-gtget_Seq_by_acc('AF303112')) seqio
gb-gtget_Stream_by_batch( qw(J00522 AF303112

2981014)))
31
Seq module
use BioDBGenBank gb new
BioDBGenBank() seq1
gb-gtget_Seq_by_acc('AF303112') seq2seq1-gttrun
c(1,90) print
seq2-gtseq(), "\n" seq3seq2-gttranslate print
seq3-gtseq(), \n
ATGGAGCCCAAGCAAGGATACCTTCTTGTAAAATTGATAGAAGCTCGCAA
GCTAGCATCTAAGGATGTGGGCGGAGGGTCAGATCCATAC MEPKQGYLL
VKLIEARKLASKDVGGGSDPY
32
SeqIO object
seq gb-gtget_Seq_by_acc('AF303112')) out
BioSeqIO-gtnew('-file' gt "gtf.fasta",
'-format' gt
'Fasta') out-gtwrite_seq(seq)
SeqIO can read/write/transform data in the
following formats Fasta, EMBL. GenBank,
Swissprot, PIR, GCG, SCF, ACE, BSML
33
Transforming Sequence Files
in BioSeqIO-gtnew('-file' gt f.fasta",

'-format' gt 'Fasta') out
BioSeqIO-gtnew('-file' gt "gtf.embl",
'-format' gt
EMBL') out-gtwrite_seq(in-gtnext_seq()) for
several sequences
while ( my seq in-gtnext_seq() )
out-gtwrite_seq(seq)

betterin BioSeqIO-gtnew('-file' gt
f.fasta",
'-format' gt 'Fasta') out BioSeqIO-gtnew('-fi
le' gt "gtf.embl", '-format' gt
EMBL') Fastalt-gtEMBL format converterprint
out _ while ltingt
34
BioPerl Pairwise Sequence Alignment
Smith-Waterman Algorithm
use BioToolspSW factory new
BioToolspSW( '-matrix' gt 'blosum62.bla',

'-gap' gt 12,
'-ext' gt 2, ) factory-gtalign_an
d_show(seq1, seq2, STDOUT)
Currently works only on protein sequences.
35
Alignment Objects
SimpleAlign handles multiple alignments of
sequences. pSW module aln factory-gtpairwise_a
lignment(seq1, seq2) foreach seq (
aln-gteachSeq() )
print seq-gtseq(), "\n"
alnout BioAlignIO-gtnew(-format gt 'fasta',
-fh gt
\STDOUT) alnout-gtwrite_aln(aln)
36
Homework
  • Write a cgi script (using Perl) that performs
    pairwise Local/Global Alignment for DNA
    sequences. All I/O is via HTML only.
  • Input
  • Choice for Local/Global alignment.
  • Two sequences text boxes or two accession
    numbers.
  • Values for match, mismatch, ins/dels.
  • Number of iterations for computing random scores.
  • Output
  • Alignment score.
  • z-score value (z (score-average)/standard
    deviation.)

Remarks 1) you are allowed to use only linear
space. 2)To compute z-score perform random
shuffling srand(time ) init,
-proc.id int(rand(i)) returns rand. number
between 0,i. 3)Shuffling is done in windows
(non-overlapping) of 10 bases length. Number of
shuffling for each window is random 0,10.
37
BioPerl Multiple Sequence Alignment
_at_params ('ktuple' gt 2, 'matrix' gt
'BLOSUM') factory BioToolsRunAlignment
Clustalw-gtnew(_at_params) aln factory-gtalign(\_at_s
eq_array) foreach seq ( aln-gteachSeq() )
print
seq-gtseq(), "\n"
Write a Comment
User Comments (0)
About PowerShow.com