Outline - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Outline

Description:

Outline. The fundamental impartance of Alignment and Statistics ... eye of the death's head a bee-line from the tree. through the shot fifty feet out.' The Solution ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 55
Provided by: CeleraE7
Category:
Tags: beeline | outline

less

Transcript and Presenter's Notes

Title: Outline


1
(No Transcript)
2
Outline
  • The fundamental impartance of Alignment and
    Statistics
  • The Basic Sequence Similarity Algorithm
  • Heuristics BLAST, FASTA, SIM4
  • ESTomics

3
A story
The GOLD-BUG
By Edgar A. Poe
4
The Gold-Bug, By Edgar A. Poe
Mr. William Legrand, left New Orleans and took
residence on Sullivans Island, near South
Carolina.
His servant was Jupiter, an old negro. He calls
Mr Legrand Massa Will.
One day, Massa Will found a bug, a scarabeus
which he believed is totally new.
5

Jupiter describes the bug in his language de
bug is a gole-bug, solid, ebery bit of him,
inside and all, sep him wing -- neber feel half
so hebby a bug in my life.
The design on the bugs back resembled a
deaths-head . And the story continues and they
were searching for a big treasure hidden by a
famous pirate Captain Kidd.
6
Captain Kidds Code
53305))64826)4.)4)806 48860))851(
883(88)5 46(8896?8)(485)52 (495
62(5_4)884069285)) 68)41(948081881
48854) 48552880681(948(884(?34 48)4161
188?
7
E. A. Poe
Circumstances, and a certain biased of mind,
have led me to take interest in such riddles,
and it may well be doubted whether human
ingenuity can construct an enigma of the
kind which human ingenuity may not, by proper
application, resolve.
8
What Language ?
In the present case -- indeed in all cases of
secret writing -- the first question regards the
language of the cipher for the principles of
solution, so far, especially, as the more simple
ciphers are concerned, depend upon, and are
varied by, the genius of a particular idiom.
9
E. A. Poe
In general, there is no alternative but
experiment (directed by probabilities) of every
tongue known to him who attempts the solution,
until the true one be attained. But for this
consideration, I should have begun my attempts
with the Spanish and French, as the tongues in
which a secret of this kind would most naturally
have been written by a pirate of the Spanish
main. As it was, I assumed the cryptograph to be
English.
10
Statistics
No division between words
Statistics of the character 8 there are 33.
there are 26. 4 there are 19. ) there are 16.
there are 13. 5 there are 12. 6 there are 11. 1
there are 8. 0 there are 6. 92 there are 5 3
there are 4. ? there are 3. there are 2. _
there are 1.
11
We found our first letter!
In English the letter which most frequently
occurs is e.
Afterwards, the succession is a o i d h n r s t
u y c f g l m w b k p q x z
12
Captain Kidds Code 8 is e
53305))64826)4.)4)806 48860))851(
883(88)5 46(8896?8)(485)52 (495
62(5_4)884069285)) 68)41(948081881
48854) 48552880681(948(884(?34 48)4161
188?
88 occurs in English in words like speed,
seen, been, agree
8 is e
13
Captain Kidds Code is t and 4 is h
53305))64826)4.)4)806 48860))851(
883(88)5 46(8896?8)(485)52 (495
62(5_4)884069285)) 68)41(948081881
48854) 48552880681(948(884(?34 48)4161
188?
must be the most frequent word
48
14

The Solution
A good glass in the bishops hostel in the
devils seat forty-one degrees and thirteen
minutes northeast and by north main branch
seventh limb east side shoot from the left eye
of the deaths head a bee-line from the tree
through the shot fifty feet out.
15
THE ALGORITHMS of GENOMICS
The Programming Language of Genomics is
BLASTALL
16
Sequence Comparison
  • Biomolecular sequences
  • DNA sequences (string over 4 letter alphabet A,
    C, G, T)
  • RNA sequences (string over 4 letter alphabet
    ACGU)
  • Protein sequences (string over 20 letter alphabet
    Amino Acids)
  • Sequence similarity helps in the discovery of
    genes, and the prediction of structure and
    function of proteins.

17
The Basic Similarity Analysis Algorithm
  • Global Similarity
  • Scoring Schemes
  • Edit Graphs
  • Alignment Path in the Edit Graph
  • The Principle of Optimality
  • The Dynamic Programming Algorithm
  • The Traceback

18
Sequence Alignment
  • Input two sequences over the same alphabet
  • Output an alignment of the two sequences
  • Example
  • GCGCATTTGAGCGA
  • TGCGTTAGGGTGACCA
  • A possible alignment
  • - GCGCATTTGAGCGA - -
  • TGCG - - TTAGGGTGACC

match mismatch indel
19
Consider two sequences
belong to
Over the alphabet
20
Scoring Schemes
Unit-score
A
C
T
G
-
A
1
0
0
0
0
C
1
0
0
0
0
0
G
1
0
0
0
0
0
0
T
1
0
-
0
0
0
0
0
21
ALIGNMENT
A is aligned with A
ACG AGG
C is aligned with G
A A
C G
G G
G is aligned with G
Unit-cost
Score
(A,A)
(C,G)
(G,G)


1 0
1 2
22
GAPS
- is the gap symbol
ACATGGAAT ACAGGAAAT
ACAT GG - AAT ACA - GG AAAT
OPTIMAL ALIGNMENTS
AAAGGG GGGAAA
- - - AAAGGG GGGAAA - - -
23
(x,y) the score for aligning x with y
(x,-) the score for aligning x with -
(-,y) the score for aligning - with y
24
Alignment
A-CG - G ATCGTG
Score
(A,A)
(G,G)
(C,C)
(-,T)
(-,T )
(G,G)
THE SUM OF THE SCORES OF THE PAIRWISE ALIGNED
SYMBOLS
25
Scoring Scheme
Dayhoff score
- A R N D C Q E G H I L K M F P
S T W Y V -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
-8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 3 -3 0 0 -3
-1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0
A R N D
6
4
...
PTIPLSRLFDNAMLRAHRLHQ SAIENQRLFNIAVSRVQHLHL
Partial alignment for Monkey and Trout
somatotropin proteins
26
Scoring Functions
Mutations Substitutions, Insertions, Deletions
Scoring function a sum of a terms each for a
pair of aligned residues, and for each gap
The meaning log of the relative likelihood that
the sequences are related, compared to being
unrelated
Identities and conservative substitutions are
Positive terms
Non-conservative substitutions are
Negative terms
27
The Edit Graph
Suppose that we want to align
AGT with AT
We are going to construct a graph
where alignments between the two
sequences correspond to paths between the begin
and and end nodes of the graph.
This is the Edit Graph
28
The sequence AGT
AGT has length 3 AT has length 2
The sequence AT
0
2
1
3
0
1
2
The Edit graph has (31)(21) nodes
29
T
G
A
Begin
0
1
2
3
0
A
1
T
2
End
AGT indexes the columns, and AT indexes the rows
of this table
30
Begin
End
The Graph is directed. The nodes (i,j) will hold
values.
31
T
G
A
Begin
0
1
2
3
0
A
1
T
2
End
32
Directed edges get as labels pairs of aligned
letters.
Begin
End
33
Alignment Path in the Edit Graph
Begin
AGT A-T
End
Every path from Begin to End corresponds to an
alignment
Every alignment corresponds to a path between
Begin and End
34
The Principle of Optimality
The optimal answer to a problem is expressed in
terms of optimal answer for its subproblems
35
Dynamic Programming
Given Two sequences X and Y Find An optimal
alignment of X with Y
Part 1 Compute first the optimal alignment score
Part 2 Construct optimal alignment
We are looking for the optimal alignment
maximal score path in the Edit Graph from the
Begin vertex to the End vertex
36
The DP Matrix S(i,j)
S(1,0)
S(2,1)
37
The DP Matrix
Matrix S S(i,j)
S(i,j) The score of the maximal cost path from
the Begin Vertex and the vertex (i,j)
Optimal Path to (i,j)
(i-1,j-1)
The optimal path to (i,j) must pass through
one of the vertices
(i,j-1)
(i-1,j)
(i,j)
(i-1,j)
(i,j-1)
(i-1,j-1)
38
Opt path
- xi
(- , yj)
S(i-1,j)
yj -
Optimal path to (i-1,j) (- , yj)
39
Optimal path
(i-1,j-1)
(i-1,j)
S(i-1,j-1)
(xi , yj)
(i,j)
(i,j-1)
Optimal path to (i-1,j-1) (xi,yj)
40
Optimal path
(i-1,j-1)
(i,j-1)
S(i,j-1) (xi, -)
(i,j)
(I-1,j)
Optimal path to (i,j-1) (xi,-)
41
The Basic ALGORITHM
S(i-1, j-1) (xi, yj)
S(i-1, j) (xi, -)
MAX
S(i,j)
S(i, j-1) (-, yj)
42
OPTIMAL ALIGNMENT and TRACBACK
0
0
0
0
1
1
1
0
0
2
1
1
AGT A - T
Optimal Alignment
43
The Basic ALGORITHM Local Similarity
We add this
0,
S(i-1, j-1) (xi, yj),
S(i-1, j) (xi, -),
MAX
S(i,j)
S(i, j-1) (-, yj)
44
General Scoring Schemes
Assumptions
1. Independence of mutations at different sites
Additive scoring scheme
2. Gaps of any length are considered one mutation
All of the efficient alignment algorithms --
employing on the dynamic programming method
--are based fundamentally on the of the fact
that the scoring function is additive.
45
Substitutions Matrices
belong to
Consider ungapped alignment of equal length
sequences
Compute the probability that the two sequences
are related
Compute the probability that the two sequences
are not related
Compute the ratio of the two probabilities
46
Random Model R
Every letter z occurs independently with
probability
q
z
47
Match Model M
a b
Aligned pairs of residues occur with joint
probability
p
ab
48
s(a,b) the substitution matrix
Log-odds ratio
log
log

i
where
49
HEURISTICSBLAST, FASTA, and SIM4
50
BLAST (Basic Local Alignment Search Tool)
  • A suite of sequence comparison algorithms
    optimized for speed used to search sequence
    databases for optimal local alignments to a
    protein or nucleotide query

Altschul, Gish, Miller, Myers, Lipman Basic
Local Alignment Search Tool, J.Mol.Biol.
215(3)403-10 (1990) Altschul, Madden, Schaffer,
Zhang, Zhang, Miller, Lipman Gapped BLAST and
PSI-BLAST a new generation of protein database
search programs, NAR 25(17)3389-402 (1997) (and
references therein)
51
The BLAST algorithm
  • Detect all word hits (exact, or nearly identical
    matches) of a given length between the two
    sequences
  • k10 for nucleotide sequences (exact word
    matches)
  • k3 for protein sequences (nearly identical word
    matches)
  • Extend the word hits in both directions to
    high-scoring gap-free segment pairs (HSPs)
  • retain only HSPs that score above a threshold
  • start from the center of the HSP (original BLAST,
    1990), or from the center of a pair of HSPs
    located close to each other on the same diagonal
    (gapped BLAST, 1997)
  • Extend the HSPs in both directions allowing for
    gaps
  • use dynamic programming, and stop when the
    alignment score falls more than a threshold X
    below the best score yet seen
  • Report all statistically significant local
    alignments
  • E-value (starting with BLAST 2.0) is used to
    measure the statistical significance
  • E-value the number of alignments with score
    equal to or higher than s one would expect to
    find by chance when searching the database

52
FASTA
  • A program for rapid alignment of pairs of protein
    and DNA sequences, building a local alignment
    from matching sequence patterns, or words
  • Algorithm for comparing a query to a database of
    sequencesFor each database sequence
  • Identify the 10 diagonal regions having the
    largest number of perfect word matches of a given
    length
  • word size k1,2 for protein, and k6-10 for
    nucleotide searches
  • Re-score these regions using a given scoring
    matrix (e.g., PAM250), and trim them to form
    (gap-free) maximal scoring initial regions
  • Join (non-overlapping) initial regions from
    adjacent diagonals to generate longer regions,
    allowing for gaps
  • Re-score these based on the initial regions
    scores, assessing a penalty for each joining
  • Align the query sequence to each of the sequences
    in the search set having the highest overall
    scores

Pearson and Lipman, Improved tools for
biological sequence comparison, Proc. Natl.
Acad. Sci. USA 85 2444-2448 (1988).
53
Sim4
  • Aligns an expressed DNA (EST, cDNA, mRNA)
    sequence with a genomic sequence for that gene,
    allowing for introns and sequencing errors

Florea, Hartzell, Zhang, Rubin, Miller, A
computer program for aligning expressed DNA and
genomic sequences, Genome Res 8(9)96774 (1998)
54
Stages and algorithmic techniques
  • Detect basic homology blocks
  • Determine gap-free matches (HSPs) using a
    blast-like homology search
  • Detect all exact word matches of length k (e.g.,
    k12)
  • Extend the word hits in both directions, by
    substitutions, to gap-free high-scoring segment
    pairs (HSPs)
  • Retain only HSPs scoring above a threshold
  • Connect the HSPs to form larger blocks (exon
    cores) using sparse dynamic programming
  • Extend or trim the exon cores to eliminate gaps
    or overlaps in the cDNA sequence
  • Extend the similarity blocks using fast greedy
    sequence comparison algorithms
  • Detect new exon cores with the blast-like
    homology search tuned for higher sensitivity
  • Refine the introns
  • Predict the locations of splice junctions using a
    combined measure of the accuracy of alignment and
    the intensity of splice signals at the ends of
    each intron
  • Generate the spliced alignment
  • Align the sequences within individual exons using
    greedy alignment algorithms
  • Connect the chain of exon alignments by gaps
    (introns)
Write a Comment
User Comments (0)
About PowerShow.com