Classical Statistics and Scoring of Alignments - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Classical Statistics and Scoring of Alignments

Description:

Classical Statistics and Scoring of Alignments. Consider a probe of length l and ... In-class exercise I ... In-class exercise. W: AB. M : QLABCD ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 45

Provided by: susan594

Category:

more less

Transcript and Presenter's Notes

Title: Classical Statistics and Scoring of Alignments

1
Classical Statistics and Scoring of Alignments
2
Consider a probe of length l and a database of
total length m. How many subsequences of length n
are there in l? l-n1 How many subsequences of
length n are there in m? m-n1 SO THERE ARE
EXACTLY (l-m1) (m-n1) potential matches of
length m between the probe and the database.
3
Probability Coin Toss

There are only two outcomes possible for a coin
toss heads and tails.
For a fair coin, the probability of getting heads
on any one toss is 0.5 p 0.5
The probability of getting tails on any one toss
is 0.5 q (1-p) 0.5

If you do lots of tosses, you expect heads and
tails approximately half the time the
probability for one toss has meaning over many
tosses, too.
But if you do a finite number of tosses, you have
a small chance of getting exactly half heads and
and half tails most of the time youll get
almost half and half

If you do many sets of tosses, you can find out
empirically what your chances are of getting say
4 heads out of 10 tosses P(k4 heads out of N10
tosses)
Or you can figure it out mathematically by
thinking about how many ways you could toss a
coin 10 times and get 4 heads combinations
formula

6
Ways to get heads

ways to get dif s heads
1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1

trials 1 2 3 4 5 6
7
Binomial theorem

The combinations in the triangle come from the
binomial theorem since we have only two
outcomes, you can write the probability of all
outcomes as (p q)n, where n the number of
trials
(pq)n pn n! p(n-1) q n! p(n-2) q2..qn
i!j! i!j!
where i the number of heads and j the number
of tails
E.g., whats the probability of 4 out of 6
heads? n 6, i 4, j 2 6!/4!2!(0.5)4(0.5)2

You can make a plot of all the possible
probability scores against the number of heads
you get

Notice that as n increases, the binomial
distribution approaches a bell-shaped curve the
gaussian distribution
The gaussian distribution has the familiar mean
x and standard deviation s for common
statistical analyses t-test, etc.

10
Gaussian Distribution
11
What about statistics for alignments?

What we want is a way to find out the
significance of the score we assign to alignments
either the similarity search or the alignments
that we will be talking about next week
We need the probability of finding a particular
score

12
Probability of finding a score simplified
example

Lets consider sequences with an alphabet of just
2 letters, A and B
Then p probability of a match to A, and q
probability of match to B (or no match to A)
this is just like tossing a coin and p q
For random sequences, we expect a total score of
½ N (N length of sequences)

13
In-class exercise I

Assume identity scoring, a gapless alignment and
a sequence alphabet of A and B. What is the
probability of two random sequences of length 4
being identical?
What is the probability of 3 out of 4 characters
identical?

14
Probability of finding a score almost realistic

So what if we let there be 4 characters (as in
the nucleic acid case)? Well keep identity
scoring and no gaps
Now p(match) 0.25, and q(no match) 0.75
Since theres still only 2 outcomes we can use
the binomial theorem to calculate the
distribution of scores just as before (this is
like a loaded coin)

15
In-class exercise II

Assume an alphabet of 4 letters A,G,C,T a
gapless alignment, and identity scoring. What is
the probability of two random sequences of length
4 being identical?
What is the probability of 3 out of 4 characters
identical?

16
Distribution for 4-character, identity scoring

You can convince yourself that the probability
distributions for scores of alignments using a
4-character alphabet are not gaussian
This means you cant use t-tests, etc., to test
significance of alignments

17
Alignment statistics

No theory for global alignments
Can use randomized sets
Statistics for ungapped local alignments
well-understood

18
FASTA statistics

FASTA generates a random set of proteins
sequences and reports the frequency of init1s
(modified hits) against the randomized set
The tail of an actual search distribution usually
contains more hits than are expected from the
randomized set some of those extra hits will
represent homologous proteins

Some of the other hits in the areas of higher
probability of random hits will also represent
homologous proteins

20
Karlin-Altschul statistics

BLAST initially generates ungapped alignments
Problem How many hits of score S are generated
using a probe of length m vs. a database of total
sequence length n?
Answer for high scores the number of hits is
determined by the extreme value distribution
EKmn exp(-lS)

So as the probe length and the data base size go
up, the expected number of hits go up linearly.
As the cutoff score is raised, the number
decreases exponentially
K and l are parameters which depend on factors
such as the scoring matrix used

22
Statistics of alignment

Consider a probe of length l and a database of
total length n. How many subsequences of length m
are there in l?
l-m1
How many subsequences of length m are there in n?
n-m1
SO THERE ARE EXACTLY (l-m1) (n-m1) potential
matches of length m between the probe and the
database.

The number of matches expected is just the
product of the number of potential matches and
the probability of a pair matching
Expectation value (l-m1) (n-m1) pm
where p is the probability of a match at each
position if each outcome (letter in the
alphabet) is of equal probability, p is the
inverse of the number of outcomes (letters).

Using gapless alignments and identity scoring,
the score of a particular comparison between a
probe and a database is just the number of
matches in the region considered.
Suppose we extend any hits in both directions
until we reach a mismatch the score obtained
will be just the number of consecutive hits m.
We seek the number of random hits expected of
score at least SM. The expectation value is just
(l-M1) (n-M1) pM
for DNA (four bases) and large l and n, this is
n l 4-S

25
Statistics of alignment

Now suppose we modify our T value and extension
rules so that to get score S we need to get only
i out of m matches, with j misses allowed. The
probability of i hits out of m is just
m! pi qj
i!j!

We can find the expectation value by multiplying
by the number of potential m base comparisons.
for DNA (four bases) and large l and n, this is
n l m! pi qj
i!j!

The expectation value for the number of hits of
score S in a database of length n with a probe of
length l is
n l m! pi qj
i!j!
if it takes i hits out of ij elements to reach
S.
Using identity scoring, Si and for qgtgtp and
small j
this is close to
n l m! pi
i!j!

The expression
n l m! pi
i!j!
or n l m! (1/p)-S
i!j!
is a reasonable approximation for DNA sequences
only for j0 or 1, but for amino acid sequences q
is large enough to tolerate several mismatches.

More generally,
n l m! (1/p)-S
i!j!
is equivalent to K n l (1/p )-S , where K
contains both the binomial coefficients and the
terms in q.
For non-identity scoring, we have
K n l exp (-lS)
which is the general form of the extreme value
distribution l is a scale factor for the score
system.

30
Algorithms

An algorithm is an exact set of instructions for
carrying out a task using a limited menu of
commands
Algorithms are most often associated with
computer programs, but they can also be used by
humans to solve problems
Small number of operations defined in the
programming language compared to, say, words or
phrases in English

31
Algorithms in minimal C

Instructions
main () means program is beginning
Statements have operators
, , -, , (increment), (modulus), ()
(separating terms)
Statements are not algebra!
x x 1

Statements end in
Statements are about variables
Variables start with characters (i, x, b5)
Variables need to be declared so the program can
use them
float x (float means any real number) int I (int
means an integer) char b (char means character)
Arrays are a list of data that can be associated
with a variable arrays run from 0 to n-1 where n
is the number in brackets
float x int a50 char b500 float x
54

Variables need to be initialized (given a
starting value) if they are going to be
incremented
Loops and conditions
For (conditions) statements means do
statement operations while conditions are met
If (conditions) statements else statements
means do first statements while conditions are
met, and when conditions are not met do second
statements this is a one-time thing
While (conditions) statements means keep
doing statements while conditions are met, and as
soon as they are not met, stop this is
continuous while conditions are met
continue go to next iteration of current loop
skipping intervening statements (curly brackets)
break means get out of the current loop

Loops have curly brackets around them
Loops can be nested
Loops are operations that continue until
something changes
Arithmetic sqrt (x), exp (x), log (x), etc.
Logical operators (equals), ! (not equal),
gt, lt, lt, gt, (boolean AND), (boolean OR)
Real programs need an input and an output
function input reads data into the program, and
output records the results of the operations in
the program. You will not need to worry about
the input and output parts of the algorithms, we
will provide them to you.

Simple algorithm to find the average of N terms
x(n)
Main() starts program
Float s, av, x declares variables
int i, N declares variables
Input (x, N) gets data
s 0 initializes s (value 0)
for (i 0 iltN i) loop and statement
s s xi
av s/N arithmetic statement
Output (av) average reported
end program

How a simple loop works
for (i 0 iltN i)
s s xi
Initially, we had set S0, so the first time
through the loop i0 and S is set equal to X(0).
Then we increment i by one, so I 1, and s
x(0) x(1). This process continues until i N,
at which point s x(0) x(1) x(N-1) in
other words, s is the sum of all the elements of
the array x, which is the input.

37
In-class exercise

Use the following data and the algorithm for the
average of N terms to find the average 3,5,8,9
Main()
Float s, av, x
int i, N
Input (x, N)
s 0
for (i 0 iltN i)
s s xi
av s/N
Output (av)

38
Nested Loops

Often we need to cover a space defined by
multiple variables. Suppose we have a matrix with
elements X(i,j)
X(1,1) X(1,2) X(1,3) X (1,n)
X(2,1) X(2,2) X(2,3) .X (2,n)
.
.
X(m,1)

To write the elements of X(i,j), we can use two
nested loops, one indexed to i and the other to
j
for (i0, iltm , i)
for (j0, jltn, j)
output x(i,j)

Outside loop
Inside loop
40
Another algorithm simple sequence similarity
search

Probe sequence w of length J elements w(j), j
1,2, J
Library sequence m of length I elements m(i), i
1,2, I
Problem find all exact matches of for the whole
length of w in m

41
Search algorithm

Main ()
int i, j, I, J char w , m
Input (w, m, I, J)
i 0, j 0
for (i0 iltI i )
for (j0 jltJ j )
if (w j ! m i j ) break
if (jJ) output i

Outer loop
Inner loop
42
Search algorithm

How it works outside loop in i controls the
start of the sequence comparison in m i is the
of the sequence element in m compared to the
first element in w.
inside loop in j increments the index of both w
and m, allowing the comparison of successive
elements for each starting point fixed by i.

43
Sliding Box Model

i controls the region of m selected for
comparison

W1 W2... M1 M2 -----------------------------------
---------MI
Above box represents the position of w on m for
i1, so that W1 gets compared to M1, W2 to M2,
etc. as j increases
W1 W2...
M1 M2 -----------------------MI
Below box represents the position of w on m for
i2, the second time through the outer loop. As j
increases W1 is compared to M2, W2 to M3, etc.
44
In-class exercise