CS 262 Discussion Section 1 - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

CS 262 Discussion Section 1

Description:

CS 262 Discussion Section 1 – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 52
Provided by: cs62
Category:

less

Transcript and Presenter's Notes

Title: CS 262 Discussion Section 1


1
CS 262 Discussion Section 1
2
Purpose of discussion sections
  • To clarify difficulties/ambiguities in the
    problem set questions and lecture material.
  • To supplement class material by going somewhat
    into the biological concepts and motivations
    underlying this field.
  • To discuss more algorithms from a topic, wherever
    needed.

3
Antiparallel vs Parallel strands
4
The DNA strand has a chemical polarity
5
The members of each base pair can fit together
within the double helix only if the two strands
of the helix are antiparallel
6
Prokaryotes do not have a nucleus, eukaryotes do
7
Eukaryotic DNA is packaged into chromosomes
  • A chromosome is a single, enormously long, linear
    DNA molecule associated with proteins that fold
    and pack the fine thread of DNA into a more
    compact structure.
  • Human Genome 3.2 x 109 base pairs distributed
    over 46 chromosomes.

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
A display of the full set of 46 chromosomes
15
Sequence similarity
16
Biological motivation
  • Sequence similarity is useful in hypothesizing
    the function of a new sequence
  • assuming that sequence similarity implies
    structural and functional similarity.

Sequence Database
Query
Response
List of similar matches
New Sequence
17
Case Study Multiple Sclerosis
  • Multiple sclerosis is an autoimmune dysfunction
    in which the T-cells of the immune system start
    attacking the bodys own nerve cells.
  • The T-cells recognize the myelin sheath protein
    of neurons as foreign.
  • Show movie

18
Why does this happen?
  • A hypothesis
  • Possibly, the myelin sheath proteins identified
    by the T-cells were similar to bacterial/viral
    sheath proteins from an earlier infection.
  • How to test this hypothesis?
  • Use sequence alignment.

Identification of cause of immune dysfunction
Lab tests
Sequence Database
Query
Response
List of similar bacterial/viral sequences.
Myelin sheath proteins
19
Dynamic Programming
  • It is a way of solving problems (involving
    recurrence relations) by storing partial results.
  • Consider the Fibonacci Series
  • F(n) F(n-1) F(n-2)
  • F(0) 0, F(1) 1
  • A recursive algorithm will take exponential time
    to find F(n)
  • A Dynamic Prog. based solution takes only n steps
    (linear time)

20
Needleman-Wunsch algorithm
  • F(i,j) Maximum of
  • F(i-1, j-1) s(xi, yj)
  • F(i-1, j) d
  • F(i, j-1) - d

F(i-1,j-1) F(i, j-1)
F(i-1, j) F(i,j)
-d
s(Xi,Yj)
Assume that match 1, mismatch 0, indel 0
-d
21
Needleman-Wunsch example
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
G
G
A
T
C
G
A
22
Needleman-Wunsch example
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1
0
0
0
0
0
0
G
G
A
T
C
G
A
23
Needleman-Wunsch example
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0
0
0
0
0
0
G
G
A
T
C
G
A
24
Needleman-Wunsch example
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
25
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
26
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
27
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
28
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
29
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
30
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
31
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
32
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
33
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
34
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
35
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
36
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
37
Traceback
G
T
C
A
G
T
T
A
T
A
A
0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 2 2 2 2
0 1 2 2 2 2 2 2 2 2 2 3
0 1 2 2 3 3 3 3 3 3 3 3
0 1 2 2 3 3 4 4 4 4 4 4
0 1 2 2 3 3 4 4 5 5 5 5
0 1 2 3 3 3 4 5 5 5 5 6
G
G
A
T
C
G
A
38
The solution
  • Optimal alignment has a score of 6.

G _ A A T T C A G T T A
G G A _ T _ C _ G _ _ A
39
Linear Space Alignment
  • Serafim talked about the Myers-Miller algorithm
    in class.
  • There is another variant of the Hirschberg
    algorithm, given in Durbin (Pg 35).

40
  • Suppose we know that characters Xi and Yj are
    aligned to each other in the optimal alignment of
    X1..n and Y1..m.
  • How can we compute the alignment using this
    information?
  • We can partition the alignment into two parts,
    align X1..i-1 with Y1..j-1 and Xi1..n with
    Yj1..m separately.

41
Middle column










42
Middle column










43
Middle column







F(i,j)


44
Middle column







F(i,j)


45
Middle column







F(i,j)


46
Middle column







F(i,j)


47
Middle column
This is the cell in the middle column from where
the traceback leaves the column.







F(i,j)


Maintain the coordinates of that cell with the
value of F(i,j)
Call it c(i,j)
48
  • For every cell in the right half of the matrix,
  • Maintain the F(i,j) value.
  • Maintain the coordinates of the cell in the
    middle column from where its traceback path
    leaves the middle column. Call it c(i, j).
  • Maintain the direction of that jump as given by
    the pointer (either or ). Call it
    P(i,j).

49
  • If (i,j) is the cell preceding to (i,j), from
    which F(i,j) is derived, then
  • c(i,j) c(i,j) and P(i,j) P(i,j)
  • We need only linear space to compute the F,c and
    P values as we proceed across the matrix.

50
Middle column
We know the traceback from (i,j) leaves the
middle column at this cell







F(i,j) c(i,j)
F(i,j) c(i,j)

Hence, the traceback from this cell will also
have the same c(i,j) value
We are interested in the value of c(n.m)
51
  • We use the c(n,m) and P(n,m) values to split the
    dynamic programming matrix into two parts.
  • How?
  • Because we know one aligned pair of letters in
    the optimal alignment now.
Write a Comment
User Comments (0)
About PowerShow.com