Computational Genomics - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Computational Genomics

Description:

The 'best' way to match the letters of one sequence with those of the other ... of the two sequences. Can think of it as a. divide-and-conquer algorithm ... – PowerPoint PPT presentation

Number of Views:14

Avg rating:3.0/5.0

Slides: 34

Provided by: serafimb

Category:

more less

Transcript and Presenter's Notes

Title: Computational Genomics

1
Computational Genomics
Lecture 1, Tuesday April 1, 2003
2
Biology in One Slide
3
High Throughput Biology

DNA Sequencing

ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGAC
TACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
4
High Throughput Biology

Sequencing of expressed genes
(EST sequencing)

protein sequence
mRNA sequence
5
High Throughput Biology

3. Gene Expression Microarrays

6
High Throughput Biology

Gene Regulation
CH.IP.

7
The goals of genomics

Study organisms at the DNA level
Identify parts (genes, etc)
Figure out connections between parts
Study evolution at the DNA level
Compare organisms
Uncover evolutionary history

8
The role of CS in Biology

Essential
DNA sequencing and assembly
Microarray analysis
Protein 3D reconstruction
Complementary
Gene finding, genome annotation
Protein fold prediction
Phylogeny, comparative genomics

9
Syllabus

Tools
Alignment algorithms
Hidden Markov models
Statistical algorithms
Applications
DNA sequencing and assembly
Sequence analysis (comparison, annotation)
Microarray analysis
Evolutionary analysis

10
Course responsibilities

Homeworks 80
4 challenging problem sets, 4-5 problems/pset
Collaboration allowed
5 late days total
Televised students required to do 75
Final 20
Takehome, 1 day
Collaboration not allowed
Easy!
Scribing
Mandatory
Grade replaces lowest 2 problems
Due one week after the lecture

11
Reading material

Books
Biological sequence analysis by Durbin, Eddy,
Krogh, Mitchinson
Chapters 1-4, 6, (7-8), (9-10)
Algorithms on strings, trees, and sequences by
Gusfield
Chapters (5-7), 11-12, (13), 14, (17)
Papers
Lecture notes

12
Topic 1. Sequence Alignment
13
Complete genomes
14
Evolution
15
Evolution at the DNA level
C
ACGGTGCAGTCACCA
ACGTTGCAGTCCACCA
SEQUENCE EDITS
REARRANGEMENTS
16
Evolutionary Rates

next generation
OK

OK

OK

X

X

Still OK?

17
Sequence conservation implies function

Interleukin region in human and mouse

18
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, N in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
19
What is a good alignment?

Alignment
The best way to match the letters of one
sequence with those of the other
How do we define best?
Alignment
A hypothesis that the two sequences come from a
common ancestor through sequence edits
Parsimonious explanation
Find the minimum number of edits that transform
one sequence into the other

20
Scoring Function

Sequence edits
AGGCCTC
Mutations
AGGACTC
Insertions
AGGGCCTC
Deletions
AGG.CTC
Scoring Function
Match m
Mismatch -s
Gap -d
Score F ( matches) ? m - ( mismatches) ? s
(gaps) ? d

21
How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Too many possible alignments O( 2MN)
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
22
Alignment is additive

Observation
The score of aligning x1xM
y1yN
is additive
Say that x1xi xi1xM
aligns to y1yj yj1yN
The two scores add up
F(x1M, y1N) F(x1i, y1j)
F(xi1M, yj1N)

23
Dynamic Programming

We will now describe a dynamic programming
algorithm
Suppose we wish to align
x1xM
y1yN
Let
F(i,j) optimal score of aligning
x1xi
y1yj

24
Dynamic Programming (contd)

Notice three possible cases
xi aligns to yj
x1xi-1 xi
y1yj-1 yj
2. xi aligns to a gap
x1xi-1 xi
y1yj -
yj aligns to a gap
x1xi -
y1yj-1 yj

m, if xi yj F(i,j) F(i-1, j-1)
-s, if not
F(i,j) F(i-1, j) - d
F(i,j) F(i, j-1) - d
25
Dynamic Programming (contd)

How do we know which case is correct?
Inductive assumption
F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal
Then,
F(i-1, j-1) s(xi, yj)
F(i, j) max F(i-1, j) d
F( i, j-1) d
Where s(xi, yj) m, if xi yj -s, if not

26
Example

x AGTA m 1
y ATA s -1
d -1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 AGTA A - TA
j 0
1
2
3
27
The Needleman-Wunsch Matrix
x1 xM
Every nondecreasing path from (0,0) to (M, N)
corresponds to an alignment of the two
sequences
y1 yN
Can think of it as a divide-and-conquer algorithm
28
The Needleman-Wunsch Algorithm