Title: Introduction to Bioinformatics 20120
1Introduction to Bioinformatics20120
- Gianluca Pollastri
- office CS A1.07
- email gianluca.pollastri_at_ucd.ie
2Credits
- Richard Lathrop and Pierre Baldis Bioinformatics
courses at University of California _at_ Irvine.
3Course overview
- Context DNA, RNA, proteins
- Resources GenBank, PDB, etc.
- Algorithms for sequence comparison.
- Phylogenetics.
- Structural bioinformatics protein structure
prediction.
4Lecture notes
- http//gruyere.ucd.ie/2007_courses/20120/
- confidential..
5Recommended/useful readings
- No book is actually required
- Introduction to Bioinformatics
- Lesk
- Introduction to Computational Molecular Biology
- Setubal, Meidanis
- Bioinformatics the Machine Learning approach
- Baldi, Brunak
6- CS 20120, Introduction to Bioinformatics
- Assignment 1, 29 January 2007
- 10 of the overall mark
- To hand in by midnight of February 12
- 1. identify your favourite pet
- 2. get the protein sequence for one of its genes
on - a. http//www.ncbi.nlm.nih.gov/entrez/
- 3. BLAST your sequence against UniProt at
- a. http//www.ebi.ac.uk/blast2/index.html?UniProt
- 4. If you get less than 6 results from 6
different organisms, go back to 2 and choose
another protein - 5. Select 6 sequences returned by BLAST, from 6
different organisms (ticking the appropriate
boxes and downloading them in fasta format will
give you the right input format for the next
step) - 6. Run clustalW on them using the page (be
patient, might take time) - a. http//www.ebi.ac.uk/clustalw/index.html
- 7. Draw a phylogenetic tree for your guide tree
(.dnd) using an online viewer, e.g. - a. http//bioweb.pasteur.fr/seqanal/interfaces/dra
wtree.html - 8. email me (gianluca.pollastri_at_ucd.ie)
- a. your protein sequence UniProt record
7Algorithms for sequence comparison
- Generating all possible alignments and picking
the best one impossibly slow. - Dynamic programming (here programming has
nothing to do with computers) solving a problem
by splitting it dynamically into subparts. - We build up a solution based on similarity
between prefixes of the two sequences..
8Aligning prefixes
- Specifically, we solve the alignment problem of
two sequences by splitting it iteratively (or
recursively) into the alignment of their prefixes.
9the algorithm
- We can fill an (n1)x(m1) matrix with this
stuff
10Hope its right
11Computing the matrix
- m s
- n t
- for i0..m ai,0 ig //m1
- for j0..n a0,j jg //n1
- for i1..m
- for j1..n
- ai,j max(ai-1,jg,
- ai,j-1g,
- ai-1,j-1p(si,tj))
- // (n1)(m1) max3 sums, etc.
-
12Computing the alignment
- What we computed here is the max similarity
matrix between all prefixes of s and t. - Using this matrix we can compute the optimal
alignments between s and t (they could be more
than one). - am,n is the max similarity between s and t. We
find the alignment by tracing the choices that
led us there.
13Computing the alignment (2)
- // We have filled in matrix ai,j before
- ns
- mt
- al_s //store here aligned s
- al_t //store here aligned t
- gap2 //gap penalty
- inm //index for the alignment don't know how
long, but at most nm - align()
- while (ngt0 mgt0)
- if (ngt0 an,man-1,m-gap)
- al_sisn
- al_ti'-'
- nn-1
- else if (ngt0 mgt0 an,man-1,m-1-p(sn
,tm)) - al_sisn
- al_titm
- nn-1
14align() while (ngt0 mgt0) if (ngt0
an,man-1,m-gap) al_sisn
al_ti'-' nn-1 else if (ngt0
mgt0 an,man-1,m-1-p(sn,tm))
al_sisn al_titm nn-1
mm-1 else if (mgt0 an,man,m-1-gap
) al_si'-' al_titm
mm-1 ii-1
15(No Transcript)
16Alignment
- ACC-AGGCTACGA
- ACCTGGGCCACGT
- only one gap, no big deal here..
17Order matters
- There might be multiple paths with the same
score. We used an upmost order here
1 vertical 2 diagonal 3 horizontal
18Order matters (2)
- To follow a downmost order, reverse the if
statements in the code. - 1 horizontal
- 2 diagonal
- 3 vertical
19(No Transcript)
20Upmost and downmost alignment
- upmost
- ACC-AGGCTACGA
- ACCTGGGCCACGT
- ---
- downmost
- ACCAGG-CTACGA
- ACCTGGGCCACGT
- ---
- Both alignments have the same score
- 9 matches (1 x 9),
- 3 mismatches (-1 x 3),
- 1 gap (-2)
- 4
21NW algorithm issues
- Always looks for a global alignment.
- If I try to align the following
- first_alignment_try_ok
- second_alignment_try_ehm
- This is what I get
- -first_alignment_try_-ok
- second_alignment_try_ehm
- -------
22NW algorithm issues
- Always looks for a global alignment.
- If I try to align the following
- alignment_alignment_try_ok
- alignment_try_ehm
- This is what I might get
- alignment_alignment_try_-ok
- alig----------nment_try_ehm
- --
23Local alignment
- We may want a variation of the previous algorithm
that throws away stuff that clearly does not
match, while keeping the good bits, together. - More formally find the highest scoring alignment
between substrings of s and t.
24Smith-Waterman algorithm
Mike Waterman at a conference