Title: A Course on Bioinformatics
 1A Course on Bioinformatics
Ming Li Bioinformatics Lab Computer Science 
Dept., UCSB 
 2 Chapter 0. Why Study Bioinformatics?
- The trend of genetic data growth 
- Many experts predict 21st century will be a 
 century of biotechnology.
- Genomes Human, Rice, Mouse, Yeast, 50 species 
 bacteria, biggest digital gold mine ever.
30 billion in year 2005 
 3Chapter 1. Biology
Phenotype 
 4Animal CELL
Mitochondrion
Nucleolus (rRNA synthesized)
Cytoplasm
Nucleus
Plasma membrance Cell coat
Chromatin
Lots of other stuff/organelles/ribosome 
 5Two kinds of Cells
- Prokaryotes  no nucleus (bacteria) 
- Their genomes are circular 
- Eukaryotes  have nucleus (animal,plants) 
- Linear genomes with multiple chromosomes in 
 pairs. When pairing up, they look like
Middle centromere Top p-arm Bottom q-arm 
 6Mitosis and Meiosis
- Mitosis homologous chromosomes pair up, 
 duplicate, one set to each cell.
- Meiosis chromosomes split  to haploid 
 (reproductive) cells.
- Haploid Number of Chromosomes 
- Human 23 
- Rice 12 
- Fruit Fly 4 
- Corn 10 
- Chimpanzee 24 
- House mouse 20
7The DNA Sequences
- All are made of H (hydrogen), C (carbon), N 
 (nitrogen), O (oxygen), S (sulfur), P
 (phosphorus).
- A (single chain) DNA sequence looks like 
-  
5
o
o
o
Sugar --- base
CH
A,C,G,T,U
2
c
c
Phosphate (PO )
c
c
4
Sugar --- base
A nucleotide
o
o
3
phosphate
Ribose RNA Deleting circled O, ? Deoxyribose 
--DNA
Sugar --- base
etc 
 8 The 4 bases
Thymine
Adenine
H
H
H
C
o
A-T
N
H
H
H
N
C
C
C
c
C
C
N
N
H
H
N
C
N
C
C
N
o
H
H
H
Note this is flat!
o
H
N
H
N
C
C
C
Uracil replaces T in RNA
c
C
C
N
N
H
H
N
C
N
C
G-C
C
N
o
N
H
Cytosine
Guanine
H 
 9Human 3 billion bases, 30k genes. E. coli 5 
million bases, 4k genes
T
A
A
T
C
G
cDNA
T
A
reverse transcription
T
A
translation
transcription
C
G
Protein
mRNA
G
C
(20 amino acids)
(A,C,G,U)
G
C
Codon three nucleotides encode an 
amino acid. 64 codons 20 amino acids, some w/more 
codes
A
T
C
G
T
A
A
T 
 10Genes and Proteins
- One gene encodes one protein. 
- Like a program, it starts with start codon (e.g. 
 ATG), then each three code one amino acid. Then a
 stop codon (e.g. TGA) signifies end of the gene.
- Sometimes, in the middle of a (eukaryotic) gene, 
 there are introns that are spliced out (as junk)
 during transcription. Good parts are called
 exons. This is the task of gene finding.
11A.A. Coding Table
- Arginine (ARG) CG 
- Asparagine (ASN) AAT, AAC 
- Glutamine (GLN) CAA, CAG 
- Cysteine (CYS) TGT, TGC 
- Methionine (MET) ATG 
- Phenylalanine (PHE) TTT,TTC 
- Tyrosine (TYR) TAT, TAC 
- Tryptophan (TRP) TGG 
- Histidine (HIS) CAT, CAC 
- Proline (PRO) CC 
- Stop TGA, TAA, TAG 
- Glycine (GLY) GG 
- Alanine(ALA) GC 
- Valine (VAL) GT 
- Leucine (LEU) CT 
- Isoleucine (ILE) AT(-G) 
- Serine (SER) AGT, AGC 
- Threonine (THR) AC 
- Aspartic Acid (ASP) GAT,GAC 
- Glutamic Acid(GLU) GAA,GAG 
- Lysine (LYS) AAA, AAG 
- Start ATG, CTG, GTG
12Bad Genes --gt Genetic Diseases
- Hemophilia on X chromosome. 
- Cystic firosis on chr. 7, CFTR gene. 4 
 Caucasians carry the defective gene. (recessive)
- Sickle-Cell Anemia single nucleotide mutation in 
 the first exon of beta-globin gene (removes a
 cutting site). 1 in 12 African Americans are
 carriers. (sick for homozygotes)
- BRCA1 gene (chr. 17q)  responsible for ½ 
 inherited breast cancer (10 of breast cancer)
- Fragile X syndrome (mentally retard)  1 in 1250 
 males, 2500 females (dominate, but females have
 partially expressed good gene). FMR-1 gene
 tri-nucleotide repeats gt200 causes disease.
- P53 gene chr. 17p, responsible for ½ of all 
 cancers
- X-rated XX, XY, XO(f), XXY(m), XYY(m)
13Where to get data?
- Larry Miller GenBank 
- http//www.ncbi.nlm.nih.gov 
- Dr. Huandong Sun 
- SWISS-PROT http//www.expasy.ch/sprot 
- PDB http//www.pdb.bnl.gov/ 
- Go to our Bioinformatics Lab website 
 www.cs.ucsb.edu/mli/Bioinf/resources/index.html
-  for all the information. 
14Chapter 2. Research and Industry
- Bioinformatics Labs and education programs 
 established in almost all major universities,
 scattered in Biology, Physics, Computer Science
- Research topics 
- Gene-finding 
- Mapping 
- Sequencing/sequence assembly. 
- Sequence Comparison alignment, database search 
- Mining DNA/proteins, finding motifs 
- Genome arrangement 
- DNA arrays 
- Computational proteomics, protein folding 
- Phylogeny reconstruction and analysis
15The Commercial Market
- Current bioinformatics market is worth 300 
 million / year. Half data, half software.
- 2 billion / year bioinformatics market in 5 
 years, predicted by Oscar Gruss  Son.
- Wide range of different demands 
- Bioinformatics software require sophisticated 
 algorithm support, ranging from probabilistic/
 approximation algorithms, data mining, learning
 algorithms, databases, to GUI design
16BUSINESS Landscape
Pharmaceutical Companies
Universities  Research Labs
Hospitals, Biotech firms
Sell Data  Service
Sell large Systems
Sell Web Service 
 17Bioinformatics Companies
- Genomatrix Software, Genaissance Pharmaceuticals, 
 Lynx, Lexicon Genetics, DeCode Genetics, CuraGen,
 AlphaGene, Bionavigation, Pangene, InforMax,
 TimeLogic, GeneCodes, LabOnWeb.com, Darwin,
 Celera, Incyte, BioResearch Online, BioTools,
 Oxford Molecular, Genomica, NetGenics, Rosetta,
 Lion BioScience, DoubleTwist, eBioinformatics,
 Prospect Genomics, Neomorphic, Molecular Mining,
 GeneLogic, GeneFormatics, Molecular Simulations,
- Total 50.
18 Challenges to Computer Science
- Enormous size of genomics data suitable for 
 asymptotic analysis
- Many NP-hard problems defy simple 
 minded/non-professional approaches multiple
 alignment, distant homology, motif finding,
 protein folding, phylogeny, gene relationship in
 expression data, mining and learning,
- Approaching these problems from computer science 
 perspective will be fruitful.
19Algorithmic Techniques in CB
- Dynamic programming (alignment, etc) 
- Divide and conquer (Protein folding) 
- Approximation algorithms (sequence assembly, 
 phylogeny)
- Greedy algorithms (sequence assembly) 
- Heuristics 
- Linear programming and relaxation
20Some CS jargons
- NP-Complete easy to guess, hard to find. 
- Approximation algorithm If the minimal solution 
 is f, your solution is ggtf, then the
 approximation ratio is g/f.
- PTAS (Polynomial time approximation scheme) this 
 is best kind of approximation algorithms for any
 e, we can achieve (1e)optimal in polynomial time
 (exponent might depend on e).
21Chapter 3. DNA Sequencing
- Two ways to copy DNA 
- 1. Polymerase Chain Reaction (1986) make many 
 copies of a fragment (500). Needs primers (end
 segments). Cleave into 2. Anneal primers (5- 3,
 and 3- 5). Make two double chains. Repeat
 1,2,4,8,16,
3  
5
5
3
5
3
5
3 
 22- 2. Cloning. Insert a large piece of DNA into a 
 cloning vector (virus, bacteria, yeast -YAC).
 Then the vector is inserted into a host cell to
 duplicate naturally.
- DNA Sequencing 
- Make many copies (single strand) 
- Cut them into fragments of lengths 500. 
- For each fragment of length L, use some process 
 like PCR, generating all lengths 1  L with some
 fluorescence dye. In old scheme, you generate all
 fragments end with A, then with C, then G, then
 T, run them in 4 lanes of electrophoresis gel. In
 the new scheme, you have 4 colors (of the dye)
 all fragments in 1 lane.
- Then assemble all fragments into the shortest 
 common superstring by GREEDY repeatedly merge
 the pair with max overlap until finish.
23Shortest Common Superstring 
- In FOCS 1990, we started formalize and analyze 
 the following learning problem Infer orginal DNA
 sequence from fragments. Or given n strings,
 find the shortest common superstring. (1994, J.
 ACM)
- The problem was proved to be NPC, 1980. 
- Open for 10 years does GREEDY work? (I.e. does 
 it give linear approximation?)
- We solved this, proving 3, STOC91. 
- Improvements by many people to 2.89, 2.81, 2.79, 
 2.75, 2.66,
- Formal Statement Given Ss1,  sn, find a 
 shortest s such that for all i, si is a substring
 of s.
- E.g. alf ate half lethal alpha alfalfa ? 
 lethalphalfalfate
24Theorem. GREEDY achieves 4.
- Proof. Given Ss1,  ,sm, construct G 
- Nodes are s1,  ,sm 
- Edges if then 
 add edge
- where pref is the pref length. I.e. 
 siprefoverlap length with sj
- SCS(S)  length shortest Hamilton cycle in G 
- (Modified) Greedy restated find all cycles with 
 minimum weights in G, then open cycle,
 concatenate to obtain the final superstring.
sj
pref
si
pref
si
sj 
 25This minimum cycle exists
- Assuming initial Hamilton cycle has w(C)  n 
- Then merging si with sj is equivalent to breaking 
 into two cycles. We have
-  w(C1) w(C2) lt n 
- Proof We merged (si, sj) because they have max 
 overlap. Picture shows
-  
-  
-  d(si,sj)d(s,s)ltd(si,s)d(s,sj) 
- Continue this process end with self-cycles C1, 
 C2, C3, C4,
-  Sw(Ci) lt n.
C
si
sj
s
s
S
sj
C1
si
S
si
sj
s
s
C2 
 26- Then we open cycles 
- Let Wiw(Ci) 
- Li  longest string in Ci  
- open Cilt Wi  Li 
- n gt SWi 
- Lemma. S1 and S2 overlap lt w1w2 
- S(Li-2Wi) lt n, by lemma, since Lis must be in 
 final SCS.
- SltS(LiWi) 
-  S(Li-2Wi)S3Wi 
-  lt n  3n 
-  4n. 
- QED
w1
w1
w1
w1
s1
w2
w2
w2
s2 
 27Chapter 4. Sequence Comparison
- Sequence Alignment 
- s(i,j)max 
-  s(i-1,j)  d(vi,-) 
-  s(i,j-1)d(-,wj) 
-  s(i-1,j-1)d(vi,wj) 
-  No gap penalties, 
- where d, for proteins, is either PAM or BLOSUM. 
- d(-,x)d(x,-) -a, d(x,y)-u. When a0, 
 uinfinity, it is LCS.
- Longest Common Subsequence (LCS). 
- Vv1v2  vn 
- Ww1w2  wm 
- s(i,j)  length of LCS 
-  of V1..i and W1..j 
- Dynamic Programming 
-  s(i-1,j) 
- s(i,j)max s(i,j-1) 
-  s(i-1,j-1)1,viwj
28Misc. Concerns
- Local sequence alignment, add si,j0. 
- Gap penalties. For good reasons, we charge first 
 gap cost a, and then subsequent continuous
 insertions blta.
- Space efficient sequence alignment. Hirschberg 
 alg. in n2 time, O(n) space.
- Multiple alignment of k sequences nk
29BLAST / Psi-BLAST / Gap-BLAST
- Popular software, using heuristics. By Altschul, 
 Gish, Miller, Myers, Lipman, 1990.
- E(epected)-value e dmne -lS, here S is score, m 
 is database length and n is query length.
- Meaning e is number of hits one can expect to 
 see just by chance when searching a database of
 size m.
- Basic Strategy For all 3 a.a. (and closely 
 related) triples, remember their locations in
 database. Given a query sequence S. For all
 triples in S, find their location in database,
 then extend as long as e-value significant.
 Similar strategy for DNA (7-11 nucleotides).
30Here is an example of a BLAST match (E-value 0) 
between gene 0189 in C. pneumoniae and gene 131 
in C. trachomatis. Query CPn0189 
 Score 
(bits) E-value Aligned with CT131 hypothetical 
protein 1240 
0.0 Query 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTE
SRKYLVFSL I HKESGLSCSAEELKISW 60 
MKR W KI G L  L L LP SES KYL 
 SKEGL EL SW Sbjct 1 
MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQ
F EIEQLHLSW 60 Query 61 FGRQTARKIKLTG-EAKDEVFS
AEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119 
 FG QTAKI G  EFAEK  GSL 
RLLY PK  TLGWSLIE S Sbjct 61 
FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL 
YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note Because of 
powerpoint character alignment, I inserted some 
white space in the alignment that are not in the 
BLAST output  I will explain in class. These 
white spaces were inserts so that things line up 
right (the k-th letter goes with the k-th letter). 
 31Chap. 4. Tools From CS
- Suffix Array. Give sequence 
-  she_sells_seashell_on_the_sea_shore 
- Suffix array is an array of pointers pointing to 
 suffices of the sequence  sorted.
- E.g., all suffices of above are e, re, ore, 
 hore, shore,
- _shore, a_shore, ea_shore, sea_shore, and so on. 
- Then these are sorted and their pointers 
 (pointing to a suffix location in sentence) are
 stored in an array (the suffix array).
32Why Suffix Array?
- It can be built very fast. 
- It can answer queries very fast 
- How many times ATG appears (their pointers are 
 all jammed together).
- What is G-C contents. 
- Disadvantages 
- Cant do approximate matching 
- Hard to insert new stuff (need to rebuild the 
 array) dynamically.
- Pointers can cost too much space. 3G pointers?
33Fast Pattern Matching
- Given T (text) and P (pattern), is P in T? 
- Slow algorithm for each position in T, check if 
 P is in T. This costs O(PT).
- Fast algorithm no back tracking. P is short, we 
 can calculate a table for P first if we have
 matched PP1  Pk with T halfway fail to match at
 Pi with Tj, since P1  Pi-1 already matched the
 text, we should know where to start in P and
 continue to match at Tj1  The table will
 indicate for each position Pi if we fail at i,
 where to start next. O(P2  T).
- Knuth-Morris-Pratt, Boyer-Moore O(PT).
34Chapter 5 Multiple alignment
- To do phylogenetic analysis 
- Same protein from different species 
- Optimal multiple alignment probably implies 
 history
- Discover irregularities, such as Systic Fibrosis 
 gene
- To find conserved regions 
- Local multiple alignment reveals conserved 
 regions
- Conserved regions usually are key functional 
 regions
- These regions are prime targets for drug 
 developments
35Definitions 
- Given sequences s1,  sn, multiple alignment M 
 puts them in n rows, one sequence per row, with
 spaces inserted to get supersequences S1, , Sn,
 SiL.
- SP-Alignment Minimize sum of Hamming(Si,Sj) over 
 all pairs i, j.
- Star-Alignment find center sequence S, minimize 
 sum of Hamming(S,Si) over all i.
- We will concentrate on SP-alignment.
36CLUSTAL-W
- Standard popular software 
- It does multiple alignment as follows 
- Align 2 
- Repeat keep on adding a new sequence to the 
 alignment until no more, or do tree-like
 heuristics.
- Problem It is simply a heuristics. 
- Alternative dynamic programming nk for k 
 sequences. This is simply too slow.
- We need to understand the problem and solve it 
 right.
37Making the Problem Simpler!
- Multiple alignment is very hard 
- For k sequences, nk time, by dynamic programming 
- NP hard in general, not clear how to approximate 
- Popular practice -- alignment within a band the 
 p-th letter in one sequence is not more than c
 places away from the p-th letter in another
 sequence in the final alignment  the alignment
 is along a diagonal bandwidth 2c.
- Used in final stage of FASTA program.
38Literature (for details, see our STOC00 paper)
- NP hardness under various models Wang-Jiang 
 (JCB), Li-Ma-Wang (STOC99), W. Just
- Approximation results Gusfield (2- 1/L), Bafna, 
 Lawler, Pevzner (CPM94, 2-k/L), star alignment.
- Sankoff, Kruskal discussed within a band 
- Pearson showed alignment within a band gives very 
 good results for a lot protein superfamilies.
- Altschul and Lipman, Chao-Pearson-Miller, 
 Fickett, Ukkonen, Spouge (survey) all have
 studied alignment within a band.
39 The following were proved
- SP-Alignment 
- NP hard 
- PTAS for constant band 
- PTAS for constant number of insertion/deletion 
 gaps per sequence on average (for coding regions,
 this assumption makes a lot of sense)
- Star-Alignment 
- PTAS in constant band 
- PTAS for constant number of insertion/deletion 
 gaps per sequence on average
40We will do only SP-alignment
- Notation in an alignment, a block of inserted 
 --- is called a gap. If a multiple alignment
 has c gaps on the average for each sequence, we
 call it average c-gap alignment.
- We first design a PTAS for the average c-gap SP 
 alignment.
- Then using the PTAS for the average c-gap SP 
 alignment, we design a PTAS for SP-alignment
 within a band.
41Average c-gap SP Alignment
- Key Idea choose r representative sequences, we 
 find their correct alignment in the optimal
 alignment, by exhaustive search. Then we use this
 alignment as reference.
- Then we align every other sequence against this 
 alignment.
- Then choose the best. 
- All we have to show is that there are r sequences 
 whose letter frequencies in each column of their
 alignment approximates the complete alignment.
42Some over-simplified reasoning
- If M is optimal average c-gap SP alignment 
- In this alignment, many sequences have less than 
 cl gaps.
- So if we take r of these sequences, and try every 
 possibly way, one way coincides with M.
- Then hopefully, its letter frequencies in each 
 column more or less approximates that of Ms
- Then we can simply optimally align all the rest 
 of the sequences one by one according to this
 frequency matrix.
43Sampling r sequences
Complete Alignment
Alignment with r sequences
j
j
We also expect this column has  k percent as
If this column has k percent as 
 44AverageSPAlign
- for Lm to nm  
-  for any r sequences  
-  for all possible alignment M of length L 
-  and with no more than cl gaps  
-  align all other sequences to M //one 
 alignment
-   
- Output the best alignment.
45SP Alignment within c-Band
- Basic Idea 
- Dynamically cut seq-uences into segments 
- Each segment satisfies the average c-gap 
 condition. Hence use previous algorithm
- Then assemble the segments together 
- Divide-Conquer.
Cutting these sequences into 6 segments, each 
segment has c-gaps per sequence on average in 
optimal alignment. 
 46The final Algorithm diagonalAlign
- while (not finished)  
-  find a maximum prefix for each sequence (same 
 length) such that AverageSPAlign returns
 lowcost. Keep the multiple alignment for this
 segment
-  
- Concatenate the multiple alignments for all 
 segments together to as final alignment.
47Discussion
- Current PTAS is extremely slow. 
- However, the design of PTAS might provide useful 
 hints to design fast programs.
- It is an interesting project to implement this 
 PTAS (combined with some heuristics) and test
 this idea.
48Chapter 6. Motif Finding
- Finding motifs/conserved regions in proteins is 
 important in drug design and proteomics.
- The problem is also called local alignment. 
- Many programs exist -- all based on heuristics 
- We proved in STOC99 it is NP-hard. 
- We provided a polynomial time approximation 
 scheme with guaranteed performance in polynomial
 time in STOC99.
- Based on this theoretical result, we have 
 implemented the COPIA system.
49Given k protein sequences, find a conserved 
region
L
K sequences
Red regions are conserved regions, or, 
motifs. The dont have to be exactly same, they 
match with higher scores than other regions. 
 50The PTAS (Li, Ma, Wang, STOC99)
- Input S1,  , Sm, integer L. 
- Output t1,  ,tm, tiL (motifs) 
- For every r length L substring, compute the 
 consensuse t of them.
- In all strings, find closest substring to t of 
 length L. From these, find the new consensus s.
- Choose the best. 
- Theorem This algorithm outputs a solution no 
 worse than 11/sqrt(r) optimal. Time complexity
 is (nm)O(r).
51COPIA (COnsensus Pattern Identification  
Analysis, Liang-Li-Ma) (http//dna.cs.ucsb.edu/
copia/copia_submit.html) and Others
- Straightforward implementation of PTAS is 
 extremely slow.
- But we can do a few things to speed up 
- E.g. instead of choosing all r, choose randomly 
- Many programs exist. Some use HMM. Most popular 
 perhaps is the Gibbs sampling method by Lawrence
 et al (see page 149 textbook). It is an iterative
 procedure as follows Input, S1,  , Sm
- Randomly choose a word wi from each Si 
- Randomly choose r 
- Create a frequency matrix (qij) from the m-1 
 words not from Sr.
- For each position in Sr, compute the probability 
 a word fits the frequency matrix (qij), and use
 that word with the highest probability, repeat.
52Linear ProgrammingRelaxation(Reference book 
Randomized algorithms, Motwani, Raghavan)
- We introduce here a new technique for algorithm 
 design (approximation).
- For simplicity, lets consider a simplified 
 problem. Given a set Ss1,  sn sequences, each
 of length n, find string t not in S that is far
 away from all si in S find max d such that
 H(t,si) gt d.
- Formulating an LP problem. If xx1  xn and 
 sis1,  sn, then H(x,si)x1  x2   xn,
 where xj  xj if sj0, xj 1-xj if sj1, now we
 have an integer program
-  max D 
-  S aijxjgtCiD for each si, aij0,1,-1  Ci are 
 from Sxi
-  xi  0, 1
53LP Relaxation
- Integer Program (as we have just described) is 
 NP-hard. So it is hard to solve it directly.
- However, Linear Program is polynomial time 
 solvable. So we relax our requirement that
 solution x1,  xn have to be 0,1. Instead we may
 use 0?xi?1.
- Then using a technique called randomized rounding 
 we convert our fraction solution to integer
 solution by set xi1 with probability xi, xi0
 with probability 1-xi.
- What does this give us? Well, for each aij xj 
 Ci, after randomized rounding, we lose about
 logn, thus if D is O(n), we are within D-logD
 with high probability, so we have a very good
 approximation.
- We can then derandomize the process getting a PTAS
54Chapter 7 Protein Folding by Threading
- Target ketosteroid isomerase KSI 
- Template 1ounA C_alpha-RMSD 1.97A and seq. 
 identity 9 with KSI
- 2.73A RMSD for all backbone atoms 
- 200 residues (M. Summers, in Science  one of 
 HIV proteins)
Blue predicted model by PROSPECT Red NMR 
structure 
 55Protein Folding Prediction by ThreadingThere are 
many interesting work by Sippl-Weitchus, 
Lathrop-Smith, Jones-Thornton, Godzik et al, 
Bryant-Altschul, and many others. But we will 
concentrate on one program PROSPECT by Ying Xu et 
al.
- Template library FSSP. It has currently about 
 2000 known protein structures and specifies, for
 each protein
- Core (?-helix, ?-sheet) regions. 
- For each a.a. position in sequence, secondary 
 structure (3 options), Sol(vent) exposed,
 buried, or intermediate. (s.s.xsol total 9
 possibilities).
- Distance interaction when lt8 angstrom. If in the 
 same core, if gt4angstrom, also specify
 interaction.
- Input Query sequence.
core
core
core
loop
loop
loop
loop
Protein 
 56Goal Find Best Template
- To minimize, over all templates T in FSSP 
-  ETw1Emutatew2Esingletonw3Epairw4Egap 
-  where 
- Emutate is alignment cost (as sequences, using 
 PAM250)
- Esingleton is a residues local preference of the 
 ss and solvent environment. (Stored in table)
- Epair specifies some (non-neighbor) pairs should 
 be close in space (Energy level also in a table).
- Egap is gap penalty. Not allowed in cores in 
 PROSPECT.
57Threading Algorithm
- The problem in general is NPC. 
- The problematic part is Epair. Other parts can be 
 optimized in polynomial time.
- In order to compute Epair, draw an edge between 
 each pair when there is pairwise interaction in
 FSSP.
- PROSPECT uses a divide and conquer plus dynamic 
 programming each step find best place to cut the
 problem into two parts (with smallest cut)
- DP1,n,1,mmini,kDP1,i,1,k,x1,  xc 
-  DPi1,n,k1,mx1,,
 xc
- Can precompute optimal cut  fix i above.
58Time Complexity
- O(ncnm), c3,4 on average. c8 max. for all 
 proteins in FSSP.
- User provided information will also help 
 threading, these include
- Secondary structure information 
- Inter-residue distances (including NMR data). 
- Gap length requirements 
- Problems remain 
- Coefficients for all (wis). It is better to be 
 specific.
- Can this be done random polynomial time? Or 
 efficient approximation? Currently too slow.
- Fast computation is important for distant 
 homology.
59Side Topic Structure Alignment
- Problem given two proteins with structures, 
 align them together.
- A simple algorithm repeat for each combination 
 of three aa from one protein and three from
 another, align the six, then align the rest (by
 dynamic programming). This algorithm is about n8
 by Takutsu.
60Chapter 8. Repeating Patterns
- Over 45 human genome are (approximate) repeats. 
 More in plants.
- Some claim that genomics is a science of 
 finding/studying genomic repeats. Genomes evolves
 by duplicating (then evolve) parts.
- How do we find approximate genome level repeats? 
- We wrote the such a program  PattenHunter (Ma, 
 Tromp) and visualization tool (Miller)
- Larry Miller/John Tromp demo.
61Repeats Arabidopsis Chr 2 vs Chr 1p  
 62Chapter 9. Coding Region Detection
- Prokaryotic genome and enkaryotic genomes have 
 different properties
- Regulatory regions 
- Starting and ending codons 
- Gene density 
- Introns 
- Such programs usually use HMM to detect ORFs and 
 intron/exon regions, and use EST databases, BLAST
 search, and species-specific statistics GenScan,
 Genie, GenMark..
- Other programs use Neural Networks GRAIL
63How it works in nature
- How does it work in nature 
-  Prokaryotes do not have introns 
-  Single cell eukaryotes have less introns 
-  Other eukaryotes may have up to 90 introns.
64- Spliceosome cuts off the introns which often 
-  start with GU, ends with AG 
-  In the middle, branch site ends with AC (match 
 GU)
Note Yc/u, Nany 
 65Hidden Markov Model
- HMM is lt?,Q,A,Egt where 
- ? is symbol alphabet 
- Q is set of states (that emit symbols) 
- A is QxQ matrix of state transition 
 probabilities
- E is Qx? matrix of emission probabilities. 
- A path ??1  ?n is a sequence of states. The 
 probability a sequence is generated by ? is
-  P(x?) ? P(xi?i)P(?i ?i1) 
- Decoding Problem Given x, find ? maximizing 
 P(x?).
- Fake/fair Coin Example Coinlt?,Q,A,Egt 
- ?0,1 (tail/head) 
- QF,B (fair, biased) 
- aFFaBB0.9, aFBaBF0.1 
- eF(0)1/2, eF(1)1/2, eB(0)1/4, eB(1)3/4 
- Decoding Problem solved by Viterbi in 1967, using 
 Dynamic Programming.
- Parameter estimation 
66A HMM Scheme for Gene Finder 
 67Chapter 10. Phylogeny (Good reference Biological 
sequence analysis, probabilistic models of 
proteins  Durbin, Eddy, Krogh, Mitchison, 
Cambridge Univ Press.)
- Consider a gene seq. from each of k species, it 
 is possible to infer evolutionary relationship of
 these k species from these k genes.
- Many heuristics for building evolutionary trees 
 exist. The problem is known to be NP-hard.
- We will discuss various algorithms 
- Max likelihood method  fastNDAml/PROTml 
- Neighbor Joining 
- Pasimony 
- UPGMA 
- Quartet cleaning (SODA00) 
- PTAS  finding the most consistent tree (FOCS98).
68Tree of Evolution
G ? T mutation
AAATGTACC
A ? G mutation
AAATGTACC
AAATGTGCC
A ? T mutation
AAATGTGCC
TAATGTGCC 
 69UPGMA (Sokal, Michener, 1958)
0.1
0.1
0.1
0.4
0.4
- Initialize 
- Ci  si, for all i. 
- Repeat until one cluster left 
- Find two clusters Ci, Cj with minij 
 dij(?dpq)/CiCj, p?Ci, q?Cj
- Define node k with i,j as children, edge weight 
 dij
- Form cluster k, remove i,j clusters.
Problem of UPGMA 
 70Neighbor Joining (Saitou-Nei, 1987)
- Initialize 
- Tsequences, LT 
- Choose i,j?L such that dij-ri-rj minimized. Rest 
 similar to UPGMA with similar modification on
 edge weights to k.
- Here, ri, rj are the average distances from i,j 
 to other nodes in L  to compensate long edges.
71Parsimony
- Finds the tree with minimal number of 
 substitutions.
- Extremely slow. Use branch and bound. 
- Even this problem is not easy given the tree 
 with leaves as sequences, find the best way to
 assign sequences to internal nodes so that the
 mutations are minimized. (Jiang-Lawler-Wang, 1994
 STOC)
72Maximum Likelihood
- Jukes-Cantor (1969) model equal substitution 
 probability (1 parameter ?)
- Kimura (1980) (2 parameter) model purine (A,G) 
 to purine and pyrimidine (C,T,U) to pyrimidine
 substitution at different rates (? and ?).
- Max Likelihood method wants a tree with max 
 probability under these models. You have to
 exhaustively search through all trees, extremely
 slow process.
73Quartet Methods
- For every four sequences, construct a tree of 4 
 nodes (quartet), using max likelihood say.
- Then build a large tree consistent with (most of) 
 these small trees (of size 4). But this step is
 (NP) hard.
- New appoach data correction (see our papers in 
 FOCS98, SODA00 on our bioinformatics lab
 website)
74Quartets and Correction
c
a
Original tree
d
b
c
a
b
e
c
d
a
d
a
b
e
e
b
d
b
d
a
e
c
correction
c
a
e
c
error
e
d 
 75HyperCleaning Software
- For less than 30 taxa, HyperCleaning is 
 comparable to fastDNAml (using maximum likehood
 score), and better than NJ.
- For more than 30 taxa, true ML and MP methods 
 take days and produces poor results.
 HyperCleaning do well, with better ML score.
76(No Transcript) 
 77Chapter 11. DNA Compression
- Life represents order, not chaos. It should be 
 compressible. Biological evidences
- In eukaryotes long tandem repeats 
- multiple copies of essential genes (rRNA) 
- only 1000 protein folding patterns 
- genes duplicate themselves for evolutionary or 
 selfishness purposes
78GenCompress (Chen, Kwong, Li, GIW99,RECOMB00) 
- Lempel-Ziv style, one pass. 
- Encodes approx. matches (edit distance). 
- Encodes approximate reverse complements. 
- Carefully designed gain function and encoding. 
 Arithmetic encoding when needed.
- Works for conditional compression. 
- Best compression ratio for benchmark data.
79Comparison with Biocompress-2
Biocompress-2 is by Grumbach-Tahi 
 80Comparison with Cfact
Cfact is by Rivals et al 
 81Chapter 12. Whole-Genome Phylogeny
- Completely Sequenced 
- C. elegans, H. sapiens, D. melanogaster, Rice 
- 50 species of archaea bacteria and eubacteria 
- What can we do with them? 
- Snel, Bork, Huynen compare gene contents 
- Boore, Brown gene order 
- Sankoff, Pevzner, Kececioglu reversal/translocati
 on/synteny
82New Thinking
- Lets first give up traditional a priori 
 conception of important similarities between 2
 sequences.
- Can we simply define a distance d(x,y) that is 
 universal if two seqs are close in any sense,
 then they are close under d(x,y)?
83Defining a Distance
- Shared Information 
- K(xy) is Kolmogorov complexity 
-  of x condition on y, defined as 
-  the length of shortest program 
-  that outputs x on input y. 
-  K(x)  K(xy) is mutual 
-  information 
 K(x) - K(xy) d(x,y)  1 - 
---------------------- 
K(xy) 
 84Universality
- It can be proved, for any other computable 
 normalized measure D(x,y) that satisfies some
 reasonable neighborhood density property, then we
 have there is a constant c such that for all
 x,y, we have
-  d(x,y) lt cD(x,y) 
- Informally speaking any similarity detected by D 
 is also detected by d !
85Approximate K(xy)
- But K(xy) is not computable!! 
- We approximate K(xy) by Compress(xy). 
- DNAs are over alphabet A,C,G,T. Trivial 
 algorithm gives 2 bits per base.
- All commercial software like compress, 
 compact, pkzip, arj give gt 2 bits/base
- We will use GenCompress, since this is the best 
 available compression alg for DNA sequences.
86 First Experiment Genome Tree
- On a single gene (or several), current methods 
- Max. likelihood assumes statistical evolutionary 
 models, compute the most likely tree. Based on
 multiple alignment.
- Max. parsimony needs multiple alignment, then 
 finds the best tree, minimizing cost.
-  Distance-based methods NJ Quartet methods, 
 Fitch-Margoliash method.
- Trouble different gene trees, manual alignment, 
 horizontally transferred genes, do not handle
 genome level events.
87Whole Genome Phylogeny
- We provide an alternative approach. It 
- uses all the information in a genome 
- completely automated 
- needs no alignment robust, fast, accurate 
- no model -- simply universal 
- generalizes alignment distance, reversal 
 distance, translocation distance
- one genome, one evolution, one tree.
88Eutherian Orders
- It has been a disputed issue which two groups of 
 placental mammals are closer Primates,
 Ferungulates, Rodents.
- Hasegawas group concatenated 12 proteins from 
 rat, house mouse, grey seal, harbor seal, cat,
 white rhino, horse, finback whale, blue whale,
 cow, gibbon, gorilla, human, chimpanzee, pygmy
 chimpanzee, orangutan, sumatran orangutan, with
 opossum, wallaroo, platypus as out group, 1998.
- They used Max Likelihood method.
89Eutherian Orders ...
- We use exactly the same species. 
- And complete mtDNA genome 
- We computed d(x,y) for each pair of species, and 
 used Neighbor Joining in Molphy package (and our
 own hypercleaning).
- We constructed exactly the same tree. Confirming 
-  ((Primates, Ferungulates), Rodents)
90Evolutionary Tree of Mammals 
 912nd Exp Chaining Chain Letters
- Charles Bennett collected 33 chain letters 
 1980--1997. Using our new method, we
 reconstructed their history. Answered open
 questions of chain letter experts. Will appear in
 Scientific American.
- Like a gene, they are about 2000 characters like 
 a virus, they have infected billions of people,
 they mutate just like a genome, even has
 horizontal transfer. Traditional phylogeny
 methods should also work on them. But they dont
 alignment fail--translocated sentences no model
 of evolution.
- We used our own method to calculate shared 
 information between each pair of chain letters.
92A sample letter 
 93Another typical chain letter
with love all things are possible this paper has 
been sent to you for good luck. the original is 
in new england. it has been around the world 
nine times. the luck has been sent to you. you 
will receive good luck within four days of 
receiving this letter. provided, in turn, you 
send it on. this is no joke. you will receive 
good luck in the mail. send no money. send 
copies to people you think need good luck. do 
not send money as faith has no price. do not keep 
this letter. It must leave your hands within 96 
hours. an r.a.f. (royal air force) 
officer received 470,000. joe elliot received 
40,000 and lost them because he broke the 
chain. while in the philippines, george welch 
lost his wife 51 days after he received the 
letter. however before her death he received 
 7,755,000. please, send twenty copies and see 
what happens in four days. the chain comes from 
venezuela and was written by saul anthony de 
grou, a missionary from south america. since 
this letter must tour the world, you must make 
twenty copies and send them to friends and 
associates. after a few days you will get a 
surprise. this is true even if you are not 
 superstitious. do note the following 
constantine dias received the chain in 1953. he 
asked his secretary to make twenty copies and 
send them. a few days later, he won a lottery of 
two million dollars. carlo daddit, an office 
 employee, received the letter and forgot it had 
to leave his hands within 96 hours. he lost his 
job. later, after finding the letter again, he 
mailed twenty copies a few days later he got a 
better job. dalan fairchild received the letter, 
and not believing, threw the letter away, nine 
days later he died. in 1987, the letter was 
received by a young woman in california, it was 
very faded and barely readable. she promised 
herself she would retype the letter and send it 
on, but she put it aside to do it later. she was 
plagued with various problems including 
expensive car repairs, the letter did not leave 
 her hands in 96 hours. she finally typed the 
letter as promised and got a new car. remember, 
send no money. do not ignore this. it works. st. 
jude 
 94Phylogeny of 33 Chain Letters
Confirmed by VanArsdales study, answers an open 
question 
 95Chapter 13. Expression Arrays
- Expression arrays can hold thousands of genes 
 (DNA) such that cDNA (complementary DNA) can
 hybridize on them.
- The process is explained next page (J. Buhler). 
- This has opened great new opportunities for 
 proteomics and biological pathway studies. For
 example, there are expression arrays containing
 complete set of genes for Yeast (Pat Brown Lab,
 Stanford).
- Now that the human genome is completed with about 
 30k genes. It is conceivable for a chip to
 contain the complete set of human genes.
961. Two kind cells
2. Reverse transcribe mRNA to cDNA
3. Label cDNA with fluorescein
4. Hybridize on the DNA array 5. Read 
array 6. Analyze it. 
 97Analysis Techniques
- Finding genes expressed in one kind of cells and 
 not in the other.
- Cluster analysis  finding what genes expressed 
 together under some condition, to deduce
 pathways.
- Data mining, machine learning techniques gene 
 expression dependency. E.g. the expression of
 gene A implies the expression of gene B and no
 expression of gene C.
- Endless possibilities, enormous data. 
- http//www.cs.ucsb.edu/mli/expression.html
98An example of clustering 
 99Chapter 14. Bioinformatics Platform
- GCG (Wisconsin package) 
- Biotools 
- MacVector 
- Doubletwist webservice 
- NCBI webservice 
- Omega 
- Specialized packages like PAUP (phylogeny), 
 GenePix (expression array)
- Many more.
100DryLab Open, distributed, platform independent 
 101Chapter 15 Projects/Missing Topics
- I wish I could cover all, but the time is short. 
 Part of the missing topics will be covered by
 student projects/presentations.
- Each student will work on one project. 
- And present the project in class. 
- Most projects involve serious programming in 
 either C or Java.
- Please start early and discuss with me first. 
 Check our web page for projects.
- When a project is large, I will allow two 
 students to work on a project together.
102Major missing topics
- Genome rearrangement distances 
- Physical mapping 
- DNA arrays / Sequence by hybridization 
- Drug target design (from computational biology to 
 computational chemistry).
- RNA structure / folding. 
- Statistical studies / properties.
103Acknowledgements
- I would like to thank many coauthors, colleagues, 
 students whose work I used here J. Badger, C.
 Bennett, B. Brejova, X. Chen, T. Jiang, P.
 Kearney, S. Kwong, C. Liang, G.H. Lin, B. Ma, L.
 Miller, H.D. Sun, J. Tromp, T. Vinar, L. Wang, Z.
 Wang, D. Xu, Y. Xu, H. Zhang.
- I have copied some materials/figures from the 
 internet  I thank these authors as well.
- Most importantly, thanks to the enthusiastic 290I 
 students of Spring 2001 at UCSB who made this
 course fun to teach.