Title: Iosif Vaisman
1Introduction to Bioinformatics
Email ivaisman_at_gmu.edu
2NIH working definition of bioinformatics and
computational biology (July 2000)
The NIH Biomedical Information Science and
Technology Initiative Consortium agreed on the
following definitions of bioinformatics and
computational biology recognizing that no
definition could completely eliminate overlap
with other activities or preclude variations in
interpretation by different individuals and
organizations. Bioinformatics Research,
development, or application of computational
tools and approaches for expanding the use of
biological, medical, behavioral or health data,
including those to acquire, store, organize,
archive, analyze, or visualize such
data. Computational Biology The development and
application of data-analytical and theoretical
methods, mathematical modeling and computational
simulation techniques to the study of biological,
behavioral, and social systems.
3Bioinformatics bibliography(papers with the word
bioinformatics in title or abstract)
4Dynamics of Database Growth
5Comparative Sequence Sizes
- Yeast chromosome 3
350,000 - Escherichia coli (bacterium) genome
4,600,000 - Largest yeast chromosome now mapped
5,800,000 - Entire yeast genome
15,000,000 - Smallest human chromosome (Y)
50,000,000 - Largest human chromosome (1)
250,000,000 - Entire human genome
3,000,000,000
6The String Alignment Problem
string - a sequence of characters from some
alphabet
given two strings acbcdb and cadbd
one of possible alignments
a c - - b c d b - c a d b - d -
score 3 . (2) 5 . (-1) 1
scoring function exact match 2 mismatch
-1 insertion -1
7The String Alignment Problem
given two strings CTCATG and TACTTG
C T C A T G T A C T T G
score 3 . (2) 3 . (-1) 3
C T C A - T - G . T - A C T T G
score 4 . (2) 4 . (-1) 4
8Entropy and Redundancy of Language
CUR F W D DIS AND P A
SED IEND ROUGHT EATH EASE AIN
BLES FR B BR AND AG
9Entropy and Redundancy of Language
CUR FW D DISAND
P
BLESFRBBRAND
AG
The sequences are 65 identical
A CURSED FIEND WROUGHT DEATH DISEASE AND
PAIN
A BLESSED FRIEND BROUGHT BREATH AND EASE
AGAIN
10Substitution Matrices
- Dayhoff (or MDM, or PAM) - Derived from global
alignments of closely related sequences PAM100 -
number referes to evolutionary distance
(Percentage of Acceptable point Mutations per 108
years)
300 million years
200 million years
100 million years
11Substitution Matrices
- BLOSUM (BLOcks SUbstitution Matrix) -Derived
from local, ungapped alignments of distantly
related sequences BLOSUM62 - number refers to
the minimum percent identity
Reference Henikoff Henikoff Proteins 1749,
1993
12Selecting a Matrix
- Compared sequences are related 200 PAM or 250
PAM - Database scanning 120 PAM
- Local alignment search 40 PAM, 120 PAM, 250 PAM
- Detection of related sequences using BLAST
BLOSUM 62
Low PAM short segments, high similarity High
PAM long segments, low similarity
THERE IS NO ONE SIZE FITS ALL MATRIX !
13Matrix Example
A B C D E F G H I
K .. 1.5 0.2 0.3 0.3 0.3 -0.5 0.7 -0.1
0.0 0.0 .. A 1.1 -0.4 1.1 0.7 -0.7
0.6 0.4 -0.2 0.4 .. B 1.5 -0.5
-0.6 -0.1 0.2 -0.1 0.2 -0.6 .. C
1.5 1.0 -1.0 0.7 0.4 -0.2 0.3 .. D
1.5 -0.7 0.5 0.4 -0.2 0.3
.. E 1.5 -0.6 -0.1
0.7 -0.7 .. F
1.5 -0.2 -0.3 -0.1 .. G
1.5 -0.3 0.1 .. H
1.5 -0.2 .. I
1.5 .. K
14Dayhoffs Acceptable Point Mutations
Ala A Arg R 30 Asn N 109 17 Asp D 154 0
532 Cys C 33 10 0 0 Gln Q 93 120 50 76
0 Glu E 266 0 94 831 0 422 Gly G 579 10 156
162 10 30 112 His H 21 103 226 43 10 243 23
10 Ile I 66 30 36 13 17 8 35 0 3 Leu
L 95 17 37 0 0 75 15 17 40 253 Lys K
57 477 322 85 0 147 104 60 23 43 39 Met M
29 17 0 0 0 20 7 7 0 57 207
90 Phe F 20 7 7 0 0 0 0 17 20 90
167 0 17 Pro P 345 67 27 10 10 93 40 49
50 7 43 43 4 7 Ser S 772 137 432 98 117
47 86 450 26 20 32 168 20 40 269 Thr T 590
20 169 57 10 37 31 50 14 129 52 200 28
10 73 696 Trp W 0 27 3 0 0 0 0 0
3 0 13 0 0 10 0 17 0 Tyr Y 20 3
36 0 30 0 10 0 40 13 23 10 0 260
0 22 23 6 Val V 365 20 13 17 33 27 37
97 30 661 303 17 77 10 50 43 186 0 17
A R N D C Q E G H I L K
M F P S T W Y Ala Arg Asn Asp
Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser
Thr Trp Tyr
15Search and alignment entropy
- Information content per position pam10 -
3.43 bits pam120 - 0.98 bits
pam160 - 0.70 bits pam250 - 0.38
bits blosum62 - 0.70 bits - Information requirements for search -
30 bits for alignment - 16 bit
16Search and alignment entropy
Recommended matrices for different query length
- Query length Substitution matrix Gap
costs -
- lt35 PAM-30 ( 9,1)
- 35-50 PAM-70 (10,1)
- 50-85 BLOSUM-80 (10,1)
- gt85 BLOSUM-62 (11,1)
17FASTA Algorithm
1
First run (identities)
18FASTA Algorithm
The score of the highest scoring initial region
is saved as the init1 score.
19FASTA Algorithm
3
Joining threshold - eliminates disjointed segments
Non-overlapping regions are joined. The score
equals sum of the scores of the regions minus a
gap penalty. The score of the highest scoring
region, at the end of this step, is saved as the
initn score.
20FASTA Algorithm
4
Alignment optimization using dynamic programming
The score for this alignment is the opt score.
21FASTA Algorithm
FastA uses a simple linear regression against the
natural log of the search set sequence length to
calculate a normalized z-score for the sequence
pair. Using the distribution of the z-score, the
program can estimate the number of sequences that
would be expected to produce, purely by chance, a
z-score greater than or equal to the z-score
obtained in the search. This is reported as the
E() score.
22FASTA Results
- When init1init0opt
100 homology over the matched stretch. - When initn gt init1
more than 1 matching region in the database
with poorly matching separating regions. - When opt gt initn
the matching regions are greatly
improved by adding gaps in one or both of the
sequences.
23BLAST - Basic Local Alignment Search Tool
- Blast programs use a heuristic search algorithm.
The programs use the statistical methods of
Karlin and Altschul (1990,1993). - Blast programs were designed for fast database
searching, with minimal sacrifice of sensitivity
to distant related sequences.
24BLAST Algorithm
1
Query sequence of length L
Maximium of L-w1 words (typically w 3 for
proteins)
For each word from the query sequence find the
list of words with high score using a
substitution matrix (PAM or BLOSUM)
Word list
25BLAST Algorithm
2
Database sequences
Word list
Exact matches of words from the word list to the
database sequences
26BLAST Algorithm
3
Maximal Segment Pairs (MSPs)
For each exact word match, alignment is extended
in both directions to find high score segments
27Gapped BLAST
- The Gapped Blast algorithm allows gaps to be
introduces into the alignments. That means that
similar regions are not broken into several
segments. - This method reflects biological relationships
much better.
28BLAST family of programs
- blastp - amino acid query sequence against a
protein sequence database - blastn - nucleotide query sequence against a
nucleotide sequence database - blastx - nucleotide query sequence translated
in all reading frames against a protein
database - tblastn - protein query sequence against a
nucleotide sequence database dynamically
translated in all reading frames - tblastx - six-frame translations of a
nucleotide query sequence against the
six-frame translations of a nucleotide sequence
database.
29Database Searches
- Run Blast first, then depending on your results
run a finer tool (Fasta, Smith-Waterman, etc.) - Where possible use translated sequence.
- E() lt 0.05 is statistically significant, usually
biologically interesting. Check also 0.05 lt E()
lt10 because you might find interesting hits. - Pay attention to abnormal composition of the
query sequence, it usually causes biased scoring. - Split large query sequence ( if gt1000 for DNA,
gt200 for protein). - If the query has repeated segments, remove them
and repeat the search.
30Documenting the Search
- Algorithm(s)
- Substitution matrix
- Gap penalty (FASTA)
- Name of database
- Version of database
- Computer used
31MULTIPLE SEQUENCE ALIGNMENT
32Computational complexity
Alignment of protein sequences with 200 amino
acid residues
33Multiple alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWY
QQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--
YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDF
YPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLT
CLVKGFYPSD--IAVEWESNG--
Column cost the sum of costs for all possible
pairs
34Multiple alignment
A correct multiple alignment corresponds to an
evolutionary history no correct way to
determine practical way - to find an alignment
with the maximum score
35Multiple sequence alignment
Given k (k gt 2) sequences, s1,, sk, each
sequence consisting of characters from an
alphabet A multiple alignment is a a rectangular
array, consisting of characters from the
alphabet A (A "-"), that satisfies the
following 3 conditions 1. There are exactly k
rows. 2. Ignoring the gap character, row number
i is exactly the sequence si. 3. Each column
contains at least one character different from
"-".
36Consensus
Plurality - minimum number of votes for a
consensus Threshold - scoring matrix value below
which a symbol may not vote
for a coalition. Sensitivity - minimum score to
select consensus Profiles - blocks of
prealigned sequences
37Multiple alignment algorithm
1. Pairwise alignments (progressive pairwise
alignments) 2. Distance matrix calculation 3.
Guide tree creation (hierarchical clustering) 4.
New sequence addition
38Scoring system (distances)
Sreal(ij) - observed similarity score for two
aligned sequences i and j Siden(ij) - average
of the two scores for each sequence aligned
with itself Srand(ij) - average score determined
from 100 global randomizations of the two
sequences
The distances D(ij) are used to generate the
distance matrix from which the approximate guide
tree is generated.
39Multiple alignment
40Multiple alignment
Segment - line joining two vertices Each unit
m-dimensional cube in the lattice contains 2m
-1 segments
41Multiple alignment
Alignment Path for 3 Sequences
(0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1),
(4,3,2)
42Multiple alignment
V S N - S - S N A - - - - A S
Pairwise Projections of the Alignment
43Alignment statistics
Rablpb Humcetp
Rabcetp Bovbpi Humlbpa
Ratlbp Maccetp Humbpi 1
2 3 4 5 6 7 8
478 67 65 19 19
18 42 43 1 0 82 80
39 39 36 64 65 0
1 0 5 5 12 2 2
327 483 58 16 16 16
39 41 2 400 0 75 38
38 35 62 63 5 0
0 5 5 12 1 1
318 284 482 18 18 17 40
43 3 390 367 0 38 38
35 64 64 4 1 0
5 5 12 1 1 96
84 95 494 95 74 20 21 4
198 192 194 0 98 84
40 41 30 29 28 0
0 7 6 5
44Alignment score
Rablpb Humcetp
Rabcetp Bovbpi Humlbpa
Ratlbp Maccetp Humbpi
1 2 3 4 5
6 7 8 1 4077 2 5358
4129 3 5323 5650 4096 4 8103
8229 8112 4210 5 8109 8243
8118 4332 4219 6 8535 8672
8575 5511 5519 4261 7 6474
6531 6500 8103 8119 8572
4103 8 6392 6434 6378 8033
8035 8520 5508 4083 1
2 3 4 5 6 7
8
45Alignment visualization
Identity
Summary view
46Alignment visualization
Physico-chemical properties
Differences mode
47Alignment visualization (tree)
48Sequence Logos a quantitative graphical display
for binding sites and proteins
Reference Schneider, T.D. Meth. Enzym 274445,
1996
49Sequence Logos
50Sequence Logos
51Multiple Alignment Programs
- Pileup (GCG) Needleman and Wunsch algorithm for
pairwise alignment and UPGMA method for tree
construction - CLUSTAL Wilbur and Lipman algorithm for pairwise
alignment (CABIOS 8189, 1992) - PIMA pattern-matching based algorithm (PNAS
87118, 1990) - TreeAlign phylogenetic algorithm (Meth. Enzymol.
18626, 1990)
52Patterns in protein sequences
53Regular Expressions
Patterns described in a standard way are known as
regular expressions
54Regular Expressions
AC-x-V-x(4)-ED. Ala or Cys-any-Val-any-any-
any-any-any but Glu or Asp ...LKHVAYVFQALIYWI
K... ...AVEMAGVKYLQVQHGS... ...LYTGAIVTNNDGPYMA..
. ...KEYKCKVEKELTDICN...
55PROSITE Database
Current version contains 1079 documentation
entries that describe 1459 different patterns,
rules and profiles/matrices ST-x(2)-DE
Casein kinase II phosphorylation site
AG-x(4)-G-K-ST ATP/GTP-binding site
motif A (P-loop) Y-x-NQH-K-DE-IVA-F-LM-R
-ED Heat shock hsp90 proteins family
signature http//www.expasy.ch/prosite
56Blocks Database
Blocks are multiply aligned ungapped segments
corresponding to the most highly conserved
regions of proteins
N-6 Adenine-specific DNA methylases
proteins width9 seqs78
DMA_VIBCHQ08318 (85) SCTQWWPPF 77
HEMK_MYCLEP45832 (181) DLFVAQPTL 100
MT57_ECOLIP25240 (111) DGALGNPPF 13
MTC1_CHVN1Q01511 (172) NFVFLDPPY 8
MTC1_COREQP42828 (71) QLSFSCPPF 49
MTH2_HAEHAP00473 (32) KIAFFDPQY 52
MTH3_HAEINP43871 (23) HAIISDIPY 73
MTM1_MICAMP50190 (306) AAVLTNPPF 14
MTM2_MORBOP23192 (25) QLAVIDPPY 10
MTMU_MYCSPP43641 (37) QVIYADPPW 13
MTR1_RHOSHP14751 (60) QLIICDPPY
8 ....................................
http//www.blocks.fhcrc.org/
57Pfam Database
Pfam is a large collection of multiple sequence
alignments and hidden Markov models covering
many common protein domains
Zinc finger, C2H2 type
TYY1_HUMAN/383-407 YVCPF.DGCN...KKFAQSTNLKSHILT..
.H ZG52_XENLA/61-83 YTCT...QCN...KQFSHSAQLRAHI
ST...H KRUP_DROME/306-328 YTCE...ICD...GKFSDSNQL
KSHMLV...H YKQ8_CAEEL/78-102
YKCT...VCR...KDISSSESLRTHMFKQ.HH
DEFI_CHICK/268-292 YECP...NCK...KRFSHSGSYSSHISSK
.KC ZFH1_DROME/389-413 FGCD...NCG...KRFSHSGSFSSH
MTSK.KC YL57_CAEEL/42-65 YLCY...YCG...KTLSDRLE
YQQHMLK..VH ZFA_MOUSE/542-564
FKCD...ICL...LTFSDTKEVQQHALV...H
BASO_HUMAN/719-742 FQCD...ICK...KTFKNACSVKIHHKN.
.MH HUNB_DROME/297-319 FQCD...KCS...YTCVNKSMLNSH
RKS...H SFP1_YEAST/598-623 FKCPV.IGCE...KTYKNQNG
LKYHRLH..GH ZG29_XENLA/62-84
FVCT...VCG...KTYKYKHGLNTHLHS...H
http//pfam.wustl.edu/
58Other Motif Databases
PRINTS a compendium of protein fingerprints. A
fingerprint is a group of conserved motifs used
to characterise a protein family http//bioinf.ma
n.ac.uk/dbbrowser/PRINTS/ DOMO a protein
domain database http//www.infobiogen.fr/gracy/do
mo/home.htm ProDom a protein domain database
http//protein.toulouse.inra.fr/prodom.html
59InterPro Database
InterPro integrated resource for the commonly
used signature databases - Pfam, PRINTS,
PROSITE, ProDom and SWISS-PROT
TrEMBL. Current release of InterPro (3.2)
contains 3939 entries, representing 1009
domains, 2850 families, 65 repeats and 15
post-translational modification
sites. http//www.ebi.ac.uk/interpro
60InterPro Database
61From genes to proteins
DNA
PROMOTER ELEMENTS
TRANSCRIPTION
RNA
SPLICE SITES
SPLICING
mRNA
START CODON
STOP CODON
TRANSLATION
PROTEIN
62From genes to proteins
63(No Transcript)
64Chromosome 19 gene map
65Computational Gene Prediction
- Where the genes are unlikely to be located?
- How do transcription factors know where to bind a
region of DNA? - Where are the transcription, splicing, and
translation start and stop signals? - What does coding region do (and non-coding
regions do not) ? - Can we learn from examples?
- Does this sequence look familiar?
66Measures of Prediction Accuracy
Nucleotide Level
67Measures of Prediction Accuracy
Exon Level
WRONGEXON
CORRECTEXON
MISSING EXON
REALITY
PREDICTION
68Spliced Alignment (Procrustes)
- New genomic sequence
- Selection of candidate exons AUG --- GU initial
exons AG --- GU internal exons AG --- UAA or
UAG or UGA terminal exons - Filtration (based on the codon usge statistics)
- Construction of all possible chains of candidate
exons - Finding a chain with the maximum global
similarity to the target protein
69Spliced Alignment (Procrustes)
70Predicted Exon Assembly(Procrustes)
71PCR Primers Prediction (GenePrimer)
Exon 1085..1182 (98) hit using first 2 primers
Exon 1628..1676 (49) missed Exon 1900..2001
(102) hit using first 8 primers Exon 2110..2184
(75) missed Exon 2516..2722 (207) hit using
first 4 primers Exon 3385..3472 (88) missed
Exon 3546..3746 (201) hit using first primer ...
72GRAIL gene identification program
73Suboptimal Solutions for the Human Growth Hormone
Gene (GeneParser)
74GeneMark Accuracy Evaluation
75Bibliography http//linkage.rockefeller.edu/wli/ge
ne/list.html and http//www-hto.usc.edu/software/p
rocrustes/fans_ref/
Gene Discovery Exercise http//metalab.unc.edu/pha
rmacy/Bioinfo/Gene