Iosif Vaisman

About This Presentation

Title:

Iosif Vaisman

Description:

Yeast chromosome 3 350,000. Escherichia coli (bacterium) genome 4,600,000. Largest yeast chromosome now mapped 5,800,000. Entire yeast genome 15,000,000 ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 76

Provided by: MML76

Learn more at: https://mason.gmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Iosif Vaisman

1
Introduction to Bioinformatics

Iosif Vaisman

Email ivaisman_at_gmu.edu
2
NIH working definition of bioinformatics and
computational biology (July 2000)
The NIH Biomedical Information Science and
Technology Initiative Consortium agreed on the
following definitions of bioinformatics and
computational biology recognizing that no
definition could completely eliminate overlap
with other activities or preclude variations in
interpretation by different individuals and
organizations. Bioinformatics Research,
development, or application of computational
tools and approaches for expanding the use of
biological, medical, behavioral or health data,
including those to acquire, store, organize,
archive, analyze, or visualize such
data. Computational Biology The development and
application of data-analytical and theoretical
methods, mathematical modeling and computational
simulation techniques to the study of biological,
behavioral, and social systems.
3
Bioinformatics bibliography(papers with the word
bioinformatics in title or abstract)
4
Dynamics of Database Growth
5
Comparative Sequence Sizes

Yeast chromosome 3
350,000
Escherichia coli (bacterium) genome
4,600,000
Largest yeast chromosome now mapped
5,800,000
Entire yeast genome
15,000,000
Smallest human chromosome (Y)
50,000,000
Largest human chromosome (1)
250,000,000
Entire human genome
3,000,000,000

6
The String Alignment Problem
string - a sequence of characters from some
alphabet
given two strings acbcdb and cadbd
one of possible alignments
a c - - b c d b - c a d b - d -
score 3 . (2) 5 . (-1) 1
scoring function exact match 2 mismatch
-1 insertion -1
7
The String Alignment Problem
given two strings CTCATG and TACTTG
C T C A T G T A C T T G
score 3 . (2) 3 . (-1) 3
C T C A - T - G . T - A C T T G
score 4 . (2) 4 . (-1) 4
8
Entropy and Redundancy of Language
CUR F W D DIS AND P A
SED IEND ROUGHT EATH EASE AIN
BLES FR B BR AND AG
9
Entropy and Redundancy of Language
CUR FW D DISAND
P
BLESFRBBRAND
AG
The sequences are 65 identical
A CURSED FIEND WROUGHT DEATH DISEASE AND
PAIN
A BLESSED FRIEND BROUGHT BREATH AND EASE
AGAIN
10
Substitution Matrices

Dayhoff (or MDM, or PAM) - Derived from global
alignments of closely related sequences PAM100 -
number referes to evolutionary distance
(Percentage of Acceptable point Mutations per 108
years)

300 million years
200 million years
100 million years
11
Substitution Matrices

BLOSUM (BLOcks SUbstitution Matrix) -Derived
from local, ungapped alignments of distantly
related sequences BLOSUM62 - number refers to
the minimum percent identity

Reference Henikoff Henikoff Proteins 1749,
1993
12
Selecting a Matrix

Compared sequences are related 200 PAM or 250
PAM
Database scanning 120 PAM
Local alignment search 40 PAM, 120 PAM, 250 PAM
Detection of related sequences using BLAST
BLOSUM 62

Low PAM short segments, high similarity High
PAM long segments, low similarity
THERE IS NO ONE SIZE FITS ALL MATRIX !
13
Matrix Example
A B C D E F G H I
K .. 1.5 0.2 0.3 0.3 0.3 -0.5 0.7 -0.1
0.0 0.0 .. A 1.1 -0.4 1.1 0.7 -0.7
0.6 0.4 -0.2 0.4 .. B 1.5 -0.5
-0.6 -0.1 0.2 -0.1 0.2 -0.6 .. C
1.5 1.0 -1.0 0.7 0.4 -0.2 0.3 .. D
1.5 -0.7 0.5 0.4 -0.2 0.3
.. E 1.5 -0.6 -0.1
0.7 -0.7 .. F
1.5 -0.2 -0.3 -0.1 .. G
1.5 -0.3 0.1 .. H
1.5 -0.2 .. I
1.5 .. K
14
Dayhoffs Acceptable Point Mutations
Ala A Arg R 30 Asn N 109 17 Asp D 154 0
532 Cys C 33 10 0 0 Gln Q 93 120 50 76
0 Glu E 266 0 94 831 0 422 Gly G 579 10 156
162 10 30 112 His H 21 103 226 43 10 243 23
10 Ile I 66 30 36 13 17 8 35 0 3 Leu
L 95 17 37 0 0 75 15 17 40 253 Lys K
57 477 322 85 0 147 104 60 23 43 39 Met M
29 17 0 0 0 20 7 7 0 57 207
90 Phe F 20 7 7 0 0 0 0 17 20 90
167 0 17 Pro P 345 67 27 10 10 93 40 49
50 7 43 43 4 7 Ser S 772 137 432 98 117
47 86 450 26 20 32 168 20 40 269 Thr T 590
20 169 57 10 37 31 50 14 129 52 200 28
10 73 696 Trp W 0 27 3 0 0 0 0 0
3 0 13 0 0 10 0 17 0 Tyr Y 20 3
36 0 30 0 10 0 40 13 23 10 0 260
0 22 23 6 Val V 365 20 13 17 33 27 37
97 30 661 303 17 77 10 50 43 186 0 17
A R N D C Q E G H I L K
M F P S T W Y Ala Arg Asn Asp
Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser
Thr Trp Tyr
15
Search and alignment entropy

Information content per position pam10 -
3.43 bits pam120 - 0.98 bits
pam160 - 0.70 bits pam250 - 0.38
bits blosum62 - 0.70 bits
Information requirements for search -
30 bits for alignment - 16 bit

16
Search and alignment entropy
Recommended matrices for different query length

Query length Substitution matrix Gap
costs
lt35 PAM-30 ( 9,1)
35-50 PAM-70 (10,1)
50-85 BLOSUM-80 (10,1)
gt85 BLOSUM-62 (11,1)

17
FASTA Algorithm
1
First run (identities)
18
FASTA Algorithm
The score of the highest scoring initial region
is saved as the init1 score.
19
FASTA Algorithm
3
Joining threshold - eliminates disjointed segments
Non-overlapping regions are joined. The score
equals sum of the scores of the regions minus a
gap penalty. The score of the highest scoring
region, at the end of this step, is saved as the
initn score.
20
FASTA Algorithm
4
Alignment optimization using dynamic programming
The score for this alignment is the opt score.
21
FASTA Algorithm
FastA uses a simple linear regression against the
natural log of the search set sequence length to
calculate a normalized z-score for the sequence
pair. Using the distribution of the z-score, the
program can estimate the number of sequences that
would be expected to produce, purely by chance, a
z-score greater than or equal to the z-score
obtained in the search. This is reported as the
E() score.
22
FASTA Results

When init1init0opt
100 homology over the matched stretch.
When initn gt init1
more than 1 matching region in the database
with poorly matching separating regions.
When opt gt initn
the matching regions are greatly
improved by adding gaps in one or both of the
sequences.

23
BLAST - Basic Local Alignment Search Tool

Blast programs use a heuristic search algorithm.
The programs use the statistical methods of
Karlin and Altschul (1990,1993).
Blast programs were designed for fast database
searching, with minimal sacrifice of sensitivity
to distant related sequences.

24
BLAST Algorithm
1
Query sequence of length L
Maximium of L-w1 words (typically w 3 for
proteins)
For each word from the query sequence find the
list of words with high score using a
substitution matrix (PAM or BLOSUM)
Word list
25
BLAST Algorithm
2
Database sequences
Word list
Exact matches of words from the word list to the
database sequences
26
BLAST Algorithm
3
Maximal Segment Pairs (MSPs)
For each exact word match, alignment is extended
in both directions to find high score segments
27
Gapped BLAST

The Gapped Blast algorithm allows gaps to be
introduces into the alignments. That means that
similar regions are not broken into several
segments.
This method reflects biological relationships
much better.

28
BLAST family of programs

blastp - amino acid query sequence against a
protein sequence database
blastn - nucleotide query sequence against a
nucleotide sequence database
blastx - nucleotide query sequence translated
in all reading frames against a protein
database
tblastn - protein query sequence against a
nucleotide sequence database dynamically
translated in all reading frames
tblastx - six-frame translations of a
nucleotide query sequence against the
six-frame translations of a nucleotide sequence
database.

29
Database Searches

Run Blast first, then depending on your results
run a finer tool (Fasta, Smith-Waterman, etc.)
Where possible use translated sequence.
E() lt 0.05 is statistically significant, usually
biologically interesting. Check also 0.05 lt E()
lt10 because you might find interesting hits.
Pay attention to abnormal composition of the
query sequence, it usually causes biased scoring.
Split large query sequence ( if gt1000 for DNA,
gt200 for protein).
If the query has repeated segments, remove them
and repeat the search.

30
Documenting the Search

Algorithm(s)
Substitution matrix
Gap penalty (FASTA)
Name of database
Version of database
Computer used

31
MULTIPLE SEQUENCE ALIGNMENT
32
Computational complexity
Alignment of protein sequences with 200 amino
acid residues
33
Multiple alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWY
QQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--
YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDF
YPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLT
CLVKGFYPSD--IAVEWESNG--
Column cost the sum of costs for all possible
pairs
34
Multiple alignment
A correct multiple alignment corresponds to an
evolutionary history no correct way to
determine practical way - to find an alignment
with the maximum score
35
Multiple sequence alignment
Given k (k gt 2) sequences, s1,, sk, each
sequence consisting of characters from an
alphabet A multiple alignment is a a rectangular
array, consisting of characters from the
alphabet A (A "-"), that satisfies the
following 3 conditions 1. There are exactly k
rows. 2. Ignoring the gap character, row number
i is exactly the sequence si. 3. Each column
contains at least one character different from
"-".
36
Consensus
Plurality - minimum number of votes for a
consensus Threshold - scoring matrix value below
which a symbol may not vote
for a coalition. Sensitivity - minimum score to
select consensus Profiles - blocks of
prealigned sequences
37
Multiple alignment algorithm
1. Pairwise alignments (progressive pairwise
alignments) 2. Distance matrix calculation 3.
Guide tree creation (hierarchical clustering) 4.
New sequence addition
38
Scoring system (distances)
Sreal(ij) - observed similarity score for two
aligned sequences i and j Siden(ij) - average
of the two scores for each sequence aligned
with itself Srand(ij) - average score determined
from 100 global randomizations of the two
sequences
The distances D(ij) are used to generate the
distance matrix from which the approximate guide
tree is generated.
39
Multiple alignment
40
Multiple alignment
Segment - line joining two vertices Each unit
m-dimensional cube in the lattice contains 2m
-1 segments
41
Multiple alignment
Alignment Path for 3 Sequences
(0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1),
(4,3,2)
42
Multiple alignment
V S N - S - S N A - - - - A S
Pairwise Projections of the Alignment
43
Alignment statistics
Rablpb Humcetp
Rabcetp Bovbpi Humlbpa
Ratlbp Maccetp Humbpi 1
2 3 4 5 6 7 8
478 67 65 19 19
18 42 43 1 0 82 80
39 39 36 64 65 0
1 0 5 5 12 2 2
327 483 58 16 16 16
39 41 2 400 0 75 38
38 35 62 63 5 0
0 5 5 12 1 1
318 284 482 18 18 17 40
43 3 390 367 0 38 38
35 64 64 4 1 0
5 5 12 1 1 96
84 95 494 95 74 20 21 4
198 192 194 0 98 84
40 41 30 29 28 0
0 7 6 5
44
Alignment score
Rablpb Humcetp
Rabcetp Bovbpi Humlbpa
Ratlbp Maccetp Humbpi
1 2 3 4 5
6 7 8 1 4077 2 5358
4129 3 5323 5650 4096 4 8103
8229 8112 4210 5 8109 8243
8118 4332 4219 6 8535 8672
8575 5511 5519 4261 7 6474
6531 6500 8103 8119 8572
4103 8 6392 6434 6378 8033
8035 8520 5508 4083 1
2 3 4 5 6 7
8
45
Alignment visualization
Identity
Summary view
46
Alignment visualization
Physico-chemical properties
Differences mode
47
Alignment visualization (tree)
48
Sequence Logos a quantitative graphical display
for binding sites and proteins
Reference Schneider, T.D. Meth. Enzym 274445,
1996
49
Sequence Logos
50
Sequence Logos
51
Multiple Alignment Programs

Pileup (GCG) Needleman and Wunsch algorithm for
pairwise alignment and UPGMA method for tree
construction
CLUSTAL Wilbur and Lipman algorithm for pairwise
alignment (CABIOS 8189, 1992)
PIMA pattern-matching based algorithm (PNAS
87118, 1990)
TreeAlign phylogenetic algorithm (Meth. Enzymol.
18626, 1990)

52
Patterns in protein sequences
53
Regular Expressions
Patterns described in a standard way are known as
regular expressions
54
Regular Expressions
AC-x-V-x(4)-ED. Ala or Cys-any-Val-any-any-
any-any-any but Glu or Asp ...LKHVAYVFQALIYWI
K... ...AVEMAGVKYLQVQHGS... ...LYTGAIVTNNDGPYMA..
. ...KEYKCKVEKELTDICN...
55
PROSITE Database
Current version contains 1079 documentation
entries that describe 1459 different patterns,
rules and profiles/matrices ST-x(2)-DE
Casein kinase II phosphorylation site
AG-x(4)-G-K-ST ATP/GTP-binding site
motif A (P-loop) Y-x-NQH-K-DE-IVA-F-LM-R
-ED Heat shock hsp90 proteins family
signature http//www.expasy.ch/prosite
56
Blocks Database
Blocks are multiply aligned ungapped segments
corresponding to the most highly conserved
regions of proteins
N-6 Adenine-specific DNA methylases
proteins width9 seqs78
DMA_VIBCHQ08318 (85) SCTQWWPPF 77
HEMK_MYCLEP45832 (181) DLFVAQPTL 100
MT57_ECOLIP25240 (111) DGALGNPPF 13
MTC1_CHVN1Q01511 (172) NFVFLDPPY 8
MTC1_COREQP42828 (71) QLSFSCPPF 49
MTH2_HAEHAP00473 (32) KIAFFDPQY 52
MTH3_HAEINP43871 (23) HAIISDIPY 73
MTM1_MICAMP50190 (306) AAVLTNPPF 14
MTM2_MORBOP23192 (25) QLAVIDPPY 10
MTMU_MYCSPP43641 (37) QVIYADPPW 13
MTR1_RHOSHP14751 (60) QLIICDPPY
8 ....................................
http//www.blocks.fhcrc.org/
57
Pfam Database
Pfam is a large collection of multiple sequence
alignments and hidden Markov models covering
many common protein domains
Zinc finger, C2H2 type
TYY1_HUMAN/383-407 YVCPF.DGCN...KKFAQSTNLKSHILT..
.H ZG52_XENLA/61-83 YTCT...QCN...KQFSHSAQLRAHI
ST...H KRUP_DROME/306-328 YTCE...ICD...GKFSDSNQL
KSHMLV...H YKQ8_CAEEL/78-102
YKCT...VCR...KDISSSESLRTHMFKQ.HH
DEFI_CHICK/268-292 YECP...NCK...KRFSHSGSYSSHISSK
.KC ZFH1_DROME/389-413 FGCD...NCG...KRFSHSGSFSSH
MTSK.KC YL57_CAEEL/42-65 YLCY...YCG...KTLSDRLE
YQQHMLK..VH ZFA_MOUSE/542-564
FKCD...ICL...LTFSDTKEVQQHALV...H
BASO_HUMAN/719-742 FQCD...ICK...KTFKNACSVKIHHKN.
.MH HUNB_DROME/297-319 FQCD...KCS...YTCVNKSMLNSH
RKS...H SFP1_YEAST/598-623 FKCPV.IGCE...KTYKNQNG
LKYHRLH..GH ZG29_XENLA/62-84
FVCT...VCG...KTYKYKHGLNTHLHS...H
http//pfam.wustl.edu/
58
Other Motif Databases
PRINTS a compendium of protein fingerprints. A
fingerprint is a group of conserved motifs used
to characterise a protein family http//bioinf.ma
n.ac.uk/dbbrowser/PRINTS/ DOMO a protein
domain database http//www.infobiogen.fr/gracy/do
mo/home.htm ProDom a protein domain database
http//protein.toulouse.inra.fr/prodom.html
59
InterPro Database
InterPro integrated resource for the commonly
used signature databases - Pfam, PRINTS,
PROSITE, ProDom and SWISS-PROT
TrEMBL. Current release of InterPro (3.2)
contains 3939 entries, representing 1009
domains, 2850 families, 65 repeats and 15
post-translational modification
sites. http//www.ebi.ac.uk/interpro
60
InterPro Database
61
From genes to proteins
DNA
PROMOTER ELEMENTS
TRANSCRIPTION
RNA
SPLICE SITES
SPLICING
mRNA
START CODON
STOP CODON
TRANSLATION
PROTEIN
62
From genes to proteins
63
(No Transcript)
64
Chromosome 19 gene map
65
Computational Gene Prediction

Where the genes are unlikely to be located?
How do transcription factors know where to bind a
region of DNA?
Where are the transcription, splicing, and
translation start and stop signals?
What does coding region do (and non-coding
regions do not) ?
Can we learn from examples?
Does this sequence look familiar?

66
Measures of Prediction Accuracy
Nucleotide Level
67
Measures of Prediction Accuracy
Exon Level
WRONGEXON
CORRECTEXON
MISSING EXON
REALITY
PREDICTION
68
Spliced Alignment (Procrustes)

New genomic sequence
Selection of candidate exons AUG --- GU initial
exons AG --- GU internal exons AG --- UAA or
UAG or UGA terminal exons
Filtration (based on the codon usge statistics)
Construction of all possible chains of candidate
exons
Finding a chain with the maximum global
similarity to the target protein

69
Spliced Alignment (Procrustes)
70
Predicted Exon Assembly(Procrustes)
71
PCR Primers Prediction (GenePrimer)
Exon 1085..1182 (98) hit using first 2 primers
Exon 1628..1676 (49) missed Exon 1900..2001
(102) hit using first 8 primers Exon 2110..2184
(75) missed Exon 2516..2722 (207) hit using
first 4 primers Exon 3385..3472 (88) missed
Exon 3546..3746 (201) hit using first primer ...
72
GRAIL gene identification program
73
Suboptimal Solutions for the Human Growth Hormone
Gene (GeneParser)
74
GeneMark Accuracy Evaluation
75
Bibliography http//linkage.rockefeller.edu/wli/ge
ne/list.html and http//www-hto.usc.edu/software/p
rocrustes/fans_ref/
Gene Discovery Exercise http//metalab.unc.edu/pha
rmacy/Bioinfo/Gene

Write a Comment

User Comments (0)