Title: BIOINFORMATICS%20Sequences%202
1BIOINFORMATICSSequences 2
- Mark Gerstein, Yale Universitygersteinlab.org/cou
rses/452 - (last edit in fall '06, includes in-class changes)
2Sequence Topics (Contents)
- Basic Alignment via Dynamic Programming
- Suboptimal Alignment
- Gap Penalties
- Similarity (PAM) Matrices
- Multiple Alignment
- Profiles, Motifs, HMMs
- Local Alignment
- Probabilistic Scoring Schemes
- Rapid Similarity Search Fasta
- Rapid Similarity Search Blast
- Practical Suggestions on Sequence Searching
- Transmembrane helix predictions
- Secondary Structure Prediction Basic GOR
- Secondary Structure Prediction Other Methods
- Assessing Secondary Structure Prediction
- Features of Genomic DNA sequences
3Multiple Sequence Alignments
- - One of the most essential tools in molecular
biology - It is widely used in
- - Phylogenetic analysis
- - Prediction of protein secondary/tertiary
structure - - Finding diagnostic patterns to characterize
protein families - - Detecting new homologies between new genes
and   established sequence families
Core
- Practically useful methods only since 1987 -
Before 1987 they were constructed by hand - The
basic problem no dynamic programming approach
can be used - First useful approach by D.
Sankoff (1987) based on phylogenetics
(LEFT, adapted from Sonhammer et al. (1997).
Pfam, Proteins 28405-20. ABOVE, G Barton AMAS
web page)
4Progressive Multiple Alignments
- Most multiple alignments based on this approach
- Initial guess for a phylogenetic tree based on
pairwise alignments - Built progressively
starting with most closely related sequences -
Follows branching order in phylogenetic tree -
Sufficiently fast - Sensitive - Algorithmically
heuristic, no mathematical property associated
with the alignment - Biologically sound, it is
common to derive alignments which are impossible
to improve by eye
(adapted from Sonhammer et al. (1997). Pfam,
Proteins 28405-20)
5Clustering approaches for multiple sequence
alignment
6C1Q - Example
Ca28_Human ELSAHATPAFTAVLTSPLPASGMPVKFDRTLYNGHSGY
NPATGIFTCPVGGVYYFAYHVH VKGTNVWVALYKNNVPATYTYDEYKK
GYLDQASGGAVLQLRPNDQVWVQIPSDQANGLYS
TEYIHSSFSGFLLCPT C1qb_Human DYKATQKIAFSATRTINVP
LRRDQTIRFDHVITNMNNNYEPRSGKFTCKVPGLYYFTYHA
SSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVF
LQATDKNSLLG MEGANSIFSGFLLFPD Cerb_Human
VRSGSAKVAFSAIRSTNHEPSEMSNRTMIIYFDQVLVNIGNNFDSERST
FIAPRKGIYSF NFHVVKVYNRQTIQVSLMLNGWPVISAFAGDQDVTRE
AASNGVLIQMEKGDRAYLKLERG NLMGGWKYSTFSGFLVFPL
COLE_LEPMA.264 RGPKGPPGESVEQIRSAFSVGLFPSRSFPPPSL
PVKFDKVFYNGEGHWDPTLNKFNVTYP GVYLFSYHITVRNRPVRAALV
VNGVRKLRTRDSLYGQDIDQASNLALLHLTDGDQVWLET
LRDWNGXYSSSEDDSTFSGFLLYPDTKKPTAM HP27_TAMAS.72
GPPGPPGMTVNCHSKGTSAFAVKANELPPAPSQPVIFKEALHDAQGHFD
LATGVFTCPVP GLYQFGFHIEAVQRAVKVSLMRNGTQVMEREAEAQDG
YEHISGTAILQLGMEDRVWLENK LSQTDLERGTVQAVFSGFLIHEN
HSUPST2_1.95 GIQGRKGEPGEGAYVYRSAFSVGLETYVTIPNMPI
RFTKIFYNQQNHYDGSTGKFHCNIP GLYYFAYHITVYMKDVKVSLFKK
DKAMLFTYDQYQENNVDQASGSVLLHLEVGDQVWLQV
YGEGERNGLYADNDNDSTFTGFLLYHDTN 2.HS27109_1
ENALAPDFSKGSYRYAPMVAFFASHTYGMTIPGPILFNNLDVNYGASYT
PRTGKFRIPYL GVYVFKYTIESFSAHISGFLVVDGIDKLAFESENINS
EIHCDRVLTGDALLELNYGQEVW LRLAKGTIPAKFPPVTTFSGYLLYR
T 4.YQCC_BACSU VVHGWTPWQKISGFAHANIGTTGVQYLKKIDHT
KIAFNRVIKDSHNAFDTKNNRFIAPND GMYLIGASIYTLNYTSYINFH
LKVYLNGKAYKTLHHVRGDFQEKDNGMNLGLNGNATVPM
NKGDYVEIWCYCNYGGDETLKRAVDDKNGVFNFFD
5.BSPBSXSE_25 ADSGWTAWQKISGFAHANIGTTGRQALIKGENNK
IKYNRIIKDSHKLFDTKNNRFVASHA GMHLVSASLYIENTERYSNFEL
YVYVNGTKYKLMNQFRMPTPSNNSDNEFNATVTGSVTV
PLDAGDYVEIYVYVGYSGDVTRYVTDSNGALNYFD
Extra
7Clustal Alignment
MMCOL10A1_1.483Â Â Â Â Â SGMPLVSANHGVTG-------MPVSAFTV
ILS--KAYPA---VGCPHPIYEILYNRQQHY
Ca1x_Chick          ----------ALTG-------MPVSAFT
VILS--KAYPG---ATVPIKFDKILYNRQQHY
S15435Â Â Â Â Â Â Â Â Â Â Â Â Â Â ----------GGPA-------YEMPAFT
AELT--APFPP---VGGPVKFNKLLYNGRQNY
CA18_MOUSE.597Â Â Â Â Â Â HAYAGKKGKHGGPA-------YEMPAFT
AELT--VPFPP---VGAPVKFDKLLYNGRQNY
Ca28_Human          ----------ELSA-------HATPAFT
AVLT--SPLPA---SGMPVKFDRTLYNGHSGY
MM37222_1.98Â Â Â Â Â Â Â Â ----GTPGRKGEPGE---AAYMYRSAFS
VGLETRVTVP-----NVPIRFTKIFYNQQNHY
COLE_LEPMA.264Â Â Â Â Â Â ------RGPKGPPGE---SVEQIRSAFS
VGLFPSRSFPP---PSLPVKFDKVFYNGEGHW
HP27_TAMAS.72Â Â Â Â Â Â Â -------GPPGPPGMTVNCHSKGTSAFA
VKAN--ELPPA---PSQPVIFKEALHDAQGHF
S19018Â Â Â Â Â Â Â Â Â Â Â Â Â Â ----------NIRD-------QPRPAFS
AIRQ---NPMT---LGNVVIFDKVLTNQESPY
C1qb_Mouse          --------------D---YRATQKVAFS
ALRTINSPLR----PNQVIRFEKVITNANENY
C1qb_Human          --------------D---YKATQKIAFS
ATRTINVPLR----RDQTIRFDHVITNMNNNY
Cerb_Human          --------------V---RSGSAKVAFS
AIRSTNHEPSEMSNRTMIIYFDQVLVNIGNNF
2.HS27109_1Â Â Â Â Â Â Â Â Â ---ENALAPDFSKGS---YRYAPMVAFF
ASHTYGMTIP------GPILFNNLDVNYGASY
                                             .
.                      MMCOL10A1_1.483    Â
DPRSGIFTCKIPGIYYFSYHVHVKGT--HVWVGLYKNGTP-TMYTY---D
EYSKGYLDTA Ca1x_Chick         Â
DPRTGIFTCRIPGLYYFSYHVHAKGT--NVWVALYKNGSP-VMYTY---D
EYQKGYLDQA S15435Â Â Â Â Â Â Â Â Â Â Â Â Â Â
NPQTGIFTCEVPGVYYFAYHVHCKGG--NVWVALFKNNEP-VMYTY---D
EYKKGFLDQA CA18_MOUSE.597Â Â Â Â Â Â
NPQTGIFTCEVPGVYYFAYHVHCKGG--NVWVALFKNNEP-MMYTY---D
EYKKGFLDQA Ca28_Human         Â
NPATGIFTCPVGGVYYFAYHVHVKGT--NVWVALYKNNVP-ATYTY---D
EYKKGYLDQA MM37222_1.98Â Â Â Â Â Â Â Â
DGSTGKFYCNIPGLYYFSYHITVYMK--DVKVSLFKKDKA-VLFTY---D
QYQEKNVDQA COLE_LEPMA.264Â Â Â Â Â Â
DPTLNKFNVTYPGVYLFSYHITVRNR--PVRAALVVNGVR-KLRTR---D
SLYGQDIDQA HP27_TAMAS.72Â Â Â Â Â Â Â
DLATGVFTCPVPGLYQFGFHIEAVQR--AVKVSLMRNGTQ-VMERE---A
EAQDG-YEHI S19018Â Â Â Â Â Â Â Â Â Â Â Â Â Â
QNHTGRFICAVPGFYYFNFQVISKWD--LCLFIKSSSGGQ-PRDSLSFSN
TNNKGLFQVL C1qb_Mouse         Â
EPRNGKFTCKVPGLYYFTYHASSRGN---LCVNLVRGRDRDSMQKVVTFC
DYAQNTFQVT C1qb_Human         Â
EPRSGKFTCKVPGLYYFTYHASSRGN---LCVNLMRGRER--AQKVVTFC
DYAYNTFQVT Cerb_Human         Â
DSERSTFIAPRKGIYSFNFHVVKVYNRQTIQVSLMLNGWP----VISAFA
GDQDVTREAA 2.HS27109_1Â Â Â Â Â Â Â Â Â
TPRTGKFRIPYLGVYVFKYTIESFSA--HISGFLVVDGIDKLAFESEN-I
NSEIHCDRVL Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â . Â Â Â Â
MMCOL10A1_1.483Â Â Â Â Â SGSAIMELTENDQVWLQLPNA-ES
NGLYSSEYVHSSFSGFLVAPM------- Ca1x_Chick         Â
SGSAVIDLMENDQVWLQLPNS-ESNGLYSSEYVHSSFSGFLFAQI----
--- S15435Â Â Â Â Â Â Â Â Â Â Â Â Â Â SGSAVLLLRPGDRVFLQMPSE-QA
AGLYAGQYVHSSFSGYLLYPM------- CA18_MOUSE.597Â Â Â Â Â Â
SGSAVLLLRPGDQVFLQNPFE-QAAGLYAGQYVHSSFSGYLLYPM----
--- Ca28_Human          SGGAVLQLRPNDQVWVQIPSD-QA
NGLYSTEYIHSSFSGFLLCPT------- MM37222_1.98Â Â Â Â Â Â Â Â
SGSVLLHLEVGDQVWLQVYGDGDHNGLYADNVNDSTFTGFLLYHDTN--
--- COLE_LEPMA.264Â Â Â Â Â Â SNLALLHLTDGDQVWLETLR--DW
NGXYSSSEDDSTFSGFLLYPDTKKPTAM HP27_TAMAS.72Â Â Â Â Â Â Â
SGTAILQLGMEDRVWLENKL--SQTDLERG-TVQAVFSGFLIHEN----
--- S19018Â Â Â Â Â Â Â Â Â Â Â Â Â Â AGGTVLQLRRGDEVWIEKDP--AK
GRIYQGTEADSIFSGFLIFPS------- C1qb_Mouse         Â
TGGVVLKLEQEEVVHLQATD---KNSLLGIEGANSIFTGFLLFPD----
--- C1qb_Human          TGGMVLKLEQGENVFLQATD---K
NSLLGMEGANSIFSGFLLFPD------- Cerb_Human         Â
SNGVLIQMEKGDRAYLKLER---GN-LMGG-WKYSTFSGFLVFPL----
--- 2.HS27109_1Â Â Â Â Â Â Â Â Â TGDALLELNYGQEVWLRLAK----
GTIPAKFPPVTTFSGYLLYRT------- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
 .   .                    . Â
Extra
8Problems with Progressive Alignments
- - Local Minimum Problem - Parameter Choice
Problem - 1. Local Minimum Problem
- - It stems from greedy nature of alignment
(mistakes made early in alignment cannot be
corrected later) - - A better tree gives a better alignment
(UPGMAÂ neighbour-joining tree method) - 2. Parameter Choice Problem
- - It stems from using just one set of parameters
(and hoping that they will do for all)
9Domain Problem in Mult. Alignment
10Profiles MotifsHMMs
- Fuse multiple alignment into
- - Motif a short signature pattern identified in
the conserved region of the multiple alignment - - Profile frequency of each amino acid at each
position is estimated - HMM Hidden Markov Model, a generalized profile
in rigorous mathematical terms
Core
Can get more sensitive searches with these
multiple alignment representations (Run the
profile against the DB.)
11Motifs
- several proteins are grouped together by
similarity searches - they share a conserved
motif - motif is stringent enough to retrieve
the family members from the complete protein
database - PROSITE a collection of motifs (1135
different motifs) Â
Core
Â
12Prosite Pattern -- EGF like pattern
A sequence of about thirty to forty amino-acidÂ
residues long found in the sequence ofÂ
epidermal growth factor (EGF) has beenÂ
shown 1 to 6 to be present, in a more or less
conserved form, in a large number of other,
mostly animal proteins. The proteins currently
known to contain one or more copies of an
EGF-like pattern are listed below. - Bone
morphogenic protein 1 (BMP-1), a protein which
induces cartilage and bone formation. -
Caenorhabditis elegans developmental proteins
lin-12 (13 copies) and glp-1 (10 copies). -
Calcium-dependent serine proteinase (CASP) which
degrades the extracellular matrix proteins type
. - Cell surface antigen 114/A10 (3 copies). -
Cell surface glycoprotein complex transmembrane
subunit . - Coagulation associated proteins C, Z
(2 copies) and S (4 copies). - Coagulation
factors VII, IX, X and XII (2 copies). -
Complement C1r/C1s components (1 copy).  -
Complement-activating component of Ra-reactive
factor (RARF) (1 copy). - Complement components
C6, C7, C8 alpha and beta chains, and C9 (1
copy). - Epidermal growth factor precursor (7-9
copies). Â Â Â Â Â Â Â Â Â Â Â Â Â Â -------------------Â Â Â Â Â
  -------------------------              Â
                                            Â
    x(4)-C-x(0,48)-C-x(3,12)-C-x(1,70)-C-x(1,6
)-C-x(2)-G-a-x(0,21)-G-x(2)-C-x    Â
                         Â
    -------------------
'C' conserved cysteine involved in a disulfide
bond.'G' often conserved glycine'a' often
conserved aromatic amino acid'' position of
both patterns.'x' any residue-Consensus
pattern C-x-C-x(5)-G-x(2)-CÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
The 3 C's are involved in disulfide
bonds http//www.expasy.ch/sprot/prosite.html
Extra
13Motifs
Each element in a pattern is separated from its
neighbor by a -. The symbol x is used for a
position where any amino acid is accepted.
Ambiguities are indicated by listing the
acceptable amino acids for a given position,
between brackets . Ambiguities are also
indicated by listing between a pair of braces
the amino acids that are not accepted at a
given position. Repetition of an element of the
pattern is indicated by with a numerical value or
a numerical range between parentheses following
that element.
Â
14Profiles
Profile a position-specific scoring matrix
composed of 21 columns and N rows (Nlength of
sequences in multiple alignment)
Core
5
What happens with gaps?
15EGF Profile Generated for SEARCHWISE
Extra
Cons A   C   D   E   F   G   H   I  Â
KÂ Â Â LÂ Â Â MÂ Â Â NÂ Â Â PÂ Â Â QÂ Â Â RÂ Â Â SÂ Â Â TÂ Â Â VÂ Â Â
W   Y Gap   V  -1  -2  -9  -5 -13 -18 Â
-2Â Â -5Â Â -2Â Â -7Â Â -4Â Â -3Â Â -5Â Â -1Â Â -3Â Â Â
0Â Â Â 0Â Â -1Â -24Â -10Â 100 Â DÂ Â Â 0Â -14Â Â -1Â Â
-1Â -16Â -10Â Â Â 0Â -12Â Â Â 0Â -13Â Â -8Â Â Â 1Â Â
-3Â Â Â 0Â Â -2Â Â Â 0Â Â Â 0Â Â -8Â -26Â Â -9Â 100 Â VÂ Â Â
0Â -13Â Â -9Â Â -7Â -15Â -10Â Â -6Â Â -5Â Â -5Â Â -7Â Â
-5Â Â -6Â Â -4Â Â -4Â Â -6Â Â -1Â Â Â 0Â Â -1Â -27Â -14Â
100 Â DÂ Â Â 0Â -20Â Â 18Â Â 11Â -34Â Â Â 0Â Â Â 4Â
-26Â Â Â 7Â -27Â -20Â Â 15Â Â Â 0Â Â Â 7Â Â Â 4Â Â Â 6Â Â Â 2Â
-19Â -38Â -21Â 100 Â PÂ Â Â 3Â -18Â Â Â 1Â Â Â 3Â -26Â Â
-9Â Â -5Â -14Â Â -1Â -14Â -12Â Â -1Â Â 12Â Â Â 1Â Â
-4Â Â Â 2Â Â Â 0Â Â -9Â -37Â -22Â 100 Â CÂ Â Â 5Â 115Â
-32Â -30Â Â -8Â -20Â -13Â -11Â -28Â -15Â Â -9Â -18Â
-31Â -24Â -22Â Â Â 1Â Â -5Â Â Â 0Â -10Â Â -5Â 100
 A   2  -7  -2  -2 -21  -5  -4 -12  -2Â
-13Â Â -9Â Â Â 0Â Â -1Â Â Â 0Â Â -3Â Â Â 2Â Â Â 1Â Â -7Â -30Â
-17 100  s   2 -12   3   2 -25   0   0Â
-18Â Â Â 0Â -18Â -13Â Â Â 4Â Â Â 3Â Â Â 1Â Â -1Â Â Â 7Â Â Â 4Â
-12 -30 -16  25  n  -1 -15   4   4 -19 Â
-7Â Â Â 3Â -16Â Â Â 2Â -16Â -10Â Â Â 7Â Â -6Â Â Â 3Â Â Â
0   2   0 -11 -23 -10  25  p   0 -18 Â
-7Â Â -6Â -17Â -11Â Â Â 0Â -17Â Â -5Â -15Â -14Â Â -5Â Â
28  -2  -5   0  -1 -13 -26  -9  25  c  Â
5Â 115Â -32Â -30Â Â -8Â -20Â -13Â -11Â -28Â -15Â Â
-9Â -18Â -31Â -24Â -22Â Â Â 1Â Â -5Â Â Â 0Â -10Â Â -5Â Â
25 Â LÂ Â -5Â -14Â -17Â Â -9Â Â Â 0Â -25Â Â -5Â Â Â 4Â Â
-5Â Â Â 8Â Â Â 8Â -12Â -14Â Â -1Â Â -5Â Â -7Â Â -5Â Â Â 2Â
-15Â Â -5Â 100 Â NÂ Â -4Â -16Â Â 12Â Â Â 5Â -20Â Â Â 0Â Â
24Â -24Â Â Â 5Â -25Â -18Â Â 25Â -10Â Â Â 6Â Â Â 2Â Â Â
4   1 -19 -26  -2 100  g   1 -16   7  Â
1Â -35Â Â 29Â Â Â 0Â -31Â Â -1Â -31Â -23Â Â 12Â -10Â Â Â
0Â Â -1Â Â Â 4Â Â -3Â -23Â -32Â -23Â Â 50 Â GÂ Â Â 6Â
-17Â Â Â 0Â Â -7Â -49Â Â 59Â -13Â -41Â -10Â -41Â
-32Â Â Â 3Â -14Â Â -9Â Â -9Â Â Â 5Â Â -9Â -29Â -39Â -38Â
100 Â TÂ Â Â 3Â -10Â Â Â 0Â Â Â 2Â -21Â -12Â Â -3Â Â
-5Â Â Â 1Â -11Â Â -5Â Â Â 1Â Â -4Â Â Â 1Â Â -1Â Â Â 6Â Â
11Â Â Â 0Â -33Â -18Â 100 Â CÂ Â Â 5Â 115Â -32Â -30Â Â
-8Â -20Â -13Â -11Â -28Â -15Â Â -9Â -18Â -31Â -24Â
-22Â Â Â 1Â Â -5Â Â Â 0Â -10Â Â -5Â 100 Â IÂ Â -6Â -13Â
-19Â -11Â Â Â 0Â -28Â Â -5Â Â Â 8Â Â -4Â Â Â 6Â Â Â 8Â -12Â
-17  -4  -5  -9  -4   6 -12  -1 100  d Â
-4Â -19Â Â Â 8Â Â Â 6Â -15Â -13Â Â Â 5Â -17Â Â Â 0Â -16Â
-12Â Â Â 5Â Â -9Â Â Â 2Â Â -2Â Â -1Â Â -1Â -13Â -24Â Â
-5  31  i   0  -6  -8  -6  -4 -11  -5  Â
3Â Â -5Â Â Â 1Â Â Â 2Â Â -5Â Â -8Â Â -4Â Â -6Â Â -2Â Â Â 0Â Â Â
4 -14  -6  31  g   1 -13   0   0 -20 Â
-3Â Â -3Â -12Â Â -3Â -13Â Â -8Â Â Â 0Â Â -7Â Â Â 0Â Â
-5Â Â Â 2Â Â Â 0Â Â -7Â -29Â -16Â Â 31 Â LÂ Â -5Â -11Â
-20Â -14Â Â Â 0Â -23Â Â -9Â Â Â 9Â -11Â Â Â 8Â Â Â 7Â -14Â
-17Â Â -9Â -14Â Â -8Â Â -4Â Â Â 7Â -17Â Â -5Â 100
 E   0 -20  14  10 -33   5   0 -25   2Â
-26Â -19Â Â 11Â Â -9Â Â Â 4Â Â Â 0Â Â Â 3Â Â Â 0Â -19Â -34Â
-22Â 100 Â SÂ Â Â 3Â -13Â Â Â 4Â Â Â 3Â -28Â Â Â 3Â Â Â 0Â
-18Â Â Â 2Â -20Â -13Â Â Â 6Â Â -6Â Â Â 3Â Â Â 1Â Â Â 6Â Â Â 3Â
-12Â -32Â -20Â 100 Â YÂ -14Â Â -9Â -25Â -22Â Â 31Â
-34Â Â 10Â Â -5Â -17Â Â Â 0Â Â -1Â -14Â -13Â -13Â -15Â
-14Â -13Â Â -7Â Â 17Â Â 44Â 100 Â TÂ Â Â 0Â -10Â Â -6Â Â
-1Â -11Â -16Â Â -2Â Â -7Â Â -1Â Â -9Â Â -5Â Â -3Â Â
-9Â Â Â 0Â Â -1Â Â Â 1Â Â Â 3Â Â -4Â -16Â Â -8Â 100 Â CÂ Â Â
5Â 115Â -32Â -30Â Â -8Â -20Â -13Â -11Â -28Â -15Â Â
-9Â -18Â -31Â -24Â -22Â Â Â 1Â Â -5Â Â Â 0Â -10Â Â -5Â
100 Â RÂ Â Â 0Â -13Â Â Â 0Â Â Â 2Â -19Â -11Â Â Â 1Â
-12Â Â Â 4Â -13Â Â -8Â Â Â 3Â Â -8Â Â Â 4Â Â Â 5Â Â Â 1Â Â Â
1Â Â -8Â -23Â -13Â 100 Â CÂ Â Â 5Â 115Â -32Â -30Â Â
-8Â -20Â -13Â -11Â -28Â -15Â Â -9Â -18Â -31Â -24Â
-22Â Â Â 1Â Â -5Â Â Â 0Â -10Â Â -5Â 100 Â PÂ Â Â 0Â -14Â Â
-8Â Â -4Â -15Â -17Â Â Â 0Â Â -7Â Â -1Â Â -7Â Â -5Â Â
-4Â Â Â 6Â Â Â 0Â Â -2Â Â Â 0Â Â Â 1Â Â -3Â -26Â -10Â 100
 P   1 -18  -3   0 -24 -13  -3 -12   1Â
-13Â -10Â Â -2Â Â 15Â Â Â 2Â Â Â 0Â Â Â 2Â Â Â 1Â Â -8Â -33Â
-19Â 100 Â GÂ Â Â 4Â -19Â Â Â 3Â Â -4Â -48Â Â 53Â -11Â
-40Â Â -7Â -40Â -31Â Â Â 5Â -13Â Â -7Â Â -7Â Â Â 4Â Â -7Â
-29 -39 -36 100  y -22  -6 -35 -31  55Â
-43Â Â 11Â Â -1Â -25Â Â Â 6Â Â Â 4Â -21Â -34Â -20Â -21Â
-22Â -20Â Â -7Â Â 43Â Â 63Â Â 50 Â SÂ Â Â 1Â Â -9Â Â -3Â Â
-1Â -14Â Â -7Â Â Â 0Â -10Â Â -2Â -12Â Â -7Â Â Â 0Â Â
-7Â Â Â 0Â Â -4Â Â Â 4Â Â Â 4Â Â -5Â -24Â Â -9Â 100 Â GÂ Â Â
5Â -20Â Â Â 1Â Â -8Â -52Â Â 66Â -14Â -45Â -11Â -44Â
-35Â Â Â 4Â -16Â -10Â -10Â Â Â 4Â -11Â -33Â -40Â -40Â
100 Â EÂ Â Â 2Â -20Â Â 10Â Â 12Â -31Â Â -7Â Â Â 0Â
-19Â Â Â 6Â -20Â -15Â Â Â 5Â Â Â 4Â Â Â 7Â Â Â 2Â Â Â 4Â Â Â 2Â
-13Â -38Â -22Â 100 Â RÂ Â -5Â -17Â Â Â 0Â Â Â 1Â -16Â
-13Â Â Â 8Â -16Â Â Â 9Â -16Â -11Â Â Â 5Â -11Â Â Â 7Â Â
15Â Â -1Â Â -1Â -13Â -18Â Â -6Â 100 Â CÂ Â Â 5Â 115Â
-32Â -30Â Â -8Â -20Â -13Â -11Â -28Â -15Â Â -9Â -18Â
-31Â -24Â -22Â Â Â 1Â Â -5Â Â Â 0Â -10Â Â -5Â 100
 E   0 -26  20  25 -34  -5   6 -25  10Â
-25Â -17Â Â Â 9Â Â -4Â Â 16Â Â Â 5Â Â Â 3Â Â Â 0Â -18Â -38Â
-23Â 100 Â TÂ Â -4Â -11Â -13Â Â -8Â Â -1Â -21Â Â Â
2Â Â Â 0Â Â -4Â Â -1Â Â Â 0Â Â -6Â -14Â Â -3Â Â -5Â Â -4Â Â Â
0Â Â Â 0Â -15Â Â Â 0Â 100 Â DÂ Â Â 0Â -18Â Â Â 5Â Â Â 4Â
-24Â -11Â Â -1Â -11Â Â Â 2Â -14Â Â -9Â Â Â 1Â Â -6Â Â Â
2Â Â Â 0Â Â Â 0Â Â Â 0Â Â -6Â -34Â -18Â 100 Â IÂ Â Â 0Â
-10Â Â -2Â Â -1Â -17Â -14Â Â -3Â Â -4Â Â -1Â Â -9Â Â
-4Â Â Â 0Â -11Â Â Â 0Â Â -4Â Â Â 0Â Â Â 2Â Â -1Â -29Â -14Â
100 Â DÂ Â -4Â -15Â Â -1Â Â -2Â -13Â -16Â Â -3Â Â -8Â Â
-5Â Â -6Â Â -4Â Â -1Â Â -7Â Â -2Â Â -7Â Â -3Â Â -2Â Â -6Â
-27Â -12Â 100
Cons. Cys
16Profiles formula for position M(p,a)
M(p,a) chance of finding amino acid a at
position p Msimp(p,a) number of times a occurs
at p divided by number of sequences However, what
if dont have many sequences in alignment?
Msimp(p,a) might be baised. Zeros for rare amino
acids. Thus Mcplx(p,a) ?b1 to 20 Msimp(p,b) x
Y(b,a) Y(b,a) Dayhoff matrix for a and b amino
acids S(p,a) ?a1 to 20 Msimp(p,a) ln
Msimp(p,a)
Core
17Profiles formula for entropy H(p,a)
H(p,a) - ?a1 to 20 f(p,a) log2 f(p,a), where
f(p,a) frequency of amino acid a occurs at
position p ( Msimp(p,a) ) Say column only has one
aa (AAAAA) H(p,a) 1 log2 1 0 log2 0 0
log2 0 0 0 0
0 Say column is random with all aa equiprobable
(ACD..ACD..ACD..)Hrand(p,a) .05 log2 .05
.05 log2 .05 -.22 -.22 -4.3 Say
column is random with aa occurring according to
probability found inthe sequence databases
(ACAAAADAADDDDAAAA.)Hdb(a) - ?a1 to 20 F(a)
log2 F(a), where F(a) is freq. of occurrence of
a in DB Hcorrected(p,a) H(p,a) Hdb(a)
Core
18End of class M3 2006,11.01Start of class M4
2006,11.06
19HMMs
Hidden Markov Model - a composition of finite
number of states, - each corresponding to a
column in a multiple alignment - each state
emits symbols, according to symbol-emission
probabilities Starting from an initial state, a
sequence of symbols is generated by moving from
state to state until an end state is reached.
Core
(Figures from Eddy, Curr. Opin. Struct. Biol.)
20Relating Different Hidden Match States to the
Observed Sequence
21The Hidden Part of HMMs
We see
We don't see
22Comparison of HMMs to Profiles
23Sequence profile elements
Extra
24Sequence profile elements
Extra
25Result HMM sequence profile
Extra
26Algorithms
Probability of a path through the model Viterbi
maximizes for seq Forward sums of all possible
paths
Forward Algorithm finds probability P that a
model l emits a given sequence O by summing over
all paths that emit the sequence the probability
of that path Viterbi Algorithm finds the most
probable path through the model for a given
sequence (both usually just boil down to simple
applications of dynamic programming)
27HMM algorithms similar to those in sequence
alignment
Core
28Building the Model
EM - expectation maximization "roll your own"
model -- dialing in probabilities
29Applications of HMMs (Gene Finding)
Matching prot fams (pfam) Predicting sec. struc.
(TM, alpha) Modelling binding sites for TF
(speech recognition)
30Modules
- Another example of the helix-loop-helix motif is
seen within several DNA binding domains including
the homeobox proteins which are the master
regulators of development
HMMs, Profiles, Motifs, and Multiple Alignments
used to define modules
(Figures from Branden Tooze)
- Several motifs (b-sheet, beta-alpha-beta,
helix-loop-helix) combine to form a compact
globular structure termed a domain or tertiary
structure - A domain is defined as a polypeptide chain or
part of a chain that can independently fold into
a stable tertiary structure - Domains are also units of function (DNA binding
domain, antigen binding domain, ATPase domain,
etc.)
31The Score
Core
- S Total Score
- S(i,j) similarity matrix score for aligning i
and j - Sum is carried out over all aligned i and j
- n number of gaps (assuming no gap ext. penalty)
- G gap penalty
Simplest score (for identity matrix) is S
matches What does a Score of 10 mean? What is
the Right Cutoff?
32Score in Context of Other Scores
- How does Score Rank Relative to all the Other
Possible Scores - P-value
- Percentile Test Score Rank
- All-vs-All comparison of the Database (100K x
100K) - Graph Distribution of Scores
- 1010 scores much smaller number of true
positives - N dependence
Core
33P-value in Sequence Matching
- P(s gt S) .01
- P-value of .01 occurs at score threshold S (392
below) where score s from random comparison is
greater than this threshold 1 of the time - Likewise for P.001 and so on.
Core
34P-values
- Significance Statistics
- For sequences, originally used in Blast
(Karlin-Altschul). Then in FASTA, c. - Extrapolated Percentile Rank How does a Score
Rank Relative to all Other Scores? - Our Strategy Fit to Observed Distribution
- 1) All-vs-All comparison
- 2) Graph Distribution of Scores in 2D (N
dependence) 1K x 1K families -gt 1M scores 2K
included TPs - 3) Fit a function r(S) to TN distribution (TNs
from scop) Integrating r gives P(sgtS), the CDF,
chance of getting a score better than threshold S
randomly - 4) Use same formalism for sequence structure
1
Core
2
e.g. P(score sgt392) 1 chance
3
35EVD Fits
- Reasonable as Dyn. Prog. maximizes over
pseudo-random variables - EVD is Max(indep. random variables)
- Normal is Sum(indep. random variables)
Observed
r(z) exp(-z2) ln r(z) -z2
Extreme Value Distribution (EVD, long-tailed)
fits the observed distributions best. The
corresponding formula for the P-value
Core
36Extreme Value vs. Gaussian
- X set of random numbers Each set indexed by j
- j1 1,4,9,1,3,1
- j2 2,7,3,11,22,1,22
- Gaussian S(j) ?j Xi central limit
- EVD S(j) max(Xi)
Freq.
S(j)
37Objective is to Find Distant Homologues
- Score (Significance) Threshold
- Maximize Coverage with an Acceptable Error Rate
- TP, TN, FP, FN
- TP and TN are good!
- We get P and N from our program
- We get T and F from a gold-standard
- Max(TP,TN) vs (FP,FN)
(graphic adapted from M Levitt)
38Coverage v Error Rate (ROC Graph)
Core
Coverage (roughly, fraction of sequences that one
confidently says something about)
Thresh10
Thresh20
sensitivitytp/ptp/(tpfn)
Thresh30
Different score thresholds
Error rate (fraction of the statements that are
false positives)
Two methods (red is more effective)
Specificity tn/n tn/(tnfp) error rate
1-specificity fp/n
39Significance Dependson Database Size
- The Significance of Similarity Scores Decreases
with Database Growth - The score between any pair of sequence pair is
constant - The number of database entries grows
exponentially - The number of nonhomologous entries gtgt homologous
entries - Greater sensitivity is required to detect
homologiesGreater s - Score of 100 might rank as best in database of
1000 but only in top-100 of database of 1000000
DB-1
DB-2
40Low-Complexity Regions
- Low Complexity Regions
- Different Statistics for matching
AAATTTAAATTTAAATTTAAATTTAAATTTthanACSQRPLRVSHRS
ENCVASNKPQLVKLMTHVKDFCV - Automatic Programs Screen These Out (SEG)
- Identify through computation of sequence entropy
in a window of a given size H ? f(a) log2 f(a)
- Also, Compositional Bias
- Matching A-rich query to A-rich DB vs. A-poor DB
Core
LLLLLLLLLLLLL
41End of class M4 2006,11.06Start of class M5
2006,11.08
42Computational Complexity
Core
- Basic NW Algorithm is O(n2) (in speed)
- M x N squares to fill
- At each square need to look back (MN) black
squares to find max in block - M x N x (MN) -gt O(n3)
- However, max values in block can be cached, so
algorithm is really only O(n2) - O(n2) in memory too!
- Improvements can (effectively) reduce sequence
comparison to O(n) in both
43FASTA
Core
- Hash table of short words in the query sequence
- Go through DB and look for matches in the query
hash (linear in size of DB) - perl whereACT 1,45,67,23....
- K-tuple determines word size (k-tup 1 is single
aa) - by Bill Pearson
VLICT _
VLICTAVLMVLICTAAAVLICTMSDFFD
44Join together query lookups into diagonals and
then a full alignment
(Adapted from D Brutlag)
45Basic Blast
- Altschul, S., Gish, W., Miller, W., Myers, E. W.
Lipman, D. J. (1990). Basic local alignment
search tool. J. Mol. Biol. 215, 403-410 - Indexes query
- BLAT - indexes DB
- Starts with all overlapping words from query
- Calculates neighborhood of each word using PAM
matrix and probability threshold matrix and
probability threshold - Looks up all words and neighbors from query in
database index - Extends High Scoring Pairs (HSPs) left and right
to maximal length - Finds Maximal Segment Pairs (MSPs) between query
and database - Blast 1 does not permit gaps in alignments
Core
46Blast Extension of Hash Hits
Core
Query
- Extend hash hits into High Scoring Segment Pairs
(HSPs) - Stop extension when total score doesnt increase
- Extension is O(N). This takes most of the time in
Blast
DB
47Blasting against the DB
- In simplest Blast algorithm, find best scoring
segment in each DB sequence - Statistics of these scores determine significance
Number of hash hits is proportional to O(NMD),
where N is the query size, M is the average DB
seq. size, and D is the size of the DB
48Blast2 Gapped Blast
Core
49Blast2 Gapped Blast
- Gapped Extension on Diagonals with two Hash Hits
- Statistics of Gapped Alignments follows EVD
empirically
Core
50Y-Blast
- Automatically builds profile and then searches
with this - Also PHI-blast
Parameters overall threshold, inclusion
threshold, interations
51PSI-Blast
Core
Semi-supervised learning
Blast FASTA Smith-Waterman PSI-BlastProfiles HMMs
Iteration Scheme
Sensitivity
Speed
Convergence vs explosion (polluted profiles)
52Practical Issues on DNA Searching
- Examine results with exp. between 0.05 and 10
- Reevaluate results of borderline significance
using limited query - Beware of hits on long sequences
- Limit query length to 1,000 bases
- Segment query if more than 1,000 bases
- Search both strands
- Protein search is more sensitive, Translate ORFs
- BLAST for infinite gap penalty
- Smith-Waterman for cDNA/genome comparisons
- cDNA gtZero gap-Transition matrices Consider
transition matrices - Ensure that expected value of score is negative
(graphic and some text adapted from D Brutlag)
53General Protein Search Principles
- If no significant results, use BLOSUM30 and lower
gap penalties - FASTA cutoff of .01
- Blast cutoff of .0001
- Examine results between exp. 0.05 and 10 for
biological significance - Ensure expected score is negative
- Beware of hits on long sequences or hits with
unusual aa composition - Reevaluate results of borderline significance
using limited query region - Segment long queries ³ 300 amino acids
- Segment around known motifs
- Choose between local or global search algorithms
- Use most sensitive search algorithm available
- Original BLAST for no gaps
- Smith-Waterman for most sensitivity
- FASTA with k-tuple 1 is a good compromise
- Gapped BLAST for well delimited regions
- PSI-BLAST for families(differential performance
on large and small families) - Initially BLOSUM62 and default gap penalties
(some text adapted from D Brutlag)