Title: Bootstrap and jackknife calculation resampling
1Bootstrap and jack-knife calculation (resampling)
- General principal is to stress the data
repeatedly and recompute the tree each time,
looking for robust features. - Jack-knife drop each of N sequences from the
alignment and recompute the resulting N trees,
testing whether they are compatible with
original. - Bootstrap recompute pairwise distances from a
random sample of alignment columns (with
replacement). Recompute the tree for each new
distance set and see how often a particular tree
branch is positioned the same.
(given in the distance method context, the same
methods works for any tree construction method)
2Bootstrap (cont.)
real alignment
Similar resampling and distance derivation done
for each sequence pair, then a new tree
calculated. Repeat ad nauseum. Compare bootstrap
trees with each other.
3The molecular clock concept
- divergence distance vs. time NOT the same!
- molecular clock is common default hypothesis
for translating distance into time (assume that
divergence time). - known to be violated whenever sufficiently broad
groups are considered. - however, frequently approximately valid for a
particular sequence family over relatively short
times (e.g. most proteins in primates).
4Things that tend to invalidate the molecular clock
- differential changes in generation time on some
branches. - differential changes in selective constraints on
some branches (extreme example would be positive
selection). - depth of divergence - though correctable unless
distances are too long.
5Protein structure and alignment
- When you see a protein sequence alignment,
notice the blocks with higher and lower
similarity (they are almost always there). - (most of the time) These are not simply
stochastic variation they represent regions
under more or less strong purifying selection. - These blocks can vary from rather small segments
to rather long domains (or both depending on your
window). - Longer blocks usually correspond to different
protein domains (which can vary as a unit in
selective pressure). - Shorter blocks usually correspond to
intra-domain structural features.
6Kinases form a complex, diverse family
Example from a particular enzyme
(many subtypes)
(many subtypes)
7CaM Kinase I and CaM Kinase II (CaMKII) (CaM
stands for calcium-calmodulin)
- Very similar in the kinase and calmodulin
regulatory domains. - CaMKI is monomeric, whereas CaMKII is a 10-12
subunit multimer. - CaMKII most likely arose after the CaM Kinase
domain by fusing a multimer formation domain to
the C-terminus.
8CaM Kinase II structure
N
C
multimer
serine-threonine
calmodulin
formation
protein kinase
regulation
12 subunits
with the catalytic
domains facing out
9unc-43 --------------------MQLQQINSGAFSVV
RRCVHKTTGLEFAAKIINTKKLSARD rCaMKII
-------MATITCTRFTEEYQLFEELGKGAFSVVRRCVKVLAGQEYPAKI
INTKKLSARD hCaMKI MLGAVEGPRWKQAEDIRDIYDFR
DVLGTGAFSEVILAEDKRTQKLVAIKCIAKEALEGKE rCaMKI
MPGAVEGPRWKQAEDIRDIYDFRDVLGTGAFSEVILAEDKRTQKLV
AIKCIAKKALEGKE
.. . .
.. unc-43 FQKLEREARICRKLQHPNIVRLHDSIQEE
SFHYLVFDLVTGGELFEDIVAREFYSEADAS rCaMKII
HQKLEREARICRLLKHPNIVRLHDSISEEGHHYLIFDLVTGGELFEDIVA
REYYSEADAS hCaMKI GS-MENEIAVLHKIKHPNIVALD
DIYESGGHLYLIMQLVSGGELFDRIVEKGFYTERDAS rCaMKI
GS-MENEIAVLHKIKHPNIVALDDIYESGGHLYLIMQLVSGGELFD
RIVEKGFYTERDAS . . .
.. . .. . ..
unc-43 HCIQQILESIAYCHSNGIVHRDLKPENL
LLASKAKGAAVKLADFGLAIEVN-DSEAWHGF rCaMKII
HCIQQILEAVLHCHQMGVVHRDLKPENLLLASKLKGAAVKLADFGLAIEV
EGEQQRWFGF hCaMKI RLIFQVLDAVKYLHDLGIVHRDL
KPENLLYYSLDEDSKIMISDFGLSKMED-PGSVLSTA rCaMKI
RLIFQVLDAVKYLHDLGIVHRDLKPENLLYYSLDEDSKIMISDFGL
SKMED-PGSVLSTA . ....
. . . ...
unc-43 AGTPGYLSPEVLKKDPYSKPVDIWACGVILY
ILLVGYPPFWDEDQHRLYAQIKAGAYDYP rCaMKII
AGTPGYLSPEVLRKDPYGKPVDLWACGVILYILLVGYPPFWDEDQHRLYQ
QIKARAYDFP hCaMKI CGTPGYVAPEVLAQKPYSKAVDC
WSIGVIAYILLCGYPPFYDENDAKLFEQILKAEYEFD rCaMKI
CGTPGYVAPEVLAQKPYSKAVDCWSIGVIAYILLCGYPPFYDENDA
KLFEQILKAEYEFD ... .
. .. ..
unc-43 SPEWDTVTPEAKSLIDSMLTVNPKKRITADQ
ALKVPWICNRERVASAIHRQDTVDCLKKF rCaMKII
SPEWDTVTPEAKDLINKMLTINPSKRITAAEALKHPWISHRSTVASCMHR
QETVDCLKKF hCaMKI SPYWDDISDSAKDFIRHLMEKDP
EKRFTCEQALQHPWIAGDTALDKNIH-QSVSEQIKKN rCaMKI
SPYWDDISDSAKDFIRHLMEKDPEKRFTCEQALQHPWIAGDTALDK
NIH-QSVSEQIKKN ..
.. . .. . . . . .
unc-43 NARRKLKGAILTTMIATRNLSSKRSYRLTLG
AEKLVISMKNIEYWQVLLNKIFATYKIKM rCaMKII
NARRKLKGAILTTMLATRNFSGG---------------------------
--------KS hCaMKI FAKSKWKQAFNATAVVRHMR---
------------------------------------- rCaMKI
FAKSKWKQAFNATAVVRHMR--------------------------
-------------- . . . .
continued
10continued (overlapped)
unc-43 SPEWDTVTPEAKSLIDSMLTVNPKKRITADQALK
VPWICNRERVASAIHRQDTVDCLKKF rCaMKII
SPEWDTVTPEAKDLINKMLTINPSKRITAAEALKHPWISHRSTVASCMHR
QETVDCLKKF hCaMKI SPYWDDISDSAKDFIRHLMEKDP
EKRFTCEQALQHPWIAGDTALDKNIH-QSVSEQIKKN rCaMKI
SPYWDDISDSAKDFIRHLMEKDPEKRFTCEQALQHPWIAGDTALDK
NIH-QSVSEQIKKN ..
.. . .. . . . . .
unc-43 NARRKLKGAILTTMIATRNLSSKRSYRLTLG
AEKLVISMKNIEYWQVLLNKIFATYKIKM rCaMKII
NARRKLKGAILTTMLATRNFSGG---------------------------
--------KS hCaMKI FAKSKWKQAFNATAVVRHMR---
------------------------------------- rCaMKI
FAKSKWKQAFNATAVVRHMR--------------------------
-------------- . . . .
unc-43 KQCRNLLNKKEQGPPSTIKESSESS-QTIDD
NDSEKGGGQLKHENTVVRADGATGIVSSS rCaMKII
G--G---NKKNDG----VKESSESTNTTIEDED-----------------
---------- .
.. .. unc-43
NSSTASKSSSTNLSAQKQDIVRVTQTLLDAISCKDFETYTRLCDTSMTCF
EPEALGNLIE rCaMKII ------------TKVRKQEIIKV
TEQLIEAISNGDFESYTKMCDPGMTAFEPEALGNLVE
.... .. ...
.. unc-43
GIEFHRFYFD--GNRKNQ-VHTTMLNPNVHIIGEDAACVAYVKLTQFLDR
NGEAHTRQSQ rCaMKII GLDFHRFYFENLWSRNSKPVHTT
ILNPHIHLMGDESACIAYIRITQYLDAGGIPRTAQSE
... . .....
..... . unc-43
ESRVWSKKQGRWVCVHVHRSTQPSTNTTVSEF rCaMKII
ETRVWHRRDGKWQIVHFHRSGAPSVLPH----
. .. . .
(note both inter- and intra-domain differences in
conservation)
11Protein structure basics
- proteins consist mostly of a-helices, b-sheets,
and turns. - the a-helices and b-sheets typically form the
framework of the protein. - the turns and other atypical structures often
play important binding and catalytic roles. - the core of the protein is hydrophobic, whereas
the surface is usually polar or charged. - most sharp turns (kinks) have glycine or proline.
12alpha helix
13three-stranded antiparallel b-sheet
14three-stranded antiparallel b-sheet, space filled
15substrate binding cleft
rCaMKII SPEWDTVTPEAKDLINKMLTINPSKRITAAEALK
HPWISHRSTVASCMHRQETVDCLKKF rCaMKI
SPYWDDISDSAKDFIRHLMEKDPEKRFTCEQALQHPWIAGDTALDKNIH-
QSVSEQIKKN 297 ..
.. . . ... . . . . .
rCaMKII NARRKLKGAILTTMLATRN rCaMKI
FAKSKWKQAFNATAVVRHM 316
. . . . .
16sliced half-way through the protein
red - charged blue - polar green - hydrophobic
17(No Transcript)
18rCaMKII HQKLEREARICRLLKHPNIVRLHDSISEEGHHYL
IFDLVTGGELFEDIVAREYYSEADAS rCaMKI
GS-MENEIAVLHKIKHPNIVALDDIYESGGHLYLIMQLVSGGELFDRIVE
KGFYTERDAS 119 . . .
. ... . ..
rCaMKII HCIQQILEAVLHCHQMGVVHRDLKPENL
LLASKLKGAAVKLADFGLAIEVEGEQQRWFGF rCaMKI
RLIFQVLDAVKYLHDLGIVHRDLKPENLLYYSLDEDSKIMISDFGLSKME
D-PGSVLSTA 178 . ..
. . . ... .
19rCaMKII HCIQQILEAVLHCHQMGVVHRDLKPENLLLASKL
KGAAVKLADFGLAIEVEGEQQRWFGF rCaMKI
RLIFQVLDAVKYLHDLGIVHRDLKPENLLYYSLDEDSKIMISDFGLSKME
D-PGSVLSTA 178 . ..
. . . ... .
rCaMKII AGTPGYLSPEVLRKDPYGKPVDLWACGVI
LYILLVGYPPFWDEDQHRLYQQIKARAYDFP rCaMKI
CGTPGYVAPEVLAQKPYSKAVDCWSIGVIAYILLCGYPPFYDENDAKLFE
QILKAEYEFD 238 ... .
. .. ... .
20(No Transcript)
21Measuring structural similarity
- Structural similarity can persist after sequence
similarity has reached noise levels. - More generally, how do you measure two
structures for degree of similarity? - Commonly used approach is root mean square
deviation (RMSD) between the positions of matched
backbone atoms.
22No statistically significant sequence similarity
RMSD for shared regions 3.5 Angstroms
23Illustration of three points on a structure of
poorly known function
- gaps in alignments tend to be on surface loops
- areas of highest conservation tend to be at key
sites (e.g. active sites of enzymes) and in core
structural elements - BUT when positive selection acts, binding faces
may tend to be the parts that vary.
24MATH domain containing genes a mystery family
in C. elegans
25(No Transcript)
26(No Transcript)
27No assignment for Thursday. Final assignment
will be posted by this evening.