Title: Welcome to Introduction to Bioinformatics Monday, 11 October
1Welcome toIntroduction to BioinformaticsMonday,
11 October
- Characteristics of PSSMs
- How to make a PSSM
- Uncertainty and information
- How to score a sequence
Problem sets (Blast, Modeling)
2 Scenario 1 Prediction of regulatory site
3N2 fixation in cyanobacteria
N2
CO2
O2
Matveyev and Elhai (unpublished)
4Differentiation in cyanobacteriaWhat does NtcA
bind to?
Herrero et al (2001) J Bacteriol 183411-425
5Differentiation in cyanobacteria
Sequence upstream from hetQ
ttctatgagaatataaaattttccttaagtttct aaaaccgaccattct
gatgaataagtccggtttt tgctttttcgctttatttatctatatttcc
aagt ggggtgacaactatcttgccaatattgtcgttat gaaaaaatct
GTAacatgagaTACacaatagcatttatatttgcttTAgtaTctctctct
tgggtggg
(20-24)TAnnnT
GTA(8)TAC NtcA binding site
Promoter
6Differentiation in cyanobacteriaIntegration of
signals through HetR
HetQ
-N
NtcA
???
Genes needed for differentiation
Position in cell cycle
HetR
Level of PatS
Level of HetN
Master regulator
Stockholm
7Scenario 1 The aftermath
YES
YES
- Did killing the site prevent heterocysts?
NO
Stockholm
8Scenario 1 The aftermath
YES
YES
- Did killing the site prevent heterocysts?
NO
NO
YES
9Scenario 1 The aftermath
If hetQ isnt the golden link, then what is?
-N
NtcA
???
Genes needed for differentiation
HetR
- Gene preceded by NtcA-binding site
- Blocking NtcA-binding affects gene expression
- Gene product required for hetR expression
10Scenario 1 The aftermath
If hetQ isnt the golden link, then what is?
-N
NtcA
???
Genes needed for differentiation
HetR
- Gene preceded by NtcA-binding site
How to find?
- Search for GTA(N8)TAC(N20-24)TAT?
11Position-specific scoring matrices A better way
12Position-specific scoring matrices A better way
13Position-specific scoring matrices A better way
Anabaena genome
14Position-specific scoring matrices A better way
15Position-specific scoring matrices A better way
16Position-specific scoring matrices A better way
17Position-specific scoring matrices A better way
Score .60 .20 1.0
18Position-specific scoring matricesIntroduction
of pseudocounts
A?
qG,6 5 real counts pG ? pseudocounts
19Position-specific scoring matricesIntroduction
of pseudocounts
Score(position,nucleotide) (q p) / (N B)
p pseudocounts B (overall frequency of
nucleotide) A 0.32T 0.32C 0.18G
0.18
B Total number of pseudocounts Square
root (N) ? or 0.1 ?
20Position-specific scoring matricesIntroduction
of pseudocounts
21Position-specific scoring matricesNormalization
How to account for similarity due to similar base
composition?
Compare ScorePSSM / Scorebackground
frequency 0.79 / 0.32 2.2
22Position-specific scoring matricesLog odds form
Log odds -log(score)
Score score score log log
log
23Position-specific scoring matricesExpand
training set through orthologs
Table 3 Training set including sequences from
two Nostocsa 71-devB CATTACTCCTTCAATCCCTCGCCCCTCAT
TTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA Np
-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTA
CGTTTTCGCGTCACAGATAAATGTAGAATTCA 71-glnA
AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAAC
TGTCTAATGTTTAGAATCTACGATAT Np-glnA
AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAAT
CCGCTAATGTCTACTATTTAAGATAT 71-hetC
GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAAT
AGCATTTATATTTGCTTTAGTATCTC 71-nirA
TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTT
TACCTGAGATCCCGACATAACCTTAG Np-nirA
CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTT
TAACAGAAATCTCGTCTTAAGTTATG 71-ntcB
ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAAT
TGGGGAGCAAAATCAGCTAACTTAAT Np-ntcB
TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAAT
TGCGGAGAATAAACCGTTAACTTAGT 71-urt
ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTC
AATGGTTAAATATCAAACTAATATCA Np-urt
TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTT
ACTAGTAAACTATCGCACTATCATCA
24Position-specific scoring matricesDecrease
complexity through info analysis
Uncertainty (Hc) - Sum pic log2(pic)
25Position-specific scoring matricesDecrease
complexity through info analysis
Uncertainty (Hc) - Sum pic log2(pic)
H1 -4/11 log2(4/11) 3/11 log2(3/11)
1/11 log2(1/11) 3/11 log2(3/11)
1.87
H31 -1/11 log2(1/11) 1/11 log2(1/11)
1/11 log2(1/11) 8/11 log2(8/11)
1.28
Information content Sum (Hmax Hc) (summed
over all columns)
26(No Transcript)