Title: Bioinformatics The Prediction of Life
1BioinformaticsThe Prediction of Life
- Tony C Smith
- Department of Computer Science
- University of Waikato
- tcs_at_cs.waikato.ac.nz
2Bioinformatics
- Computation with biological data
- Data genes, proteins, microarrays, mass spectra,
written documents, populations of organisms - Goal knowledge discovery
3The essence is prediction
- My dog is very littl_ ?
- We know that letters do not occur in English at
random not all letters are equally common (e.g.
e is more common than x) - We know that context changes the probability of a
letter (e.g. whats the most likely letter after
the sequence I eat Weet-Bi_) - Prediction is important in many applications
(e.g. encryption, compression, communication,
graphics, simulation and bioinformatics!)
4Prediction in bioinformatics
- Predicting the location of genes in DNA
- Predicting the function of proteins
- Predicting diseases from molecular samples
- Predicting population dynamics
- Anything that involves making a judgment
typically expressible as a yes/no decision about
some sample datum
5Representation
- W e e t B i x
- 0101011101100101011001010111010000101101
- to the computer, everything is binary!
6- 0101011101100101011001010111010000101101
- 0101101100100111111011010011010000101101
- A A C G T C A T T C G A T G A T T C G
A - Just as we can teach a computer to predict
things about a sequence of letters in English
prose, we can also teach it to predict things
about a other sequenceslike a genetic sequence
7A genetic prediction problem
- ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggcta
cgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtc
gcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacg
gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcga
ggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgac
tacgacgacactaacgacgatgttgcgcacccacaccagttatatagaga
cgaactcgcatcagc
8A genetic prediction problem
- ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggcta
cgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtc
gcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacg
gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcga
ggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgac
tacgacgacactaacgacgatgttgcgcacccacaccagttatatagaga
cgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatatt
cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgca
tcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
ttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagt
attttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgc
atacgacgacgactacgacgacactaacgacgatgttgcgcacccacacc
agttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaa
atttattatattcccggcgcggctacgttcatcccagcagcagcgatttt
aaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcta
atggacgacatcttttactacgacggcgcctacgcatcgcagcatacgac
gcccagcatagtattttagaggcgaggacatcatcatatcgcagctacag
cgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttg
cgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcg
ctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagc
agcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcct
ttattcacgctaatggacgacatcttttactacgacggcgcctacgcatc
gcagcatacgacgcccagcatagtattttagaggcgaggacatcatcata
tcgcagctacagcgcatcagacgcatacgacgacgactacgacgacacta
acgacgatgttgcgcacccacaccagttatatagagacgaactcgcatca
gtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctac
gttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcg
cgttcgtcgcctttattcacgctaatggacgacatcttttactacgacgg
cgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgag
gacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgact
acgacgacactaacgacgatgttgcgcacccacaccagttatatagagac
gaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcc
cggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatc
agactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctt
ttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtat
tttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcat
acgacgacgactacgacgacactaacgacgatgttgcgcacccacaccag
ttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaat
ttattatattcccggcgcggctacgttcatcccagcagcagcgattttaa
aattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaat
ggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgc
ccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcg
catcagacgcatacgacgacgactacgacgacactaacgacgatgttgcg
cacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct
acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcag
cagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgc
agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatc
gcagctacagcgcatcagacgcatacgacgacgactacgacgacactaac
gacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagt
gttgcgcacccacaccagttatatagagacgaactc
9A genetic prediction problem
- ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggcta
cgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtc
gcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacg
gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcga
ggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgac
tacgacgacactaacgacgatgttgcgcacccacaccagttatatagaga
cgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatatt
cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgca
tcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
ttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagt
attttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgc
atacgacgacgactacgacgacactaacgacgatgttgcgcacccacacc
agttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaa
atttattatattcccggcgcggctacgttcatcccagcagcagcgatttt
aaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcta
atggacgacatcttttactacgacggcgcctacgcatcgcagcatacgac
gcccagcatagtattttagaggcgaggacatcatcatatcgcagctacag
cgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttg
cgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcg
ctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagc
agcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcct
ttattcacgctaatggacgacatcttttactacgacggcgcctacgcatc
gcagcatacgacgcccagcatagtattttagaggcgaggacatcatcata
tcgcagctacagcgcatcagacgcatacgacgacgactacgacgacacta
acgacgatgttgcgcacccacaccagttatatagagacgaactcgcatca
gtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctac
gttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcg
cgttcgtcgcctttattcacgctaatggacgacatcttttactacgacgg
cgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgag
gacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgact
acgacgacactaacgacgatgttgcgcacccacaccagttatatagagac
gaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcc
cggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatc
agactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctt
ttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtat
tttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcat
acgacgacgactacgacgacactaacgacgatgttgcgcacccacaccag
ttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaat
ttattatattcccggcgcggctacgttcatcccagcagcagcgattttaa
aattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaat
ggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgc
ccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcg
catcagacgcatacgacgacgactacgacgacactaacgacgatgttgcg
cacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct
acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcag
cagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgc
agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatc
gcagctacagcgcatcagacgcatacgacgacgactacgacgacactaac
gacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagt
gttgcgcacccacaccagttatatagagacgaactcttgcaatcggcgct
acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcag
cagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgc
agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatc
gcagctacagcgcatcagacgcatacgacgacgactacgacgacactaac
gacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagc
tgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacg
ttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgc
gttcgtcgcctttattcacgctaatggacgacatcttttactacgacggc
gcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgagg
acatcatcatatcgcagctacagcgcatcagacgcatacgacgacgacta
cgacgacactaacgacgatgttgcgcacccacaccagttatatagagacg
aactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattccc
ggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatca
gactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttt
tactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtatt
ttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcata
cgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagt
tatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatt
tattatattcccggcgcggctacgttcatcccagcagcagcgattttaaa
attaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatg
gacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcc
cagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgc
atcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgc
acccacaccagttatatagagacgaactcgcatcagtgcaatcggcgcta
cgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagc
agcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttta
ttcacgctaatggacgacatcttttactacgacggcgcctacgcatcgca
gcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcg
cagctacagcgcatcagacgcatacgacgacgactacgacgacactaacg
acgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtg
caatcggcgctacgcttcaaaatttattatattcccggcgcggctacgtt
catcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgt
tcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgc
ctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggac
atcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacg
acgacactaacgacgatgttgcgcacccacaccagttatatagagacgaa
ctcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccgg
cgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcaga
ctctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttta
ctacgacggcgcctacgcatcgcagcatacgacgcccagcatagtatttt
agaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacg
acgacgactacgacgacactaacgacgatgttgcgcacccacaccagtta
tatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaattta
ttatattcccggcgcggctacgttcatcccagcagcagcgattttaaaat
taacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga
cgacatcttttactacgacggcgcctacgcatcgcagcatacgacgccca
gcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcat
cagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcac
ccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacac
cagttatatagagacgaactc
10A genetic prediction problem
- A gene encodes a protein
- It is a blueprint that provides biochemical
instructions on how to construct a sequence of
amino acids so as to make a working protein that
will perform some function in the organism
11A genetic prediction problem
encoding region
untranslated region
12A genetic prediction problem
untranslated region
13A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
untranslated region
14A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
What transcription factors bind to this
gene? Where is the transcription factor binding
site?
15A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Clues
A binding site is often a short general
pattern E.g. CCGATNATCGG
16A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Clues
The patterns are often reverse complements E.g. C
CGATNATCGG GGCTANTAGCC
17A genetic prediction problem
ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Clues
Where there is one binding site, often there is
another nearby.
18A genetic prediction problem
All of these properties are the kinds of things
for which computer science has developed
algorithms and data structures to identify
quickly and efficiently, and therefore it is
exactly the kind of problem computer scientists
should be able to solve.
19proteomics
Three consecutive nucleotides in the coding
region form a codon i.e. encode an amino
acid. A string of amino acids makes a protein.
3 nucleotides, 4 possibilities for each, so
43 64 possible codons But there are only
20 amino acids!
20proteomics
There is quite a bit of redundancy in codons.
Glycine GGA, GGC, GGG, GGT Tyrosine TAT,
TAC Methionine ATG
21Amino Acid
R group
Amide group
Carboxyl group
22Amino Acid
tyrosine
glycine
23Primary structure MSALVSTTPSLLAGVRNVDB ..
24Tertiary Structure
25Secondary Structure
26Signal peptide
- A relatively short sequence of amino residues at
the N-terminus of the nascent protein - typically 15-50 residues
- MAGPRPSPWARLLLAALISVSLSGTLARCKKAPVSKKCETCVGQAALT
GL - Cleaved off as protein passes through membrane
(operates like a pass key) - Knowing signal peptide helps determine protein
function in the organism
27How do we do it?
- see any patterns? ttgcaatcggcgctacgcttcaaaatttat
tatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatt
aacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggac
gacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccag
catagtattttagaggcgaggacatcatcatatcgcagctacagcgcatc
agacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacc
cacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacg
cttcaaaatttattatattcccggcgcggctacgttcatcccagcagcag
cgattttaaaatttcgcctttattcacgctaatggacgacatcttttact
acgacggcgcctacgcatcgcagcatacgacgcccacgcccagcatagta
ttttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgca
tacgacgacgactacgacgacactaacgacgatgttgcgcacccacacca
gttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaa
tttattatagcatagtattttagaggcgaggacatcatcatatcgcagct
acagcgcatcagacgcatacgacgacgactacgacgacactaacgacgat
gttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatc
ggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcc
cagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtc
gcctttattcacgctaatggacgacatcttttactacgacggcgcctacg
catcgcagcatacgacgcccagcatagtattttagaggcgaggacatcat
catatcgcagctacagcgcatcagacgcatacgacgacgactacgacgac
actaacgacgatgttgcgcacccacaccagttatatagagacgaactcgc
atcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcgg
ctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctc
gtaacgcatcagactctcgtcgcgttcgcgcgttcgtcgcctttattcac
gctaatggacgacatcttttactacgacggcgcctacgcatcgcagcata
cgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagct
acagcgcatcagacgcatacgacgacgactacgacgacactaacgacgat
gttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatc
ggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcc
cagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtc
gcctttattcacgctaatggacgacatcttttactacgacggcgcctacg
catcgcagcatacgacgcccagcatagtattttagaggcgaggacatcac
tacgacgacactaacgacgatgttgcgcacccacaccagttatatagaga
cgaactcgcatcagtgctacgcttcaaaatttattatattcccggcggca
atcggcgctacgcttcaaaatttattatattcccggcgcggctacgttca
tcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttc
gtcgcctttattcacgctaatggacgacatcttttactacgacggcgcct
acgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacat
catcatatcgcagctacagcgcatcagacgcatacgacgacgactacgac
gacactaacgacgatgttgcgcacccacaccagttatatagagacgaact
cgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcg
cggctacgttcatcccagcagcagcgattttaaaattaacgcatcagact
ctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttact
acgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttag
aggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgac
gacgactacgacgacactaacgacgatgttgcgcacccacaccagttata
tagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttatt
atattcccggcgcggctacgttcatcccagcagcagcgattttaaaatta
acgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacg
acatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagc
atagtattttagaggcgaggacatcatcatatcgcagctacagcgcatca
gacgcatacgacgacgactacgacgacactaacgacgatgttgcgcaccc
acaccagttatatagagacgaactcgcatcagtgttgcgcacccacacca
gttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttatt
atattcccggcgcggctacgttcatcccagcagcagcgattttaaaatta
acgcatcagactctcgtcgcgttcgtcgcctttatttattatattcccgg
cgcggctacgttcatcccagcattcacgctaatggacgacatcttttact
acgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttag
aggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgac
gacgactacgacgacactaacgacgatgttgcgcacccacaccagttata
tagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttat
tatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatt
aacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggac
gacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccag
catagtattttagaggcgaggacatcatcatatcgcagctacagcgcatc
agacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacc
cacaccagttatatagagacgaactcgcatcaggacatcttttactacga
cggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggc
gaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacg
actacgacgacactaacgacgatgttgcgcacccacaccagttatataga
tgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacg
ttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgc
gttcgtcgcctttattcacgctaatggacgacatcttttactacgacggc
gcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgagg
acatcatcatatcgcagctactcatatcgcagctacagcgcatcagacgc
atacgacgacgaagcgcatcagacgcatacgacgacgactacgacgacac
taacgacgatgttgcgcacccacaccagttatatagagacgaactcgcat
cagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggct
acgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgt
cgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgac
ggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcg
aggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacga
ctacgacgacactaacgacgatgttgcgcacccacaccagttatatagag
acgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatatt
cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgca
tcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
ttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagt
attttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgc
atacgacgacgactacgacgacactaacgacgatgttgcgcacccacacc
agttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaa
agcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcct
ttattcacgctaatggacgacgaactcgcatcagtgcaatcggccggcta
cgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtc
gcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacg
gcgcctacgcatcgcagcatacgattcccggcgcggctacgttcatccca
gcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgc
ctttattcacgctaatggacgacatcttttactacgacggcgcctacgca
tcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatca
tatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacac
taacgacgatgttgcgcacccacaccagttatatagagacgaactcgcat
cagtgttgcgcacccacaccagttatatagagacgaactcttagaggcga
ggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatc
atcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatat
cgc
28Local biases in residues around the cleavage site
Sequence regularities can be exploited by
statistical and pattern-based models
29Proteomic prediction
Language letters combine to form words
words combine to form phrases phrases combine
to form sentences sentences combine to form
sentences (and ultimately Harry Potter
books) Proteins amino acids combine to form
peptides peptides combine to form secondary
motifs (e.g. a-helixes and ß-sheets)
motifs combine to make proteins proteins
combine to make toenails (and ultimately
people)
30Approach
- Problem is stated as two-class
- an amino acid is either the first residue of the
mature protein or it is not - Each residue is described by a single document,
which includes as many electrochemical,
structural or contextual facts as are available
(desirable)
31Properties of amino acids
32Residue as a document
- E.g. Cysteine Cys C
- aliphatic yes, aromatic no, hydrophobic
yes, charge -, polarized yes, small no,
number of nitrogen atoms 1, contains sulphur
yes, has a carbon ring no, ionized yes,
valence 2, cbeta no, covalent yes, h-bond
yes, - etc. (whatever else experimenter wants to
include)
33Sample document
- PRNUM1. AANUM21.
- AMINO-8L. ALIPH-8-. AROMA-8-.
CBETA-8-. CHARG-8-. COVAL-8-.
HBOND-8-. HPHOB-8. IONIZ-8-.
NITRO-81. POLAR-8-. POSNG-80.
SMALL-8-. SULPH-8-. TEENY-8-.
CRING-8-. VALEN-82. AMINO-7L.
ALIPH-7-. AROMA-7-. CBETA-7-.
CHARG-7-. COVAL-7-. HBOND-7-.
HPHOB-7. IONIZ-7-. NITRO-71.
POLAR-7-. POSNG-70. SMALL-7-.
SULPH-7-. TEENY-7-. CRING-7-.
VALEN-72. AMINO-6F. ALIPH-6.
AROMA-6. CBETA-6-. CHARG-6-.
COVAL-6-. HBOND-6-. HPHOB-6.
IONIZ-6-. NITRO-61. POLAR-6-.
POSNG-60. SMALL-6-. SULPH-6-.
TEENY-6-. CRING-6. VALEN-62.
AMINO-5A. ALIPH-5-. AROMA-5-.
CBETA-5-. CHARG-5-. COVAL-5-.
HBOND-5-. HPHOB-5-. IONIZ-5-.
NITRO-51. POLAR-5-. POSNG-50.
SMALL-5. SULPH-5-. TEENY-5.
CRING-5-. VALEN-52. AMINO-4T.
ALIPH-4. AROMA-4-. CBETA-4.
CHARG-4-. COVAL-4-. HBOND-4.
HPHOB-4-. IONIZ-4-. NITRO-41.
POLAR-4. POSNG-40. SMALL-4.
SULPH-4-. TEENY-4-. CRING-4-.
VALEN-42. AMINO-3C. ALIPH-3.
AROMA-3-. CBETA-3-. CHARG-3-.
COVAL-3. HBOND-3. HPHOB-3.
IONIZ-3. NITRO-31. POLAR-3.
POSNG-3-. SMALL-3-. SULPH-3.
TEENY-3-. CRING-3-. VALEN-32.
AMINO-2I. ALIPH-2-. AROMA-2-.
CBETA-2. CHARG-2-. COVAL-2-.
HBOND-2-. HPHOB-2. IONIZ-2-.
NITRO-21. POLAR-2-. POSNG-20.
SMALL-2-. SULPH-2-. TEENY-2-.
CRING-2-. VALEN-22. AMINO-1A.
ALIPH-1-. AROMA-1-. CBETA-1-.
CHARG-1-. COVAL-1-. HBOND-1-.
HPHOB-1-. IONIZ-1-. NITRO-11.
POLAR-1-. POSNG-10. SMALL-1.
SULPH-1-. TEENY-1. CRING-1-.
VALEN-12. AMINO0R. ALIPH0. AROMA0-.
CBETA0-. CHARG0. COVAL0-. HBOND0.
HPHOB0-. IONIZ0. NITRO04. POLAR0.
POSNG0. SMALL0-. SULPH0-. TEENY0-.
CRING0-. VALEN03. AMINO1H. ALIPH1.
AROMA1. CBETA1-. CHARG1. COVAL1-.
HBOND1. HPHOB1-. IONIZ1. NITRO13.
POLAR1. POSNG1. SMALL1-. SULPH1-.
TEENY1-. CRING1. VALEN13. AMINO2Q.
ALIPH2. AROMA2-. CBETA2-. CHARG2-.
COVAL2-. HBOND2. HPHOB2-. IONIZ2-.
NITRO22. POLAR2. POSNG20. SMALL2-.
SULPH2-. TEENY2-. CRING2-. VALEN22.
AMINO3Q. ALIPH3. AROMA3-. CBETA3-.
CHARG3-. COVAL3-. HBOND3. HPHOB3-.
IONIZ3-. NITRO32. POLAR3. POSNG30.
SMALL3-. SULPH3-. TEENY3-. CRING3-.
VALEN32. AMINO4R. ALIPH4. AROMA4-.
CBETA4-. CHARG4. COVAL4-. HBOND4.
HPHOB4-. IONIZ4. NITRO44. POLAR4.
POSNG4. SMALL4-. SULPH4-. TEENY4-.
CRING4-. VALEN43. AMINO5Q. ALIPH5.
AROMA5-. CBETA5-. CHARG5-. COVAL5-.
HBOND5. HPHOB5-. IONIZ5-. NITRO52.
POLAR5. POSNG50. SMALL5-. SULPH5-.
TEENY5-. CRING5-. VALEN52. AMINO6Q.
ALIPH6. AROMA6-. CBETA6-. CHARG6-.
COVAL6-. HBOND6. HPHOB6-. IONIZ6-.
NITRO62. POLAR6. POSNG60. SMALL6-.
SULPH6-. TEENY6-. CRING6-. VALEN62.
AMINO7Q. ALIPH7. AROMA7-. CBETA7-.
CHARG7-. COVAL7-. HBOND7. HPHOB7-.
IONIZ7-. NITRO72. POLAR7. POSNG70.
SMALL7-. SULPH7-. TEENY7-. CRING7-.
VALEN72. AMINO8Q. ALIPH8. AROMA8-.
CBETA8-. CHARG8-. COVAL8-. HBOND8.
HPHOB8-. IONIZ8-. NITRO82. POLAR8.
POSNG80. SMALL8-. SULPH8-. TEENY8-.
CRING8-. VALEN82. MULT37. MULT54.
MULT73. MULT92. 2GRAMIA. GRAM2HQ. 3GRAMCIA.
GRAM3HQQ.
34Artificial Intelligence
- Computers do things only human brains can
otherwise do
expert
expert
35Artificial Intelligence
- Computers do things only human brains can
otherwise do
expert system
expert
36Artificial Intelligence
- Computers do things only human brains can
otherwise do
learning system
expert system
37Machine learning
What is machine learning?
- creating computer programs that get better with
experience - learn how to make expert judgments
- discover previously hidden, potentially useful
information (data mining)
How does it work?
- user provides learning system with examples of
concept to be learned - induction algorithm infers a characteristic model
of the examples - model is used to predict whether or not future
novel instances are also examples and it does
this very consistently, and very, very quickly!
38Bioinformatics
- Biologists know proteins, computer scientists
know machine learning - Together, they can find hidden and potentially
useful information about genes and proteins - Biotechnology is a multi-billion dollar industry
- Biotechnology is one of the best funded areas of
scientific research - Shortage of people educated in bioinformatics
39The University of Waikato
- Waikato University is ranked first in the country
in computer science and in molecular, cellular,
and whole-organism biology - centre of the universe for machine learning
40The University of Waikato
- If youre interested in getting involved in
bioinformatics, or indeed any other area along
the leading edge of computer science and/or
biology, then - Waikato wants You!