Bioinformatics The Prediction of Life - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Bioinformatics The Prediction of Life

Description:

Bioinformatics The Prediction of Life Tony C Smith Department of Computer Science University of Waikato tcs_at_cs.waikato.ac.nz – PowerPoint PPT presentation

Number of Views:163
Avg rating:3.0/5.0
Slides: 41
Provided by: TonyC184
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics The Prediction of Life


1
BioinformaticsThe Prediction of Life
  • Tony C Smith
  • Department of Computer Science
  • University of Waikato
  • tcs_at_cs.waikato.ac.nz

2
Bioinformatics
  • Computation with biological data
  • Data genes, proteins, microarrays, mass spectra,
    written documents, populations of organisms
  • Goal knowledge discovery

3
The essence is prediction
  • My dog is very littl_ ?
  • We know that letters do not occur in English at
    random not all letters are equally common (e.g.
    e is more common than x)
  • We know that context changes the probability of a
    letter (e.g. whats the most likely letter after
    the sequence I eat Weet-Bi_)
  • Prediction is important in many applications
    (e.g. encryption, compression, communication,
    graphics, simulation and bioinformatics!)

4
Prediction in bioinformatics
  • Predicting the location of genes in DNA
  • Predicting the function of proteins
  • Predicting diseases from molecular samples
  • Predicting population dynamics
  • Anything that involves making a judgment
    typically expressible as a yes/no decision about
    some sample datum

5
Representation
  • W e e t B i x
  • 0101011101100101011001010111010000101101
  • to the computer, everything is binary!

6
  • 0101011101100101011001010111010000101101
  • 0101101100100111111011010011010000101101
  • A A C G T C A T T C G A T G A T T C G
    A
  • Just as we can teach a computer to predict
    things about a sequence of letters in English
    prose, we can also teach it to predict things
    about a other sequenceslike a genetic sequence

7
A genetic prediction problem
  • ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggcta
    cgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtc
    gcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacg
    gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcga
    ggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgac
    tacgacgacactaacgacgatgttgcgcacccacaccagttatatagaga
    cgaactcgcatcagc

8
A genetic prediction problem
  • ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggcta
    cgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtc
    gcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacg
    gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcga
    ggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgac
    tacgacgacactaacgacgatgttgcgcacccacaccagttatatagaga
    cgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatatt
    cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgca
    tcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
    ttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagt
    attttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgc
    atacgacgacgactacgacgacactaacgacgatgttgcgcacccacacc
    agttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaa
    atttattatattcccggcgcggctacgttcatcccagcagcagcgatttt
    aaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcta
    atggacgacatcttttactacgacggcgcctacgcatcgcagcatacgac
    gcccagcatagtattttagaggcgaggacatcatcatatcgcagctacag
    cgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttg
    cgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcg
    ctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagc
    agcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcct
    ttattcacgctaatggacgacatcttttactacgacggcgcctacgcatc
    gcagcatacgacgcccagcatagtattttagaggcgaggacatcatcata
    tcgcagctacagcgcatcagacgcatacgacgacgactacgacgacacta
    acgacgatgttgcgcacccacaccagttatatagagacgaactcgcatca
    gtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctac
    gttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcg
    cgttcgtcgcctttattcacgctaatggacgacatcttttactacgacgg
    cgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgag
    gacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgact
    acgacgacactaacgacgatgttgcgcacccacaccagttatatagagac
    gaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcc
    cggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatc
    agactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctt
    ttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtat
    tttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcat
    acgacgacgactacgacgacactaacgacgatgttgcgcacccacaccag
    ttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaat
    ttattatattcccggcgcggctacgttcatcccagcagcagcgattttaa
    aattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaat
    ggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgc
    ccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcg
    catcagacgcatacgacgacgactacgacgacactaacgacgatgttgcg
    cacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct
    acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcag
    cagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
    attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgc
    agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatc
    gcagctacagcgcatcagacgcatacgacgacgactacgacgacactaac
    gacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagt
    gttgcgcacccacaccagttatatagagacgaactc

9
A genetic prediction problem
  • ttgcaatcggcgctacgcttcaaaatttattatattcccggcgcggcta
    cgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtc
    gcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacg
    gcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcga
    ggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgac
    tacgacgacactaacgacgatgttgcgcacccacaccagttatatagaga
    cgaactcgcatcagctgcaatcggcgctacgcttcaaaatttattatatt
    cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgca
    tcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
    ttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagt
    attttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgc
    atacgacgacgactacgacgacactaacgacgatgttgcgcacccacacc
    agttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaa
    atttattatattcccggcgcggctacgttcatcccagcagcagcgatttt
    aaaattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgcta
    atggacgacatcttttactacgacggcgcctacgcatcgcagcatacgac
    gcccagcatagtattttagaggcgaggacatcatcatatcgcagctacag
    cgcatcagacgcatacgacgacgactacgacgacactaacgacgatgttg
    cgcacccacaccagttatatagagacgaactcgcatcagtgcaatcggcg
    ctacgcttcaaaatttattatattcccggcgcggctacgttcatcccagc
    agcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcct
    ttattcacgctaatggacgacatcttttactacgacggcgcctacgcatc
    gcagcatacgacgcccagcatagtattttagaggcgaggacatcatcata
    tcgcagctacagcgcatcagacgcatacgacgacgactacgacgacacta
    acgacgatgttgcgcacccacaccagttatatagagacgaactcgcatca
    gtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctac
    gttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcg
    cgttcgtcgcctttattcacgctaatggacgacatcttttactacgacgg
    cgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgag
    gacatcatcatatcgcagctacagcgcatcagacgcatacgacgacgact
    acgacgacactaacgacgatgttgcgcacccacaccagttatatagagac
    gaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcc
    cggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatc
    agactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctt
    ttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtat
    tttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcat
    acgacgacgactacgacgacactaacgacgatgttgcgcacccacaccag
    ttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaat
    ttattatattcccggcgcggctacgttcatcccagcagcagcgattttaa
    aattaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaat
    ggacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgc
    ccagcatagtattttagaggcgaggacatcatcatatcgcagctacagcg
    catcagacgcatacgacgacgactacgacgacactaacgacgatgttgcg
    cacccacaccagttatatagagacgaactcgcatcagtgcaatcggcgct
    acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcag
    cagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
    attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgc
    agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatc
    gcagctacagcgcatcagacgcatacgacgacgactacgacgacactaac
    gacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagt
    gttgcgcacccacaccagttatatagagacgaactcttgcaatcggcgct
    acgcttcaaaatttattatattcccggcgcggctacgttcatcccagcag
    cagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttt
    attcacgctaatggacgacatcttttactacgacggcgcctacgcatcgc
    agcatacgacgcccagcatagtattttagaggcgaggacatcatcatatc
    gcagctacagcgcatcagacgcatacgacgacgactacgacgacactaac
    gacgatgttgcgcacccacaccagttatatagagacgaactcgcatcagc
    tgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacg
    ttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgc
    gttcgtcgcctttattcacgctaatggacgacatcttttactacgacggc
    gcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgagg
    acatcatcatatcgcagctacagcgcatcagacgcatacgacgacgacta
    cgacgacactaacgacgatgttgcgcacccacaccagttatatagagacg
    aactcgcatcagtgcaatcggcgctacgcttcaaaatttattatattccc
    ggcgcggctacgttcatcccagcagcagcgattttaaaattaacgcatca
    gactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttt
    tactacgacggcgcctacgcatcgcagcatacgacgcccagcatagtatt
    ttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgcata
    cgacgacgactacgacgacactaacgacgatgttgcgcacccacaccagt
    tatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatt
    tattatattcccggcgcggctacgttcatcccagcagcagcgattttaaa
    attaacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatg
    gacgacatcttttactacgacggcgcctacgcatcgcagcatacgacgcc
    cagcatagtattttagaggcgaggacatcatcatatcgcagctacagcgc
    atcagacgcatacgacgacgactacgacgacactaacgacgatgttgcgc
    acccacaccagttatatagagacgaactcgcatcagtgcaatcggcgcta
    cgcttcaaaatttattatattcccggcgcggctacgttcatcccagcagc
    agcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgccttta
    ttcacgctaatggacgacatcttttactacgacggcgcctacgcatcgca
    gcatacgacgcccagcatagtattttagaggcgaggacatcatcatatcg
    cagctacagcgcatcagacgcatacgacgacgactacgacgacactaacg
    acgatgttgcgcacccacaccagttatatagagacgaactcgcatcagtg
    caatcggcgctacgcttcaaaatttattatattcccggcgcggctacgtt
    catcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgt
    tcgtcgcctttattcacgctaatggacgacatcttttactacgacggcgc
    ctacgcatcgcagcatacgacgcccagcatagtattttagaggcgaggac
    atcatcatatcgcagctacagcgcatcagacgcatacgacgacgactacg
    acgacactaacgacgatgttgcgcacccacaccagttatatagagacgaa
    ctcgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccgg
    cgcggctacgttcatcccagcagcagcgattttaaaattaacgcatcaga
    ctctcgtcgcgttcgtcgcctttattcacgctaatggacgacatctttta
    ctacgacggcgcctacgcatcgcagcatacgacgcccagcatagtatttt
    agaggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacg
    acgacgactacgacgacactaacgacgatgttgcgcacccacaccagtta
    tatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaattta
    ttatattcccggcgcggctacgttcatcccagcagcagcgattttaaaat
    taacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatgga
    cgacatcttttactacgacggcgcctacgcatcgcagcatacgacgccca
    gcatagtattttagaggcgaggacatcatcatatcgcagctacagcgcat
    cagacgcatacgacgacgactacgacgacactaacgacgatgttgcgcac
    ccacaccagttatatagagacgaactcgcatcagtgttgcgcacccacac
    cagttatatagagacgaactc

10
A genetic prediction problem
  • A gene encodes a protein
  • It is a blueprint that provides biochemical
    instructions on how to construct a sequence of
    amino acids so as to make a working protein that
    will perform some function in the organism

11
A genetic prediction problem

encoding region
untranslated region
12
A genetic prediction problem

untranslated region
13
A genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggc
untranslated region
14
A genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggc
What transcription factors bind to this
gene? Where is the transcription factor binding
site?
15
A genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Clues
A binding site is often a short general
pattern E.g. CCGATNATCGG
16
A genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Clues
The patterns are often reverse complements E.g. C
CGATNATCGG GGCTANTAGCC
17
A genetic prediction problem

ttgcaatcggcgctacgcttcaaaatttattatattcccggc
Clues
Where there is one binding site, often there is
another nearby.
18
A genetic prediction problem

All of these properties are the kinds of things
for which computer science has developed
algorithms and data structures to identify
quickly and efficiently, and therefore it is
exactly the kind of problem computer scientists
should be able to solve.
19
proteomics
Three consecutive nucleotides in the coding
region form a codon i.e. encode an amino
acid. A string of amino acids makes a protein.
3 nucleotides, 4 possibilities for each, so
43 64 possible codons But there are only
20 amino acids!
20
proteomics
There is quite a bit of redundancy in codons.
Glycine GGA, GGC, GGG, GGT Tyrosine TAT,
TAC Methionine ATG
21
Amino Acid
R group
Amide group
Carboxyl group
22
Amino Acid
tyrosine
glycine
23
Primary structure MSALVSTTPSLLAGVRNVDB ..
24
Tertiary Structure
25
Secondary Structure
26
Signal peptide
  • A relatively short sequence of amino residues at
    the N-terminus of the nascent protein
  • typically 15-50 residues
  • MAGPRPSPWARLLLAALISVSLSGTLARCKKAPVSKKCETCVGQAALT
    GL
  • Cleaved off as protein passes through membrane
    (operates like a pass key)
  • Knowing signal peptide helps determine protein
    function in the organism

27
How do we do it?
  • see any patterns? ttgcaatcggcgctacgcttcaaaatttat
    tatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatt
    aacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggac
    gacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccag
    catagtattttagaggcgaggacatcatcatatcgcagctacagcgcatc
    agacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacc
    cacaccagttatatagagacgaactcgcatcagctgcaatcggcgctacg
    cttcaaaatttattatattcccggcgcggctacgttcatcccagcagcag
    cgattttaaaatttcgcctttattcacgctaatggacgacatcttttact
    acgacggcgcctacgcatcgcagcatacgacgcccacgcccagcatagta
    ttttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgca
    tacgacgacgactacgacgacactaacgacgatgttgcgcacccacacca
    gttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaa
    tttattatagcatagtattttagaggcgaggacatcatcatatcgcagct
    acagcgcatcagacgcatacgacgacgactacgacgacactaacgacgat
    gttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatc
    ggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcc
    cagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtc
    gcctttattcacgctaatggacgacatcttttactacgacggcgcctacg
    catcgcagcatacgacgcccagcatagtattttagaggcgaggacatcat
    catatcgcagctacagcgcatcagacgcatacgacgacgactacgacgac
    actaacgacgatgttgcgcacccacaccagttatatagagacgaactcgc
    atcagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcgg
    ctacgttcatcccagcagcagcgattttaaaattaacgcatcagactctc
    gtaacgcatcagactctcgtcgcgttcgcgcgttcgtcgcctttattcac
    gctaatggacgacatcttttactacgacggcgcctacgcatcgcagcata
    cgacgcccagcatagtattttagaggcgaggacatcatcatatcgcagct
    acagcgcatcagacgcatacgacgacgactacgacgacactaacgacgat
    gttgcgcacccacaccagttatatagagacgaactcgcatcagtgcaatc
    ggcgctacgcttcaaaatttattatattcccggcgcggctacgttcatcc
    cagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtc
    gcctttattcacgctaatggacgacatcttttactacgacggcgcctacg
    catcgcagcatacgacgcccagcatagtattttagaggcgaggacatcac
    tacgacgacactaacgacgatgttgcgcacccacaccagttatatagaga
    cgaactcgcatcagtgctacgcttcaaaatttattatattcccggcggca
    atcggcgctacgcttcaaaatttattatattcccggcgcggctacgttca
    tcccagcagcagcgattttaaaattaacgcatcagactctcgtcgcgttc
    gtcgcctttattcacgctaatggacgacatcttttactacgacggcgcct
    acgcatcgcagcatacgacgcccagcatagtattttagaggcgaggacat
    catcatatcgcagctacagcgcatcagacgcatacgacgacgactacgac
    gacactaacgacgatgttgcgcacccacaccagttatatagagacgaact
    cgcatcagtgcaatcggcgctacgcttcaaaatttattatattcccggcg
    cggctacgttcatcccagcagcagcgattttaaaattaacgcatcagact
    ctcgtcgcgttcgtcgcctttattcacgctaatggacgacatcttttact
    acgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttag
    aggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgac
    gacgactacgacgacactaacgacgatgttgcgcacccacaccagttata
    tagagacgaactcgcatcagtgcaatcggcgctacgcttcaaaatttatt
    atattcccggcgcggctacgttcatcccagcagcagcgattttaaaatta
    acgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggacg
    acatcttttactacgacggcgcctacgcatcgcagcatacgacgcccagc
    atagtattttagaggcgaggacatcatcatatcgcagctacagcgcatca
    gacgcatacgacgacgactacgacgacactaacgacgatgttgcgcaccc
    acaccagttatatagagacgaactcgcatcagtgttgcgcacccacacca
    gttatatagagacgaactcttgcaatcggcgctacgcttcaaaatttatt
    atattcccggcgcggctacgttcatcccagcagcagcgattttaaaatta
    acgcatcagactctcgtcgcgttcgtcgcctttatttattatattcccgg
    cgcggctacgttcatcccagcattcacgctaatggacgacatcttttact
    acgacggcgcctacgcatcgcagcatacgacgcccagcatagtattttag
    aggcgaggacatcatcatatcgcagctacagcgcatcagacgcatacgac
    gacgactacgacgacactaacgacgatgttgcgcacccacaccagttata
    tagagacgaactcgcatcagctgcaatcggcgctacgcttcaaaatttat
    tatattcccggcgcggctacgttcatcccagcagcagcgattttaaaatt
    aacgcatcagactctcgtcgcgttcgtcgcctttattcacgctaatggac
    gacatcttttactacgacggcgcctacgcatcgcagcatacgacgcccag
    catagtattttagaggcgaggacatcatcatatcgcagctacagcgcatc
    agacgcatacgacgacgactacgacgacactaacgacgatgttgcgcacc
    cacaccagttatatagagacgaactcgcatcaggacatcttttactacga
    cggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggc
    gaggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacg
    actacgacgacactaacgacgatgttgcgcacccacaccagttatataga
    tgcaatcggcgctacgcttcaaaatttattatattcccggcgcggctacg
    ttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtcgc
    gttcgtcgcctttattcacgctaatggacgacatcttttactacgacggc
    gcctacgcatcgcagcatacgacgcccagcatagtattttagaggcgagg
    acatcatcatatcgcagctactcatatcgcagctacagcgcatcagacgc
    atacgacgacgaagcgcatcagacgcatacgacgacgactacgacgacac
    taacgacgatgttgcgcacccacaccagttatatagagacgaactcgcat
    cagtgcaatcggcgctacgcttcaaaatttattatattcccggcgcggct
    acgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgt
    cgcgttcgtcgcctttattcacgctaatggacgacatcttttactacgac
    ggcgcctacgcatcgcagcatacgacgcccagcatagtattttagaggcg
    aggacatcatcatatcgcagctacagcgcatcagacgcatacgacgacga
    ctacgacgacactaacgacgatgttgcgcacccacaccagttatatagag
    acgaactcgcatcagtgcaatcggcgctacgcttcaaaatttattatatt
    cccggcgcggctacgttcatcccagcagcagcgattttaaaattaacgca
    tcagactctcgtcgcgttcgtcgcctttattcacgctaatggacgacatc
    ttttactacgacggcgcctacgcatcgcagcatacgacgcccagcatagt
    attttagaggcgaggacatcatcatatcgcagctacagcgcatcagacgc
    atacgacgacgactacgacgacactaacgacgatgttgcgcacccacacc
    agttatatagagacgaactcgcatcagtgcaatcggcgctacgcttcaaa
    agcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgcct
    ttattcacgctaatggacgacgaactcgcatcagtgcaatcggccggcta
    cgttcatcccagcagcagcgattttaaaattaacgcatcagactctcgtc
    gcgttcgtcgcctttattcacgctaatggacgacatcttttactacgacg
    gcgcctacgcatcgcagcatacgattcccggcgcggctacgttcatccca
    gcagcagcgattttaaaattaacgcatcagactctcgtcgcgttcgtcgc
    ctttattcacgctaatggacgacatcttttactacgacggcgcctacgca
    tcgcagcatacgacgcccagcatagtattttagaggcgaggacatcatca
    tatcgcagctacagcgcatcagacgcatacgacgacgactacgacgacac
    taacgacgatgttgcgcacccacaccagttatatagagacgaactcgcat
    cagtgttgcgcacccacaccagttatatagagacgaactcttagaggcga
    ggacatcatcatatcgcagctacagcgcatcagttagaggcgaggacatc
    atcatatcgcagctacagcgcatcagttagaggcgaggacatcatcatat
    cgc

28
Local biases in residues around the cleavage site
Sequence regularities can be exploited by
statistical and pattern-based models
29
Proteomic prediction
Language letters combine to form words
words combine to form phrases phrases combine
to form sentences sentences combine to form
sentences (and ultimately Harry Potter
books) Proteins amino acids combine to form
peptides peptides combine to form secondary
motifs (e.g. a-helixes and ß-sheets)
motifs combine to make proteins proteins
combine to make toenails (and ultimately
people)
30
Approach
  • Problem is stated as two-class
  • an amino acid is either the first residue of the
    mature protein or it is not
  • Each residue is described by a single document,
    which includes as many electrochemical,
    structural or contextual facts as are available
    (desirable)

31
Properties of amino acids
32
Residue as a document
  • E.g. Cysteine Cys C
  • aliphatic yes, aromatic no, hydrophobic
    yes, charge -, polarized yes, small no,
    number of nitrogen atoms 1, contains sulphur
    yes, has a carbon ring no, ionized yes,
    valence 2, cbeta no, covalent yes, h-bond
    yes,
  • etc. (whatever else experimenter wants to
    include)

33
Sample document
  • PRNUM1. AANUM21.
  • AMINO-8L. ALIPH-8-. AROMA-8-.
    CBETA-8-. CHARG-8-. COVAL-8-.
    HBOND-8-. HPHOB-8. IONIZ-8-.
    NITRO-81. POLAR-8-. POSNG-80.
    SMALL-8-. SULPH-8-. TEENY-8-.
    CRING-8-. VALEN-82. AMINO-7L.
    ALIPH-7-. AROMA-7-. CBETA-7-.
    CHARG-7-. COVAL-7-. HBOND-7-.
    HPHOB-7. IONIZ-7-. NITRO-71.
    POLAR-7-. POSNG-70. SMALL-7-.
    SULPH-7-. TEENY-7-. CRING-7-.
    VALEN-72. AMINO-6F. ALIPH-6.
    AROMA-6. CBETA-6-. CHARG-6-.
    COVAL-6-. HBOND-6-. HPHOB-6.
    IONIZ-6-. NITRO-61. POLAR-6-.
    POSNG-60. SMALL-6-. SULPH-6-.
    TEENY-6-. CRING-6. VALEN-62.
    AMINO-5A. ALIPH-5-. AROMA-5-.
    CBETA-5-. CHARG-5-. COVAL-5-.
    HBOND-5-. HPHOB-5-. IONIZ-5-.
    NITRO-51. POLAR-5-. POSNG-50.
    SMALL-5. SULPH-5-. TEENY-5.
    CRING-5-. VALEN-52. AMINO-4T.
    ALIPH-4. AROMA-4-. CBETA-4.
    CHARG-4-. COVAL-4-. HBOND-4.
    HPHOB-4-. IONIZ-4-. NITRO-41.
    POLAR-4. POSNG-40. SMALL-4.
    SULPH-4-. TEENY-4-. CRING-4-.
    VALEN-42. AMINO-3C. ALIPH-3.
    AROMA-3-. CBETA-3-. CHARG-3-.
    COVAL-3. HBOND-3. HPHOB-3.
    IONIZ-3. NITRO-31. POLAR-3.
    POSNG-3-. SMALL-3-. SULPH-3.
    TEENY-3-. CRING-3-. VALEN-32.
    AMINO-2I. ALIPH-2-. AROMA-2-.
    CBETA-2. CHARG-2-. COVAL-2-.
    HBOND-2-. HPHOB-2. IONIZ-2-.
    NITRO-21. POLAR-2-. POSNG-20.
    SMALL-2-. SULPH-2-. TEENY-2-.
    CRING-2-. VALEN-22. AMINO-1A.
    ALIPH-1-. AROMA-1-. CBETA-1-.
    CHARG-1-. COVAL-1-. HBOND-1-.
    HPHOB-1-. IONIZ-1-. NITRO-11.
    POLAR-1-. POSNG-10. SMALL-1.
    SULPH-1-. TEENY-1. CRING-1-.
    VALEN-12. AMINO0R. ALIPH0. AROMA0-.
    CBETA0-. CHARG0. COVAL0-. HBOND0.
    HPHOB0-. IONIZ0. NITRO04. POLAR0.
    POSNG0. SMALL0-. SULPH0-. TEENY0-.
    CRING0-. VALEN03. AMINO1H. ALIPH1.
    AROMA1. CBETA1-. CHARG1. COVAL1-.
    HBOND1. HPHOB1-. IONIZ1. NITRO13.
    POLAR1. POSNG1. SMALL1-. SULPH1-.
    TEENY1-. CRING1. VALEN13. AMINO2Q.
    ALIPH2. AROMA2-. CBETA2-. CHARG2-.
    COVAL2-. HBOND2. HPHOB2-. IONIZ2-.
    NITRO22. POLAR2. POSNG20. SMALL2-.
    SULPH2-. TEENY2-. CRING2-. VALEN22.
    AMINO3Q. ALIPH3. AROMA3-. CBETA3-.
    CHARG3-. COVAL3-. HBOND3. HPHOB3-.
    IONIZ3-. NITRO32. POLAR3. POSNG30.
    SMALL3-. SULPH3-. TEENY3-. CRING3-.
    VALEN32. AMINO4R. ALIPH4. AROMA4-.
    CBETA4-. CHARG4. COVAL4-. HBOND4.
    HPHOB4-. IONIZ4. NITRO44. POLAR4.
    POSNG4. SMALL4-. SULPH4-. TEENY4-.
    CRING4-. VALEN43. AMINO5Q. ALIPH5.
    AROMA5-. CBETA5-. CHARG5-. COVAL5-.
    HBOND5. HPHOB5-. IONIZ5-. NITRO52.
    POLAR5. POSNG50. SMALL5-. SULPH5-.
    TEENY5-. CRING5-. VALEN52. AMINO6Q.
    ALIPH6. AROMA6-. CBETA6-. CHARG6-.
    COVAL6-. HBOND6. HPHOB6-. IONIZ6-.
    NITRO62. POLAR6. POSNG60. SMALL6-.
    SULPH6-. TEENY6-. CRING6-. VALEN62.
    AMINO7Q. ALIPH7. AROMA7-. CBETA7-.
    CHARG7-. COVAL7-. HBOND7. HPHOB7-.
    IONIZ7-. NITRO72. POLAR7. POSNG70.
    SMALL7-. SULPH7-. TEENY7-. CRING7-.
    VALEN72. AMINO8Q. ALIPH8. AROMA8-.
    CBETA8-. CHARG8-. COVAL8-. HBOND8.
    HPHOB8-. IONIZ8-. NITRO82. POLAR8.
    POSNG80. SMALL8-. SULPH8-. TEENY8-.
    CRING8-. VALEN82. MULT37. MULT54.
    MULT73. MULT92. 2GRAMIA. GRAM2HQ. 3GRAMCIA.
    GRAM3HQQ.

34
Artificial Intelligence
  • Computers do things only human brains can
    otherwise do

expert
expert
35
Artificial Intelligence
  • Computers do things only human brains can
    otherwise do

expert system
expert
36
Artificial Intelligence
  • Computers do things only human brains can
    otherwise do

learning system
expert system
37
Machine learning
What is machine learning?
  • creating computer programs that get better with
    experience
  • learn how to make expert judgments
  • discover previously hidden, potentially useful
    information (data mining)

How does it work?
  • user provides learning system with examples of
    concept to be learned
  • induction algorithm infers a characteristic model
    of the examples
  • model is used to predict whether or not future
    novel instances are also examples and it does
    this very consistently, and very, very quickly!

38
Bioinformatics
  • Biologists know proteins, computer scientists
    know machine learning
  • Together, they can find hidden and potentially
    useful information about genes and proteins
  • Biotechnology is a multi-billion dollar industry
  • Biotechnology is one of the best funded areas of
    scientific research
  • Shortage of people educated in bioinformatics

39
The University of Waikato
  • Waikato University is ranked first in the country
    in computer science and in molecular, cellular,
    and whole-organism biology
  • centre of the universe for machine learning

40
The University of Waikato
  • If youre interested in getting involved in
    bioinformatics, or indeed any other area along
    the leading edge of computer science and/or
    biology, then
  • Waikato wants You!
Write a Comment
User Comments (0)
About PowerShow.com