Building a dictionary for genomes - PowerPoint PPT Presentation

About This Presentation
Title:

Building a dictionary for genomes

Description:

Building a dictionary for genomes. By Harmen J. Bussemaker, Hao Li, and Eric D. ... Call me Ishmael. Some years ago- never mind how long precisely- having little ... – PowerPoint PPT presentation

Number of Views:10
Avg rating:3.0/5.0
Slides: 25
Provided by: tal8
Learn more at: https://www.sns.ias.edu
Category:

less

Transcript and Presenter's Notes

Title: Building a dictionary for genomes


1
Building a dictionary for genomes
  • By Harmen J. Bussemaker, Hao Li, and Eric D.
    Siggia

Tal Frank
2
Topics that will be discussed
  • Biological background
  • Present the biological problem
  • Show an algorithm that treats this problem
  • ?statistical mechanics methods
  • Try our algorithm on two well known problems

3
What we did so far
  • Human Genome Project(2001)
  • This article published(2000) Sequence is not
    everything - Lets do some theory
  • Control over gene expression - when, how much
  • Control element Regulator Sequence motif
  • Genes are working together Co-regulated genes

4
The goals of this work
  • Identify the Control element
  • Where are they located ?
  • Identify co-regulated genes

5
Multiple control elements
  • Example where are the control elements located?
  • Concepts directionality , upstream ,in the junk
  • TACGAXTTCGA
  • Example co-regulated genes
  • naïve approach TACGAXTTTAAYATGGCA
  • experimentally TACGAXTTCGAYATGGCA
  • To activate set of genes multiple sequences
    needed

6
New terminology
  • DNA string of letters
  • Control element word
  • Multiple control element sentences
  • Genes and junk background noise
  • Example S GAGCXTGGYGCTT
  • words GA,TG
  • sentence GA.TG
  • background genes and junk.

7
MobyDick algorithm
  • decipher a text consisting of a long string
    of letters written in an unknown language.
  • Find the words in the text
  • Find the right spacing
  • example DA,T,AT SATT
  • P1A.T.T
  • P2AT.T

8
How would you do it ?
  • 1.Look for repeated substring in the string
  • ? went, to, he? D (dictionary)
  • 2.Space the text ooopps Spacing is not that
  • simple.
  • e.g. DA,T,AT SATT
  • P1A.T.T ?p1
  • P2AT.T ?p2

Tal went to Weizmann this morning. When he
arrived he didnt go to his office, he went to
drink a cup of coffee .
9
MobyDick Blueprints
STAGATAT
DT,A,G
1 letter word
pw pA,pT..
Find pw
STAGATAT
DA,TA,
2 letter word
pw pA,pTA.
Find pw
No more optional words ? stop!
Find spacing
STA.G.A.TA.T
10
statistical mechanics in order to ?
  • 1.How does MobyDick decide pw?
  • 2.When does MobyDick add a new
  • word?
  • 3.Space (parse) the text.

11
The likelihood function
  • k a possible spacing
  • Nw number of times the word w appears
  • Example D(T,AT,A) STATA
  • k1T.A.T.A
  • k2T.AT.A

12
Likelihood function - intuition
  • Z(D,pw)- partition function ltEgt,ltNgt,ltTgt,.
  • Z(D,pw)- the probability to obtain a
  • sequence S.
  • Example D (T,AT,A) pT,pA,pAT
  • Question what is the probability to STATA?
  • 1st possibility T.A.T.A ? pApApTpT
  • 2nd possibility T.AT.A ?pTpATpA

13
Finding pw
Given D,S
Maximize Z(pw,D) with respect to pw
This pw gives the highest probability to get
the given S
14
Lets find the pw !
  • Definition - average number of
    the word w
  • over the
    different spacings .
  • Can prove
  • maximize Z- solve
  • solving is done by iteration

pw
ltNwgt
pw
15
Enough is enough !!!
  • When is pw good enough ?
  • when the new pw dont give higher Z
  • We say this method converges !
  • Other methods dont converge.

16
Why finding pw using this way ?
  • Monte-Carlo methods dont converge.
  • Slow method ? can transform to fast method
  • Order of complexity O(LDl)
  • L-the length of the string
  • D-the size of the dictionary
  • l-the length of the longest word in D

17
Add new words ?
Look at dictionary
DT,A,C,G STATTGA
Compose new word ww
DT,A,C,G STATTGA wwTA
Check occurrence
DT,A,C,G STATTGA wwTA
DT,A,C,G,TA STATTGA
Yes- add to dictionary
18
A problem and a bad solution
  • The algorithm finds only the words which are
    composed from words already in the dictionary.
  • Example SAATATAAA
  • 1st step SAATATAAA
  • D A
  • 2nd step SAATATAAA
  • AT is not a composition of
    words
  • Solution Look for repeated long strings
  • by consideration the problem

19
Spacing
  • Define number of times the word w occurs
    in
  • a given spacing.
  • Quality factor
  • The required condition

20
checking the algorithm
  • Applying on the English novel Moby Dick
  • Applying on Control elements on the yeast genome
  • Not always possible - Voynich manuscript (1450)

21
Preparing the book MobyDick
  • Call me Ishmael. Some years ago- never mind how
    long precisely- having little
  • or no money in my purse, and nothing particular
    tothought I would sail
  • CallmeIshmaelSomeyearsagonevermindhowlongprecisely
    havingli
  • ttleornomoneyinmypurseandnothingparticulartothough
    tIwouldsail..
  • CallabajameIshmaelbjklmbbSomeyearsagonevermindhowl
    on
  • Eciselyhavinglittlermsdrornomoneyinmypurseandnothi
    ngparticu
  • artothoughtIwouldsail

22
Results- MobyDick
  • 10 first chapters
  • Da,b,c.
  • Text 4,214 unique words
  • 2,630 occurred only once
  • Background increases L by the factor of 3.
  • 2,450 words found , 700 in English, 40 composite
    words.

23
Results- yeast
  • DT,A,C,G
  • Text 443 experimentally determined sites
  • Background genes and junk
  • 500 words found
  • 114 match the experimentally predictions
  • Not that good it is a beginning!

24
  • The end
Write a Comment
User Comments (0)
About PowerShow.com