Title: Building a dictionary for genomes
1Building a dictionary for genomes
- By Harmen J. Bussemaker, Hao Li, and Eric D.
Siggia
Tal Frank
2Topics that will be discussed
- Biological background
- Present the biological problem
- Show an algorithm that treats this problem
- ?statistical mechanics methods
- Try our algorithm on two well known problems
3What we did so far
- Human Genome Project(2001)
- This article published(2000) Sequence is not
everything - Lets do some theory - Control over gene expression - when, how much
- Control element Regulator Sequence motif
- Genes are working together Co-regulated genes
-
4The goals of this work
- Identify the Control element
- Where are they located ?
- Identify co-regulated genes
5Multiple control elements
- Example where are the control elements located?
- Concepts directionality , upstream ,in the junk
- TACGAXTTCGA
- Example co-regulated genes
- naïve approach TACGAXTTTAAYATGGCA
- experimentally TACGAXTTCGAYATGGCA
- To activate set of genes multiple sequences
needed
6New terminology
- DNA string of letters
- Control element word
- Multiple control element sentences
- Genes and junk background noise
- Example S GAGCXTGGYGCTT
- words GA,TG
- sentence GA.TG
- background genes and junk.
7MobyDick algorithm
- decipher a text consisting of a long string
of letters written in an unknown language. - Find the words in the text
- Find the right spacing
- example DA,T,AT SATT
- P1A.T.T
- P2AT.T
8How would you do it ?
- 1.Look for repeated substring in the string
- ? went, to, he? D (dictionary)
- 2.Space the text ooopps Spacing is not that
- simple.
- e.g. DA,T,AT SATT
- P1A.T.T ?p1
- P2AT.T ?p2
Tal went to Weizmann this morning. When he
arrived he didnt go to his office, he went to
drink a cup of coffee .
9MobyDick Blueprints
STAGATAT
DT,A,G
1 letter word
pw pA,pT..
Find pw
STAGATAT
DA,TA,
2 letter word
pw pA,pTA.
Find pw
No more optional words ? stop!
Find spacing
STA.G.A.TA.T
10statistical mechanics in order to ?
- 1.How does MobyDick decide pw?
- 2.When does MobyDick add a new
- word?
- 3.Space (parse) the text.
11 The likelihood function
- k a possible spacing
- Nw number of times the word w appears
- Example D(T,AT,A) STATA
- k1T.A.T.A
- k2T.AT.A
12 Likelihood function - intuition
- Z(D,pw)- partition function ltEgt,ltNgt,ltTgt,.
- Z(D,pw)- the probability to obtain a
- sequence S.
- Example D (T,AT,A) pT,pA,pAT
- Question what is the probability to STATA?
- 1st possibility T.A.T.A ? pApApTpT
- 2nd possibility T.AT.A ?pTpATpA
13Finding pw
Given D,S
Maximize Z(pw,D) with respect to pw
This pw gives the highest probability to get
the given S
14Lets find the pw !
- Definition - average number of
the word w - over the
different spacings . - Can prove
-
- maximize Z- solve
-
- solving is done by iteration
pw
ltNwgt
pw
15Enough is enough !!!
- When is pw good enough ?
- when the new pw dont give higher Z
- We say this method converges !
- Other methods dont converge.
16Why finding pw using this way ?
- Monte-Carlo methods dont converge.
- Slow method ? can transform to fast method
- Order of complexity O(LDl)
- L-the length of the string
- D-the size of the dictionary
- l-the length of the longest word in D
17Add new words ?
Look at dictionary
DT,A,C,G STATTGA
Compose new word ww
DT,A,C,G STATTGA wwTA
Check occurrence
DT,A,C,G STATTGA wwTA
DT,A,C,G,TA STATTGA
Yes- add to dictionary
18A problem and a bad solution
- The algorithm finds only the words which are
composed from words already in the dictionary. - Example SAATATAAA
- 1st step SAATATAAA
- D A
- 2nd step SAATATAAA
- AT is not a composition of
words - Solution Look for repeated long strings
- by consideration the problem
19Spacing
- Define number of times the word w occurs
in - a given spacing.
- Quality factor
- The required condition
20checking the algorithm
- Applying on the English novel Moby Dick
- Applying on Control elements on the yeast genome
- Not always possible - Voynich manuscript (1450)
21Preparing the book MobyDick
- Call me Ishmael. Some years ago- never mind how
long precisely- having little - or no money in my purse, and nothing particular
tothought I would sail - CallmeIshmaelSomeyearsagonevermindhowlongprecisely
havingli - ttleornomoneyinmypurseandnothingparticulartothough
tIwouldsail.. - CallabajameIshmaelbjklmbbSomeyearsagonevermindhowl
on - Eciselyhavinglittlermsdrornomoneyinmypurseandnothi
ngparticu - artothoughtIwouldsail
22Results- MobyDick
- 10 first chapters
- Da,b,c.
- Text 4,214 unique words
- 2,630 occurred only once
- Background increases L by the factor of 3.
-
- 2,450 words found , 700 in English, 40 composite
words.
23Results- yeast
- DT,A,C,G
- Text 443 experimentally determined sites
- Background genes and junk
-
- 500 words found
- 114 match the experimentally predictions
- Not that good it is a beginning!
24