Title: Finding sequence motifs in PBM data Workshop Project
1Finding sequence motifs in PBM data Workshop
Project
Yaron Orenstein October 2010
2Outline
- 1. Some background again
- 2. The project
31. Background
- Slides with Ron Shamir and Chaim Linhart
4Gene from DNA to protein
Pre-mRNA
Mature mRNA
DNA
protein
transcription
translation
splicing
5DNA
- DNA a string over the alphabet of 4 bases
(nucleotides) A, C, G, T - Resides in chromosomes
- Complementary strands A-T C-G
- Forward/sense strand AACTTGCG
- Reverse-complement/anti-sense strand
TTGAACGC - Directional from 5 to 3
- (upstream) AACTTGCGATACTCCTA
(downstream)
5 end
3 end
6Gene structure (eukaryotes)
Promoter
DNA
Coding strand
Transcription start site (TSS)
Transcription (RNA polymerase)
Pre-mRNA
Exon
Intron
Exon
Splicing (spliceosome)
5 UTR
3 UTR
Mature mRNA
Stop codon
Start codon
Coding region
Translation (ribosome)
Protein
7Translation
- Codon - a triplet of bases, codes a specific
amino acid (except the stop codons) many-to-1
relation - Stop codons - signal termination of the protein
synthesis process
http//ntri.tamuk.edu/cell/ribosomes.html
8Genome sequences
- Many genomes have been sequences, including those
of viruses, microbes, plants and animals. - Human
- 23 pairs of chromosomes
- 3 Gbps (bps base pairs) , only 3 are genes
- 25,000 genes
- Yeast
- 16 chromosomes
- 20 Mbps
- 6,500 genes
9Regulation of Expression
- Each cell contains an identical copy of the whole
genome - but utilizes only a subset of the genes
to perform diverse, unique tasks - Most genes are highly regulated
- their expression is limited to specific tissues,
developmental stages, physiological condition - Main regulatory mechanism transcriptional
regulation
10Transcriptional regulation
- Transcription is regulated primarily by
transcription factors (TFs) proteins that bind
to DNA subsequences, called binding sites (BSs) - TFBSs are located mainly (not always!) in the
genes promoter the DNA sequence upstream the
genes transcription start site (TSS) - BSs of a particular TF share a common pattern, or
motif - Some TFs operate together TF modules
TSS
11TFBS motif models
- Consensus (degenerate) string
AC
CG
ACT
T
gene 1
gene 2
AACTGT
gene 3
CACTGT
gene 4
CACTCT
gene 5
CACTGT
gene 6
gene 7
gene 8
gene 9
AACTGT
gene 10
- Statistical models
- Motif logo representation
12Human G2M cell-cycle genesThe CHR NF-Y module
CDCA3 (trigger of mitotic entry
1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAAC
T -18 CDCA8 (cell division cycle associated
8) TTGTGATTGGATGTTGTGGGA25bpTGACTGTGGAGTTTGAAT
TGG 23 CDC2 (cell division control protein 2
homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTG
GTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT
0 CDC42EP4 (cdc42 effector protein
4) GCTTTCAGTTTGAACCGAGGA25bpCGACGGCCATTGGCTGCT
GC -110 CCNB1 (G2/mitotic-specific cyclin
B1) AGCCGCCAATGGGAAGGGAG30bpAGCAGTGCGGGGTTTAAA
TCT 45 CCNB2 (G2/mitotic-specific cyclin
B2) TTCAGCCAATGAGAGT15bpGTGTTGGCCAATGAGAAC15
bpGGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA
10 BSs are short, non-specific, hiding in
both strands and at various locations along the
promoters
TFs NF-Y , CHR
13Protein Binding MicroarraysBerger et al, Nat.
Biotech 2006
- Generate an array of double-stranded DNA with all
possible k-mers - Detect TF binding to specific k-mers
13
14PBM (2)
14
15PBM - implementation
- Use 60-mers (Agilent) 25nt constant primer
35nt variable region - De Bruijn seq of all 10-mers (410 long) split
into 35nt long fragments with 9nt overlap - 40K probes
- For each 8-mer, combine signals from all probes
that contain it (or differ in 1nt) to obtain its
binding score
15
16The computational challenge
- Input PBM data (sequences and binding scores) of
one TF. - Goal Find a motif (PWM) that is the binding site
of that TF. - Intuition sequences that match the motif (on one
of the two possible strands!) are expected to
have high binding scores.
172. The project
18General goals
- Research
- - Learn about known solutions
- - Trial and error with training data
- Develop software from A-Z
- Design
- Implementation (Optimization)
- Execution analysis of test data
- A taste of bioinformatics
- Have fun
- Get credit
19The computational task
- Given a set of PBM data of different TFs.
- Find the binding site motif in PWM format of each
TF. - Main challenges
- Performance (time, memory)
- Accuracy
20Input
- File with 41,923 lines, each containing a probe
sequence of length 35 and binding intensity. - ltsequence 35bpgt \t ltintensitygt \n
21Input (II)
- For the training data, an additional PWM file
will be supplied for each PBM data set. - A ltfreq1gt ltfreq2gt ltfreq10gt
- C ltfreq1gt ltfreq10gt
- G
- T
- Separated by \t and \n.
- All lines must contain same number of frequencies
(10 is just an example).
22Input (III)
- You will be given
- 10 training sets (PBM data PWM)
- 4 test sets (PBM data). You have to provide the
PWM. - In the final project presentation, you will be
given an online test set (PBM data) and your
software will be applied to it.
23Output
- A PWM file describing the binding site found in
the given PBM file. - The PWM in motif logo format (i.e. displayed on
the screen). -
- The file logo.zip contains a java package with
the code that will easily display your motif.
bits 2 - entropy
24Output (II)
- Show graphically how well your motif predicts
the binding intensity. - One example (note its not PWM)
25Ranking 8-mers
- One possible way to start rank the 8-mers in
some way. Scores for example - 1. Signal average.
- 2. Signal median.
- You can think of other scores that incorporate
more information, e.g. position in probe
sequence. - This is just an example. You can think of other
ways to start.
26Alignment procedure
- Then, you can align the significant 8-mers.
- You may take into account the relative score.
- Dont forget about the reverse complement!
- Example Cebpb TF
27Enrichment scores
- To test how good your motif is, you can use an
enrichment score. - An enrichment score tests how good the motif
distinguishes between high-ranking probes and the
rest of the probes.
28Hypergeometric probability
total not drawn drawn
m m - k k white
N - m N k - n - m n - k black
N N - n n total
29Hypergeometric enrichment score
- Let B and T (T B) denote the BG and target
sets, respectively, and let b and t denote the
subset of probes from the BG and target set,
respectively, that contain at least one
occurrence of the motif.
30Hypergeometric score (2)
- The HG enrichment score computes the probability
of observing at least t target sequences with a
motif occurrence, under the null hypothesis that
the probes in the target set were drawn randomly,
independently, and without replacement from the
BG set. - Code is provided in math.zip
31Wilcoxon-Mann-Whitney (WMW) enrichment score
- Foreground probes are all those containing a
match, background are all the others. - B and F are the sizes of background and
foreground, respectively. - ?B and ?F are the sums of the background and
foreground ranks. - Read more in supplementary info (Berger06).
32Deciding the length of the motif
- Another challenge is to decide the length of the
motif. - Most binding site are 6-12 bp long.
- You should consider the information each position
contains and decide on the length accordingly.
33Scoring your PWM
- One way to score your motif is by ranking the
probe sequences according to a match score. - You may use the given code for match score.
- Compare the ranking of the probes you got to the
ranking according to binding intensities. There
are different correlation score for that.
34Match Score between PWMs
- Already implemented for you
- Euclidian Distance
- Pearson Correlation Coefficient
- KL Divergence
35Implementation
- Java (Eclipse) Linux (Other languages are
possible, but will not participate in bonus). - Input one single argument PBM filename
- Output PWM file, PWM presented in logo and
graphical presentation of PWM matching
distribution among probes. - Packages for motif logo and statistical scores
will be supplied - Time performance will be measured
- Reasonable documentation
- Separate packages for data-structures, scores,
GUI, I/O, etc.
36Submission
- Printed design document.
- Printed code for comments and remarks.
- Printed results document for each test set PWM
logo how good your result in terms of
correlation to the probes ranks. - 4 PWM files, e.g. Test_1.pwm (submitted by
email). - Executable for the online test.
37Grade
- 20 for the design
- 30 for the implementation (20 for modularity,
clarity, documentation, 10 for efficiency) - 30 for the performance and experimental results
(20 for the accuracy on the 4 test queries and
10 for the accuracy on the online test query) - 20 for the final report and presentation
- 10 bonus to the group with the most accurate
results - 10 bonus for the group with the fastest
implementation
38Bonus grading
- Accuracy will be determined using the provided
code that compares two PWMs. - We will take the average of runs on several
different PBM data sets. - Running time will be measured in java
implementation, and the average will be taken.
39Schedule
- First progress report 23/11
- Design document 21/12
- Final presentation 16/2
- We shall meet with each group on each of these
dates mark your calendars! - Schedule can be made earlier if you are ready.
- You are always welcome to meet us. Contact us by
email.
40Design document
- Due in week 12 (21/12).
- 3-5 pages (Word), Hebrew/English
- Briefly describe main goal, input and output of
program - Describe main data structures, algorithms, and
scores. - Meet with me before submission.
41Reference
- Berger MF, Philippakis AA, Quershi AM, He FS,
EstepIII PW, Bulyk ML. Compact, universal DNA
microarrays to comprehensively determine
transcription-factor binding site specificities.
Nature biotechnology. 20063381429-1435. - Very important! Read the_brain.bwh.harvard.edu/UP
BMseqn/suppl_methods.doc - Chen X, Hughes TR, Morris Q. RankMotif a
motif-search algorithm that accounts for relative
ranks of K-mers in binding transcription factors.
Bioinformatics. 2007 Jul 123(13)i72-79.
42Fin