Finding sequence motifs in PBM data Workshop Project PowerPoint PPT Presentation

presentation player overlay
1 / 42
About This Presentation
Transcript and Presenter's Notes

Title: Finding sequence motifs in PBM data Workshop Project


1
Finding sequence motifs in PBM data Workshop
Project
Yaron Orenstein October 2010
2
Outline
  • 1. Some background again
  • 2. The project

3
1. Background
  • Slides with Ron Shamir and Chaim Linhart

4
Gene from DNA to protein
Pre-mRNA
Mature mRNA
DNA
protein
transcription
translation
splicing
5
DNA
  • DNA a string over the alphabet of 4 bases
    (nucleotides) A, C, G, T
  • Resides in chromosomes
  • Complementary strands A-T C-G
  • Forward/sense strand AACTTGCG
  • Reverse-complement/anti-sense strand
    TTGAACGC
  • Directional from 5 to 3
  • (upstream) AACTTGCGATACTCCTA
    (downstream)

5 end
3 end
6
Gene structure (eukaryotes)
Promoter
DNA
Coding strand
Transcription start site (TSS)
Transcription (RNA polymerase)
Pre-mRNA
Exon
Intron
Exon
Splicing (spliceosome)
5 UTR
3 UTR
Mature mRNA
Stop codon
Start codon
Coding region
Translation (ribosome)
Protein
7
Translation
  • Codon - a triplet of bases, codes a specific
    amino acid (except the stop codons) many-to-1
    relation
  • Stop codons - signal termination of the protein
    synthesis process

http//ntri.tamuk.edu/cell/ribosomes.html
8
Genome sequences
  • Many genomes have been sequences, including those
    of viruses, microbes, plants and animals.
  • Human
  • 23 pairs of chromosomes
  • 3 Gbps (bps base pairs) , only 3 are genes
  • 25,000 genes
  • Yeast
  • 16 chromosomes
  • 20 Mbps
  • 6,500 genes

9
Regulation of Expression
  • Each cell contains an identical copy of the whole
    genome - but utilizes only a subset of the genes
    to perform diverse, unique tasks
  • Most genes are highly regulated
  • their expression is limited to specific tissues,
    developmental stages, physiological condition
  • Main regulatory mechanism transcriptional
    regulation

10
Transcriptional regulation
  • Transcription is regulated primarily by
    transcription factors (TFs) proteins that bind
    to DNA subsequences, called binding sites (BSs)
  • TFBSs are located mainly (not always!) in the
    genes promoter the DNA sequence upstream the
    genes transcription start site (TSS)
  • BSs of a particular TF share a common pattern, or
    motif
  • Some TFs operate together TF modules

TSS
11
TFBS motif models
  • Consensus (degenerate) string

AC
CG
ACT
T
gene 1
gene 2
AACTGT
gene 3
CACTGT
gene 4
CACTCT
gene 5
CACTGT
gene 6
gene 7
gene 8
gene 9
AACTGT
gene 10
  • Statistical models
  • Motif logo representation

12
Human G2M cell-cycle genesThe CHR NF-Y module
CDCA3 (trigger of mitotic entry
1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAAC
T -18 CDCA8 (cell division cycle associated
8) TTGTGATTGGATGTTGTGGGA25bpTGACTGTGGAGTTTGAAT
TGG 23 CDC2 (cell division control protein 2
homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTG
GTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT
0 CDC42EP4 (cdc42 effector protein
4) GCTTTCAGTTTGAACCGAGGA25bpCGACGGCCATTGGCTGCT
GC -110 CCNB1 (G2/mitotic-specific cyclin
B1) AGCCGCCAATGGGAAGGGAG30bpAGCAGTGCGGGGTTTAAA
TCT 45 CCNB2 (G2/mitotic-specific cyclin
B2) TTCAGCCAATGAGAGT15bpGTGTTGGCCAATGAGAAC15
bpGGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA
10 BSs are short, non-specific, hiding in
both strands and at various locations along the
promoters
TFs NF-Y , CHR
13
Protein Binding MicroarraysBerger et al, Nat.
Biotech 2006
  • Generate an array of double-stranded DNA with all
    possible k-mers
  • Detect TF binding to specific k-mers

13
14
PBM (2)
14
15
PBM - implementation
  • Use 60-mers (Agilent) 25nt constant primer
    35nt variable region
  • De Bruijn seq of all 10-mers (410 long) split
    into 35nt long fragments with 9nt overlap
  • 40K probes
  • For each 8-mer, combine signals from all probes
    that contain it (or differ in 1nt) to obtain its
    binding score

15
16
The computational challenge
  • Input PBM data (sequences and binding scores) of
    one TF.
  • Goal Find a motif (PWM) that is the binding site
    of that TF.
  • Intuition sequences that match the motif (on one
    of the two possible strands!) are expected to
    have high binding scores.

17
2. The project
18
General goals
  • Research
  • - Learn about known solutions
  • - Trial and error with training data
  • Develop software from A-Z
  • Design
  • Implementation (Optimization)
  • Execution analysis of test data
  • A taste of bioinformatics
  • Have fun
  • Get credit

19
The computational task
  • Given a set of PBM data of different TFs.
  • Find the binding site motif in PWM format of each
    TF.
  • Main challenges
  • Performance (time, memory)
  • Accuracy

20
Input
  • File with 41,923 lines, each containing a probe
    sequence of length 35 and binding intensity.
  • ltsequence 35bpgt \t ltintensitygt \n

21
Input (II)
  • For the training data, an additional PWM file
    will be supplied for each PBM data set.
  • A ltfreq1gt ltfreq2gt ltfreq10gt
  • C ltfreq1gt ltfreq10gt
  • G
  • T
  • Separated by \t and \n.
  • All lines must contain same number of frequencies
    (10 is just an example).

22
Input (III)
  • You will be given
  • 10 training sets (PBM data PWM)
  • 4 test sets (PBM data). You have to provide the
    PWM.
  • In the final project presentation, you will be
    given an online test set (PBM data) and your
    software will be applied to it.

23
Output
  • A PWM file describing the binding site found in
    the given PBM file.
  • The PWM in motif logo format (i.e. displayed on
    the screen).
  • The file logo.zip contains a java package with
    the code that will easily display your motif.

bits 2 - entropy
24
Output (II)
  • Show graphically how well your motif predicts
    the binding intensity.
  • One example (note its not PWM)

25
Ranking 8-mers
  • One possible way to start rank the 8-mers in
    some way. Scores for example
  • 1. Signal average.
  • 2. Signal median.
  • You can think of other scores that incorporate
    more information, e.g. position in probe
    sequence.
  • This is just an example. You can think of other
    ways to start.

26
Alignment procedure
  • Then, you can align the significant 8-mers.
  • You may take into account the relative score.
  • Dont forget about the reverse complement!
  • Example Cebpb TF

27
Enrichment scores
  • To test how good your motif is, you can use an
    enrichment score.
  • An enrichment score tests how good the motif
    distinguishes between high-ranking probes and the
    rest of the probes.

28
Hypergeometric probability
total not drawn drawn
m m - k k white
N - m N k - n - m n - k black
N N - n n total
29
Hypergeometric enrichment score
  • Let B and T (T B) denote the BG and target
    sets, respectively, and let b and t denote the
    subset of probes from the BG and target set,
    respectively, that contain at least one
    occurrence of the motif.

30
Hypergeometric score (2)
  • The HG enrichment score computes the probability
    of observing at least t target sequences with a
    motif occurrence, under the null hypothesis that
    the probes in the target set were drawn randomly,
    independently, and without replacement from the
    BG set.
  • Code is provided in math.zip

31
Wilcoxon-Mann-Whitney (WMW) enrichment score
  • Foreground probes are all those containing a
    match, background are all the others.
  • B and F are the sizes of background and
    foreground, respectively.
  • ?B and ?F are the sums of the background and
    foreground ranks.
  • Read more in supplementary info (Berger06).

32
Deciding the length of the motif
  • Another challenge is to decide the length of the
    motif.
  • Most binding site are 6-12 bp long.
  • You should consider the information each position
    contains and decide on the length accordingly.

33
Scoring your PWM
  • One way to score your motif is by ranking the
    probe sequences according to a match score.
  • You may use the given code for match score.
  • Compare the ranking of the probes you got to the
    ranking according to binding intensities. There
    are different correlation score for that.

34
Match Score between PWMs
  • Already implemented for you
  • Euclidian Distance
  • Pearson Correlation Coefficient
  • KL Divergence

35
Implementation
  • Java (Eclipse) Linux (Other languages are
    possible, but will not participate in bonus).
  • Input one single argument PBM filename
  • Output PWM file, PWM presented in logo and
    graphical presentation of PWM matching
    distribution among probes.
  • Packages for motif logo and statistical scores
    will be supplied
  • Time performance will be measured
  • Reasonable documentation
  • Separate packages for data-structures, scores,
    GUI, I/O, etc.

36
Submission
  • Printed design document.
  • Printed code for comments and remarks.
  • Printed results document for each test set PWM
    logo how good your result in terms of
    correlation to the probes ranks.
  • 4 PWM files, e.g. Test_1.pwm (submitted by
    email).
  • Executable for the online test.

37
Grade
  • 20 for the design
  • 30 for the implementation (20 for modularity,
    clarity, documentation, 10 for efficiency)
  • 30 for the performance and experimental results
    (20 for the accuracy on the 4 test queries and
    10 for the accuracy on the online test query)
  • 20 for the final report and presentation
  • 10 bonus to the group with the most accurate
    results
  • 10 bonus for the group with the fastest
    implementation

38
Bonus grading
  • Accuracy will be determined using the provided
    code that compares two PWMs.
  • We will take the average of runs on several
    different PBM data sets.
  • Running time will be measured in java
    implementation, and the average will be taken.

39
Schedule
  • First progress report 23/11
  • Design document 21/12
  • Final presentation 16/2
  • We shall meet with each group on each of these
    dates mark your calendars!
  • Schedule can be made earlier if you are ready.
  • You are always welcome to meet us. Contact us by
    email.

40
Design document
  • Due in week 12 (21/12).
  • 3-5 pages (Word), Hebrew/English
  • Briefly describe main goal, input and output of
    program
  • Describe main data structures, algorithms, and
    scores.
  • Meet with me before submission.

41
Reference
  • Berger MF, Philippakis AA, Quershi AM, He FS,
    EstepIII PW, Bulyk ML. Compact, universal DNA
    microarrays to comprehensively determine
    transcription-factor binding site specificities.
    Nature biotechnology. 20063381429-1435.
  • Very important! Read the_brain.bwh.harvard.edu/UP
    BMseqn/suppl_methods.doc
  • Chen X, Hughes TR, Morris Q. RankMotif a
    motif-search algorithm that accounts for relative
    ranks of K-mers in binding transcription factors.
    Bioinformatics. 2007 Jul 123(13)i72-79.

42
Fin
Write a Comment
User Comments (0)
About PowerShow.com