Title: Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April 27 2005 Bioinformatics Capstone presentation
1Statistical Analysis for Word countingin
Drosophila Core PromotersYogita MantriApril 27
2005 Bioinformatics Capstone presentation
2- Introduction Motivation
- Dataset used
- Part I Unbiased word counting
- Part II TCAGT-centric word counting
- Conclusions and Future work
3Introduction
- Regulatory elements are short DNA sequences that
control gene expression. - They are often found around the Transcription
Start Site (TSS), sometimes further upstream. - Identification of promoters and regulatory
elements is a major challenge in bioinformatics - Regulatory elements are not well-conserved
- Computational discovery of TSS in not
straightforward - Promoter sequences do not have distinguishable
statistical properties - Transcription is a highly cooperative process
including competitive or cooperative binding
which is not completely determined from the rest
of the genomes DNA sequence
4Drosophila Core Promoters
Computational analysis of core promoters in the
Drosophila Genome, Ohler, Rubin et. al, Genome
Biology 2002, 3(12)research0087.10087.12
Above image edited from http//163.238.8.180/da
vis/Bio_327/lectures/Transcription/TranscriptionOv
er.html
5Motivation for project
- Database of Core Promoters with TSS
experimentally determined is a huge advantage
over other approaches using only gene upstream
regions. - Word Counting method to determine significant
patterns, inspired by Dr. Peter Cherbas earlier
work. - The arthropod initiator the capsite consensus
plays an important role in transcription,
Cherbas L, Cherbas P., Insect Biochem Mol Biol.
1993 Jan23(1)81-90
6- Introduction Motivation
- Dataset used
- Part I Unbiased word counting
- Part II TCAGT-centric word counting
- Conclusions and Future work
7The Database of Drosophila Core Promoters
- Compiled by Sumit Middha. It consists of
Drosophila core promoters from three experimental
sources. - Ohler, Rubin et al
- 1941 promoters
- Stringent criteria for identifying TSSs,
requiring 5 ends of multiple cDNAs to lie in
close proximity. - Kadonaga et al
- 205 promoters
- Changed TSS to coincide with A of Inr consensus
TCAGT even if experimental results reported TSS
in the vicinity. - The discrepancy was fixed by taking the
experimentally reported TSS. - Eukaryotic Promoter Database
- 1926 promoters
- Assigned TSS based on experimental data with a
precision of /- 5bp or better. - 3458 sequences after removing redundant entries
in the dataset.
8- Introduction Motivation
- Dataset used
- Part I Unbiased word counting
- Part II TCAGT-centric word counting
- Conclusions and Future work
9Word Analysis Part IUnbiased search
- Used various statistical measures like Z-score on
all possible n-mers in the entire dataset and in
specific windows. - The goal was to see whether known patterns of
interest were significantly enriched in promoter
sequences than other patterns.
10Basic Statistics of the dataset
- 3458 promoter sequences in the database.
- First step was a word-frequency analysis
(pentamers used for initial analysis) - Performed analysis on the following sets
- Entire dataset (DS-1)
- Subset of above dataset, with only -20 to 20
region (DS-2) - 2 types of analyses, differing in Random
sequences used - 1st Order Markov Chains based on base and
transition probabilities of respective dataset - non-coding regions
11Random set
- Generated 100 sets of 1st order Markov chains
- Each set contained same number of sequences as
original datase (3458), and having same length
(350) - Computed occurrence of each pentamer in actual
and random sequences - For random sequences, calculated average and S.D
over all sets
12Z-score
- A test of significance
- Mean and S.D calculated over 100 sets
- Calculated Z-scores for all pentamers
- Looking for pentamers with very high or very low
Z-scores
13Rank of TCAGT and variants in entire dataset
Rank Pattern Z-Score
1 aaaaa 113.037
2 ttttt 111.647
3 ttttg 88.1
4 gaaaa 83.156
5 aaaac 82.69
6 atttt 82.152
7 gtttt 82.067
8 ttttc 79.485
9 aaaat 78.348
10 gcagc 77.091
101 gcagt 29.269
115 tcagt 27.156
307 acagt 10.286
485 tcatt 1.375
965 tataa -25.213
14Summary of known pentamers in different windows
-2020
Non-overlapping windows
PATTERN Z-Score Rank
tcagt 58.929 2
tcatt 3.6 418
gcagt 25.545 34
acagt 12.923 179
tataa -25 1022
Pattern Z-score Rank
tcagt 4.277429 356
tcatt -2.00671 590
gcagt 7.714143 246
acagt 2.080429 435
tataa -9.064 898
Sliding Windows
Pattern Z-score Rank
tcagt 7.559871 254
tcatt -1.402484 576
gcagt 9.0644839 200
acagt 2.7177419 409
tataa -8.962065 880
15Z-score Plots of tcagt and variants using sliding
windows of 10 bp
16Lesson
- Cannot ignore position preference of regulatory
motifs!
17- Introduction Motivation
- Dataset used
- Part I Unbiased word counting
- Part II TCAGT-centric word counting
- Conclusions and Future work
18Word Analysis Part IIGuided search, starting
with known INR element TCAGT
- Identification of INR enriched regions
- Identification of synonyms
- Correlation analysis of INR synonyms
- Guided search
19TCAGT-centric word analysis
Window Zscore (-3,3) 130.58 (-4,2)
116.27 (-2,4) 105.67 (-5,1) 98.96 (-6,-1)
95.71 (-7,-2) 85.83 (-1,5) 59.23 (1,6)
47.68 (2,7) 43.30 (3,8) 28.79
20INR Synonyms
Group1 CTCAG--- ATCAG--- TTCAG--- GTCAG---
-TCAGT-- ---AGTTG ---AGTCG --CAGTT- --CAGTC-
Group 4 -TCACA- GTCAC-- --CACAC
Group 5 TCACTCT
Group 6 -CATTC TCATT-
Group 2 TTAGT
Computational analysis of core promoters in the
Drosophila Genome, Ohler, Rubin et. al, Genome
Biology 2002, 3(12)research0087.10087.12
Group 3 ACACT--- -CACTCTG
21Binary Tree Representation of Dataset
TOTAL 3412
INR
INR-
1801
1611
TATA
TATA-
TATA
TATA-
397
1404
1201
410
DPE-
DPE
DPE-
DPE
DPE
DPE-
DPE
DPE-
1172
79
232
321
331
369
76
832
223 Clusters in INR-positive set
23Contingency Matrices for INR, TATA, DPE
 DPE DPE- Â
INR 448 1163 1611
INR- 308 1493 1801
 756 2656 Â
INR, DPE Log Likelihood 0.227 INR, DPE Log Likelihood 0.227 INR, DPE Log Likelihood 0.227 INR, DPE Log Likelihood 0.227
 TATA TATA- Â
INR 410 1201 1611
INR- 397 1404 1801
 807 2605 Â
INR, TATA Log Likelihood 0.073 INR, TATA Log Likelihood 0.073 INR, TATA Log Likelihood 0.073 INR, TATA Log Likelihood 0.073
 DPE DPE- Â
TATA 155 652 807
TATA- 601 2004 2605
 756 2656 Â
INR, DPE Log Likelihood -0.143 INR, DPE Log Likelihood -0.143 INR, DPE Log Likelihood -0.143 INR, DPE Log Likelihood -0.143
24Possible Alternative TATA and INR Synonyms ??
90.0
TATA 2 ?
INR 2 ?
80.0
tctttcttt
ggtcacac
70.0
ctatcgat
60.0
ctcgaggg
gtcacact
50.0
ttctttccg
40.0
cggtcacac
30.0
20.0
10.0
0.0
25Enrichment further upstream New Binding Sites?
actatcgat
ctatcgat
tatcgata
aactatcgat
26Next Level of Binary Tree analysis
TOTAL 3412
INR
INR-
1801
1611
TATA
TATA-
INR_2
INR_2-
410
1201
DPE-
DPE
397
1404
TATA_2-
TATA_2
DPE-
DPE-
DPE
?
DPE
?
DPE-
DPE-
DPE
DPE
27Conclusions Future steps
- The main goal of this project was to try to
identify significant words based on only
statistical over-representation. - The first part of the analysis using an unbiased
searching method was successful only in a very
narrow range of positions around the TSS. - However, the biased search starting with the Inr
consensus revealed the 3 known regulatory
elements in that region. - An analysis of the Inr-negative set showed
over-expression of patterns in the same positions
as the Inr, TATA and DPE should be, and could be
possible synonyms. - Thus the word-counting strategy has the potential
to reveal - Regulatory motifs and interrelationships that
other motif discovery programs cannot - Synonyms for regulatory motifs
- Dependencies among regulatory motifs
28Acknowledgements
- Dr. Haixu Tang
- Dr. Sun Kim
- Dr. Peter Cherbas
- Sumit Middha
- Bioinformatics Research Group