Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April 27 2005 Bioinformatics Capstone presentation - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April 27 2005 Bioinformatics Capstone presentation

Description:

... Matrices for INR, TATA, DPE. tctttcttt. ggtcacac. ctcgaggg ... in the same positions as the Inr, TATA and DPE should be, and could be possible synonyms. ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 29

Provided by: Yogi5

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April 27 2005 Bioinformatics Capstone presentation

1
Statistical Analysis for Word countingin
Drosophila Core PromotersYogita MantriApril 27
2005 Bioinformatics Capstone presentation
2

Introduction Motivation
Dataset used
Part I Unbiased word counting
Part II TCAGT-centric word counting
Conclusions and Future work

3
Introduction

Regulatory elements are short DNA sequences that
control gene expression.
They are often found around the Transcription
Start Site (TSS), sometimes further upstream.
Identification of promoters and regulatory
elements is a major challenge in bioinformatics
Regulatory elements are not well-conserved
Computational discovery of TSS in not
straightforward
Promoter sequences do not have distinguishable
statistical properties
Transcription is a highly cooperative process
including competitive or cooperative binding
which is not completely determined from the rest
of the genomes DNA sequence

4
Drosophila Core Promoters
Computational analysis of core promoters in the
Drosophila Genome, Ohler, Rubin et. al, Genome
Biology 2002, 3(12)research0087.10087.12
Above image edited from http//163.238.8.180/da
vis/Bio_327/lectures/Transcription/TranscriptionOv
er.html
5
Motivation for project

Database of Core Promoters with TSS
experimentally determined is a huge advantage
over other approaches using only gene upstream
regions.
Word Counting method to determine significant
patterns, inspired by Dr. Peter Cherbas earlier
work.
The arthropod initiator the capsite consensus
plays an important role in transcription,
Cherbas L, Cherbas P., Insect Biochem Mol Biol.
1993 Jan23(1)81-90

Introduction Motivation
Dataset used
Part I Unbiased word counting
Part II TCAGT-centric word counting
Conclusions and Future work

7
The Database of Drosophila Core Promoters

Compiled by Sumit Middha. It consists of
Drosophila core promoters from three experimental
sources.
Ohler, Rubin et al
1941 promoters
Stringent criteria for identifying TSSs,
requiring 5 ends of multiple cDNAs to lie in
close proximity.
Kadonaga et al
205 promoters
Changed TSS to coincide with A of Inr consensus
TCAGT even if experimental results reported TSS
in the vicinity.
The discrepancy was fixed by taking the
experimentally reported TSS.
Eukaryotic Promoter Database
1926 promoters
Assigned TSS based on experimental data with a
precision of /- 5bp or better.
3458 sequences after removing redundant entries
in the dataset.

Introduction Motivation
Dataset used
Part I Unbiased word counting
Part II TCAGT-centric word counting
Conclusions and Future work

9
Word Analysis Part IUnbiased search

Used various statistical measures like Z-score on
all possible n-mers in the entire dataset and in
specific windows.
The goal was to see whether known patterns of
interest were significantly enriched in promoter
sequences than other patterns.

10
Basic Statistics of the dataset

3458 promoter sequences in the database.
First step was a word-frequency analysis
(pentamers used for initial analysis)
Performed analysis on the following sets
Entire dataset (DS-1)
Subset of above dataset, with only -20 to 20
region (DS-2)
2 types of analyses, differing in Random
sequences used
1st Order Markov Chains based on base and
transition probabilities of respective dataset
non-coding regions

11
Random set

Generated 100 sets of 1st order Markov chains
Each set contained same number of sequences as
original datase (3458), and having same length
(350)
Computed occurrence of each pentamer in actual
and random sequences
For random sequences, calculated average and S.D
over all sets

12
Z-score

A test of significance
Mean and S.D calculated over 100 sets
Calculated Z-scores for all pentamers
Looking for pentamers with very high or very low
Z-scores

13
Rank of TCAGT and variants in entire dataset
Rank Pattern Z-Score
1 aaaaa 113.037
2 ttttt 111.647
3 ttttg 88.1
4 gaaaa 83.156
5 aaaac 82.69
6 atttt 82.152
7 gtttt 82.067
8 ttttc 79.485
9 aaaat 78.348
10 gcagc 77.091
101 gcagt 29.269
115 tcagt 27.156
307 acagt 10.286
485 tcatt 1.375
965 tataa -25.213
14
Summary of known pentamers in different windows
-2020
Non-overlapping windows
PATTERN Z-Score Rank
tcagt 58.929 2
tcatt 3.6 418
gcagt 25.545 34
acagt 12.923 179
tataa -25 1022
Pattern Z-score Rank
tcagt 4.277429 356
tcatt -2.00671 590
gcagt 7.714143 246
acagt 2.080429 435
tataa -9.064 898
Sliding Windows
Pattern Z-score Rank
tcagt 7.559871 254
tcatt -1.402484 576
gcagt 9.0644839 200
acagt 2.7177419 409
tataa -8.962065 880
15
Z-score Plots of tcagt and variants using sliding
windows of 10 bp
16
Lesson

Cannot ignore position preference of regulatory
motifs!

Introduction Motivation
Dataset used
Part I Unbiased word counting
Part II TCAGT-centric word counting
Conclusions and Future work

18
Word Analysis Part IIGuided search, starting
with known INR element TCAGT

Identification of INR enriched regions
Identification of synonyms
Correlation analysis of INR synonyms
Guided search

19
TCAGT-centric word analysis
Window Zscore (-3,3) 130.58 (-4,2)
116.27 (-2,4) 105.67 (-5,1) 98.96 (-6,-1)
95.71 (-7,-2) 85.83 (-1,5) 59.23 (1,6)
47.68 (2,7) 43.30 (3,8) 28.79
20
INR Synonyms
Group1 CTCAG--- ATCAG--- TTCAG--- GTCAG---
-TCAGT-- ---AGTTG ---AGTCG --CAGTT- --CAGTC-
Group 4 -TCACA- GTCAC-- --CACAC
Group 5 TCACTCT
Group 6 -CATTC TCATT-
Group 2 TTAGT
Computational analysis of core promoters in the
Drosophila Genome, Ohler, Rubin et. al, Genome
Biology 2002, 3(12)research0087.10087.12
Group 3 ACACT--- -CACTCTG
21
Binary Tree Representation of Dataset
TOTAL 3412
INR
INR-
1801
1611
TATA
TATA-
TATA
TATA-
397
1404
1201
410
DPE-
DPE
DPE-
DPE
DPE
DPE-
DPE
DPE-
1172
79
232
321
331
369
76
832
22
3 Clusters in INR-positive set
23
Contingency Matrices for INR, TATA, DPE
DPE DPE-
INR 448 1163 1611
INR- 308 1493 1801
756 2656
INR, DPE Log Likelihood 0.227 INR, DPE Log Likelihood 0.227 INR, DPE Log Likelihood 0.227 INR, DPE Log Likelihood 0.227
TATA TATA-
INR 410 1201 1611
INR- 397 1404 1801
807 2605
INR, TATA Log Likelihood 0.073 INR, TATA Log Likelihood 0.073 INR, TATA Log Likelihood 0.073 INR, TATA Log Likelihood 0.073
DPE DPE-
TATA 155 652 807
TATA- 601 2004 2605
756 2656
INR, DPE Log Likelihood -0.143 INR, DPE Log Likelihood -0.143 INR, DPE Log Likelihood -0.143 INR, DPE Log Likelihood -0.143
24
Possible Alternative TATA and INR Synonyms ??
90.0
TATA 2 ?
INR 2 ?
80.0
tctttcttt
ggtcacac
70.0
ctatcgat
60.0
ctcgaggg
gtcacact
50.0
ttctttccg
40.0
cggtcacac
30.0
20.0
10.0
0.0
25
Enrichment further upstream New Binding Sites?
actatcgat
ctatcgat
tatcgata
aactatcgat
26
Next Level of Binary Tree analysis
TOTAL 3412
INR
INR-
1801
1611
TATA
TATA-
INR_2
INR_2-
410
1201
DPE-
DPE
397
1404
TATA_2-
TATA_2
DPE-
DPE-
DPE
?
DPE
?
DPE-
DPE-
DPE
DPE
27
Conclusions Future steps

The main goal of this project was to try to
identify significant words based on only
statistical over-representation.
The first part of the analysis using an unbiased
searching method was successful only in a very
narrow range of positions around the TSS.
However, the biased search starting with the Inr
consensus revealed the 3 known regulatory
elements in that region.
An analysis of the Inr-negative set showed
over-expression of patterns in the same positions
as the Inr, TATA and DPE should be, and could be
possible synonyms.
Thus the word-counting strategy has the potential
to reveal
Regulatory motifs and interrelationships that
other motif discovery programs cannot
Synonyms for regulatory motifs
Dependencies among regulatory motifs