Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April 27 2005 Bioinformatics Capstone presentation - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April 27 2005 Bioinformatics Capstone presentation

Description:

... Matrices for INR, TATA, DPE. tctttcttt. ggtcacac. ctcgaggg ... in the same positions as the Inr, TATA and DPE should be, and could be possible synonyms. ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 29
Provided by: Yogi5
Category:

less

Transcript and Presenter's Notes

Title: Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April 27 2005 Bioinformatics Capstone presentation


1
Statistical Analysis for Word countingin
Drosophila Core PromotersYogita MantriApril 27
2005 Bioinformatics Capstone presentation
2
  • Introduction Motivation
  • Dataset used
  • Part I Unbiased word counting
  • Part II TCAGT-centric word counting
  • Conclusions and Future work

3
Introduction
  • Regulatory elements are short DNA sequences that
    control gene expression.
  • They are often found around the Transcription
    Start Site (TSS), sometimes further upstream.
  • Identification of promoters and regulatory
    elements is a major challenge in bioinformatics
  • Regulatory elements are not well-conserved
  • Computational discovery of TSS in not
    straightforward
  • Promoter sequences do not have distinguishable
    statistical properties
  • Transcription is a highly cooperative process
    including competitive or cooperative binding
    which is not completely determined from the rest
    of the genomes DNA sequence

4
Drosophila Core Promoters
Computational analysis of core promoters in the
Drosophila Genome, Ohler, Rubin et. al, Genome
Biology 2002, 3(12)research0087.10087.12
Above image edited from http//163.238.8.180/da
vis/Bio_327/lectures/Transcription/TranscriptionOv
er.html
5
Motivation for project
  • Database of Core Promoters with TSS
    experimentally determined is a huge advantage
    over other approaches using only gene upstream
    regions.
  • Word Counting method to determine significant
    patterns, inspired by Dr. Peter Cherbas earlier
    work.
  • The arthropod initiator the capsite consensus
    plays an important role in transcription,
    Cherbas L, Cherbas P., Insect Biochem Mol Biol.
    1993 Jan23(1)81-90

6
  • Introduction Motivation
  • Dataset used
  • Part I Unbiased word counting
  • Part II TCAGT-centric word counting
  • Conclusions and Future work

7
The Database of Drosophila Core Promoters
  • Compiled by Sumit Middha. It consists of
    Drosophila core promoters from three experimental
    sources.
  • Ohler, Rubin et al
  • 1941 promoters
  • Stringent criteria for identifying TSSs,
    requiring 5 ends of multiple cDNAs to lie in
    close proximity.
  • Kadonaga et al
  • 205 promoters
  • Changed TSS to coincide with A of Inr consensus
    TCAGT even if experimental results reported TSS
    in the vicinity.
  • The discrepancy was fixed by taking the
    experimentally reported TSS.
  • Eukaryotic Promoter Database
  • 1926 promoters
  • Assigned TSS based on experimental data with a
    precision of /- 5bp or better.
  • 3458 sequences after removing redundant entries
    in the dataset.

8
  • Introduction Motivation
  • Dataset used
  • Part I Unbiased word counting
  • Part II TCAGT-centric word counting
  • Conclusions and Future work

9
Word Analysis Part IUnbiased search
  • Used various statistical measures like Z-score on
    all possible n-mers in the entire dataset and in
    specific windows.
  • The goal was to see whether known patterns of
    interest were significantly enriched in promoter
    sequences than other patterns.

10
Basic Statistics of the dataset
  • 3458 promoter sequences in the database.
  • First step was a word-frequency analysis
    (pentamers used for initial analysis)
  • Performed analysis on the following sets
  • Entire dataset (DS-1)
  • Subset of above dataset, with only -20 to 20
    region (DS-2)
  • 2 types of analyses, differing in Random
    sequences used
  • 1st Order Markov Chains based on base and
    transition probabilities of respective dataset
  • non-coding regions

11
Random set
  • Generated 100 sets of 1st order Markov chains
  • Each set contained same number of sequences as
    original datase (3458), and having same length
    (350)
  • Computed occurrence of each pentamer in actual
    and random sequences
  • For random sequences, calculated average and S.D
    over all sets

12
Z-score
  • A test of significance
  • Mean and S.D calculated over 100 sets
  • Calculated Z-scores for all pentamers
  • Looking for pentamers with very high or very low
    Z-scores

13
Rank of TCAGT and variants in entire dataset
Rank Pattern Z-Score
1 aaaaa 113.037
2 ttttt 111.647
3 ttttg 88.1
4 gaaaa 83.156
5 aaaac 82.69
6 atttt 82.152
7 gtttt 82.067
8 ttttc 79.485
9 aaaat 78.348
10 gcagc 77.091
101 gcagt 29.269
115 tcagt 27.156
307 acagt 10.286
485 tcatt 1.375
965 tataa -25.213
14
Summary of known pentamers in different windows
-2020
Non-overlapping windows
PATTERN Z-Score Rank
tcagt 58.929 2
tcatt 3.6 418
gcagt 25.545 34
acagt 12.923 179
tataa -25 1022
Pattern Z-score Rank
tcagt 4.277429 356
tcatt -2.00671 590
gcagt 7.714143 246
acagt 2.080429 435
tataa -9.064 898
Sliding Windows
Pattern Z-score Rank
tcagt 7.559871 254
tcatt -1.402484 576
gcagt 9.0644839 200
acagt 2.7177419 409
tataa -8.962065 880
15
Z-score Plots of tcagt and variants using sliding
windows of 10 bp
16
Lesson
  • Cannot ignore position preference of regulatory
    motifs!

17
  • Introduction Motivation
  • Dataset used
  • Part I Unbiased word counting
  • Part II TCAGT-centric word counting
  • Conclusions and Future work

18
Word Analysis Part IIGuided search, starting
with known INR element TCAGT
  • Identification of INR enriched regions
  • Identification of synonyms
  • Correlation analysis of INR synonyms
  • Guided search

19
TCAGT-centric word analysis
Window Zscore (-3,3) 130.58 (-4,2)
116.27 (-2,4) 105.67 (-5,1) 98.96 (-6,-1)
95.71 (-7,-2) 85.83 (-1,5) 59.23 (1,6)
47.68 (2,7) 43.30 (3,8) 28.79
20
INR Synonyms
Group1 CTCAG--- ATCAG--- TTCAG--- GTCAG---
-TCAGT-- ---AGTTG ---AGTCG --CAGTT- --CAGTC-
Group 4 -TCACA- GTCAC-- --CACAC
Group 5 TCACTCT
Group 6 -CATTC TCATT-
Group 2 TTAGT
Computational analysis of core promoters in the
Drosophila Genome, Ohler, Rubin et. al, Genome
Biology 2002, 3(12)research0087.10087.12
Group 3 ACACT--- -CACTCTG
21
Binary Tree Representation of Dataset
TOTAL 3412
INR
INR-
1801
1611
TATA
TATA-
TATA
TATA-
397
1404
1201
410
DPE-
DPE
DPE-
DPE
DPE
DPE-
DPE
DPE-
1172
79
232
321
331
369
76
832
22
3 Clusters in INR-positive set
23
Contingency Matrices for INR, TATA, DPE
  DPE DPE-  
INR 448 1163 1611
INR- 308 1493 1801
  756 2656  
INR, DPE Log Likelihood 0.227 INR, DPE Log Likelihood 0.227 INR, DPE Log Likelihood 0.227 INR, DPE Log Likelihood 0.227
  TATA TATA-  
INR 410 1201 1611
INR- 397 1404 1801
  807 2605  
INR, TATA Log Likelihood 0.073 INR, TATA Log Likelihood 0.073 INR, TATA Log Likelihood 0.073 INR, TATA Log Likelihood 0.073
  DPE DPE-  
TATA 155 652 807
TATA- 601 2004 2605
  756 2656  
INR, DPE Log Likelihood -0.143 INR, DPE Log Likelihood -0.143 INR, DPE Log Likelihood -0.143 INR, DPE Log Likelihood -0.143
24
Possible Alternative TATA and INR Synonyms ??
90.0
TATA 2 ?
INR 2 ?
80.0
tctttcttt
ggtcacac
70.0
ctatcgat
60.0
ctcgaggg
gtcacact
50.0
ttctttccg
40.0
cggtcacac
30.0
20.0
10.0
0.0
25
Enrichment further upstream New Binding Sites?
actatcgat
ctatcgat
tatcgata
aactatcgat
26
Next Level of Binary Tree analysis
TOTAL 3412
INR
INR-
1801
1611
TATA
TATA-
INR_2
INR_2-
410
1201
DPE-
DPE
397
1404
TATA_2-
TATA_2
DPE-
DPE-
DPE
?
DPE
?
DPE-
DPE-
DPE
DPE
27
Conclusions Future steps
  • The main goal of this project was to try to
    identify significant words based on only
    statistical over-representation.
  • The first part of the analysis using an unbiased
    searching method was successful only in a very
    narrow range of positions around the TSS.
  • However, the biased search starting with the Inr
    consensus revealed the 3 known regulatory
    elements in that region.
  • An analysis of the Inr-negative set showed
    over-expression of patterns in the same positions
    as the Inr, TATA and DPE should be, and could be
    possible synonyms.
  • Thus the word-counting strategy has the potential
    to reveal
  • Regulatory motifs and interrelationships that
    other motif discovery programs cannot
  • Synonyms for regulatory motifs
  • Dependencies among regulatory motifs

28
Acknowledgements
  • Dr. Haixu Tang
  • Dr. Sun Kim
  • Dr. Peter Cherbas
  • Sumit Middha
  • Bioinformatics Research Group
Write a Comment
User Comments (0)
About PowerShow.com