Title: Promoter Analysis
1Promoter Analysis
- Goals, Problems Solutions
Chaim Linhart Rani Elkon Dec. 03
2Outline
- Background
- Questions
- Some answers
- PRIMA
3Regulation of Expression
- Each cell contains a copy of the whole genome
- BUT utilizes only a subset of the genes
- Most genes are highly regulated
- their expression is limited to specific tissues,
developmental stages, physiological condition
How is the expression of genes regulated?
One way is through transcriptional regulation
4Regulation of Transcription
- The conditions in which a gene is transcribed are
mainly encoded in the DNA in a region called
promoter - Each promoter contains several short DNA
subsequences, called binding sites (BSs) that
are bound by specific proteins called
transcription factors (TFs)
5Regulation of Transcription (II)
6Regulation of Transcription (III)
- By binding to a genes promoter, TFs can either
promote or repress the recruitment of the
transcription machinery - The conditions in which a gene is transcribed are
determined by the specific combination of BSs in
its promoter
7Regulation of Transcription (III)
- Assumption
- Co-expression
- ?
- Transcriptional co-regulation
- ?
- Common BSs
8DNA chips
? Data analysis (normalization,
clustering) ? Co-expression
9WH-questions
- So we know why were looking for common BSs
- What exactly are we trying to find?
- Where should we look for it?
- How can we find it?
10Promoter Region (Where?)
- What is the promoter region?
- Upstream Transcription Start Site (TSS)
- Too short ? miss many real BSs (false negatives)
- Too long ? lots of wrong hits (false positives)
- Length is species dependent (e.g., yeast 600bp,
thousands in human) - Common practice 500-2000bp
- Mask-out repetitive sequences?
- Common practice Yes
- Consider both strands?
- Common practice Yes
11Promoter Region II
- Additional problems
- Where exactly is the TSS?
- What about 1st exon, intron?
- Multiple transcripts
-
- Answers actually depend on the TF
12The What? question
- Computational tasks
- New BSs of known TFs
- New motifs (BSs of unknown TFs)
- Modules combinations of TFs
13BSs Models
- Exact string(s)
- Example
- BS TACACC , TACGGC
- CAATGCAGGATACACCGATCGGTA
- GGAGTACGGCAAGTCCCCATGTGA
- AGGCTGGACCAGACTCTACACCTA
14BSs Models (II)
- String with mismatches
- Example
- BS TACACC 1 mismatch
- CAATGCAGGATTCACCGATCGGTA
- GGAGTACAGCAAGTCCCCATGTGA
- AGGCTGGACCAGACTCTACACCTA
15BSs Models (III)
- Degenerate string
- Example
- BS TASDAC (SC,G DA,G,T)
- CAATGCAGGATACAACGATCGGTA
- GGAGTAGTACAAGTCCCCATGTGA
- AGGCTGGACCAGACTCTACGACTA
16BSs Models (IV)
- Position Weight Matrix (PWM)
- Example BS
Need to set score threshold
- ATGCAGGATACACCGATCGGTA 0.0605
- GGAGTAGAGCAAGTCCCGTGA 0.0605
- AAGACTCTACAATTATGGCGT 0.0151
17BSs Models (V)
- More complex models
- PWM with spacers (e.g., for p53)
- Markov model (dependency between adjacent columns
of PWM) - Hybrid models, e.g., mixture of two PWMs
And we also need to model the non-BSs sequences
in the promoters
18How to find novel motifs
- Degenerate string
- YMF - Sinha Tompa 02
- String with mismatches
- WINNOWER Pevzner Sze 00
- Random Projections Buhler Tompa 02
- MULTIPROFILER Keich Pevzner 02
- PWM
- MEME Bailey Elkan 95
- AlignACE Hughes et al. 98
- CONSENSUS - Hertz Stormo 99
19How to find TF modules
- BioProspector Liu et al. 01
- Co-Bind GuhaThakurta Stormo 01
- MITRA Eskin Pevzner 02
- CREME Sharan et al. 03
- MCAST Bailey Noble 03
20PRIMAPRomoter Integration in Microarray Analysis
- Goal Identify TFs whose BSs are abundant
(statistically over-represented) in promoters of
co-expressed genes - Limited to known TFs
- Uses PWM to model BSs
- Allows multiple BSs per promoter
- Integrated into Expander
21PRIMA input-output
- Input
- Promoter sequences (typically 1200bp) of
- Background (BG) set, typically all genes
- Target set, i.e., co-expressed genes
- PWMs of known TFs, e.g., TRANSFAC
- Output
- p-values of over-represented TFs
22PRIMA algorithm
- For each PWM
- Compute a threshold score for declaring hits of
the PWM (hit subsequence that is similar to the
PWM hypothetical BS) - Scan BG and target-set promoters for hits
- Apply a statistical test to decide whether the
number of hits in the target-set is significantly
higher than expected by chance, given the
distribution of hits in the BG - (Find co-occurring pairs of TFs)
23PRIMA results on HCC
- We ran PRIMA on 568 genes that are
- periodically expressed in the human cell-cycle
- (data from Whitfield et al. 02)
24PRIMA results on HCC (II)
25PRIMA results on HCC (III)
- Co-occurring pairs of TFs
26PRIMA future directions
- More information to utilize
- Distribution of hits locations
- Modules co-occurrence of TFs, possibly with
distance and/or strand bias - Homology BSs are more conserved than rest of
promoter
27PRIMA in EXPANDER
28PRIMA in EXPANDER (II)
29Acknowledgements
- PRIMA
- Rani Elkon
- Roded Sharan
- Ron Shamir
- Yossi Shiloh
- Expander
- Adi Maron-Katz
- Amos Tanay
- Israel Steinfeld
- Naama Arbili
- Roded Sharan
30Questions?