Title: Transcription Factor Motif Finding And Operon Prediction
1Transcription Factor Motif FindingAnd Operon
Prediction
- Xiaole Shirley Liu
- Biostatistics
- 9/15/04
2Imagine a Chef
3Each Cell Is Like a Chef
4Motivation
5TF Finding Helps Understanding Transcription
Regulation
Upstream Regions Co-expressed Genes
GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTGGATCTTGT...AGAATGGCCTAT
6TF Finding Helps Understanding Transcription
Regulation
Upstream Regions Co-expressed Genes
GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTGGATCTTGT...AGAATGGCCTAT
7TF Finding Helps Understanding Transcription
Regulation
Upstream Regions Co-expressed Genes
GATGGCTGCACCACGTTTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA TCTCGTTAGGAC
CATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTTGATCTTGT...AGAATGGCCTAT
8BioProspector
- Finds sequence motif enriched in a group of
sequences (upstream of co-expressed genes) - Adopts a Gibbs sampling strategy
- Represent a TF motif with a probability matrix
- http//BioProspector.stanford.edu/
Sites ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT
ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
Motif Matrix
9Gibbs Sampling
- Randomly initialize a probability matrix
10Gibbs Sampling
- Take out one sequence with its sites from current
motif
a1'
a2'
a3'
a4'
ak'
11Gibbs Sampling
- Score each possible segment of this sequence
Sequence 1
Segment (1-6) 1.5
a2'
a3'
a4'
ak'
12Gibbs Sampling
- Score each possible segment of this sequence
Sequence 1
Segment (2-7) 3
a2'
a3'
a4'
ak'
13Gibbs Sampling
- Sample a new segment to put the sequence back
a2'
a3'
a4'
ak'
14Gibbs Sampling
- Repeat the process until motif converges
a2'
a3'
a4'
ak'
15Gibbs Sampler Intuition
- Beginning
- Randomly initialized motif
- No preference towards any segment
16Gibbs Sampler Intuition
- Motif appears
- Motif should have enriched signal (more sites)
- By chance some correct sites come to alignment
- Sites bias motif to attract other similar sites
17Gibbs Sampler Intuition
- Motif converges
- All sites come to alignment
- Motif totally biased to sample sites every time
18ChIP-chip Experiments
- Chromatin immunoprecipitation microarray
(ChIP-on-chip) - Detects in vivo protein-DNA interaction
19MDscan
- Insights
- High ChIP-chip enrichment gt true targets
- Highest ChIP-chip gt contain more sites
- Basic strategy
- Search TF motif from highest ranking targets
first (high signal / background ratio) - Refine candidate motifs with all targets
- http//MDscan.stanford.edu/
20MDscan Seeds
21Many Seed Motifs
ATTGCAAAT TTTGCGAAT TTTGCAAAT
TTGCAAATC TTGCGAATA TTGCAAATT TTGCCCATC
GCAAATCCA GCAAATTCG GCAAATCCA GGAAATCCA GGAAATCCT
TGCAAATCC TGCAAATTC
GCCACCGT ACCACCGT ACCACGGT GCCACGGC
CAAATCCAA CAAATCCAA GAAATCCAC
22MDscan Motif Refinement
Add/remove 9-mer from all ChIP-chip sequences
to optimize the scoring function
TTGCAAATC TTGCGAATA TTGCAAATT TTGCCCATC
23Motif Regressor Approach
- Look at one expression experiment
MDscan
Expression log ratio
Genes
24Motif Regressor Rational
- For each TF
- Upstream Downstream
- Seq Mtf Match Gene Exp
- Gene1 3.2 1.8
- Gene2 2.8 0.3
- Gene3
- Upstream sequence X motif matching score
measures - Number of sites
- Strength of matching
25Motif Regressor Strategy
- Rank genes by log2 (expression fold change)
- Try MDscan (width 5-15) on induced and repressed
genes separately - Find 50 candidate motifs from top 100 genes
- Refine candidate motifs with top 500 genes
- Report lt 30 distinct motifs
- Score each upstream sequence with each motif
- Linear regression to eliminate insignificant
motifs
26Linear Regression Example
- Person IQ Age Education Height Eye
color Spend/week of CD - A 120 30 High 171 blue 4000 30
- B 250 41 PhD 155 brown 1500 18
- C 150 8 Grade10 115 black 100 90
- D 180 16 Grade12 140 gray 200 15
- E 90 4 Preschool 88 green 500 26
- F 130 17 High 178 black 80 500
- G 110 21 College 182 blue 800 220
-
- Gene Express Mtf1 Mtf2 Mtf3 Mtf4 Mtf5 Mtf6
- Single X X X -- -- --
- Regression
http//www.techtransfer.harvard.edu/Software/Motif
Regressor/
27Motif Regressor on Expression Profile
V. cholerae Rugose (virulent)
V. cholerae Smooth (non-virulent)
vpsR-
hapR-
28Motif Regressor on Expression Profile
V. cholerae Rugose (virulent)
V. cholerae Smooth (non-virulent)
vpsR-
hapR-
- Many mutation-induced phase-transition microarray
experiments - Found one strong novel motif
- Binding site found in front of a transcription
factor gene X - X mutant also causes smooth to rugose transition
29Stepwise Regression Example
- Person IQ Age Education Height Eye
color Spend/week of CD - A 120 30 High 171 blue 4000 30
- B 250 41 PhD 155 brown 1500 18
- C 150 8 Grade10 115 black 100 90
- D 180 16 Grade12 140 gray 200 15
- E 90 4 Preschool 88 green 500 26
- F 130 17 High 178 black 80 500
- G 110 21 College 182 blue 800 220
-
- Gene Express Mtf1 Mtf2 Mtf3 Mtf4 Mtf5 Mtf6
- Single X X X -- -- --
- Regression
- Stepwise
- Regression 2 1 --
http//www.techtransfer.harvard.edu/Software/Motif
Regressor/
30Amino Acid Starvation
- Slow cell growth
- M3A
- M3B
- RAP1
- Deal with stress
- STRE
- URS1
- Nutrient biosynthesis
- MET4 sulfur
- PHO4 phosphate
- GCN4 amino acids
- Multiple regression R-sq 0.198
31CompareProspector Strategy
- Functional regulatory element is more conserved
- Gibbs Sampling biases conserved sequences
- http//compareprospector.stanford.edu/
21-bp window
Human Seq
Mouse Seq
32Conclusion
- BioProspector
- Expression profile clusters
- Consider each sequence equally
- MDscan
- ChIP-array targets
- Separate the great from the good
- Motif Regressor
- Single microarray expression or ChIP-chip
experiments - Weigh sequence by array value
- Find motifs acting together
- CompareProspector
- For higher eukaryotes
- Bias toward conservation
33Bacteria operon prediction
- Operons
- Adjacent genes in the same strand
- One mRNA, multiple proteins
- Functional related
- Subunits of a protein complex.
- Enzymes in a common metabolic pathway
- Motif finding and regulatory network analysis
start with operon prediction.
34Directons ? transcriptional units ? operons
- Directon a run of gene(s) on the same strand
- Transcriptional unit (TU) gene(s) transcribed as
a single mRNA - TU genes genes belonging to same TU
- Operon transcriptional unit coding for multiple
proteins
35Homology predictions(Ermolaeva et al, 2001)
- Rational
- Gene neighborhood conservation result of natural
selection - Genes with complementary functions organized in
operon benefit from co-regulation
36Conserved directon pairs predict transcriptional
units
CC
SS
DD
SS
37Homology high-confidence predictions in V.
cholerae
p SS / (SS DD CC SD SC )
- TU gene criteria
- p gt 0.95
- conserved pairs in at least 3 genomes
- 654 TU-gene neighbors in V. cholerae
38TU-gene neighbors are close
39Intergenic distances predict operons
intergenic distance (nt)
Homology predicted TU gene pairs
All same stranded neighboring pairs
40cDNA Array Data
- Single mRNA genes in same operon should have
highly correlated expression - 40 cDNA microarray experiments
- Pearson correlation between genes in close
proximity - We picked two populations of neighboring genes to
estimate /- sample correlation
41Correlation densities
Close, -1020 same stranded neighbors
Convergent opposite stranded neighbors
42Multiple correlations are informative
43Evaluation of predictions
- EcoCyc has the gold standard for E. coli operon
prediction - Our E. coli prediction has 80 accuracy
- Similar to other prediction methods
- But we did not use the E. coli gold standard to
tune our parameters - Should work better on other organisms
- Performance depends on the array data
44Pilus biosynthesis proteins(V. cholerae)
- VC0827 pilus biosynthesis protein H
- 15 0.98 0.91 - VC0828 toxin co-regulated pilin (tcpA )
543 0.92 0.92 - VC0829 pilus biosynthesis protein B
80 0.99 0.93 - VC0830 pilus biosynthesis protein Q 0
0.68 0.72 - VC0831 pilus biosynthesis outer 3
0.92 0.92 - VC0832 pilus biosynthesis protein R - 7
0.63 0.88 - VC0833 pilus biosynthesis protein D - 12
0.97 0.60 - VC0834 pilus biosynthesis protein S 16
0.99 0.98 - VC0835 pilus biosynthesis protein T - 10
0.99 0.56 - VC0836 pilus biosynthesis protein E - 25
0.65 0.99 - VC0837 pilus biosynthesis protein F 10
0.47 0.96 - VC0838 pilus virulence tcpN/toxT 207
0.81 0.84
Distance Expression
BioCyc
Homology
45ABC transporter proteins (V. cholerae)
- VC1091 oligopeptide ABC transporter 453
0.01 -0.03 - VC1092 oligopeptide ABC transporter 149
0.89 0.17 - VC1093 oligopeptide ABC transporter 14
0.96 0.89 - VC1094 oligopeptide ABC transporter 25
0.97 0.97 - VC1095 oligopeptide ABC transporter -21
0.97 0.97 -
Distance Expression
BioCyc
Homology
46Future Work
- A stochastic operon model?
- No strict operon boundaries
- Have strong and weak operons