Transcription Factor Motif Finding And Operon Prediction - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Transcription Factor Motif Finding And Operon Prediction

Description:

X. Shirley Liu 9/15/04. Imagine a Chef. Restaurant Dinner. Home Lunch. Certain recipes used to. make certain dishes. X. Shirley Liu 9/15/04 ... – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0
Slides: 47
Provided by: xliu
Category:

less

Transcript and Presenter's Notes

Title: Transcription Factor Motif Finding And Operon Prediction


1
Transcription Factor Motif FindingAnd Operon
Prediction
  • Xiaole Shirley Liu
  • Biostatistics
  • 9/15/04

2
Imagine a Chef
3
Each Cell Is Like a Chef
4
Motivation
5
TF Finding Helps Understanding Transcription
Regulation
Upstream Regions Co-expressed Genes
GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTGGATCTTGT...AGAATGGCCTAT
6
TF Finding Helps Understanding Transcription
Regulation
Upstream Regions Co-expressed Genes
GATGGCTGCACCACGTGTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA
TCTCGTTAGGACCATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTGGATCTTGT...AGAATGGCCTAT
7
TF Finding Helps Understanding Transcription
Regulation
Upstream Regions Co-expressed Genes
GATGGCTGCACCACGTTTATGC...ACGATGTCTCGC
CACATCGCATCACGTGACCAGT...GACATGGACGGC
GCCTCGCACGTGGTGGTACAGT...AACATGACTAAA TCTCGTTAGGAC
CATCACGTGA...ACAATGAGAGCG
CGCTAGCCCACGTTGATCTTGT...AGAATGGCCTAT
8
BioProspector
  • Finds sequence motif enriched in a group of
    sequences (upstream of co-expressed genes)
  • Adopts a Gibbs sampling strategy
  • Represent a TF motif with a probability matrix
  • http//BioProspector.stanford.edu/

Sites ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT
ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
Motif Matrix
9
Gibbs Sampling
  • Randomly initialize a probability matrix

10
Gibbs Sampling
  • Take out one sequence with its sites from current
    motif

a1'
a2'
a3'
a4'
ak'
11
Gibbs Sampling
  • Score each possible segment of this sequence

Sequence 1
Segment (1-6) 1.5
a2'
a3'
a4'
ak'
12
Gibbs Sampling
  • Score each possible segment of this sequence

Sequence 1
Segment (2-7) 3
a2'
a3'
a4'
ak'
13
Gibbs Sampling
  • Sample a new segment to put the sequence back

a2'
a3'
a4'
ak'
14
Gibbs Sampling
  • Repeat the process until motif converges

a2'
a3'
a4'
ak'
15
Gibbs Sampler Intuition
  • Beginning
  • Randomly initialized motif
  • No preference towards any segment

16
Gibbs Sampler Intuition
  • Motif appears
  • Motif should have enriched signal (more sites)
  • By chance some correct sites come to alignment
  • Sites bias motif to attract other similar sites

17
Gibbs Sampler Intuition
  • Motif converges
  • All sites come to alignment
  • Motif totally biased to sample sites every time

18
ChIP-chip Experiments
  • Chromatin immunoprecipitation microarray
    (ChIP-on-chip)
  • Detects in vivo protein-DNA interaction

19
MDscan
  • Insights
  • High ChIP-chip enrichment gt true targets
  • Highest ChIP-chip gt contain more sites
  • Basic strategy
  • Search TF motif from highest ranking targets
    first (high signal / background ratio)
  • Refine candidate motifs with all targets
  • http//MDscan.stanford.edu/

20
MDscan Seeds
21
Many Seed Motifs
ATTGCAAAT TTTGCGAAT TTTGCAAAT
TTGCAAATC TTGCGAATA TTGCAAATT TTGCCCATC
GCAAATCCA GCAAATTCG GCAAATCCA GGAAATCCA GGAAATCCT
TGCAAATCC TGCAAATTC
GCCACCGT ACCACCGT ACCACGGT GCCACGGC
CAAATCCAA CAAATCCAA GAAATCCAC

22
MDscan Motif Refinement
Add/remove 9-mer from all ChIP-chip sequences
to optimize the scoring function
TTGCAAATC TTGCGAATA TTGCAAATT TTGCCCATC
23
Motif Regressor Approach
  • Look at one expression experiment

MDscan
Expression log ratio
Genes
24
Motif Regressor Rational
  • For each TF
  • Upstream Downstream
  • Seq Mtf Match Gene Exp
  • Gene1 3.2 1.8
  • Gene2 2.8 0.3
  • Gene3
  • Upstream sequence X motif matching score
    measures
  • Number of sites
  • Strength of matching

25
Motif Regressor Strategy
  • Rank genes by log2 (expression fold change)
  • Try MDscan (width 5-15) on induced and repressed
    genes separately
  • Find 50 candidate motifs from top 100 genes
  • Refine candidate motifs with top 500 genes
  • Report lt 30 distinct motifs
  • Score each upstream sequence with each motif
  • Linear regression to eliminate insignificant
    motifs

26
Linear Regression Example
  • Person IQ Age Education Height Eye
    color Spend/week of CD
  • A 120 30 High 171 blue 4000 30
  • B 250 41 PhD 155 brown 1500 18
  • C 150 8 Grade10 115 black 100 90
  • D 180 16 Grade12 140 gray 200 15
  • E 90 4 Preschool 88 green 500 26
  • F 130 17 High 178 black 80 500
  • G 110 21 College 182 blue 800 220
  • Gene Express Mtf1 Mtf2 Mtf3 Mtf4 Mtf5 Mtf6
  • Single X X X -- -- --
  • Regression

http//www.techtransfer.harvard.edu/Software/Motif
Regressor/
27
Motif Regressor on Expression Profile
V. cholerae Rugose (virulent)
V. cholerae Smooth (non-virulent)
vpsR-
hapR-
28
Motif Regressor on Expression Profile
V. cholerae Rugose (virulent)
V. cholerae Smooth (non-virulent)
vpsR-
hapR-
  • Many mutation-induced phase-transition microarray
    experiments
  • Found one strong novel motif
  • Binding site found in front of a transcription
    factor gene X
  • X mutant also causes smooth to rugose transition

29
Stepwise Regression Example
  • Person IQ Age Education Height Eye
    color Spend/week of CD
  • A 120 30 High 171 blue 4000 30
  • B 250 41 PhD 155 brown 1500 18
  • C 150 8 Grade10 115 black 100 90
  • D 180 16 Grade12 140 gray 200 15
  • E 90 4 Preschool 88 green 500 26
  • F 130 17 High 178 black 80 500
  • G 110 21 College 182 blue 800 220
  • Gene Express Mtf1 Mtf2 Mtf3 Mtf4 Mtf5 Mtf6
  • Single X X X -- -- --
  • Regression
  • Stepwise
  • Regression 2 1 --

http//www.techtransfer.harvard.edu/Software/Motif
Regressor/
30
Amino Acid Starvation
  • Slow cell growth
  • M3A
  • M3B
  • RAP1
  • Deal with stress
  • STRE
  • URS1
  • Nutrient biosynthesis
  • MET4 sulfur
  • PHO4 phosphate
  • GCN4 amino acids
  • Multiple regression R-sq 0.198

31
CompareProspector Strategy
  • Functional regulatory element is more conserved
  • Gibbs Sampling biases conserved sequences
  • http//compareprospector.stanford.edu/

21-bp window
Human Seq
Mouse Seq
32
Conclusion
  • BioProspector
  • Expression profile clusters
  • Consider each sequence equally
  • MDscan
  • ChIP-array targets
  • Separate the great from the good
  • Motif Regressor
  • Single microarray expression or ChIP-chip
    experiments
  • Weigh sequence by array value
  • Find motifs acting together
  • CompareProspector
  • For higher eukaryotes
  • Bias toward conservation

33
Bacteria operon prediction
  • Operons
  • Adjacent genes in the same strand
  • One mRNA, multiple proteins
  • Functional related
  • Subunits of a protein complex.
  • Enzymes in a common metabolic pathway
  • Motif finding and regulatory network analysis
    start with operon prediction.

34
Directons ? transcriptional units ? operons
  • Directon a run of gene(s) on the same strand
  • Transcriptional unit (TU) gene(s) transcribed as
    a single mRNA
  • TU genes genes belonging to same TU
  • Operon transcriptional unit coding for multiple
    proteins

35
Homology predictions(Ermolaeva et al, 2001)
  • Rational
  • Gene neighborhood conservation result of natural
    selection
  • Genes with complementary functions organized in
    operon benefit from co-regulation

36
Conserved directon pairs predict transcriptional
units
CC
SS
DD
SS
37
Homology high-confidence predictions in V.
cholerae
p SS / (SS DD CC SD SC )
  • TU gene criteria
  • p gt 0.95
  • conserved pairs in at least 3 genomes
  • 654 TU-gene neighbors in V. cholerae

38
TU-gene neighbors are close
39
Intergenic distances predict operons
intergenic distance (nt)
Homology predicted TU gene pairs
All same stranded neighboring pairs
40
cDNA Array Data
  • Single mRNA genes in same operon should have
    highly correlated expression
  • 40 cDNA microarray experiments
  • Pearson correlation between genes in close
    proximity
  • We picked two populations of neighboring genes to
    estimate /- sample correlation

41
Correlation densities
Close, -1020 same stranded neighbors
Convergent opposite stranded neighbors
42
Multiple correlations are informative
43
Evaluation of predictions
  • EcoCyc has the gold standard for E. coli operon
    prediction
  • Our E. coli prediction has 80 accuracy
  • Similar to other prediction methods
  • But we did not use the E. coli gold standard to
    tune our parameters
  • Should work better on other organisms
  • Performance depends on the array data

44
Pilus biosynthesis proteins(V. cholerae)
  • VC0827 pilus biosynthesis protein H
    - 15 0.98 0.91
  • VC0828 toxin co-regulated pilin (tcpA )
    543 0.92 0.92
  • VC0829 pilus biosynthesis protein B
    80 0.99 0.93
  • VC0830 pilus biosynthesis protein Q 0
    0.68 0.72
  • VC0831 pilus biosynthesis outer 3
    0.92 0.92
  • VC0832 pilus biosynthesis protein R - 7
    0.63 0.88
  • VC0833 pilus biosynthesis protein D - 12
    0.97 0.60
  • VC0834 pilus biosynthesis protein S 16
    0.99 0.98
  • VC0835 pilus biosynthesis protein T - 10
    0.99 0.56
  • VC0836 pilus biosynthesis protein E - 25
    0.65 0.99
  • VC0837 pilus biosynthesis protein F 10
    0.47 0.96
  • VC0838 pilus virulence tcpN/toxT 207
    0.81 0.84

Distance Expression
BioCyc
Homology
45
ABC transporter proteins (V. cholerae)
  • VC1091 oligopeptide ABC transporter 453
    0.01 -0.03
  • VC1092 oligopeptide ABC transporter 149
    0.89 0.17
  • VC1093 oligopeptide ABC transporter 14
    0.96 0.89
  • VC1094 oligopeptide ABC transporter 25
    0.97 0.97
  • VC1095 oligopeptide ABC transporter -21
    0.97 0.97

Distance Expression
BioCyc
Homology
46
Future Work
  • A stochastic operon model?
  • No strict operon boundaries
  • Have strong and weak operons
Write a Comment
User Comments (0)
About PowerShow.com