Comparative genomics of 12 Drosophila genomes - PowerPoint PPT Presentation

About This Presentation
Title:

Comparative genomics of 12 Drosophila genomes

Description:

Comparative genomics of 12 Drosophila genomes Bill Gelbart (in the role of Manolis Kellis) Broad Institute of MIT and Harvard MIT Computer Science and Artificial ... – PowerPoint PPT presentation

Number of Views:214
Avg rating:3.0/5.0
Slides: 68
Provided by: manolis
Learn more at: https://compbio.mit.edu
Category:

less

Transcript and Presenter's Notes

Title: Comparative genomics of 12 Drosophila genomes


1
Comparative genomics of 12 Drosophila genomes
  • Bill Gelbart
  • (in the role of Manolis Kellis)

Broad Institute of MIT and Harvard
MIT Computer Science and Artificial Intelligence
Laboratory
2
Fly comparative genomics
Complexity of bilateral animal - Developmental
principles Signaling pathways, kernels -
Enhancers, splicing, microRNAs Genetics of a
small eukaryote - 100 years of classical
genetics - Systematic experiments, RNAi, in-situ
visualization Comparative genomics power -
Richest than any eukaryote - 2.1 substitutions
per site - Multiple sets of close relatives
3
Fly comparative genomics
Matt Rasmussen Poster 267
  • Extensive conservation of gene order
  • Genome-wide alignments spanning entire genus

4
Fly comparative genomics
Gene identification
Regulatory motif discovery
microRNA regulation
5
  • Part 1. Gene Identification
  • Evolutionary signatures of genes
  • Revisiting the fly genome
  • Unusual gene structures

6
Distinguishing genes from non-coding regions
Splice
Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCA
GGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCA
GCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG
-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-
--GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAAC
AGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGAC
GAGCATGT---GGCTCCAGCATCTTC Dyak
TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGG
AGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATC
TTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTA
GCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGC
TCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTA
GCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCG
TGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTA
CAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGC
ATACGCCCGTGG---GGCTCCATCATTTTC Dper
TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGG
AATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATT
TTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTA
GCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGT
TCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTA
GCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCC
TGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTA
AAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGC
GTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri
TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAG
AGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGC
TTT

  • Protein-coding genes have specific evolutionary
    constraints
  • Gaps are multiples of three (preserve amino acid
    translation)
  • Mutations are largely 3-periodic (silent codon
    substitutions)
  • Specific triplets exchanged more frequently
    (conservative substs.)
  • Conservation boundaries are sharp (pinpoint
    individual splicing signals)
  • ? Encode as evolutionary signatures
  • Computational test for each of them
  • Combine and score systematically

7
Gene identification
Study known genes
Derive conservation rules
Discover new genes
8
Signature 1 Reading frame conservation
9
Signature 2 Codon substitution patterns
Genes
Codon observed in species 1
Codon observed in species 2
  • Codon substitution patterns specific to genes
  • Genetic code dictates substitution patterns
  • Amino acid properties dictate substitution
    patterns

10
Codon Substitution Matrix (CSM)
11
Signature 3 Spectral analysis of DNA sequence
Genes
Intergenic
  • Reveal frequency spectra due to non-uniform codon
    usage
  • Extend to multiple genomes
  • Initial results (single genome, single spectrum)
  • Fly Sensitivity 85 Specificity 87 Accuracy 93

12
Signatures 4, 5, 6, 7, etc
real exon
ISEs
ISEs
donor site
acceptor site
ESEs
  • Mutation patterns of splicing signals
  • Real splice acceptor/donor evolve in specific
    ways
  • Evolution of other motifs associated with
    splicing
  • Exonic/Intronic Splicing Enhancers/Silencers
    (ESE,ESI)
  • Density of motif clouds surrounding real exons
  • Sharp conservation boundaries
  • Relative conservation exon vs. surrounding
    regions
  • Length of longest open reading frame
  • Frequency of stop codons in each frame / each
    species

13
Putting it all together CONGO gene finder
  • CONGO gene finder based on Conditional Random
    Fields (CRFs)
  • Hidden Markov Models (HMMs)
  • Generative model, learn emission, transition
    probabilities
  • Easy to train, hard to integrate long-range
    signals
  • Conditional Random Fields (CRFs)
  • Discriminative dual of HMMs, learn weights on
    features
  • Easy to integrate diverse signals, gradient
    ascent for training
  • Features of the model
  • Train each evolutionary signatures as a feature
  • Train single-species signals
  • Apply it systematically to revisit Drosophila
    genome

14
  • Part 1. Gene Identification
  • Evolutionary signatures of genes
  • Revisiting the fly genome
  • Unusual gene structures

?
15
Revisiting Drosophila annotation
D. melanog.
D. simulans
D. erecta
D. persimilis
()
579 fully rejected
1,454 exons (800 genes)
668 exons in 443 genes
10,845 fully confirmed
2,499 not aligned
  • Fully rejected genes weak/no evidence
  • New exons existing novel experimental evidence
  • Large-scale functional annotation for novel genes

16
Example 1 Known genes stand out
Sharp conservation boundaries. Known exons
stand out. High sensitivity and specificity.
conserved
substitution
insertion
frameshift
gap
17
Example 2 Novel multi-exon gene
  • 1,454 novel exons
  • outside known genes
  • Many cluster in new multi-exon genes
  • Others are isolated high-confidence exons

18
Novel genes and exons
  • 1,454 novel exons outside existing genes
  • 60 cluster in 300 multi-exon genes
  • 40 isolated exons
  • 668 novel exons inside existing genes
  • Alternative splicing Many with cDNA support
  • Nested genes Few known examples
  • Human curation
  • Collaboration with FlyBase
  • Hundreds of changes in release 5.1, more in 5.2
  • Systematic experimentation
  • Sue Celniker and Berkeley Genome Project
  • Thousands of new genes in the pipeline

19
Example 3 Dubious single-exon gene
  • Only evidence was an open reading frame
  • Comparative information much stronger

20
579 Dubious Genes
  • Classification approach Yes / No answer
  • Closely related species both genes and
    intergenic aligned
  • Show very different patterns of mutation
  • Comparative analysis provides negative evidence
  • Alignment is unambiguous, orthologous, spans
    entire gene
  • Sequence shows mutations and indels in every
    species
  • Weak or missing evidence in D. melanogaster
  • 100 of these independently rejected by FlyBase
  • These are missing from systematic clone
    collections
  • Only 34 (6) have assigned names (vs. 36 of all
    fly genes)

21
Example 4 Start codon adjustment
  • Codon substitution patterns suggest new start in
    200 genes
  • Score each substitution using Codon Substitution
    Matrix (CSM)

poor CSM score, atypical substitution high CSM
score, protein-like substitution
annotated start codon
conserved start codon
22
Example 5 Gene annotated on wrong reading frame
  • cDNA evidence supports overlapping reading
    frames, both open
  • Annotation traditionally selects longer one
  • Conservation enables distinguishing the two

Shorter ORF is the correct one
mRNA supports both ORFs
Annotated ORF (345nt)
Real ORF (315nt)
Conservation only supports shorter ORF
CG7738-RA is incorrect
23
Example 6 Incorrect splice causes wrong frame
  • Second exon annotated in the wrong frame
  • Due to splice site boundary error
  • Correction is supported by cDNA evidence

First exon correct frame
2nd exon incorrect frame
Fix exon boundary
24
Fly Gene Reannotation
novel genes
novel exons
dubious genes
  • BDGP
  • iPCR
  • RT-PCR/RACE
  • FlyBase
  • manual curation
  • official annotations

25
  • Part 1. Genome interpretation
  • Evolutionary signatures of genes
  • Revisiting the fly genome
  • Unusual gene structures

?
?
26
Distinguishing protein-coding regions, in
absence of traditional signals
Traditional stop codons Gene-like conservation
stops sharp
droMel TGCGATCGCTGCCGAAGGCCAATGGCACGTAAGCAGG------
-----------TCCAGGACTGGA-----GCAGAGAA droSim
TGCGATCGCTGCCGAAAGCCAATGGCACGTAAGCAGG-------------
----CCCAGGACTAGG-----GCAGAGAA droYak
TGCGATCGCTGCCGAAGGCCAATGGCACGTAAGCAGGGT-----A-----
AAAGACCAGGACCAGA-----GCAGCGGA droEre
TGCGATCGCTGCCGAAGGCCAATGGCACGTAAGCAGGGT-----G-----
AAAGGCCAGGACCAGA-----GCAGCGGC droAna
TGCGGTCCCTGCCGAAGGCCAATGGCACGTAAAGTGTGTTGCCGA-----
AGCGTCCGGAATCGGAA-TCTGAATTTGA droPse
TGCGGTCGCTGCCCAAGGCCAATGGAACGTAA-----ATTGCC-------
----ACAAGGATGAATA-TCATCAAAGG- droVir
TGCGGTCGCTGCCAAAGGCCAACGGCACGTAAATGGGCCACGCCACACA-
----------ACCCAAACTCATCATCTAT droMoj
TGCGACCGCTGCCAAAGGCCAATGGCACGTAAAGAGGGAAACAAACAAAC
AAACAGAAAAACAAAAACTCAAC----AT droGri
TGCGACCGCTGCCCAAGGCCAATGGCACGTAAA-TGGCATCCCACACAAA
AAACAACAAAACAAAAAGAAAAATGGGTT


27
Unusual genes 1 Stop codon read-through
  • Method 1 (single exons)
  • 112 events, 95 extending known genes ? Manual
    curation 82
  • Enriched in neuronal function
  • Method 2 (after splicing)
  • 256 events, looser cutoff, large overlap, needs
    manual curation
  • Enriched in transcription factors

Protein-coding conservation
Continued protein-coding conservation
No more conservation
Stop codon readthrough
2nd stop codon
28
Mechanisms for stop-codon read-through
  • Sequence-dependent inhibition of eRF binding

Interference from RNA secondary structure
Steric interactions from P-site tRNA
GAGGUGU
GAG GUG AGU UGA CACGAUGGAGAUC
1,2,3 nucleotides
29
micr
midlife-crisis, stem cell maintenance
166 AA
dm AAAATTGACCCAGTTCCCCGACGCTGCTGAAGATGTTCGAGAC
CACGCTGACCCTGCCGCGAACCAGTGTG---CTTAC droSim
AAAATTGACCCAGTTCCCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACCCTGCCGCGAACCAGTGTG---CTTAC droYak
AAAATTGACCCAGTTCGCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACCCTGCCGCGAACCAGTGTG---CTTAC droEre
AAAATTGACCCAGTTCACCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACCCTGCCGCGAACCAGTGTG---CTTAC droAna
AAAATTGACCCAGTTCGCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACTCTGCCGCGCACCAGTGTGGCCCTCAC droPse
AAAATTGACCCAGTTCGCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACCCTGCCGCGCACCAGCGTA---CTGAC droVir
AAAATTGACCCAGTTCGCCGACGCTGCTAAAGATGTTCGAGACCACGCTG
ACTCTGCCGCGCACCAGCGTG---CTAAC droMoj
AAAATTGACCCAGTTCGCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACCTTGCCGCGCACCAGCGTG---CTAAC droGri
AAAATTGACCCAGTTCGCCGACGCTGCTGAAGATGTTCGAGACCACGCTG
ACAATGCCGCGGAGCAGCGTG---CTAAC
12012012012012012012012012012012012012012012012012
012012012012012012012 01201


30
DopR
dopamine-receptor neurotransmitter
38 AA
24 AA
TGGTGG---CGGTGGCCGCCGTGACCGAATCTCATGGAACATATCGCGTG
GTCA AGACA---GCATTTTGGTCAAATCGCA TGGTAG---CGGT
GGCCGCCGTGACCGAATCTCATGGAACATACCGCGTGGTCA
AGACA---GCCTTTTGGTCAAATCGCA TGGTGG---CGGTGGCCGCCGT
GACCGAATCTCATGGAACATACCGCGTGGTCA
AGACA---GCCTTTTGGTCAAATCGCA TGGTGG---CGGTGGCCGCCGT
GACCGAATCTCATGGAACATACCGCGTGGTCA
AGACA---GCCTTTTGGTCAAATCGCA TGGTGG---CCGTGGGTGCCGT
GACCGAATCTCATGGAACGTACCGCGTGGTCA
AGGCATCGGCCTTTTGGTCAAATTGCA TGGTGA---CCGTAGGCGCCGT
GACCGAATCTCATGGAACATAGCGCGTGGTCA
---TAGCCGCCTTTTGGTCAAATCGCA TTGTTGTTGTTGGGGCCGCCGT
GACCGAATCTCATGGAACGTATCTTGTGGTCA
---CAACAACGTTTTGGTCAAATTGCG TGGTCGTTGTGG---CCGCCGC
GACCGAATCTCATGGAACGTACCTTGTGGTCA
---CCGGTGCGTTTTGGTCAAATCGCA TGGTTGTTGTGG---CCGCCGC
GACCGAATCTCATGGAACGTAACGGTTGGTCA
---CAGCAGCGTTTTGGTCAAATTGCA


31
Sequence context of read-through events
  • Sequence motifs

61 UGA
95 top candidates
15 UAG
14 UAA
5 mixed
32
Unusual genes 2 Polycistronic messages / uORFs
  • Method
  • High-scoring ORFs with cDNA evidence
  • Disjoint from the annotated ORF
  • Results
  • 217 cases

Protein-coding conservation in the 5UTR
33
Unusual genes 3 Frame-shift in the middle of
exons
  • Method
  • Exons changing high-scoring frame
  • Far from splice junctions
  • Results
  • 68 cases in 44 genes

Frame 1 is high-scoring
Frame 2 is high-scoring
34
Part 1 summary Gene identification
  • Signatures specific to protein-coding genes
  • Reading frame conservation
  • Codon substitution biases
  • Splicing-associated motifs
  • Combine in a discriminative framework
  • Support Vector Machine, Conditional Random Fields
  • Very high accuracy in yeast, fly, human
  • Signatures more informative than individual
    genome
  • Can doubt primary sequence of any one species
  • Identify and correct sequencing errors
  • New biological mechanisms stop codon skipping

35
Scaling gene identification to 12 species
  • 12 species pose new challenges
  • Varying coverage, sequencing errors,
    misassemblies
  • Duplication, loss, divergence, misalignments
  • Current methods difficult to scale
  • Michael Brent writes No method for improving
    gene prediction accuracy by using multi-genome
    alignments has yet been found despite several
    serious efforts Genome Research 15 1777-1786
    (see poster 40)
  • Does performance continue to improve with more
    species?

36
Discriminative framework enables continued
increase in power
  • Reading frame conservation (RFC) score

2 species
3 species
5 species
12 species
  • Codon substitution matrix (CSM) score

2 species
?
70
80
2 species
30
20
12 species
12 species
12 species
90
95
10
5
37
Fly comparative genomics
Gene identification
Regulatory motif discovery
microRNA regulation
38
Overview
  • Part 2. Motif identification
  • Evolutionary signatures for motif discovery
  • Functional roles of novel motifs
  • Scaling motif discovery to 12 species

39
Known motifs are preferentially conserved
Experimentally validated region, where
dmel AATGATTTGCCAGCTAGCCAACTCTCTAATTAGCGACTAAGT
CCAAGTCAC . . .
dmel AATGATTTGC----------------CAGC--TAGCC-AACT
CTCTAATTAGCGACTAAGTCC-----------AAGTCAC dsim
AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATT
AGCGACTAAGTCC-----------AAGTCAC dyak
AATGATTTGC----------------CAGC--TAGCC-AACTCTCTAATT
AGCGACTAAGTCC-----------AAGTCAG dere
AATGGTTTGC----------------CAGCGGTCGCCAAACTCTCTAATT
AGCGACCAAGTCC-----------AAGTCAG dana
AATGATTTCCATTTCTCCCCACCCCCCACTAGTTCCTAGGCACTCTAATT
AGCAAGTTAGTCTCTAGAGACTCTAAGTCGG dpse
AAT--------TTTC-----------------------AGCCGTCTAATT
AGTGGTGTTCTC------GGTTCTCAAT---


engrailed TAATTA (Ades Sauer, 1994)
  • Enough to discover motifs? Not really
  • Conservation not limited to exact binding site ?
    Additional bases would be found
  • Weakly constrained positions can diverge ? Real
    motifs will be missed
  • Experimental validation typically not available ?
    How do we discover motifs de novo?
  • ? Use basic property of regulatory motifs
    Multiple functional (selected) instances

40
Known motifs are frequently conserved
D. mel.
  • Across the fly genome, the engrailed motif
  • appears 8599 times
  • is conserved 1534 times
  • Statistical significance
  • 5 flies conservation rate of random control
    motifs 2.8
  • Engrailed enrichment 6.8-fold (Binomial
    P-value 35 stdev)

Motif Conservation Score (MCS)
41
Systematically evaluate candidate patterns
gap
G
T
C
R
Y
S
A
G
T
R
W
  • Enumerate
  • Length between 6 and 15 nt, allow central gap
  • 11 letter alphabet (A C G T, 2-fold codes, N)
  • Score
  • Compute binomial score (conserved vs. total)
  • Select MCS gt 6.0 ? specificity 97
  • Collapsing
  • Sequence similarity
  • Overlapping occurrences

All potential motifs
Evaluate MCS
Collapse motif variants
42
Overview
  • Part 2. Motif identification
  • Evolutionary signatures for motif discovery
  • Functional roles of novel motifs
  • Scaling motif discovery to 12 species

?
43
Evidence for promoter motifs
Consensus MCS
1 CTAATTAAA 65.6
2 TTKCAATTAA 57.3
3 WATTRATTK 54.9
4 AAATTTATGCK 54.4
5 GCAATAAA 51
6 DTAATTTRYNR 46.7
7 TGATTAAT 45.7
8 YMATTAAAA 43.1
9 AAACNNGTT 41.2
10 RATTKAATT 40
11 GCACGTGT 39.5
12 AACASCTG 38.8
13 AATTRMATTA 38.2
14 TATGCWAAT 37.8
15 TAATTATG 37.5
16 CATNAATCA 36.9
17 TTACATAA 36.9
18 RTAAATCAA 36.3
19 AATKNMATTT 36
20 ATGTCAAHT 35.6
21 ATAAAYAAA 35.5
22 YYAATCAAA 33.9
23 WTTTTATG 33.8
24 TTTYMATTA 33.6
25 TGTMAATA 33.2
26 TAAYGAG 33.1
27 AAAKTGA 32.9
28 AAANNAAA 32.9
29 RTAAWTTAT 32.9
30 TTATTTAYR 32.9
44
Evidence for promoter motifs
Consensus MCS Matches to known TFs
1 CTAATTAAA 65.6 engrailed (en)
2 TTKCAATTAA 57.3 reversed-polarity (repo)
3 WATTRATTK 54.9 araucan (ara)
4 AAATTTATGCK 54.4 paired (prd)
5 GCAATAAA 51 ventral veins lacking (vvl)
6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx)
7 TGATTAAT 45.7 apterous (ap)
8 YMATTAAAA 43.1 abdominal A (abd-A)
9 AAACNNGTT 41.2
10 RATTKAATT 40
11 GCACGTGT 39.5 fushi tarazu (ftz)
12 AACASCTG 38.8 broad-Z3 (br-Z3)
13 AATTRMATTA 38.2
14 TATGCWAAT 37.8
15 TAATTATG 37.5 Antennapedia (Antp)
16 CATNAATCA 36.9
17 TTACATAA 36.9
18 RTAAATCAA 36.3
19 AATKNMATTT 36
20 ATGTCAAHT 35.6
21 ATAAAYAAA 35.5
22 YYAATCAAA 33.9
23 WTTTTATG 33.8 Abdominal B (Abd-B)
24 TTTYMATTA 33.6 extradenticle (exd)
25 TGTMAATA 33.2
26 TAAYGAG 33.1
27 AAAKTGA 32.9
28 AAANNAAA 32.9
29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n)
30 TTATTTAYR 32.9 Deformed (Dfd)
45
Novel motifs show expression enrichment
  • Discovered independently of any expression
    clusters
  • Validate against known expression datasets

46
Evidence for promoter motifs
Consensus MCS Matches to known Expression enrichment
1 CTAATTAAA 65.6 engrailed (en)
2 TTKCAATTAA 57.3 reversed-polarity (repo)
3 WATTRATTK 54.9 araucan (ara)
4 AAATTTATGCK 54.4 paired (prd)
5 GCAATAAA 51 ventral veins lacking (vvl)
6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx)
7 TGATTAAT 45.7 apterous (ap)
8 YMATTAAAA 43.1 abdominal A (abd-A)
9 AAACNNGTT 41.2
10 RATTKAATT 40
11 GCACGTGT 39.5 fushi tarazu (ftz)
12 AACASCTG 38.8 broad-Z3 (br-Z3)
13 AATTRMATTA 38.2
14 TATGCWAAT 37.8
15 TAATTATG 37.5 Antennapedia (Antp)
16 CATNAATCA 36.9
17 TTACATAA 36.9
18 RTAAATCAA 36.3
19 AATKNMATTT 36
20 ATGTCAAHT 35.6
21 ATAAAYAAA 35.5
22 YYAATCAAA 33.9
23 WTTTTATG 33.8 Abdominal B (Abd-B)
24 TTTYMATTA 33.6 extradenticle (exd)
25 TGTMAATA 33.2
26 TAAYGAG 33.1
27 AAAKTGA 32.9
28 AAANNAAA 32.9
29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n)
30 TTATTTAYR 32.9 Deformed (Dfd)
47
Evidence for promoter motifs
Consensus MCS Matches to known Expression enrichment Promoters Enhancers
1 CTAATTAAA 65.6 engrailed (en) 25.4 2
2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2
3 WATTRATTK 54.9 araucan (ara) 11.7 2.6
4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5
5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3
6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3
7 TGATTAAT 45.7 apterous (ap) 7.1 1.7
8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2
9 AAACNNGTT 41.2 20.1 4.3
10 RATTKAATT 40 3.9 0.7
11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9
12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7
13 AATTRMATTA 38.2 19.5 1.2
14 TATGCWAAT 37.8 5.8 2
15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4
16 CATNAATCA 36.9 1.8 1.7
17 TTACATAA 36.9 5.4
18 RTAAATCAA 36.3 3.2 2.8
19 AATKNMATTT 36 3.6 0
20 ATGTCAAHT 35.6 2.4 4.6
21 ATAAAYAAA 35.5 57.2 -0.5
22 YYAATCAAA 33.9 5.3 0.6
23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6
24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7
25 TGTMAATA 33.2 8.9 1.6
26 TAAYGAG 33.1 4.7 2.7
27 AAAKTGA 32.9 7.6 0.3
28 AAANNAAA 32.9 449.7 0.8
29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8
30 TTATTTAYR 32.9 Deformed (Dfd) 30.7
48
Overview
  • Part 2. Motif identification
  • Evolutionary signatures for motif discovery
  • Functional roles of novel motifs
  • Scaling motif discovery to 12 species

?
?
49
The slide for the skeptics
  • 6 species

dmel CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC
dsim CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC
dsec CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC
dyak CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC
dere CAGCGGTCGCCAAACTCTCTAATTAGCGACCAAGTC-CAAGTC
dana CACTAGTTCCTAGGCACTCTAATTAGCAAGTTAGTCTCTAGAG
  • 12 species

footprint
dmel CAGCT--AGCC-AACTCTCTAATTA-------------------
------------------------------GCGACTA---AGTC-CAAGT
C dsim CAGCT--AGCC-AACTCTCTAATTA-----------------
--------------------------------GCGACTA---AGTC-CAA
GTC dsec CAGCT--AGCC-AACTCTCTAATTA---------------
----------------------------------GCGACTA---AGTC-C
AAGTC dyak CAGC--TAGCC-AACTCTCTAATTA-------------
------------------------------------GCGACTA---AGTC
-CAAGTC dere CAGCGGTCGCCAAACTCTCTAATTA-----------
--------------------------------------GCGACCA---AG
TC-CAAGTC dana CACTAGTTCCTAGGCACTCTAATTA---------
----------------------------------------GCAAGTT---
AGTCTCTAGAG dpse ------------AGCCGTCTAATTA-------
------------------------------------------GTG--GT-
--GTTCTCGGTTC dper ------------AGCCGTCTAATTA-----
--------------------------------------------GTG--G
T---GTTCTCGGTTC dwil AAAAT------ATGCTCTCTAATTAATG
TGATGGGATGGGA---------------TTCGAGACATTCGTGTTAGAGA
CTTTAGAGTCGCAAGTG dmoj -------AGCTCGACT-TTTAATTAA
TATGACGGGCGAC----TTCACAGTGAGAATGGGAGAGAGAG----AGAC
AGTGAGAGAGAGTGA---- dvir -------AGCCCGACT--TTAATT
AATATGATGAGCGCCAAAGTTCAGAGGGCCAGTTCAAAGCAGAC----AG
ACA------------------ dgri ------GAGGCAGAC--TTTAA
TTAATATGATGAGCG--TTTGGTCGGAGTGCCAAAGTTGAACAGGCCTGA
AAACAGCTAAAGCTGCTCG----

This is not typical. Motifs not always aligned.
50
New challenges in using 12 genomes
engrailed (TAATTA)
engrailed
engrailed
  • Sequencing / assembly / alignment artifacts
  • Contig breaks, misassemblies, misalignments
  • Evolutionary variation
  • Individual binding sites can move / mutate
  • Some motifs conserved only in subset of species

51
Discovery power increases with more species
engrailed motif engrailed motif engrailed motif Random motifs conservation Enrichment
Total Conserved Conservation Random motifs conservation Enrichment
5/5 Flies 8599 1534 17.8 2.6 6.8x
8/8 Flies 8529 770 9 0.7 12.9x
Real ½
Random ¼
Enrich 2
  • From 5 to 8 flies conservation enrichment
    increases
  • Real sites ½ as many conserved
  • Random motifs ¼ as many conserved
  • Enrichment Doubles
  • ? Undertake motif discovery with 12 species

52
Signal increases linearly with branch length
  • MCS of top motifs increases
  • Several combinations of species allowed

53
Fly comparative genomics
Gene identification
Regulatory motif discovery
microRNA regulation
54
Motif discovery in 3-UTRs
TSS
3-UTR
Stop
ATG
  • 168 promoter motifs
  • match known TF motifs
  • show expression enrichment
  • enriched in known enhancers
  • 196 motifs in 3-UTR
  • Strand specific
  • Abundance of 8-mers
  • Role in microRNA regulation

55
Directionality of 3-UTR motifs
3-UTR motifs
Promoter motifs
Stop
motif
motif
ATG
3-UTR motifs likely to act post-transcriptionally
56
3-UTR motif properties
(2) Length distribution
  • Enriched in motifs of length 8

Features reminiscent of microRNA genes
57
Properties of microRNA genes (miRNAs)
Have we discovered target motifs for miRNAs ?
58
Top 3-UTR motifs match known miRNAs
Specifically match 5 positions of known
microRNA genes Complementarity can allow
mismatches, in specific positions
59
Central 8-mer positions dictate miRNA binding
Evolution
Mismatches excluded from central positions
Experiment
Mutations in central positions affect binding
more strongly
Central 8-mer positions important for miRNA
binding
60
8-mer conservation rates pinpoints miRNA start
MCS of 8-mer starting at that position
  • 8-mers starting at positions 0, -1, 1 all have
    strong MCS
  • True across all miRNA genes (animation)
  • Can use as signature for miRNA starting position

61
Use 8-mer conservation profile to adjust
microRNA annotation
dme-miR-33 AGGTGCATTGTAGTCGCATTG
Old hsa-miR-33 GTGCATTGTAGTTGCATTG dme-miR-33
GTGCATTGTAGTCGCATTG New
  • Conservation profile matches human microRNA
    ortholog

62
Using 8-mers to discovery novel microRNAs
  • Methodology

- Find all matching conserved 8mers genome-wide
  • Restrict to those that are flanked by conserved
    20mer

- Fold surrounding genomic regions in all 9
genomes
- Restrict to those that have miRNA-like hairpin
structure
  • Results

55 known miRNAs 43 8-mers 41 hairpins (38
known miRNAs)
Top 3-UTR motifs (MCSgt30) 3234 8-mers 185
hairpins (50 known miRNAs)
63
Discovery of 50 novel miRNA genes
64
Ability to determine functional miRNA targets
Real motif instances (conserved above expected)
Motif instances within noise (expected for random
motifs)
65
Regulatory motif discovery in the fly
Systematic discovery of regulatory motifs in the
fly
  • Frequently occurring, strongly conserved short
    regulatory signals

TSS
3-UTR
Stop
ATG
  • promoter motifs
  • match known TF motifs
  • show expression enrichment
  • enriched in enhancers promoters
  • motifs in 3-UTR
  • Strand specific
  • Abundance of 8-mers
  • microRNA-associated

microRNA regulation
ATATGCAA
conserved 8-mers
50 known 50 candidate new miRNA genes
66
Summary of results
  • 1. Gene identification
  • Evolutionary signatures for gene identification
  • 1400 novel, 600 refined, 579 dubious genes
  • Unusual gene structures
  • 2. Regulatory motif discovery
  • Genome-wide motif discovery
  • Novel motifs tissues, promoters, enhancers
  • Discovery power scales with branch length
  • 3. microRNA regulation
  • Motif-centric discovery, precise start positions
  • New microRNA genes, refined existing genes
  • 12 genomes identify individual motif targets

67
Students and Collaborators
Alex Stark Regulatory motifs
Bill Gelbart Harvard MCB
Mike Lin Gene identification
FlyBase curators
Matt Rasmussen Whole-genome phylogeny
Sue Celniker Berkeley Drosophila Genome Center
Bing Ren UC San Diego
Martha Bulyk Harvard Medical School
Write a Comment
User Comments (0)
About PowerShow.com