Title: Whole-genome comparative genomics
1Whole-genome comparative genomics
6.095/6.895 - Computational Biology Genomes,
Networks, Evolution
- Analyzing the human genome
Lecture 21
Dec 6, 2005
2Challenges in Computational Biology
4
Genome Assembly
Gene Finding
Regulatory motif discovery
DNA
Sequence alignment
Comparative Genomics
TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT
Database lookup
Evolutionary Theory
RNA folding
Gene expression analysis
RNA transcript
Cluster discovery
10
Gibbs sampling
Protein network analysis
12
13
Regulatory network inference
Emerging network properties
14
3TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATA
TCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTG
TGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAG
CAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTA
CGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAA
ATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACT
GTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGA
AGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATG
CTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGT
CTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAA
CTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAA
TAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATA
AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAA
ACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTT
TTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGT
GGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGC
AAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCT
TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATT
TCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTT
TCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATT
TTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATT
TGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCA
TAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAA
TGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCAT
CTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAA
AAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTA
TTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGG
ATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGG
GTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAA
TATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATT
GGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCT
TTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCC
TATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTA
TTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATT
GCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTT
ACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCAT
TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTT
ATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTT
TTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAA
AATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACA
TGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACT
ACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATT
ATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGAT
AATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCT
TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATT
TCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTG
TATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATA
CATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAA
GAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAA
TGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAG
TTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCA
ATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTT
AATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCT
TATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
4TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATA
TCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTG
TGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAG
CAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTA
CGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAA
ATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACT
GTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGA
AGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATG
CTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGT
CTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAA
CTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAA
TAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATA
AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAA
ACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTT
TTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGT
GGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGC
AAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCT
TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATT
TCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTT
TCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATT
TTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATT
TGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCA
TAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAA
TGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCAT
CTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAA
AAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTA
TTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGG
ATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGG
GTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAA
TATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATT
GGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCT
TTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCC
TATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTA
TTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATT
GCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTT
ACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCAT
TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTT
ATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTT
TTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAA
AATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACA
TGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACT
ACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATT
ATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGAT
AATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCT
TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATT
TCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTG
TATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATA
CATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAA
GAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAA
TGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAG
TTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCA
ATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTT
AATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCT
TATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT
5Comparing genomes reveals functional elements
6Extensive sequencing of mammalian tree
Black - complete 8X Red - 2x sequencing
elephant armadillo rabbit bat tenrec shrew
cat hedgehog Average extra branch length 0.2
subs/site
7Hidden Markov Models for gene finding
8Modeling biological sequences
Intergenic
CpG island
Promoter
First exon
Intron
Other exon
Intron
GGTTACAGGATTATGGGTTACAGGTAACCGTTGTACTCACCGGGTTACAG
GATTATGGGTTACAGGTAACCGGTACTCACCGGGTTACAGGATTATGGTA
ACGGTACTCACCGGGTTACAGGATTGTTACAGG
- Ability to emit DNA sequences of a certain type
- Not exact alignment to previously known gene
- Preserving properties of type, not identical
sequence - Ability to recognize DNA sequences of a certain
type (state) - What (hidden) state is most likely to have
generated observations - Find set of states and transitions that generated
a long sequence - Ability to learn distinguishing characteristics
of each state - Training our generative models on large datasets
- Learn to classify unlabelled data
9HMM-based Gene Finding
- GENSCAN (Burge 1997)
- FGENESH (Solovyev 1997)
- HMMgene (Krogh 1997)
- GENIE (Kulp 1996)
- GENMARK (Borodovsky McIninch 1993)
- VEIL (Henderson, Salzberg, Fasman 1997)
- TWINSCAN (Brent 2001)
- NSCAN (Brent 2005)
10VEIL Viterbi Exon-Intron Locator
- Contains 9 hidden states or features
- Each state is a complex internal Markovian model
of the feature - Features
- Exons, introns, intergenic regions, splice sites,
etc.
- Enter start codon or intron (3 Splice Site)
- Exit 5 Splice site or three stop codons (taa,
tag, tga)
VEIL Architecture
11Genie
- Uses a generalized HMM (GHMM)
- Edges in model are complete HMMs
- States can be any arbitrary program
- States are actually neural networks specially
designed for signal finding
- J5 5 UTR
- EI Initial Exon
- E Exon, Internal Exon
- I Intron
- EF Final Exon
- ES Single Exon
- J3 3UTR
12Genscan Overview
- Developed by Chris Burge (Burge 1997)
- Characteristics
- Designed to predict complete gene structures
- Introns and exons, Promoter sites,
Polyadenylation signals - Incorporates
- Descriptions of transcriptional, translational
and splicing signal - Length distributions (Explicit State Duration
HMMs) - Compositional features of exons, introns,
intergenic, CG regions - Larger predictive scope
- Deal w/ partial and complete genes
- Multiple genes separated by intergenic DNA in a
seq - Consistent sets of genes on either/both DNA
strands - Based on a general probabilistic model of genomic
sequences composition and gene structure
13Genscan Architecture
- It is based on Generalized HMM (GHMM)
- Model both strands at once
- Other models Predict on one strand first, then
on the other strand - Avoids prediction of overlapping genes on the two
strands (rare) - Each state may output a string of symbols
(according to some probability distribution). - Explicit intron/exon length modeling
- Special sensors for Cap-site and TATA-box
- Advanced splice site sensors
Fig. 3, Burge and Karlin 1997
14GenScan States
- N - intergenic region
- P - promoter
- F - 5 untranslated region
- Esngl single exon (intronless) (translation
start -gt stop codon) - Einit initial exon (translation start -gt donor
splice site) - Ek phase k internal exon (acceptor splice site
-gt donor splice site) - Eterm terminal exon (acceptor splice site -gt
stop codon) - Ik phase k intron 0 between codons 1
after the first base of a codon 2 after the
second base of a codon
15Classification-based Gene finding
16Gene identification
TTACGGTACCGCTATACCCGAACGTCTAATAGAAAAAACTATAATGACTA
AATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
M
T
K
S
H
S
E
E
V
I
V
P
E
F
K
- Intuition
- Genes are translated in units of 3 nucleotides
(codons) - Every DNA strand can be translated in 3 reading
frames - Insertions and deletions may cause frame-shifts
- Selective pressure on the amino-acid translation
- Silent substitutions tolerated
- Codons for similar amino-acids frequently
exchanged - Method
- Observe patterns of nucleotide change in genes /
intergenic regions - Develop signatures / tests to discriminate
between the two - Validate tests with known genes / intergenic
regions - Use them to revisit the yeast and human genomes
17Gene identification
Study known genes
Derive conservation rules
Discover new genes
18Overall conservation vs. signatures of divergence
- Not a gene
- Region of perfect/near-perfect non-coding
conservation - Scores very well with HMM approaches, ExoniPhy,
N-Scan, which measure general levels of local
nucleotide conservation
human TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAA
GGAGCTCAAGTAC mouse TGCCAGCCGCGCGAGGTGGCCGCCTCGGCA
GCCGCAGCTAAGAAGGAGCTCAAGTAC rat
TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCT
CAAGTAC dog TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCA
GCTAAGAAGGAGCTCAAGTAC
- Real gene
- Mutations do occur, consistent with constraints
under which genes evolve - Insertions preserve reading frame. Mutations
preserve amino-acid function - ? Quantify and capture these constraints
computationally
human TGC---CCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAA
GGAGCTCAAGTAC mouse TGCCAGCCACGTGACGTGGCTG---TGGCA
GCGGCAGCTAAAAAAGAGCTTAAGTAT rat
TGCCAGCCACGCGACGTGGCCG---TGGCAGCAGCCGCTAAAAAGGAACT
TAAGTAC dog TGCCAGCCACGCGAGGTGGCGG---------CTGCG
GCCAAGAAAGAGCTCAAGTAC
19Signature 1 Reading frame conservation
20Signature 2 Distinct patterns of codon
substitution
Genes
Codon observed in species 1
Codon observed in species 2
- Codon substitution patterns specific to genes
- Genetic code dictates substitution patterns
- Amino acid properties dictate substitution
patterns
21Evaluating reading frame conservation (RFC)
Scer
CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCA
GAGAAACAGCTCTATGAGAAATCAGCTGATG Spar
TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACA
GAGAAACAGCTTC-TGAGAAGTCAGCCGGTG
Scer
CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCA
GAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1
123123123123123123-12312312312312312312312-----312
3123123123123123123123123123123
Spar
TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACA
GAGAAACAGCTTC-TGAGAAGTCAGCCGGTG Spar_f1
12312312-31231231231231231231231231231231231231231
2312312312312-31231231231231231 Spar_f2
23123123-12312312312312312312312312312312312312312
3123123123123-12312312312312312 Spar_f3
31231231-23123123123123123123123123123123123123123
1231231231231-23123123123123123
Scer
CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCA
GAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1
123123123123123123-12312312312312312312312-----312
3123123123123123123123123123123
Spar
TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACA
GAGAAACAGCTTC-TGAGAAGTCAGCCGGTG RFC Spar_f1
12312312-31231231231231231231231231231231231231231
2312312312312-31231231231231231 ? 43 Spar_f2
23123123-12312312312312312312312312312312312312312
3123123123123-12312312312312312 ? 34 Spar_f3
31231231-23123123123123123123123123123123123123123
1231231231231-23123123123123123 ? 23
22Evaluating the codon substitution score (CSM)
Mouse
AAA/K AAG/K AAC/N AAT/N AGA/R
AGG/R...TAA/X AAA/K 1552 608 12 8
74 26 0 AAG/K 423 2531 11 9
23 73 0 AAC/N 8 13 1368 331
1 1 0 AAT/N 8 12 444 1007
2 1 0 AGA/R 44 22 1
1 664 178 0 AGG/R 15 72 1
1 148 594 0
Human
(10-5)
pX/Y P(human codon X aligns to mouse codon Y in
genes) qX/Y P(human codon X aligns to mouse
codon Y outside genes)
- Scoring an aligned region
human CTGTTTTTCCCCTTTTGTAGGAAGTCAC
mouse CTGTTTTTCCTCTTTTGTAGTAAGTCAC
P
pCCC/CTC
pAGG/AGT
Coding Score
qCCC/CTC
qAGG/AGT
23Multiple levels of selection
Genes
Codon observed in species 1
Codon observed in species 2
- Multi-level information
- All positions ? overall conservation
- Exclude conserved triplets ? amino-acid sequence
- Exclude conserved amino-acids ? amino-acid
properties
24Effect of using only off-diagonal CSM positions
CSM coding score for human/mouse (x-axis) and
human/dog (y-axis) in CFTR region
False positives
No false positives
Using full CSM matrix
Using only off-diagonal positions
Is it conserved like a coding gene?
Has it diverged like a coding gene?
25Putting it all together ExoClass gene finder
- Train Support Vector Machine (SVM) classifier
- Reading Frame Conservation (RFC) score
- Codon Substitution Matrix (CSM) coding score
- Splice signal conservation, ESEs, ESIs
- Exon length, conservation boundaries
- Apply it systematically to all candidate
intervals - Use full gene model constraints for
post-processing
26Results in yeast
Accept Reject
4000 named genes 99.9 0.1
300 intergenic regions 1 99
Accept Reject
4000 named genes
300 intergenic regions
Accept Reject
4000 named genes
300 intergenic regions
Accept Reject
4000 named genes 99.9 0.1
300 intergenic regions 1 99
2000 Hypothetical ORFs 1500 500
High sensitivity and specificity
27Results in human ENCODE regions (Human/Mouse)
Nucl Sn Nucl Sp Exon Sn Exon Sp Missed Wrong Wrong w/evidnc
GENSCAN 85 62 67 49 17 39 17
TWINSCAN 77 88 66 79 26 11 25
SGP2 84 84 72 69 18 20 24
Exoniphy 73 88 57 67 26 10 53
ExoClass 86 87 73 75 17 14 37
- High nucleotide sensitivity and specificity
- Increases with additional species (with some
caveats) - Missed exons due to
- Sequencing / assembly / alignment problems
- Rapidly evolving genes Immunity and olfactory
families - Wrong exons due to
- Novel exons, Novel exons, Novel exons
- Existing evidence human / non-human spliced
mRNAs - New evidence validated using specific RT-PCR
(with MGC)
28Examples in the human
- Example 1 New gene
- Example 2 Deleted gene
29Example 3 Changed exons
30Initial results for the whole human genome
Human
Dog
Mouse
Rat
1065 fully rejected
454 novel (2591 exons)
7,717 refined
9862 fully confirmed
1,919 not aligned
- Fully rejected genes typically have only weak
evidence - New exons often supported by existing
experimental evidence - RT-PCR validation of 90 fully novel genes 50
confirmed
31Experimental validation
- Select novel predictions with highest specificity
- Unique in the genome
- No pseudogenes
- Absolutely no previous experimental evidence
- Results
- June 2005 454 genes ? 90 entirely novel
- RT-PCR validation for specific exon splicing
- 50 fully validated using pooled tissues
- New validation set
- Top of the list 354 genes, 1162 exons
- and many more (gene families, lower scores)
32Gene Identification Summary
- Exon-centric approach
- Identify discriminating variables
- Observed distinct patterns of nucleotide change
- Systematically identify all exons in the genome
- Use gene structure constraints to link them
- Application
- High sensitivity and specificity (90)
- More powerful than experimental methods
- Largest reannotation of the yeast genome
- Reannotation of the human gene set
33Regulatory Motif Discovery
34Regulatory Motif Discovery
GAL1
Gal4
Gal4
Mig1
ATGACTAAATCTCATTCAGAAGAAGTGA
CCCCW
CGG
CCG
CGG
CCG
- Gene regulation
- Genes are turned on / off in response to changing
environments - No direct addressing subroutines (genes)
contain sequence tags (motifs) - Specialized proteins (transcription factors)
recognize these tags - What makes motif discovery hard?
- Motifs are short (6-8 bp), sometimes degenerate
- Can contain any set of nucleotides (no ATG or
other rules) - Act at variable distances upstream (or
downstream) of target gene
35Regulatory Motif Discovery
Study known motifs
Derive conservation rules
Discover novel motifs
36Known motifs are preferentially conserved
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGA
CGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTC
TGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse
GTCTTAGGAGGCT-CGATCGCC---------------------GCCTG
CATTATT-----rat GTCTTAGTTGGCCACGACCTGC-----
----------------TCATGCATAATT-----
human CGGGTAGGCCTGGCCGAAAATCTCTCCCG
CGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog
CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCC
GCAGCGGGGCmouse --------------CACAAGCCTGTGGCG
CGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat
--------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCC
CCAGGCGAG-
human
TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCC
CCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCC
CTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse
TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCT
TTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT
---------TTTTTTTTTGCCGTTCAAG-AG
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGA
CGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTC
TGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse
GTCTTAGGAGGCT-CGATCGCC---------------------GCCTG
CATTATT-----rat GTCTTAGTTGGCCACGACCTGC-----
----------------TCATGCATAATT-----
human CGGGTAGGCCTGGCCGAAAATCTCTCCCG
CGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog
CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCC
GCAGCGGGGCmouse --------------CACAAGCCTGTGGCG
CGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat
--------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCC
CCAGGCGAG-
human
TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCC
CCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCC
CTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse
TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCT
TTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT
---------TTTTTTTTTGCCGTTCAAG-AG
Gabpa
Is this enough to discover motifs?
Is this enough to discover motifs? No.
37Known motifs are frequently conserved
Human
- Across the human promoter regions, the Erra
motif - appears 434 times
- is conserved 162 times
- Compare to random control motifs
- Conservation rate of control motifs 6.8
- Erra enrichment 5.4-fold
- Erra p-value lt 10-50 (25 standard deviations
under binomial)
Motif Conservation Score (MCS)
38MCS distribution of all 6-mers shows excess
conservation
Motif density
Motif density
Motif Conservation Score (MCS)
- High scoring patterns include known motifs
- Excess specific to promoters and 3-UTRs (not
introns) - For MCS gt 6, estimate 97 specificity
Use MCS to discover new motifs
39Hill-climbing in sequence space
- Seed selection
- Three mini-motif conservation criteria (CC1, CC2,
CC3) - Motif extension
- Non-random conservation of neighbors
- Motif collapsing
- Merge neighbors using hierarchical clustering,
avg-max-linkage - Re-scoring complex motifs
- Motif conservation score for full motifs (MCS)
40Test 1 Intergenic conservation
Conserved count
Total count
41Test 1 Selecting mini-motifs
- Estimate basal rate of conservation
- Expected conservation rate at the evolutionary
distances observed - Average conservation rate of non-outlier
mini-motifs - Score conservation of mini-motif
- k conserved motif occurrences
- n total motif occurrences
- r basal conservation rate
- Evaluate binomial probability of observing k
successes out of n trials - Assign z-score to each mini-motif
- Bulk of distribution is symmetric
- Estimate specificity as (R-L)/R
- Select cutoff 5.0 sigma
- 1190 mini-motifs, 97.5 non-random
Specificity
Cutoff
Right tail
Left tail
42Test 2 Intergenic vs. Coding
Intergenic Conservation
Coding Conservation
43Test 3 Upstream vs. Downstream
Upstream Conservation
Downstream Conservation
44Constructing full motifs
Test 1
Test 2
Test 3
2,000 Mini-motifs
C
T
A
C
G
A
R
R
45Extending mini-motifs
- Separate conserved and non-conserved instances
6
C
T
A
C
G
A
Causal set
6
C
T
x
x
G
A
Random set
46Collapsing similar motifs
- Motif similarity sequence and genomic positions
- Motifs share similar sequences, count bits in
common - Motifs appear conserved in similar sets of regions
Regions with motif 1
Regions with motif 2
Regions containing both motifs
47Systematically test candidate patterns
gap
G
T
C
R
Y
S
A
G
T
R
W
- Enumerate
- Length between 6 and 15 nt, allow central gap
- 11 letter alphabet (A C G T, 2-fold codes, N)
- Score
- Compute binomial score (conserved vs. total)
- Select MCS gt 6.0 ? specificity 97
- Cluster
- Sequence similarity
- Overlapping occurrences
All potential motifs
Evaluate MCS
Cluster similar motifs
Are these real ?
48Functions of discovered motifs
49Evidence of motif function
Promoter
3-UTR
Stop
ATG
174 motifs
106 motifs
- Promoter motifs
- Comparison to known motifs
- Distance from TSS
- Expression enrichment
50Promoter motifs match known TF binding sites
- Compare discovered motifs to TRANSFAC database of
125 known motifs
55 of TRANSFAC motifs match discovered motifs
51(2) Promoter motifs show preferred distance to TSS
Motif instances in human
Conserved motif sites in all four species
Motif 4
-81
Each of 174 discovered motifs
Motif 8
-63
Distance from TSS
Discovered motifs occur preferentially Within 200
bp of Transcription Start Site
Individual motifs show strong peaks Regardless of
conservation
32 of discovered motifs show strong positional
bias
52(3) Promoter motifs enriched in specific tissues
70 of motifs show significant enrichment in at
least one tissue
53Summary for promoter motifs
Rank Discovered Motif Known TF motif Tissue Enrichment Distance bias
1 RCGCAnGCGY NRF-1 Yes Yes
2 CACGTG MYC Yes Yes
3 SCGGAAGY ELK-1 Yes Yes
4 ACTAYRnnnCCCR Yes Yes
5 GATTGGY NF-Y Yes Yes
6 GGGCGGR SP1 Yes Yes
7 TGAnTCA AP-1 Yes
8 TMTCGCGAnR Yes Yes
9 TGAYRTCA ATF3 Yes Yes
10 GCCATnTTG YY1 Yes
11 MGGAAGTG GABP Yes Yes
12 CAGGTG E12 Yes
13 CTTTGT LEF1 Yes
14 TGACGTCA ATF3 Yes Yes
15 CAGCTG AP-4 Yes
16 RYTTCCTG C-ETS-2 Yes Yes
17 AACTTT IRF1() Yes
18 TCAnnTGAY SREBP-1 Yes Yes
19 GKCGCn(7)TGAYG Yes Yes
20 GTGACGY E4F1 Yes Yes
21 GGAAnCGGAAnY Yes Yes
22 TGCGCAnK Yes Yes
23 TAATTA CHX10 Yes
24 GGGAGGRR MAZ Yes
25 TGACCTY ERRA Yes
- 174 promoter motifs
- 70 match known TF motifs
- 115 expression enrichment
- 60 show positional bias
- ? 75 have evidence
- Control sequences
- lt 2 match known TF motifs
- lt 5 expression enrichment
- lt 3 show positional bias
- ? lt 7 false positives
Most discovered motifs are likely to be
functional
54What about 3-UTR motifs ?
TSS
3-UTR
Stop
ATG
- Sequence properties of 3-UTR motifs
- Regulatory roles of 3-UTR motifs
55Directionality of 3-UTR motifs
3-UTR motifs
Promoter motifs
Stop
motif
motif
ATG
3-UTR motifs likely to act post-transcriptionally
56What are microRNAs (miRNAs)?
- Endogenous small non-coding RNA
- 22nt in length
- Located in genomic loci that can produce
fold-back structures - Often conserved (but conservation may not be
required)
57miRNA and siRNA
miRNA gene/miRNA host gene
Double stranded RNA formation
5
3
RISC Complex
P
OH
58miRNA siRNA as Negative Regulators of Gene
Expression
mRNA
Near Perfect Match Degradation of Target
miRNA siRNA
Partial Match Inhibition of Translation Degradatio
n of Target
Chromosomal Silencing
Off-Target Effect
lin-14 mRNA
lin-4 RNA, 22 nt
59Properties of microRNA genes (miRNAs)
Properties similar to the motifs we have
discovered
603-UTR motif properties
(2) Length distribution
- Enriched in motifs of length 8
Have we in fact discovered targets of microRNA
genes?
61Compare 8-mer sequence to known miRNAs
- Compare 8-mer motifs against all 207 known miRNAs
- 72 discovered 8-mers match 44 of known miRNA
genes - (72 control sequences only match 5)
- 8-mer motifs are likely miRNA targets
62Novel miRNA genes show deep evolutionary
conservation
- Using 8-mers to discovery novel miRNA genes
- Conserved much further than mammalian lineage
63Can we use 8-mers to discover miRNA genes ?
8-mer motif
miRNA complement
TTGCATAT
ATATGCAA
258 stem loops discovered
64Properties of discovered miRNA genes
ATATGCAA
8-mer motif
Discovered miRNA gene
- 258 candidate miRNA genes discovered
- 114 correspond to known miRNA genes (of 222)
- 144 novel candidate miRNA genes
- Experimentally tested 12 representative novel
miRNAs - Specifically tested for expression of inferred
22mer using RT-PCR - Pooled small RNAs from 10 adult human tissues
- 6 of 12 found to be expressed with predicted
structure in adults - (developmental tissues may contain additional
miRNA genes)
Many of the discovered miRNA genes are likely to
be real
65Two classes of miRNA genes
222 known miRNA genes
114 re-discovered
108 missed
- Rapidly evolving
- (5-fold higher mutation rate)
Many targets ? Evolutionary constraint Co-evolutio
n of miRNA genes and their targets ?
66How many targets do miRNA genes regulate ?
ATATGCAA
8-mer motif
Inferred 3-UTR targets
miRNA gene
- What fraction of conserved 8-mers are true miRNA
targets ? - 40 of genes contain at least one discovered
8-mer - (vs. 25 for appropriate control 8-mers)
Extraordinary importance of miRNA regulation
673 UTR motifs and post-transcriptional regulation
8-mer associated
46 motifs are 8-mer associated ? Targets of
microRNAs
Other 3-UTR motifs
60 motifs left ? Targets of RNA-binding proteins
Motif length
- Several noteworthy examples
- AATAAA Poly-A signal
- 6 AT-rich elements mRNA stability and
degradation - 24 TGTA-rich elements mRNA localization
(PUF-family) - 29 other, potential target of RNA-binding
proteins
May help systematic study of post-transcriptional
regulation
68Summary Regulatory motif discovery
Systematic discovery of regulatory motifs in the
human
- Frequently occurring, strongly conserved short
regulatory signals
TSS
3-UTR
Stop
ATG
- 174 promoter motifs
- 70 match known TF motifs
- 115 expression enrichment
- 60 show positional bias
- 106 motifs in 3-UTR
- Strand specific
- 8-mers are miRNA-associated
- mRNA localization and stability
miRNA regulation
ATATGCAA
Target 20 of human 3-UTRs
discovered 8-mers
114 known 144 new miRNA genes
69Towards human regulatory networks
Global motif co-occurrence map ? Reveal
co-operating regulators
Initial network of master regulators ? Reveal
hubs, cascades, network motifs
From sequence-based discovery to dynamic models
70Motifs outside promoters and 3-UTRs
71Extract conserved regions in the human genome
Procedure for generating conserved regions
- Extract top 5 most conserved regions in the
human genome based on PhyloHMM score (142M bp). - Remove protein-coding regions.
- Extract regions with conservation rate above 80
in sliding windows of 20 bp in human/mouse/rat/dog
alignment. - Remove alignments not in syntenic blocks.
- Remove alignments not in one-to-one mapping.
- Mask repeat sequences.
- gt 70M bp sequences (2.5 of the human genome)
72Random chance of occurrence of K-mers with
different size in conserved regions
Mean number of occurrence in 70M bp region by
chance
73An example K-mer
TTCAGCACCATGGACAGC 18-mer Appear 199 times in
the conserved regions --gt 1300-fold
enrichment.
- Enrichment in the conserved regions
- Moreover, in the whole human genome
- The 18-mer occurred 446 times
- (45 of the sites in conserved regions)
- --gt an enrichment of 18-fold, compared
with 2.5.
74Model motifs by consensus with mismatch
Given an k-mer word w, we consider the ball
B(w, r) of radius r around w. r is distance
measure between two different words.
Example k20 w
GGCGCTGTCCGTGGTGCTGA r2
GGCGCTGTCCGTGGTGCTGA TGCGCTGTCCGTGGTGCTGAGGAGCTGT
CCGTGGTACTGA GGCACTGGCCGTGGTGCTGA ...
75Algorithms for searching overrepresented sequences
Ver1 Build suffix tree first, and then numerate
motifs with mismatches. (dont allow indels, but
motif search is exhaustive, slow) Ver2 Hash
k-mer first, and extend shared k-mer sites to
screen out sites that are similar to each other.
(allow indels, but with lower sensitivity, fast)
- Alignment based method (for long sequences gt 30
bp)
- Blastz human vs human sequences.
- Extract sequences with multiple hits.
- Generate consensus sequence for each multiple
alignment. - Smith-Waterman alignment on the whole genome to
identify all hits for each consensus.
76Discovered sequences
- 67 consensus sequences with average size 80 bp,
enrichment rate gt 0.6, and number gt 20. - 30 20-mers enrichment rate gt 20, and number gt
20. - 46 18-mers, enrichment rate gt 30, number gt 30.
77An example K-mer
TTCAGCACCATGGACAGC 18-mer Appear 199 times in
the conserved regions --gt 1300-fold
enrichment.
- Enrichment in the conserved regions
- Moreover, in the whole human genome
- The 18-mer occurred 446 times
- (45 of the sites in conserved regions)
- --gt an enrichment of 18-fold, compared
with 2.5.
78Discovered sequences
- 67 consensus sequences with average size 80 bp,
enrichment rate gt 0.6, and number gt 20. - 30 20-mers enrichment rate gt 20, and number gt
20. - 46 18-mers, enrichment rate gt 30, number gt 30.
79A few examples
- Sequence Enrichment Total
in_gene in_promot UTR - TGGAAATGCTGACACAACCT 0.789 21 7 2 0
- TTCATTTACACTTAACTCAT 0.739 90 28 5 0
- AAAGGCCCTTTTCAGAGCCA 0.729 46 46 0 43
- AAATGCTGACAGACCCTTAA 0.700 25 13 4 0
- GTCTGTCAGCATTTCCATTA 0.698 35 14 1 0
- GGTTCCCATGGCAACAGCCT 0.686 22 10 3 0
- AACTCCCATTAATGCTAATG 0.680 21 7 0 0
- CAGCATCTGGCTCCTTGGCA 0.667 21 7 0 0
- GTTGCCATGGCAACAGCAGC 0.640 32 14 5 2
- TTTTATGGCTGAGTTATAAA 0.640 23 11 1 1
- CTGTTGCCATGGCAACCAGG 0.630 39 22 11 1
- GGTCTCCATGGCAACCAGCC 0.621 15 7 3 0
- AGTGGCCTGAAAGAGTTAAT 0.615 22 12 1 0
- TTATAATGGAAATGCTGACA 0.604 52 23 2 0
- GTCTGTTAGCATTTCCATTA 0.595 23 10 2 0
- AATAGGGGTTTATAATGGAA 0.594 27 11 2 1
- TCCCATTAATGTTAATGGGA 0.591 23 10 2 0
- GCTTTGGTTTCCATGGAAAC 0.583 25 7 2 0
80Context of K-mers conservation island
Conservation island
81Context of K-mers extended conservation
TGCTGTTCCATGGCAAC
Palindromic sequence
82Context of K-mers connected conservation
Histone 3UTR motif
83Context of K-mers connected conservation
84Context of K-mers connected conservation
85Identify long sequences based on alignment
86Interesting RNA structure of the sequence
GGAAGAAGGGAAGAAATGGCTCACTTTTCAGAGGTGCATTTACTCTTTGA
CCCACTAGGGTACTATTTAGTGTTCTAGAAGAGGTAATTTAGTAAATTGT
ACCCCAGTGGCCTGAAAAAGTTAATGCAACTCTGAAAAGTGAGCCATTCA
ATCGATTTTCCCTATTGCTTTTAAAAAAT .(((((.(((((((((((((
((((((((((((((.((((((.(((.(((.(.(((((.((((((.(((((
.((.(((.....))).....)).))))).)))))).)))))..).))).)
)).)))))).))))))))))))))))))).......))))))))....))
)))....... (-74.51)
87Conserved instance in the intron of ADCY5
TGCTGTTCCATGGCAAC
88Conclusion
- Goldmines of conservation in the human genome
- Short motifs, very frequently occurring
- Longer motifs, many occurrences
- Extremely long elements, near-perfect
conservation - Regulatory role?
- microRNA genes / other non-coding RNAs
- Early development, body-plan formation
- Repeat elements high-jacked for regulatory roles?
- Contain strong enhancer regions, scattered across
genome - A lot of un-translated transcription
89Regulatory motif evolution
Erez Lieberman
90Evidence of motif movement by neutral evolution
Motif disappears, and reappears about 100 bp
downstream in S. mikatae
CGTNNNNNRYGAY Scer GGCTCCATCAATTCGTATCAAG
TGATAATT-AT------CACATAAATTATATAATTGTA Spar
AACCCTATTAATTCGTAAGCAGTGATATAA-AT-AGAATAACCTAACTT
ATACAACTGTA Smik AACCCTATGAATTCCTAGTAAGCCAC
CTATTATAGAGATAACCTAAGTAGTATAGTAGTA Sbay
AGCCCTATACATTCGTACCAAGTGATAAAT-ATTATTAAGACCTAACATT
TAAAACAGTT
CGTNNNNNRYGAY Scer
AACCT------ATTAATAACCCTAAT-ATCATCCTCATGCCCTA-AGAAA
TATTCAATAT Spar TCCCTTTTAAACCCCCTAATATTACC-
ATCTAAGACCTAACTAATATCAA----GGGAAA Smik
A-CCTATTAAAATTAAAAACGTTAACCATGATGCCCTAACAATATAATGA
-----AGGAA Sbay ACCCT-----ACCCTAAAATGGGAAC-
ATAAAACACAAACCCTATATAAACGTAGAGAAA
91Evidence of strand crossing for near-palindromic
motifs
ABF1
YHL012W
S. cerevisiae
ABF1
YHL012W
S. paradoxus
ABF1
ABF1
S. mikatae
YHL012W
ABF1
ABF1
S. bayanus
ABF1
YHL012W
ABF1
- ABF1 Crosses the Strand in YHL012W
- CGTNNNNNRYGAY
- RTCRYNNNNNACG
- Scer ---TAAAATAGCATATCGTTAAAAACGACAAACGC
GT - Spar ---TAATATAACATCTCGTTAAAAACGACAAACGC
GT - Smik TAATGAAATAA-ATCTCGTAAAAAACGACAAACGC
GT - Sbay ---TGATCTGCCCTTCCGTATATAATGACAAACGC
GT
92The birth-death process of regulatory motifs
Motif birth
Abf1
Abf1
Motif movement
Hap4
Hap4
Motif death
Msn2
93Motif birth governed by random process ?
f - Footprint i - Information
More Bits Slow movmt
Wider Faster
AANNCG GTNNTG
AC CT
2X 1X
4X 1X
f4
i2
GNNNT
GT
f2
i4
94Motif birth governed by random process !
Observed motif birth rate
Motif information content
Motif birth can be modeled as a largely random
process
95Motif aging
Age 0
Number of instances
Information content
Red All regions Green Bound regions
What is responsible for shift in distribution ?
963. Death rates governed by selective landscape
Green Death rate in bound regions Red Death
rate in unbound regions
Motif death rates drastically different in
functional / non-functional regions
97Intensity of selection determines motif death rate
Rate of motif death
Bound Cooperative
Bound
Not bound Cooperative
Neither
Each level of selective pressure shows distinct
death rate
98Birth and death events for chromosome arm (16R)
Yap 1
Strength of selective pressure
Green motif birth Red motif death Blue
motif aging
Yap 1
Chromosomal position on chromosome 16 (right arm)
Birth-death process governed by selection
landscape
Chromosomal position on chromosome 16 (right arm)
99Motif evolution governed by three processes
- Motif birth
- Short motifs can appear by neutral evolution
- Rate of motif birth information content
- Motif aging
- Motif abundance shifts towards bound regions
- Distribution changes gradually over time
- Motif death
- Governed by functional selection landscape
- Predicted by partner motifs factor binding
Modeling motif evolution can lead to better
discovery
100Network evolution by duplication
Aviva Presser
101Networks are dynamic in time and in evolution
Global motif co-occurrence map ? Reveal
co-operating regulators
Initial network of master regulators ? Reveal
hubs, cascades, network motifs
How do networks change in the face of gene
duplication ?
102Evidence of Whole Genome Duplication
103Whole Genome Duplications in diverse lineages
Yeast Duplication Kellis et al. Nature, Apr 8,
2004
Vertebrate Duplication in Fish Jaillon et al.
Nature, Oct 21, 2004
Two rounds of WGD in human! Dehal et al. PLoS
Biology, Oct 2005
104The return to haploidy
Number of genes
5,000
time
Today
100Myrs
Advantage of WGD may lie in 500 gained genes
105Functions of duplicated genes
- As a group
- Biased towards environment adaptation
- Sugar metabolism, fermentation, regulation
- Individual pairs
- Are new gene functions gained by WGD ?
- How are new gene functions emerging ?
Rate 1
WGD
S. cerevisiae copy 1
Rate 2
S. cerevisiae copy 2
K. waltii
Evidence of accelerated protein divergence ?
106Scenarios for rapid gene evolution
One copy faster
Scer - copy1
Scer - copy2
Kwal
Ohno, 1970
Both copies faster
Scer - copy1
Scer - copy2
Force, 1999
Kwal
20 of duplicated genes show acceleration
20 of duplicated genes show acceleration 95 of
cases Only one copy faster
107Emerging gene functions after duplication
- Origin of replication ? silencing
4-fold acceleration
Scer - Sir3 (silencing)
Scer - Orc1 (origin of replication)
Kwal - Orc1
- Translation initiation ? anti-viral defense
3-fold acceleration
Scer - Ski7 (anti-viral defense)
Scer - Hbs1 (translation initiation)
Kwal - Hbs1
Asymmetric divergence ? recognize ancestral /
derived
108Asymmetric divergence ? distinct functional
properties
Ancestral function Derived function
Gene deletion Lethal (20) Never lethal
Gain new function and lose ancestral function
109Asymmetric divergence ? distinct functional
properties
Ancestral function Derived function
Gene deletion Lethal (20) Never lethal
Expression Abundant Specific (stress, starvation)
Localization General Specific (mitochondrion, spores)
Gain new function and lose ancestral function
110Asymmetry also found in network connectivity
Duplicated gene
Interaction partners
Asymmetric Divergence
Duplication
Interaction loss more likely than gain. One
protein maintains ancestral function?
Study network in context of duplication
111Network evolution by duplication
Modern Network
Pre-WGD
Lost Duplicate
Network motif
Duplication
Loss
-
-
Scenario 1
Ancestral network motifs
Modern network motif
Duplication
Gain
Scenario 2
112Mechanisms of network motif emergence
Lost Interactions
Kept Interactions
Gained Interactions
- Pre-Duplication Probabilities
- p probability of interaction
- q probability of self-interaction
- Post-Duplication Probabilities
- Pplus probability of adding an interaction
- Pminus probability of eliminating an interaction
113Emergence of post-duplication network motifs
All have either 4 or 0 edges across the
pairs (4-across or 0-across)
114Modeling network evolution
- Parameters
- Fraction Duplicated vs Spontaneous Generation
- Fraction Edges Deleted
- Number of Edges for Spontaneous Genes
- 90 of timesteps duplication
- Pick a gene at random
- Duplicate with all its connections
- Delete on average 35 of new connections
- 10 of timesteps creation
- Create a new gene
- Randomly connect it to the existing network with
0 20 connections
Study emergence of network motifs
115Abundance of network motifs predicted by
duplication
1162. High frequency of ohnolog pair interaction
1. Asymmetry in network connectivity
Lessons Learned
Duplication
Divergence
Interaction loss more likely than gain. One
protein maintains ancestral network function?
- Abundance of ancestral self-interactions
- Gain of ohnolog interaction by proximity due to
common interactions - Selection for ohnologs with interaction, both
kept since neither can mutate. Faulty A would
disrupt polymerization of A-A-A-A, reduced
fitness.
Duplication
- Ancestral self-interaction or
- gain of ohnolog interaction
1173. Abundance of global properties and network hubs
Duplication asymmetric divergence model
Traditional preferential attachment model
Model matches local and global network properties
118Network evolution Conclusions
- Asymmetric evolution of network connectivity
- One pair preserves connections
- One pair keeps subset (rarely gains)
- WGD preserves network connectivity
- Duplicates highly interconnected
- Simple model of network evolution
- Estimate rates of interaction gain and loss
- Very good fit to simulated and actual yeast
network - Infer connectivity patterns of ancestral network
- Ancestral network shows increased number of
self-interactions - Self-interacting proteins favored in duplicated
network?
119Comparative genomics and regulatory networks
- Regulatory motif discovery
- Genome-wide conservation score
- Validated using expression, positional bias,
multiplicity - Pre- and post-transcriptional regulation
- microRNA regulation
- Motif-centric discovery of new microRNA genes
- Many new microRNAs, experimentally validated
- Role of microRNA regulation 20 of the genome
- Regulatory motif evolution
- Underlying birth-death process, random birth
process - Aging shifts distribution, death governed by
selection - Ability to model motifs for discovery in many
species - Protein network evolution
- Simple duplication-based model
- Motif abundance, degree distribution can be
predicted - Asymmetric divergence, cross-interactions
120Acknowledgements
- Human motifs
- Xiaohui Xie
- Eric Lander
- Vamsi Mootha
- Kerstin Lindblad-Toh
- Jun Lu
- E.J. Kulbokas
- Todd R. Golub
- Fungal comparisons
- Bruce Birren
- Christina Cuomo
- James Galagan
- Li-Jun Ma
- Joshua Grochow
- Gene identification
- Mike Lin
- Michael Brent
- Network evolution
- Aviva Presser
- Michael Elovitz
- Roy Kishony
- Motif Evolution
- Erez Lieberman
- Martin Nowak
- Genome-wide phylogeny
- Matt Rasmussen
- Marcia Lara
121Whos actually doing the work
Mike Lin Gene identification
Erez Lieberman Motif evolution
Xiaohui Xie Motif finding
Aviva Presser Network evolution
Matt Rasmussen Whole-genome phylogeny
Josh Grochow Protein motifs
Pouya Kheradpour Human motifs
Alex Stark Fly regulatory networks