Whole-genome comparative genomics - PowerPoint PPT Presentation

1 / 121
About This Presentation
Title:

Whole-genome comparative genomics

Description:

6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Whole-genome comparative genomics Analyzing the human genome Lecture 21 Dec 6, 2005 – PowerPoint PPT presentation

Number of Views:478
Avg rating:3.0/5.0
Slides: 122
Provided by: Mano74
Category:

less

Transcript and Presenter's Notes

Title: Whole-genome comparative genomics


1
Whole-genome comparative genomics
6.095/6.895 - Computational Biology Genomes,
Networks, Evolution
  • Analyzing the human genome

Lecture 21
Dec 6, 2005
2
Challenges in Computational Biology
4
Genome Assembly
Gene Finding
Regulatory motif discovery
DNA
Sequence alignment
Comparative Genomics
TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT
Database lookup
Evolutionary Theory
RNA folding
Gene expression analysis
RNA transcript
Cluster discovery
10
Gibbs sampling
Protein network analysis
12
13
Regulatory network inference
Emerging network properties
14
3
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATA
TCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTG
TGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAG
CAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTA
CGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAA
ATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACT
GTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGA
AGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATG
CTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGT
CTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAA
CTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAA
TAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATA
AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAA
ACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTT
TTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGT
GGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGC
AAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCT
TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATT
TCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTT
TCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATT
TTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATT
TGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCA
TAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAA
TGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCAT
CTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAA
AAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTA
TTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGG
ATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGG
GTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAA
TATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATT
GGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCT
TTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCC
TATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTA
TTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATT
GCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTT
ACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCAT
TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTT
ATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTT
TTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAA
AATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACA
TGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACT
ACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATT
ATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGAT
AATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCT
TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATT
TCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTG
TATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATA
CATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAA
GAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAA
TGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAG
TTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCA
ATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTT
AATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCT
TATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
4
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATA
TCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTG
TGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAG
CAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTA
CGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAA
ATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACT
GTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGA
AGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATG
CTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGT
CTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAA
CTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAA
TAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATA
AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAA
ACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTT
TTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGT
GGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGC
AAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCT
TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATT
TCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTT
TCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATT
TTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATT
TGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCA
TAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAA
TGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCAT
CTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAA
AAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTA
TTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGG
ATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGG
GTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAA
TATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATT
GGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCT
TTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCC
TATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTA
TTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATT
GCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTT
ACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCAT
TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTT
ATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTT
TTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAA
AATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACA
TGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACT
ACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATT
ATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGAT
AATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCT
TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATT
TCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTG
TATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATA
CATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAA
GAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAA
TGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAG
TTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCA
ATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTT
AATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCT
TATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT
5
Comparing genomes reveals functional elements
  • Protein-coding genes
  • Ultra-conserved elements
  • Short regulatory motifs

6
Extensive sequencing of mammalian tree
Black - complete 8X Red - 2x sequencing
elephant armadillo rabbit bat tenrec shrew
cat hedgehog Average extra branch length 0.2
subs/site
7
Hidden Markov Models for gene finding
8
Modeling biological sequences
Intergenic
CpG island
Promoter
First exon
Intron
Other exon
Intron
GGTTACAGGATTATGGGTTACAGGTAACCGTTGTACTCACCGGGTTACAG
GATTATGGGTTACAGGTAACCGGTACTCACCGGGTTACAGGATTATGGTA
ACGGTACTCACCGGGTTACAGGATTGTTACAGG
  • Ability to emit DNA sequences of a certain type
  • Not exact alignment to previously known gene
  • Preserving properties of type, not identical
    sequence
  • Ability to recognize DNA sequences of a certain
    type (state)
  • What (hidden) state is most likely to have
    generated observations
  • Find set of states and transitions that generated
    a long sequence
  • Ability to learn distinguishing characteristics
    of each state
  • Training our generative models on large datasets
  • Learn to classify unlabelled data

9
HMM-based Gene Finding
  • GENSCAN (Burge 1997)
  • FGENESH (Solovyev 1997)
  • HMMgene (Krogh 1997)
  • GENIE (Kulp 1996)
  • GENMARK (Borodovsky McIninch 1993)
  • VEIL (Henderson, Salzberg, Fasman 1997)
  • TWINSCAN (Brent 2001)
  • NSCAN (Brent 2005)

10
VEIL Viterbi Exon-Intron Locator
  • Contains 9 hidden states or features
  • Each state is a complex internal Markovian model
    of the feature
  • Features
  • Exons, introns, intergenic regions, splice sites,
    etc.
  • Enter start codon or intron (3 Splice Site)
  • Exit 5 Splice site or three stop codons (taa,
    tag, tga)

VEIL Architecture
11
Genie
  • Uses a generalized HMM (GHMM)
  • Edges in model are complete HMMs
  • States can be any arbitrary program
  • States are actually neural networks specially
    designed for signal finding
  • J5 5 UTR
  • EI Initial Exon
  • E Exon, Internal Exon
  • I Intron
  • EF Final Exon
  • ES Single Exon
  • J3 3UTR

12
Genscan Overview
  • Developed by Chris Burge (Burge 1997) 
  • Characteristics
  • Designed to predict complete gene structures
  • Introns and exons, Promoter sites,
    Polyadenylation signals
  • Incorporates
  • Descriptions of transcriptional, translational
    and splicing signal
  • Length distributions (Explicit State Duration
    HMMs)
  • Compositional features of exons, introns,
    intergenic, CG regions
  • Larger predictive scope
  • Deal w/ partial and complete genes
  • Multiple genes separated by intergenic DNA in a
    seq
  • Consistent sets of genes on either/both DNA
    strands
  • Based on a general probabilistic model of genomic
    sequences composition and gene structure

13
Genscan Architecture
  • It is based on Generalized HMM (GHMM)
  • Model both strands at once
  • Other models Predict on one strand first, then
    on the other strand
  • Avoids prediction of overlapping genes on the two
    strands (rare)
  • Each state may output a string of symbols
    (according to some probability distribution).
  • Explicit intron/exon length modeling
  • Special sensors for Cap-site and TATA-box
  • Advanced splice site sensors

Fig. 3, Burge and Karlin 1997
14
GenScan States
  • N - intergenic region
  • P - promoter
  • F - 5 untranslated region
  • Esngl single exon (intronless) (translation
    start -gt stop codon)
  • Einit initial exon (translation start -gt donor
    splice site)
  • Ek phase k internal exon (acceptor splice site
    -gt donor splice site)
  • Eterm terminal exon (acceptor splice site -gt
    stop codon)
  • Ik phase k intron 0 between codons 1
    after the first base of a codon 2 after the
    second base of a codon

15
Classification-based Gene finding
16
Gene identification
TTACGGTACCGCTATACCCGAACGTCTAATAGAAAAAACTATAATGACTA
AATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
M
T
K
S
H
S
E
E
V
I
V
P
E
F
K
  • Intuition
  • Genes are translated in units of 3 nucleotides
    (codons)
  • Every DNA strand can be translated in 3 reading
    frames
  • Insertions and deletions may cause frame-shifts
  • Selective pressure on the amino-acid translation
  • Silent substitutions tolerated
  • Codons for similar amino-acids frequently
    exchanged
  • Method
  • Observe patterns of nucleotide change in genes /
    intergenic regions
  • Develop signatures / tests to discriminate
    between the two
  • Validate tests with known genes / intergenic
    regions
  • Use them to revisit the yeast and human genomes

17
Gene identification
Study known genes
Derive conservation rules
Discover new genes
18
Overall conservation vs. signatures of divergence
  • Not a gene
  • Region of perfect/near-perfect non-coding
    conservation
  • Scores very well with HMM approaches, ExoniPhy,
    N-Scan, which measure general levels of local
    nucleotide conservation

human TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAA
GGAGCTCAAGTAC mouse TGCCAGCCGCGCGAGGTGGCCGCCTCGGCA
GCCGCAGCTAAGAAGGAGCTCAAGTAC rat
TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCT
CAAGTAC dog TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCA
GCTAAGAAGGAGCTCAAGTAC
  • Real gene
  • Mutations do occur, consistent with constraints
    under which genes evolve
  • Insertions preserve reading frame. Mutations
    preserve amino-acid function
  • ? Quantify and capture these constraints
    computationally

human TGC---CCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAA
GGAGCTCAAGTAC mouse TGCCAGCCACGTGACGTGGCTG---TGGCA
GCGGCAGCTAAAAAAGAGCTTAAGTAT rat
TGCCAGCCACGCGACGTGGCCG---TGGCAGCAGCCGCTAAAAAGGAACT
TAAGTAC dog TGCCAGCCACGCGAGGTGGCGG---------CTGCG
GCCAAGAAAGAGCTCAAGTAC

19
Signature 1 Reading frame conservation
20
Signature 2 Distinct patterns of codon
substitution
Genes
Codon observed in species 1
Codon observed in species 2
  • Codon substitution patterns specific to genes
  • Genetic code dictates substitution patterns
  • Amino acid properties dictate substitution
    patterns

21
Evaluating reading frame conservation (RFC)

Scer
CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCA
GAGAAACAGCTCTATGAGAAATCAGCTGATG Spar
TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACA
GAGAAACAGCTTC-TGAGAAGTCAGCCGGTG

Scer
CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCA
GAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1
123123123123123123-12312312312312312312312-----312
3123123123123123123123123123123

Spar
TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACA
GAGAAACAGCTTC-TGAGAAGTCAGCCGGTG Spar_f1
12312312-31231231231231231231231231231231231231231
2312312312312-31231231231231231 Spar_f2
23123123-12312312312312312312312312312312312312312
3123123123123-12312312312312312 Spar_f3
31231231-23123123123123123123123123123123123123123
1231231231231-23123123123123123

Scer
CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCA
GAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1
123123123123123123-12312312312312312312312-----312
3123123123123123123123123123123

Spar
TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACA
GAGAAACAGCTTC-TGAGAAGTCAGCCGGTG RFC Spar_f1
12312312-31231231231231231231231231231231231231231
2312312312312-31231231231231231 ? 43 Spar_f2
23123123-12312312312312312312312312312312312312312
3123123123123-12312312312312312 ? 34 Spar_f3
31231231-23123123123123123123123123123123123123123
1231231231231-23123123123123123 ? 23
22
Evaluating the codon substitution score (CSM)
  • Filling in the CSM

Mouse
AAA/K AAG/K AAC/N AAT/N AGA/R
AGG/R...TAA/X AAA/K 1552 608 12 8
74 26 0 AAG/K 423 2531 11 9
23 73 0 AAC/N 8 13 1368 331
1 1 0 AAT/N 8 12 444 1007
2 1 0 AGA/R 44 22 1
1 664 178 0 AGG/R 15 72 1
1 148 594 0
Human
(10-5)
pX/Y P(human codon X aligns to mouse codon Y in
genes) qX/Y P(human codon X aligns to mouse
codon Y outside genes)
  • Scoring an aligned region

human CTGTTTTTCCCCTTTTGTAGGAAGTCAC
mouse CTGTTTTTCCTCTTTTGTAGTAAGTCAC
P
pCCC/CTC
pAGG/AGT
Coding Score

qCCC/CTC
qAGG/AGT
23
Multiple levels of selection
Genes
Codon observed in species 1
Codon observed in species 2
  • Multi-level information
  • All positions ? overall conservation
  • Exclude conserved triplets ? amino-acid sequence
  • Exclude conserved amino-acids ? amino-acid
    properties

24
Effect of using only off-diagonal CSM positions
CSM coding score for human/mouse (x-axis) and
human/dog (y-axis) in CFTR region
False positives
No false positives
Using full CSM matrix
Using only off-diagonal positions
Is it conserved like a coding gene?
Has it diverged like a coding gene?
25
Putting it all together ExoClass gene finder
  • Train Support Vector Machine (SVM) classifier
  • Reading Frame Conservation (RFC) score
  • Codon Substitution Matrix (CSM) coding score
  • Splice signal conservation, ESEs, ESIs
  • Exon length, conservation boundaries
  • Apply it systematically to all candidate
    intervals
  • Use full gene model constraints for
    post-processing

26
Results in yeast
Accept Reject
4000 named genes 99.9 0.1
300 intergenic regions 1 99
Accept Reject
4000 named genes
300 intergenic regions
Accept Reject
4000 named genes
300 intergenic regions
Accept Reject
4000 named genes 99.9 0.1
300 intergenic regions 1 99
2000 Hypothetical ORFs 1500 500
High sensitivity and specificity
27
Results in human ENCODE regions (Human/Mouse)
Nucl Sn Nucl Sp Exon Sn Exon Sp Missed Wrong Wrong w/evidnc
GENSCAN 85 62 67 49 17 39 17
TWINSCAN 77 88 66 79 26 11 25
SGP2 84 84 72 69 18 20 24
Exoniphy 73 88 57 67 26 10 53
ExoClass 86 87 73 75 17 14 37
  • High nucleotide sensitivity and specificity
  • Increases with additional species (with some
    caveats)
  • Missed exons due to
  • Sequencing / assembly / alignment problems
  • Rapidly evolving genes Immunity and olfactory
    families
  • Wrong exons due to
  • Novel exons, Novel exons, Novel exons
  • Existing evidence human / non-human spliced
    mRNAs
  • New evidence validated using specific RT-PCR
    (with MGC)

28
Examples in the human
  • Example 1 New gene
  • Example 2 Deleted gene

29
Example 3 Changed exons
30
Initial results for the whole human genome
Human
Dog
Mouse
Rat
1065 fully rejected
454 novel (2591 exons)
7,717 refined
9862 fully confirmed
1,919 not aligned
  • Fully rejected genes typically have only weak
    evidence
  • New exons often supported by existing
    experimental evidence
  • RT-PCR validation of 90 fully novel genes 50
    confirmed

31
Experimental validation
  • Select novel predictions with highest specificity
  • Unique in the genome
  • No pseudogenes
  • Absolutely no previous experimental evidence
  • Results
  • June 2005 454 genes ? 90 entirely novel
  • RT-PCR validation for specific exon splicing
  • 50 fully validated using pooled tissues
  • New validation set
  • Top of the list 354 genes, 1162 exons
  • and many more (gene families, lower scores)

32
Gene Identification Summary
  • Exon-centric approach
  • Identify discriminating variables
  • Observed distinct patterns of nucleotide change
  • Systematically identify all exons in the genome
  • Use gene structure constraints to link them
  • Application
  • High sensitivity and specificity (90)
  • More powerful than experimental methods
  • Largest reannotation of the yeast genome
  • Reannotation of the human gene set

33
Regulatory Motif Discovery
34
Regulatory Motif Discovery
GAL1
Gal4
Gal4
Mig1
ATGACTAAATCTCATTCAGAAGAAGTGA
CCCCW
CGG
CCG
CGG
CCG
  • Gene regulation
  • Genes are turned on / off in response to changing
    environments
  • No direct addressing subroutines (genes)
    contain sequence tags (motifs)
  • Specialized proteins (transcription factors)
    recognize these tags
  • What makes motif discovery hard?
  • Motifs are short (6-8 bp), sometimes degenerate
  • Can contain any set of nucleotides (no ATG or
    other rules)
  • Act at variable distances upstream (or
    downstream) of target gene

35
Regulatory Motif Discovery
Study known motifs
Derive conservation rules
Discover novel motifs
36
Known motifs are preferentially conserved
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGA
CGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTC
TGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse
GTCTTAGGAGGCT-CGATCGCC---------------------GCCTG
CATTATT-----rat GTCTTAGTTGGCCACGACCTGC-----
----------------TCATGCATAATT-----

human CGGGTAGGCCTGGCCGAAAATCTCTCCCG
CGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog
CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCC
GCAGCGGGGCmouse --------------CACAAGCCTGTGGCG
CGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat
--------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCC
CCAGGCGAG-
human
TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCC
CCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCC
CTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse
TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCT
TTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT
---------TTTTTTTTTGCCGTTCAAG-AG


human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGA
CGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTC
TGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse
GTCTTAGGAGGCT-CGATCGCC---------------------GCCTG
CATTATT-----rat GTCTTAGTTGGCCACGACCTGC-----
----------------TCATGCATAATT-----

human CGGGTAGGCCTGGCCGAAAATCTCTCCCG
CGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog
CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCC
GCAGCGGGGCmouse --------------CACAAGCCTGTGGCG
CGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat
--------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCC
CCAGGCGAG-
human
TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCC
CCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCC
CTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse
TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCT
TTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT
---------TTTTTTTTTGCCGTTCAAG-AG


Gabpa
Is this enough to discover motifs?
Is this enough to discover motifs? No.
37
Known motifs are frequently conserved
Human
  • Across the human promoter regions, the Erra
    motif
  • appears 434 times
  • is conserved 162 times
  • Compare to random control motifs
  • Conservation rate of control motifs 6.8
  • Erra enrichment 5.4-fold
  • Erra p-value lt 10-50 (25 standard deviations
    under binomial)

Motif Conservation Score (MCS)
38
MCS distribution of all 6-mers shows excess
conservation
Motif density
Motif density
Motif Conservation Score (MCS)
  • High scoring patterns include known motifs
  • Excess specific to promoters and 3-UTRs (not
    introns)
  • For MCS gt 6, estimate 97 specificity

Use MCS to discover new motifs
39
Hill-climbing in sequence space
  • Seed selection
  • Three mini-motif conservation criteria (CC1, CC2,
    CC3)
  • Motif extension
  • Non-random conservation of neighbors
  • Motif collapsing
  • Merge neighbors using hierarchical clustering,
    avg-max-linkage
  • Re-scoring complex motifs
  • Motif conservation score for full motifs (MCS)

40
Test 1 Intergenic conservation
Conserved count
Total count
41
Test 1 Selecting mini-motifs
  • Estimate basal rate of conservation
  • Expected conservation rate at the evolutionary
    distances observed
  • Average conservation rate of non-outlier
    mini-motifs
  • Score conservation of mini-motif
  • k conserved motif occurrences
  • n total motif occurrences
  • r basal conservation rate
  • Evaluate binomial probability of observing k
    successes out of n trials
  • Assign z-score to each mini-motif
  • Bulk of distribution is symmetric
  • Estimate specificity as (R-L)/R
  • Select cutoff 5.0 sigma
  • 1190 mini-motifs, 97.5 non-random

Specificity
Cutoff
Right tail
Left tail
42
Test 2 Intergenic vs. Coding
Intergenic Conservation
Coding Conservation
43
Test 3 Upstream vs. Downstream
Upstream Conservation
Downstream Conservation
44
Constructing full motifs
Test 1
Test 2
Test 3
2,000 Mini-motifs
C
T
A
C
G
A
R
R
45
Extending mini-motifs
  • Separate conserved and non-conserved instances

6
C
T
A
C
G
A
Causal set
6
C
T
x
x
G
A
Random set
46
Collapsing similar motifs
  • Motif similarity sequence and genomic positions
  • Motifs share similar sequences, count bits in
    common
  • Motifs appear conserved in similar sets of regions

Regions with motif 1
Regions with motif 2
Regions containing both motifs
47
Systematically test candidate patterns
gap
G
T
C
R
Y
S
A
G
T
R
W
  • Enumerate
  • Length between 6 and 15 nt, allow central gap
  • 11 letter alphabet (A C G T, 2-fold codes, N)
  • Score
  • Compute binomial score (conserved vs. total)
  • Select MCS gt 6.0 ? specificity 97
  • Cluster
  • Sequence similarity
  • Overlapping occurrences

All potential motifs
Evaluate MCS
Cluster similar motifs
Are these real ?
48
Functions of discovered motifs
49
Evidence of motif function
Promoter
3-UTR
Stop
ATG
174 motifs
106 motifs
  • Promoter motifs
  • Comparison to known motifs
  • Distance from TSS
  • Expression enrichment

50
Promoter motifs match known TF binding sites
  • Compare discovered motifs to TRANSFAC database of
    125 known motifs

55 of TRANSFAC motifs match discovered motifs
51
(2) Promoter motifs show preferred distance to TSS
Motif instances in human
Conserved motif sites in all four species
Motif 4
-81
Each of 174 discovered motifs
Motif 8
-63
Distance from TSS
Discovered motifs occur preferentially Within 200
bp of Transcription Start Site
Individual motifs show strong peaks Regardless of
conservation
32 of discovered motifs show strong positional
bias
52
(3) Promoter motifs enriched in specific tissues
70 of motifs show significant enrichment in at
least one tissue
53
Summary for promoter motifs
Rank Discovered Motif Known TF motif Tissue Enrichment Distance bias
1 RCGCAnGCGY NRF-1 Yes Yes
2 CACGTG MYC Yes Yes
3 SCGGAAGY ELK-1 Yes Yes
4 ACTAYRnnnCCCR Yes Yes
5 GATTGGY NF-Y Yes Yes
6 GGGCGGR SP1 Yes Yes
7 TGAnTCA AP-1 Yes
8 TMTCGCGAnR Yes Yes
9 TGAYRTCA ATF3 Yes Yes
10 GCCATnTTG YY1 Yes
11 MGGAAGTG GABP Yes Yes
12 CAGGTG E12 Yes
13 CTTTGT LEF1 Yes
14 TGACGTCA ATF3 Yes Yes
15 CAGCTG AP-4 Yes
16 RYTTCCTG C-ETS-2 Yes Yes
17 AACTTT IRF1() Yes
18 TCAnnTGAY SREBP-1 Yes Yes
19 GKCGCn(7)TGAYG Yes Yes
20 GTGACGY E4F1 Yes Yes
21 GGAAnCGGAAnY Yes Yes
22 TGCGCAnK Yes Yes
23 TAATTA CHX10 Yes
24 GGGAGGRR MAZ Yes
25 TGACCTY ERRA Yes
  • 174 promoter motifs
  • 70 match known TF motifs
  • 115 expression enrichment
  • 60 show positional bias
  • ? 75 have evidence
  • Control sequences
  • lt 2 match known TF motifs
  • lt 5 expression enrichment
  • lt 3 show positional bias
  • ? lt 7 false positives

Most discovered motifs are likely to be
functional
54
What about 3-UTR motifs ?
TSS
3-UTR
Stop
ATG
  • Sequence properties of 3-UTR motifs
  • Regulatory roles of 3-UTR motifs

55
Directionality of 3-UTR motifs
3-UTR motifs
Promoter motifs
Stop
motif
motif
ATG
3-UTR motifs likely to act post-transcriptionally
56
What are microRNAs (miRNAs)?
  • Endogenous small non-coding RNA
  • 22nt in length
  • Located in genomic loci that can produce
    fold-back structures
  • Often conserved (but conservation may not be
    required)

57
miRNA and siRNA
miRNA gene/miRNA host gene
Double stranded RNA formation
5
3
RISC Complex
P
OH
58
miRNA siRNA as Negative Regulators of Gene
Expression
mRNA
Near Perfect Match Degradation of Target
miRNA siRNA
Partial Match Inhibition of Translation Degradatio
n of Target
Chromosomal Silencing
Off-Target Effect
lin-14 mRNA
lin-4 RNA, 22 nt
59
Properties of microRNA genes (miRNAs)
Properties similar to the motifs we have
discovered
60
3-UTR motif properties
(2) Length distribution
  • Enriched in motifs of length 8

Have we in fact discovered targets of microRNA
genes?
61
Compare 8-mer sequence to known miRNAs
  • Compare 8-mer motifs against all 207 known miRNAs
  • 72 discovered 8-mers match 44 of known miRNA
    genes
  • (72 control sequences only match 5)
  • 8-mer motifs are likely miRNA targets

62
Novel miRNA genes show deep evolutionary
conservation
  • Using 8-mers to discovery novel miRNA genes
  • Conserved much further than mammalian lineage

63
Can we use 8-mers to discover miRNA genes ?
8-mer motif
miRNA complement
TTGCATAT
ATATGCAA
258 stem loops discovered
64
Properties of discovered miRNA genes
ATATGCAA
8-mer motif
Discovered miRNA gene
  • 258 candidate miRNA genes discovered
  • 114 correspond to known miRNA genes (of 222)
  • 144 novel candidate miRNA genes
  • Experimentally tested 12 representative novel
    miRNAs
  • Specifically tested for expression of inferred
    22mer using RT-PCR
  • Pooled small RNAs from 10 adult human tissues
  • 6 of 12 found to be expressed with predicted
    structure in adults
  • (developmental tissues may contain additional
    miRNA genes)

Many of the discovered miRNA genes are likely to
be real
65
Two classes of miRNA genes
222 known miRNA genes
114 re-discovered
108 missed
  • No 8-mers
  • Few targets
  • Rapidly evolving
  • (5-fold higher mutation rate)

Many targets ? Evolutionary constraint Co-evolutio
n of miRNA genes and their targets ?
66
How many targets do miRNA genes regulate ?
ATATGCAA
8-mer motif
Inferred 3-UTR targets
miRNA gene
  • What fraction of conserved 8-mers are true miRNA
    targets ?
  • 40 of genes contain at least one discovered
    8-mer
  • (vs. 25 for appropriate control 8-mers)

Extraordinary importance of miRNA regulation
67
3 UTR motifs and post-transcriptional regulation
8-mer associated
46 motifs are 8-mer associated ? Targets of
microRNAs
Other 3-UTR motifs
60 motifs left ? Targets of RNA-binding proteins
Motif length
  • Several noteworthy examples
  • AATAAA Poly-A signal
  • 6 AT-rich elements mRNA stability and
    degradation
  • 24 TGTA-rich elements mRNA localization
    (PUF-family)
  • 29 other, potential target of RNA-binding
    proteins

May help systematic study of post-transcriptional
regulation
68
Summary Regulatory motif discovery
Systematic discovery of regulatory motifs in the
human
  • Frequently occurring, strongly conserved short
    regulatory signals

TSS
3-UTR
Stop
ATG
  • 174 promoter motifs
  • 70 match known TF motifs
  • 115 expression enrichment
  • 60 show positional bias
  • 106 motifs in 3-UTR
  • Strand specific
  • 8-mers are miRNA-associated
  • mRNA localization and stability

miRNA regulation
ATATGCAA
Target 20 of human 3-UTRs
discovered 8-mers
114 known 144 new miRNA genes
69
Towards human regulatory networks
Global motif co-occurrence map ? Reveal
co-operating regulators
Initial network of master regulators ? Reveal
hubs, cascades, network motifs
From sequence-based discovery to dynamic models
70
Motifs outside promoters and 3-UTRs
71
Extract conserved regions in the human genome
Procedure for generating conserved regions
  • Extract top 5 most conserved regions in the
    human genome based on PhyloHMM score (142M bp).
  • Remove protein-coding regions.
  • Extract regions with conservation rate above 80
    in sliding windows of 20 bp in human/mouse/rat/dog
    alignment.
  • Remove alignments not in syntenic blocks.
  • Remove alignments not in one-to-one mapping.
  • Mask repeat sequences.
  • gt 70M bp sequences (2.5 of the human genome)

72
Random chance of occurrence of K-mers with
different size in conserved regions
Mean number of occurrence in 70M bp region by
chance
73
An example K-mer
  • Number of occurrence

TTCAGCACCATGGACAGC 18-mer Appear 199 times in
the conserved regions --gt 1300-fold
enrichment.
  • Enrichment in the conserved regions
  • Moreover, in the whole human genome
  • The 18-mer occurred 446 times
  • (45 of the sites in conserved regions)
  • --gt an enrichment of 18-fold, compared
    with 2.5.

74
Model motifs by consensus with mismatch
Given an k-mer word w, we consider the ball
B(w, r) of radius r around w. r is distance
measure between two different words.
Example k20 w
GGCGCTGTCCGTGGTGCTGA r2
GGCGCTGTCCGTGGTGCTGA TGCGCTGTCCGTGGTGCTGAGGAGCTGT
CCGTGGTACTGA GGCACTGGCCGTGGTGCTGA ...
75
Algorithms for searching overrepresented sequences
  • Word-search based method

Ver1 Build suffix tree first, and then numerate
motifs with mismatches. (dont allow indels, but
motif search is exhaustive, slow) Ver2 Hash
k-mer first, and extend shared k-mer sites to
screen out sites that are similar to each other.
(allow indels, but with lower sensitivity, fast)
  • Alignment based method (for long sequences gt 30
    bp)
  1. Blastz human vs human sequences.
  2. Extract sequences with multiple hits.
  3. Generate consensus sequence for each multiple
    alignment.
  4. Smith-Waterman alignment on the whole genome to
    identify all hits for each consensus.

76
Discovered sequences
  • 67 consensus sequences with average size 80 bp,
    enrichment rate gt 0.6, and number gt 20.
  • 30 20-mers enrichment rate gt 20, and number gt
    20.
  • 46 18-mers, enrichment rate gt 30, number gt 30.

77
An example K-mer
  • Number of occurrence

TTCAGCACCATGGACAGC 18-mer Appear 199 times in
the conserved regions --gt 1300-fold
enrichment.
  • Enrichment in the conserved regions
  • Moreover, in the whole human genome
  • The 18-mer occurred 446 times
  • (45 of the sites in conserved regions)
  • --gt an enrichment of 18-fold, compared
    with 2.5.

78
Discovered sequences
  • 67 consensus sequences with average size 80 bp,
    enrichment rate gt 0.6, and number gt 20.
  • 30 20-mers enrichment rate gt 20, and number gt
    20.
  • 46 18-mers, enrichment rate gt 30, number gt 30.

79
A few examples
  • Sequence Enrichment Total
    in_gene in_promot UTR
  • TGGAAATGCTGACACAACCT 0.789 21 7 2 0
  • TTCATTTACACTTAACTCAT 0.739 90 28 5 0
  • AAAGGCCCTTTTCAGAGCCA 0.729 46 46 0 43
  • AAATGCTGACAGACCCTTAA 0.700 25 13 4 0
  • GTCTGTCAGCATTTCCATTA 0.698 35 14 1 0
  • GGTTCCCATGGCAACAGCCT 0.686 22 10 3 0
  • AACTCCCATTAATGCTAATG 0.680 21 7 0 0
  • CAGCATCTGGCTCCTTGGCA 0.667 21 7 0 0
  • GTTGCCATGGCAACAGCAGC 0.640 32 14 5 2
  • TTTTATGGCTGAGTTATAAA 0.640 23 11 1 1
  • CTGTTGCCATGGCAACCAGG 0.630 39 22 11 1
  • GGTCTCCATGGCAACCAGCC 0.621 15 7 3 0
  • AGTGGCCTGAAAGAGTTAAT 0.615 22 12 1 0
  • TTATAATGGAAATGCTGACA 0.604 52 23 2 0
  • GTCTGTTAGCATTTCCATTA 0.595 23 10 2 0
  • AATAGGGGTTTATAATGGAA 0.594 27 11 2 1
  • TCCCATTAATGTTAATGGGA 0.591 23 10 2 0
  • GCTTTGGTTTCCATGGAAAC 0.583 25 7 2 0

80
Context of K-mers conservation island
Conservation island
81
Context of K-mers extended conservation
TGCTGTTCCATGGCAAC
Palindromic sequence
82
Context of K-mers connected conservation
Histone 3UTR motif
83
Context of K-mers connected conservation
84
Context of K-mers connected conservation
85
Identify long sequences based on alignment
86
Interesting RNA structure of the sequence
GGAAGAAGGGAAGAAATGGCTCACTTTTCAGAGGTGCATTTACTCTTTGA
CCCACTAGGGTACTATTTAGTGTTCTAGAAGAGGTAATTTAGTAAATTGT
ACCCCAGTGGCCTGAAAAAGTTAATGCAACTCTGAAAAGTGAGCCATTCA
ATCGATTTTCCCTATTGCTTTTAAAAAAT .(((((.(((((((((((((
((((((((((((((.((((((.(((.(((.(.(((((.((((((.(((((
.((.(((.....))).....)).))))).)))))).)))))..).))).)
)).)))))).))))))))))))))))))).......))))))))....))
)))....... (-74.51)
87
Conserved instance in the intron of ADCY5
TGCTGTTCCATGGCAAC
88
Conclusion
  • Goldmines of conservation in the human genome
  • Short motifs, very frequently occurring
  • Longer motifs, many occurrences
  • Extremely long elements, near-perfect
    conservation
  • Regulatory role?
  • microRNA genes / other non-coding RNAs
  • Early development, body-plan formation
  • Repeat elements high-jacked for regulatory roles?
  • Contain strong enhancer regions, scattered across
    genome
  • A lot of un-translated transcription

89
Regulatory motif evolution
  • Genes
  • Regulation

Erez Lieberman
  • Evolution

90
Evidence of motif movement by neutral evolution
Motif disappears, and reappears about 100 bp
downstream in S. mikatae
CGTNNNNNRYGAY Scer GGCTCCATCAATTCGTATCAAG
TGATAATT-AT------CACATAAATTATATAATTGTA Spar
AACCCTATTAATTCGTAAGCAGTGATATAA-AT-AGAATAACCTAACTT
ATACAACTGTA Smik AACCCTATGAATTCCTAGTAAGCCAC
CTATTATAGAGATAACCTAAGTAGTATAGTAGTA Sbay
AGCCCTATACATTCGTACCAAGTGATAAAT-ATTATTAAGACCTAACATT
TAAAACAGTT

CGTNNNNNRYGAY Scer
AACCT------ATTAATAACCCTAAT-ATCATCCTCATGCCCTA-AGAAA
TATTCAATAT Spar TCCCTTTTAAACCCCCTAATATTACC-
ATCTAAGACCTAACTAATATCAA----GGGAAA Smik
A-CCTATTAAAATTAAAAACGTTAACCATGATGCCCTAACAATATAATGA
-----AGGAA Sbay ACCCT-----ACCCTAAAATGGGAAC-
ATAAAACACAAACCCTATATAAACGTAGAGAAA


91
Evidence of strand crossing for near-palindromic
motifs
ABF1
YHL012W
S. cerevisiae
ABF1
YHL012W
S. paradoxus
ABF1
ABF1
S. mikatae
YHL012W
ABF1
ABF1
S. bayanus
ABF1
YHL012W
ABF1
  • ABF1 Crosses the Strand in YHL012W
  • CGTNNNNNRYGAY
  • RTCRYNNNNNACG
  • Scer ---TAAAATAGCATATCGTTAAAAACGACAAACGC
    GT
  • Spar ---TAATATAACATCTCGTTAAAAACGACAAACGC
    GT
  • Smik TAATGAAATAA-ATCTCGTAAAAAACGACAAACGC
    GT
  • Sbay ---TGATCTGCCCTTCCGTATATAATGACAAACGC
    GT

92
The birth-death process of regulatory motifs
Motif birth
Abf1
Abf1
Motif movement
Hap4
Hap4
Motif death
Msn2
93
Motif birth governed by random process ?
f - Footprint i - Information
More Bits Slow movmt
Wider Faster
AANNCG GTNNTG
AC CT
2X 1X
4X 1X
f4
i2
GNNNT
GT
f2
i4
94
Motif birth governed by random process !
Observed motif birth rate
Motif information content
Motif birth can be modeled as a largely random
process
95
Motif aging
Age 0
Number of instances
Information content
Red All regions Green Bound regions
What is responsible for shift in distribution ?
96
3. Death rates governed by selective landscape
Green Death rate in bound regions Red Death
rate in unbound regions
Motif death rates drastically different in
functional / non-functional regions
97
Intensity of selection determines motif death rate
Rate of motif death
Bound Cooperative
Bound
Not bound Cooperative
Neither
Each level of selective pressure shows distinct
death rate
98
Birth and death events for chromosome arm (16R)
Yap 1
Strength of selective pressure
Green motif birth Red motif death Blue
motif aging
Yap 1
Chromosomal position on chromosome 16 (right arm)
Birth-death process governed by selection
landscape
Chromosomal position on chromosome 16 (right arm)
99
Motif evolution governed by three processes
  • Motif birth
  • Short motifs can appear by neutral evolution
  • Rate of motif birth information content
  • Motif aging
  • Motif abundance shifts towards bound regions
  • Distribution changes gradually over time
  • Motif death
  • Governed by functional selection landscape
  • Predicted by partner motifs factor binding

Modeling motif evolution can lead to better
discovery
100
Network evolution by duplication
  • Motif discovery
  • Motif evolution

Aviva Presser
  • Network evolution

101
Networks are dynamic in time and in evolution
Global motif co-occurrence map ? Reveal
co-operating regulators
Initial network of master regulators ? Reveal
hubs, cascades, network motifs
How do networks change in the face of gene
duplication ?
102
Evidence of Whole Genome Duplication
103
Whole Genome Duplications in diverse lineages
Yeast Duplication Kellis et al. Nature, Apr 8,
2004
Vertebrate Duplication in Fish Jaillon et al.
Nature, Oct 21, 2004
Two rounds of WGD in human! Dehal et al. PLoS
Biology, Oct 2005
104
The return to haploidy
Number of genes
5,000
time
Today
100Myrs
Advantage of WGD may lie in 500 gained genes
105
Functions of duplicated genes
  • As a group
  • Biased towards environment adaptation
  • Sugar metabolism, fermentation, regulation
  • Individual pairs
  • Are new gene functions gained by WGD ?
  • How are new gene functions emerging ?

Rate 1
WGD
S. cerevisiae copy 1
Rate 2
S. cerevisiae copy 2
K. waltii
Evidence of accelerated protein divergence ?
106
Scenarios for rapid gene evolution
One copy faster
Scer - copy1
Scer - copy2
Kwal
Ohno, 1970
Both copies faster
Scer - copy1
Scer - copy2
Force, 1999
Kwal
20 of duplicated genes show acceleration
20 of duplicated genes show acceleration 95 of
cases Only one copy faster
107
Emerging gene functions after duplication
  • Origin of replication ? silencing

4-fold acceleration
Scer - Sir3 (silencing)
Scer - Orc1 (origin of replication)
Kwal - Orc1
  • Translation initiation ? anti-viral defense

3-fold acceleration
Scer - Ski7 (anti-viral defense)
Scer - Hbs1 (translation initiation)
Kwal - Hbs1
Asymmetric divergence ? recognize ancestral /
derived
108
Asymmetric divergence ? distinct functional
properties
Ancestral function Derived function
Gene deletion Lethal (20) Never lethal
Gain new function and lose ancestral function
109
Asymmetric divergence ? distinct functional
properties
Ancestral function Derived function
Gene deletion Lethal (20) Never lethal
Expression Abundant Specific (stress, starvation)
Localization General Specific (mitochondrion, spores)
Gain new function and lose ancestral function
110
Asymmetry also found in network connectivity
Duplicated gene
Interaction partners
Asymmetric Divergence
Duplication
Interaction loss more likely than gain. One
protein maintains ancestral function?
Study network in context of duplication
111
Network evolution by duplication
Modern Network
Pre-WGD
Lost Duplicate
Network motif
Duplication
Loss
-
-
Scenario 1
Ancestral network motifs
Modern network motif
Duplication
Gain


Scenario 2
112
Mechanisms of network motif emergence
Lost Interactions
Kept Interactions
Gained Interactions
  • Pre-Duplication Probabilities
  • p probability of interaction
  • q probability of self-interaction
  • Post-Duplication Probabilities
  • Pplus probability of adding an interaction
  • Pminus probability of eliminating an interaction

113
Emergence of post-duplication network motifs
All have either 4 or 0 edges across the
pairs (4-across or 0-across)
114
Modeling network evolution
  • Parameters
  • Fraction Duplicated vs Spontaneous Generation
  • Fraction Edges Deleted
  • Number of Edges for Spontaneous Genes
  • 90 of timesteps duplication
  • Pick a gene at random
  • Duplicate with all its connections
  • Delete on average 35 of new connections
  • 10 of timesteps creation
  • Create a new gene
  • Randomly connect it to the existing network with
    0 20 connections


Study emergence of network motifs
115
Abundance of network motifs predicted by
duplication
116
2. High frequency of ohnolog pair interaction
1. Asymmetry in network connectivity
Lessons Learned
Duplication
Divergence
Interaction loss more likely than gain. One
protein maintains ancestral network function?
  1. Abundance of ancestral self-interactions
  2. Gain of ohnolog interaction by proximity due to
    common interactions
  3. Selection for ohnologs with interaction, both
    kept since neither can mutate. Faulty A would
    disrupt polymerization of A-A-A-A, reduced
    fitness.

Duplication
  • Ancestral self-interaction or
  • gain of ohnolog interaction

117
3. Abundance of global properties and network hubs
Duplication asymmetric divergence model
Traditional preferential attachment model
Model matches local and global network properties
118
Network evolution Conclusions
  • Asymmetric evolution of network connectivity
  • One pair preserves connections
  • One pair keeps subset (rarely gains)
  • WGD preserves network connectivity
  • Duplicates highly interconnected
  • Simple model of network evolution
  • Estimate rates of interaction gain and loss
  • Very good fit to simulated and actual yeast
    network
  • Infer connectivity patterns of ancestral network
  • Ancestral network shows increased number of
    self-interactions
  • Self-interacting proteins favored in duplicated
    network?

119
Comparative genomics and regulatory networks
  • Regulatory motif discovery
  • Genome-wide conservation score
  • Validated using expression, positional bias,
    multiplicity
  • Pre- and post-transcriptional regulation
  • microRNA regulation
  • Motif-centric discovery of new microRNA genes
  • Many new microRNAs, experimentally validated
  • Role of microRNA regulation 20 of the genome
  • Regulatory motif evolution
  • Underlying birth-death process, random birth
    process
  • Aging shifts distribution, death governed by
    selection
  • Ability to model motifs for discovery in many
    species
  • Protein network evolution
  • Simple duplication-based model
  • Motif abundance, degree distribution can be
    predicted
  • Asymmetric divergence, cross-interactions

120
Acknowledgements
  • Human motifs
  • Xiaohui Xie
  • Eric Lander
  • Vamsi Mootha
  • Kerstin Lindblad-Toh
  • Jun Lu
  • E.J. Kulbokas
  • Todd R. Golub
  • Fungal comparisons
  • Bruce Birren
  • Christina Cuomo
  • James Galagan
  • Li-Jun Ma
  • Joshua Grochow
  • Gene identification
  • Mike Lin
  • Michael Brent
  • Network evolution
  • Aviva Presser
  • Michael Elovitz
  • Roy Kishony
  • Motif Evolution
  • Erez Lieberman
  • Martin Nowak
  • Genome-wide phylogeny
  • Matt Rasmussen
  • Marcia Lara

121
Whos actually doing the work
Mike Lin Gene identification
Erez Lieberman Motif evolution
Xiaohui Xie Motif finding
Aviva Presser Network evolution
Matt Rasmussen Whole-genome phylogeny
Josh Grochow Protein motifs
Pouya Kheradpour Human motifs
Alex Stark Fly regulatory networks
Write a Comment
User Comments (0)
About PowerShow.com