Title: Comparative genomics of 24 mammals
1Comparative genomics of 24 mammals
Broad Institute of MIT and Harvard
MIT Computer Science Artificial Intelligence
Laboratory
2Sequencing the mammalian phylogeny
- Species Center Covg
- H1 Human Done Full
- H2 Chimp Done Full
- H3 Rhesus Done Full
- H4 Mouse Done Full
- H5 Rat Done Full
- H6 Dog Done Full
- H7 Cow Done Full
- 1 Elephant Broad 1.94x
- 2 Armadillo Broad 1.98x
- 3 Tenrec Broad 1.90x
- 4 Rabbit Broad 1.95x
- 5 Guinea Pig Broad 1.92x
- 6 Hedgehog Broad 1.86x
- 7 Shrew Broad 1.92x
- 8 Microbat Broad 1.84x
- 9 Tree Shrew Broad 1.89x
- 10 Squirrel Broad 1.90x
- 11 Bushbaby Broad 1.87x
Kerstin Lindblad-Toh, Sante Gnerre, Federica
DiPalma Broad, Baylor, WashU, Arachne, UCSC
3Comparative genomics of mammalian species
- Goal 1 Discover regions of increased selection
- Detect functional elements by their increased
conservation - More genomes detect smaller elements, subtle
selection - Goal 2 Discover different classes of functional
elements - Patterns of change distinguish different types of
functional elements - Specific function ? Selective pressures ?
Patterns of mutation/inse/del - Develop evolutionary signatures characteristic of
each function
4Protein-coding genes
5Evolutionary signatures for protein-coding genes
- Same conservation levels, distinct patterns of
divergence - Gaps are multiples of three (preserve amino acid
translation) - Mutations are largely 3-periodic (silent codon
substitutions) - Specific triplets exchanged more frequently
(conservative substs.) - Conservation boundaries are sharp (pinpoint
individual splicing signals)
6Protein-coding evolution vs nucleotide
conservation
Protein-coding exons
Highly conserved non-coding elements
- Evolutionary signatures specific to each function
- Distinguish protein-coding from non-coding
conservation - Genome-wide run (CSF only) 81 sens., 91
precision - Incorporating additional signatures RFC,
single-species
7Many new genes confirmed by chromatin domains
Missed exon
Example MM14qC3
- Several hundred new exons, many in clusters
8Genome-wide curation / experimental follow-up
G
PI Tim Hubbard, Sanger Center. HAVANA curators,
experimental validation.
- Novel candidate genes and exons
- Experimental cDNA sequencing and validation
- Curation of gene structures integrating evidence
- Revising existing annotations
- Identify dubious genes with non-protein-like
evolution - Refine boundaries and exon sets of existing genes
- Curation evaluate evidence supporting that
annotation - Unusual gene structures
- Evolutionary evidence in absence of primary
signals - Reveal new and unusual biological mechanisms
9Unusual protein-coding events
10When primary sequence signals are ignored
- Typical gene (MEF2A). Evolutionary signal stops
at the stop codon.
- Unusual gene (GPX2). Protein-coding signal
continues past the stop. - GPX2 is a known selenoprotein! Additional
candidates found.
11Translational read-through in neuronal proteins
Novel candidate OPRL1 neurotransmitter
Continued protein-coding conservation
Protein-coding conservation
No more conservation
Stop codon read through
- New mechanism of post-transcriptional control.
- Conserved in both mammals (5 candidates) and
flies (150 candidates) - Strongly enriched for neurotransmitters and
brain-expressed proteins - Read-through stop codon (surrounding) shows
increased conservation - Many questions remain
- Role of editing? Cryptic splice sites? RNA
secondary structure?
Lin et al, Genome Research 2007
12Measuring excess constraint within protein-coding
exons
Typical protein-coding exon (Numerous mutations,
at each column)
Excess-conservation exon conserved above and
beyond the call of duty ? Likely to have
additional functions, overlapping selective
pressures
13Searching for excess-constraint coding sequence
- (1) Build a model for expected substitution counts
Syn.subs. correlate w/ degeneracy CpG
Distribution for each ancestral codon
(2) Score windows for depletion in syn. subst.
- Z-score P(obs. subst expected for each codon)
(3) Top candidate exons with excess constraint
- PCPB2 derived from ancestral transposon
- Hox B5 gene start 52 AA before 1 syn.subst
- C6orf111 predicted ORF on chr. 6
- EIF4G2 overlaps spliced EvoFold prediction
14Examples Top candidate exons showing increased
selection
- HoxB5 52 amino acids before the first
synonymous substitution - Overlaps highly conserved RNA secondary structure
- C6orf11 Predicted ORF, protein-coding,
extremely conserved
- EIF4G2 Several consecutive exons, conserved RNA
struct.
15microRNA genes
- Alex Stark
- Pouya Kheradpour
16Evolutionary signatures for microRNA genes
Combine with 10 other features ? 4,500-fold
enrichment
17Novel miRNAs validated by sequencing reads
Stark et al, Genome Research (GR) 2007. Ruby et
al GR 2007
- In fly genome 101 hairpins above 0.95 cutoff
- 60 of 74 (81) known Rfam miRNAs rediscovered
- 24 novel expression-validated by 454Solexa
(Bartel/Hannon) - 17 additional candidates show diverse evidence
of function - In mammals combine experimental evolutionary
info - Rely on reads for discovery, use evolutionary
signal to study function
18Surprise 1 microRNA microRNA function
Drosophila Hox
- Both hairpin arms of a microRNA can be functional
- High scores, abundant processing, conserved
targets - Hox miRNAs miR-10 and miR-iab-4 as master Hox
regulators
Stark et al, Genome Research 2007
19Surprise 2 microRNA-anti-sense function
anti- sense
sense
Stark et al, GenesDevelopment 2007
- A single miRNA locus transcribed from both
strands - The two transcripts show distinct expression
domains (mutually exclusive) - Both processed to mature miRNAs mir-iab-4,
miR-iab-4AS (anti-sense)
20miR-iab-4AS leads to homeotic transformations
?wing w/bristles
Sensory bristles
haltere
haltere
?wing
WT
Note C,D,E same magnification
?wing
sense
Antisense
- Mis-expression of mir-iab-4S AS alteres?wings
homeotic transform. - Stronger phenotype for AS miRNA
- Sense/anti-sense pairs as general building blocks
for miRNA regulation - 10 sense/anti-sense miRNAs in mouse
Stark et al, GenesDevelopment 2007
21Function of miRNA arms and anti-sense miRNAs
- Denser Hox miRNA targeting network
22Measuring selection
- Michele Clamp
- Manuel Garber
- Xiaohui Xie
23Detecting Purifying Selection (?)
?
Neutral sequence
Constrained sequence
- Estimating intensity of constraint (?)
- Probabilistic evolutionary model
- Maximum Likelihood (ML) estimation of ?
- sitewise (evaluate every k-long window)
- windows-based (increased power)
- Reports ?, and its log odds score (LODS).
- Theoretical p-value (LODS distributes ?2 with df
1)
Manuel Garber, Michele Clamp, Xiaohui Xie
24Detecting other constraint signatures (p)
? 0 0 0.8 0.5 0.6 3.2 0
0
- Repeated C?G transversion
- Has happened at least 4 times.
- Very unlikely given neutral model.
?
- Goal Identify sites with unlikely substitution
pattern. - Approach Probabilistic method to detect a
- stationary distribution that is different from
background. - Solution Implement ML estimator (?) of this
vector - Provides a Position Weight Matrix for any given
k-mer in the genome. - Scores every base in the genome (LODS).
Manuel Garber, Michele Clamp, Xiaohui Xie
25Estimation of genome-wide constraint
Pilot Encode Regions (1)
9.4 conserved 5.7 above FDR cutoff
10.5 conserved 6 above FDR cutoff
Genome-wide
Across entire genome 5 under selection. Same as
for Human-Mouse. Whats different?
Manuel Garber, Michele Clamp, Xiaohui Xie
26More mammals We can actually tell which 5 it is!
Constraint calculated over a 50mer
21 mammals
4 mammals
5 FDR
gt40 FDR
Constraint calculated over a 12mer
21 mammals
4 mammals
5 FDR
gt40 FDR
Michele Clamp
27Individual conserved elements match known TF sites
Example TNNC1 (Troponin C)
?5
Constraint score
Promoter alignment
Known TF binding sites
?5
TATA
SP-1
CEF-2
CEF1
Binding site resolution, even without known motif
model
Michele Clamp
28Binding sites for known regulators
- Pouya Kheradpour
- Alex Stark
29Computing Branch Length Score (BLS)
- Allows for
- Mutations permitted by motif degeneracy
- Misalignment/movement of motifs within window (up
to hundreds of nucleotides) - Missing motif in dense species tree
30Branch Length Score ? Confidence
- Use motif-specific shuffled control motifs
determine the expected number of instances at
each BLS by chance alone (or due to non-motif
conservation) - Compute Confidence Score as fraction of instances
over noise at a given BLS(1 false discovery
rate) - Many species are needed to confidently predict
instances
31Performance on vertebrate Transfac motifs
Median number of instances (at fixed confidence)
- Most motifs have confident instances into 90
confidence with 18 mammals - Substantial increase in the number of instances
compared to only human, mouse rat and dog.
32Intersection with CTCF ChIP-Seq regions
ChIP data from Barski, et al., Cell (2007)
- ChIP-Seq and ChIP-Chip technologies allow for
identifying binding sites of a motif
experimentally - Conserved CTCF motif instances highly enriched in
ChIP-Seq sites - High enrichment does not require low sensitivity
- Many motif instances are verified
33Enrichment also found for other factors
Barski, et al., Cell (2007)
34Enrichment increases in conserved bound regions
- ChIP bound regions may not be conserved
- For CTCF we also have binding data in mouse
- Enrichment in intersection is dramatically higher
Human Barski, et al., Cell (2007) Mouse
Bernstein, unpublished
35Enrichment increases in conserved bound regions
- ChIP bound regions may not be conserved
- For CTCF we also have binding data in mouse
- Enrichment in intersection is dramatically higher
- Trend persists for other factors where we have
multi-species ChIP data
36Motif discovery
- Pouya Kheradpour
- Alex Stark
37Using confidence for motif discovery
- Use motif-specific shuffled control motifs
determine the expected number of instances at
each BLS by chance alone (or due to non-motif
conservation) - Compute Confidence Score as fraction of instances
over noise at a given BLS(1 false discovery
rate)
38Motif discovery pipeline
- Enumerate motif seeds
- Six non-degenerate characters with variable size
gap in the middle - Score seed motifs
- Use a conservation ratio corrected for
composition and small counts to rank seed motifs - Expand seed motifs
- Use expanded nucleotide IUPAC alphabet to fill
unspecified bases around seed using hill climbing
- Cluster to remove redundancy
- Using sequence similarity
39Motif discovery in enhancer regions
Heinzman et al, Bing Rens lab
- Collaboration with Ren, White, Posakony labs
- Predict novel enhancer / promoter / insulator
elements - Identify motifs associated with these regions
- Validate predicted regions for in vivo function
- Initial results in human genome
- Motif combinations predictive of enhancer regions
(5X)
40Motif discovery in 3UTRs
- Perform motif discovery by ranking 7-mers in
3UTRs by the highest confidence they reach with
100 instances.
41Summary
- Measuring increased selection
- Scaling of branch lengths ?
- Non-random stationary distribution p
- Increased resolution individual binding sites
- Protein-coding genes
- Distinct evolutionary signatures
- Novel genes, revised genes
- Unusual structures read-through, increased
selection - microRNAs
- Function of miRNA/miRNA and sense/anti-sense
pairs - Dense miRNA targeting network for Hox cluster
- Regulatory motifs
- Measure increased selection, derive confidence
score - High sensitivity / high specificity for known
motifs - Use enumeration/confidence metric for motif
discovery
42Acknowledgements
MIT Computer Science and AI Lab
Broad Institute of MIT and Harvard
Pouya Kheradpour
Kerstin Lindblad-Toh
Michele Clamp
Manuel Garber
Mike Lin
Xiaohui Xie
Alex Stark
Matt Rasmussen
Sante Gnerre, David Jaffe Issao Fujiwara Federica
Di Palma Arachne Assembly Team Broad Sequencing
Platform Eric Lander
Sequencing Baylor, WashU, Agencourt. Funding
NHGRI miRNAs Julius Brennecke, Graham Ruby, Greg
Hannon, David Bartel iab-4AS Natascha Bushati,
Steve Cohen, Julius, Greg Hannon