Title: Comparative Genome Maps
1Comparative Genome Maps
- CSCI 7000-005 Computational Genomics
- Debra Goldberg
- debg_at_hms.harvard.edu
2What is a comparative map?
3Why construct comparative maps?
- Identify isolate genes
- Crops drought resistance, yield, nutrition...
- Human disease genes, drug response,
- Infer ancestral relationships
- Discover principles of evolution
- Chromosome
- Gene family
- key to understanding the human genome
4Why automate?
- Time consuming, laborious
- Needs to be redone frequently
- Codify a common set of principles
- Nadeau and Sankoff warn of arbitrary nature of
comparative map construction
5Definitions
- Marker identifiable chromosomal locus
- Homology genes with common ancester
- Homeology chromosomal regions derived from a
common ancestral linkage group - Synteny loci on the same chromosome
- Colinearity syntenic regions with conserved gene
order
6Input/Output
- Input
- genetic maps of 2 species
- marker/gene correspondences (homologs)
- Output
- a comparative map
- homeologies identified
7Map construction
Go from this
to this
Maize 1 (target), Rice (base) Wilson et al.
Genetics 1999
8Chromosome labeling
Maize 1 (target), Rice (base) Wilson et al.
Genetics 1999
9A natural model?
Maize 1 (target), Rice (base) Wilson et al.
Genetics 1999
10Scoring
10L
3L
11Assumptions
- Accept published marker order
- All linkage groups of base are unique
- Simplistic homeology criteria
- At least one homeologous region
12A natural model?
13A natural model?
14A natural model?
15A natural model?
16Dynamic programming
- li location of homolog to marker i
- Si,a penalty (score) for an optimal labeling
of the submap from marker i to the end, when
labeling begins with label a
a 1 ... i ... n
17Recurrence relation
- Sn,a m ?(a, ln)Si,a m ?(a, li) min
(Si1,b s ?(a,b) )
a ... n ... ln
b?L
18Problem with linear model
19The stack model
d
f
e
c
c
b
b
b
a
- Segment at top of the stack can be
- pushed (remembered), later popped
- replaced
- Push and replace cost s -- pop is free.
20Scoring
21Dynamic programming
- Si,j,a score for an optimal labeling of
- submap from marker i to marker j
- when labeling begins with label a -- i.e.,
marker i is labeled a
a 1 ... i ... j ... n
22Recurrence relation
- Si,i,a m ?(a, li)
- Si,j,a min
- m ?(a, li) min (Si1,j,b s ?(a,b) )
- min Si,k,a Sk1,j,a
b?L
iltkltj
23Results infers evolutionary events
Wilson et al.
Maize 1 (target) Rice (base)
24Problem Incomplete input
- Gene order not always fully resolved.
- Co-located genes can be ordered to give most
parsimonious labeling.
25The reordering algorithm
- Uses a compression scheme
- Within a megalocus, group genes by location of
related gene. - Order these groups
- First, last groups interact with nearby genes
- Any ordering of internal groups is equally
parsimonious
26The reordering algorithm
27The reordering algorithm
28Definitions
- ? extended to distance to a set A of labels
- 0 if a ? A,
- 1 otherwise
- S the set of indices of supernode start
elements - For simplicity, call supernode i ? S
?(a, A)
29Definitions
- For i ? S
- ni markers in i
- ni(a) markers in i with a homolog on a
- li set of labels matching markers in i
- li a ? L ni(a) ? 1,
30Definitions
- pi(c) gives mismatched marker and segment
boundary penalties for label c
31Definitions
- p(i,a,b) gives the total mismatched marker and
segment boundary penalties attributed to hidden
markers
? (pi(c)) m ?i (a,b) for i?S, a?b p(i,a,b)
? (m ni(c)) m ?i (a,b) for i?S,
ab 0 otherwise.
c ? a,b
c ? a
32Definitions
- For i ? S
- ? i(a,b) labels in a,b without matching
marker in i - ? i(a,b) ?(a, li) ?(b, li)
- ? i(a,b) ? 0,1,2
33Definitions
- ?i (a,b) corrects if mismatch marker penalties
assigned twice for same marker in the recurrence
and in p(i,a,b) - For example
- ?i (a,b) 0 if ? i(a,b) 0(if a, b are both
represented in supernode) - ?i (a,a) -2 if ? i(a,a) gt 0(if a is not
represented in supernode)
34Recurrence relation
Si,j,a min m ?(a, li) min (Si1,j,b s
?(a,b) p(i,a,b)) min Si,k,a Sk1,j,a
b?L
iltkltj k ? S
35Results Fewer mismatches
Mouse 5 (target) Human (base)
36Results Mismatches placed between segments
Mouse 8 (target) Human (base)
37Results Detects new segments
Mouse 13 (target) Human (base)
38Summary
- Finds optimal comparative map
- Arranges markers in most parsimonious way
- First algorithm to use megalocus data
- Fast, objective, simple to use
- Biologically meaningful results
39Summary
- Global view
- Biologically meaningful results
- Provides testable hypotheses
- Robust
- not species-specific
- high/low resolution, genetic/physical maps
- stable to errors in marker order
40Future Directions
- Algorithmic extensions
- 3rd species
- polyploidy
- search for ancient duplications
- Deduce history of evolutionary events
- makes genome rearrangement measures tractable and
robust - infer common ancestor
41Future Directions
- Block-segmental sequence comparisons
- non-local sequence alignment
- protein domains
- 2D block-segmental comparisons
- comparison of regulatory networks
- image processing
42Acknowledgments
- NSF
- AAUW
- David and Lucile Packard Foundation
- USDA
- Cooperative State Research Education and
Extension Service - ONR
- Jon Kleinberg
- Susan McCouch
- Chris Pelkie
- Sandra Harrington
- Sam Cartinhour
- Dave Schneider