Identifying conserved spatial patterns in genomes

About This Presentation

Title:

Identifying conserved spatial patterns in genomes

Description:

Identifying conserved spatial patterns in genomes Rose Hoberman David Sankoff Dept. of Math and Statistics University of Ottawa Dannie Durand Depts. of Biological ... – PowerPoint PPT presentation

Number of Views:167

Avg rating:3.0/5.0

Slides: 135

Provided by: DerekD65

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Identifying conserved spatial patterns in genomes

1
Identifying conserved spatial patterns in genomes

Rose Hoberman

David Sankoff Dept. of Math and Statistics
University of Ottawa
Dannie Durand Depts. of Biological Sciences
and Computer Science, CMU
Student Seminar Series Jan 20, 2006
2
The complete genetic material of an organism or
species
The Genome
3
Key genomic component genes
A gene is a DNA subsequence
ACCCTTAGCTAGACCTTTAGGAGG...

Genes encode proteins,
the building blocks of the cell

4
Comparing Genomes
75 Million years
Human Mouse Fly Rice E. Coli Chlamydia
Chromosomes 23 20 4 12 1 1
Genes 20-25k 20-25k 13.6k 40k 3200 936
5
Accidental duplication of chromosome 21 causes
Down Syndrome
Human Chromosome 21 is broken into at least three
pieces in mouse
6
Outline

Evolution of genome organization
Why identify related genomic regions?
How do we find them?
Identification Formal cluster definition
Validation Testing cluster significance

7
A simple model of a chromosome

an ordered list of genes

8
What are the processes of genomic change?
9
A single species
10
Speciation

Initially the two populations have identical
genomes

The populations evolve independently

?
3. Eventually, there will be two new species with
similar but distinct genomes
11
Types of Genomic Rearrangements
Inversions
6
3
4
5
3
7
1
2
20
Species 2
Duplications/Insertions
Loss
12
Types of Genomic Rearrangements
Chromosomal fissions and fusions
8
9
7
11
12
10
6
20
17
16
4
5
3
1
2
4
3
1
2
13
14
15
Species 2
13
Genome Comparison
Species 1
8
12
9
11
10
4
5
3
7
1
2
13
14
15
3
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
20
13
14
15
4
3
1
2
Species 2
Our goal identify chromosomal regions that
descended from the same region in the genome of
the common ancestor
14
Outline

Evolution of genome organization
Why identify related genomic regions?
How do we find them?
Identification Formal cluster definition
Validation Testing cluster significance

15
Genome Annotation Problem
Given the set of genes in the genome, label each
with its function
Gene
ACCCTTAGCTAGACCTTTAGGAGGTGCAGGA
Cellular Pathway Glucose Metabolism
Protein
16
There are many aspects of gene function

Gene trpA
Biochemical Function cleaves a double bond
Cellular Process amino-acid biosynthesis
Protein-protein interactions binds trpB

17
There are many aspects of gene function

Gene a typical gene
Biochemical Function ?
Biological Process ?
Protein-protein interactions ?

40-60 of genes in most genomes have unknown
function
Comparisons of spatial organization within
genomes can yield gene function predictions
18
In bacteria, genes in the same pathway often
occur together in the genome
Tryptophan Synthesis Pathway
1-2 Carboxy-phenylaminodeoxy-ribulose-5P
N-5-P-ribosyl-anthranilate
3-Indole Glycerol-P
Chorismate
Tryptophan
Anthranilate
trpCF
trpD
trpB
trpA
trpE
E. coli
Bacillus Subtilis
trpD
trpC
trpB
trpA
trpE
trpF
19
Conserved spatial organization between distantly
related species suggests functional associations
between the genes
C
B
D
A
G
E
F
A Glucose metabolism B Glucose metabolism C
? D Tryptophan synthesis E ? F ? G Tryptophan
synthesis
A
B
C
D
E
F
G
20
Conserved spatial organization between distantly
related species suggests functional associations
between the genes
C
B
D
A
G
E
F
G
D
F
E
C
B
A
A Glucose metabolism B Glucose metabolism C
Prediction Glucose metabolism D Tryptophan
synthesis E ? F Prediction Tryptophan
synthesis G Tryptophan synthesis
A
B
C
D
E
F
G
21
Outline

Evolution of genome organization
Why identify related genomic regions?
How do we find them?
Identification Formal cluster definition
Validation Testing cluster significance

22
Closely related genomes
Species 1
8
11
12
9
10
5
7
2
13
4
3
1
14
15
3
20
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
Related regions, regions that descended from the
same region in the genome of the common ancestor,
are easy to identify
23
A hundred million years...
24
More Diverged Genomes
5
8
9
11
12
18
20
11
4
3
7
2
13
10
14
15
17
16
19
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1

Related regions are harder to detect, but there
is still spatial evidence of common ancestry
Similar gene content
Neither gene content nor order is perfectly
preserved

25
The signature of diverged regions
5
8
11
12
18
9
11
3
7
2
13
17
16
19
20
18
4
1
10
14
15
3
8
12
6
20
17
9
11
10
4
5
1
2
3
1
2
13
14
15
16
4
1
7

Gene clusters
Similar gene content
Neither gene content nor order is perfectly

26
A Framework for Identifying Gene Clusters
given as input

Find corresponding genes
Formally define a gene cluster
Devise an algorithm to identify clusters
Statistically verify clusters

review the most common definition
my work
27
Clusters are signatures of distantly related
regions.

Without functional constraints...
After sufficient time has passed, gene order will
become randomized
Uniform random data tends to be clumpy
some genes will end up proximal in both genomes
simply by chance

Not all clusters have biological significance.
28
Cluster Validation via Hypothesis Testing

Null hypothesis random gene order
Reject gene clusters that could have arisen under
the null model
Clusters that cannot be rejected are likely to be
functionally constrained

29
Outline

Evolution of genome organization
Why find related genomic regions?
How do we find them?
Identification max-gap cluster definition
Validation Testing cluster significance

30
A max-gap chain
g? 2
gap? 3

The distance or gap between genes is equal to
the number of intervening genes
A set of genes in a genome form a max-gap chain
if
the gap between adjacent genes is never greater
than g (a user-specified parameter)

31
Max-Gap cluster definition
g? 2
gap? 3

A set of genes form a max-gap cluster of two
genomes if
the genes forms a max-gap chain in each genome
the cluster is maximal (i.e. not contained within
a larger cluster)

32
Max-Gap cluster definition
gap? 3
g? 2
g? 3

A set of genes form a max-gap cluster of two
genomes if
the genes forms a max-gap chain in each genome
the cluster is maximal (i.e. not contained within
a larger cluster)

33
The max-gap definition is the most widely used
cluster definition in genomic analyses

Allows extensive rearrangement of gene order
Allows limited gene insertion and losses

There is no formal statistical model for max-gap
clusters
34
Outline

Evolution of genome organization
Why find related genomic regions?
How do we find them?
Identification max-gap cluster definition
Validation Testing cluster significance

35
The Questions
Suppose two whole genomes were compared, and this
max-gap cluster was identified

Is this cluster biologically meaningful?
Could it have occurred in a comparison of random
genomes?

36
The Inputs
h4
n number of genes in each genome m number of
matching genes pairs g the maximum gap allowed
in a cluster h number of matching genes in the
cluster
37
The Problem
h4

What is the probability of observing a max-gap
cluster
containing exactly h matching gene pairs
assuming the genomes are randomly ordered

38
Probability of a cluster of size h
m genes
m-h genes
h genes
Basic approach Enumerate all ways to

Place m-h remaining genes so they do not extend
the cluster

Create chains of h genes in both genomes

Normalize to get a probability

39
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
40
Total number of configurations of m gene pairs
in two genomes of size n
m genes
41
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
42
Number of ways to place h genes in two genomes so
they form a cluster
h genes
m genes
m-h genes
Select h spots in each genome, so they form a
max-gap chain
Choose h genes to compose the cluster
Assign each gene to a selected spot in each genome
43
The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 places left
1 2 3 4 5 .
n-L1 .
n
The maximum length of the chain is L (h-1)g h
44
The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Choices for the size of each gap (from 0 to g)
There are h-1 gaps in a chain of h genes
45
The number of ways to create a chain of h genes
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Chains near the end of the genome
Choices for the size of each gap (from 0 to g)
There are h-1 gaps in a chain of h genes
1 2 3 4 5 .
n-L1 .
n
46
Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Starting positions near end
Starting positions
Ways to place remaining h-1 genes

47
Probability of a cluster of size h
m-h genes
h genes
Basic approach Enumerate all ways to

Place m-h remaining genes so they do not extend
the cluster

Create chains of h genes in both genomes

48
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
All configurations of m gene pairs in two
genomes of size n
49
Counting the number of ways to place m-h genes
outside the cluster
g 1
h 3

Approach
design a rule specifying where the genes can be
placed so that the cluster is not extended
count the positions

50
Counting the number of ways to place m-h genes
outside the cluster
gaps 1
g 1

Rule 1 A gene can go anywhere except in the
cluster (the white box).

Too lenient
51
Counting the number of ways to place m-h genes
outside the cluster
g 1
g 1
g 1

Rule 2 Every gene must be at least g1 positions
from the cluster (outside the grey box).

Too strict
52
Counting the number of ways to place m-h genes
outside the cluster
gap gt 1
g 1
h 3
gap gt 1

Rule 2 Every gene must be at least g1 positions
from the cluster (outside the grey box).

Too strict
53
Counting the number of ways to place m-h genes
outside the cluster
g 1
gap gt 1

Rule 3 At most one member of each gene pair can
be in the grey box.

Too lenient
54
Counting the number of ways to place m-h genes
outside the cluster
gaps 1
g 1

Rule 3 At most one member of each gene pair can
be in the grey box.

Too lenient
55
Counting the number of ways to place m-h genes
outside the cluster
g 1

Acceptable positions for a gene depend on the
positions of the remaining genes
Use strict and lenient rules to calculate upper
and lower bounds on G

56
Estimating G

Upper bound
Erroneously enumerates this configuration

Lower bound
Fails to enumerate this configuration

57
Probability of observing a cluster of size h
number of ways to place h genes so they form a
chain in both genomes
number of ways to place m-h remaining genes so
they do not extend the cluster
Hoberman, Sankoff, Durand Journal of
Computational Biology, 2005
58
What can we learn from this statistical result?

Are we less likely to observe a large cluster
(containing more gene pairs) than a small
cluster?
How large does a cluster have to be before we are
surprised to observe it?
How do we choose the maximum allowed gap value?
Larger values will
yield more clusters
more of these will be false positives

59
Whole-genome comparison cluster statistics
n1000, m250
g20
With a significance threshold of 10-4, any
cluster containing 8 or more genes is significant.
h (cluster size)
60
Conclusion

Statistical analysis of max-gap gene clusters
Provides a principled approach for choosing a gap
size that will yield significant clusters
Allows statistically significant max-gap clusters
to be identified
Provides insight on criteria for cluster
definitions

61
Odd properties of max-gap clusters

A larger cluster may be less significant

Moving a gene further away may make a cluster
more likely

62
Acknowledgements

Barbara Lazarus Women_at_IT Fellowship
The Sloan Foundation
The Durand Lab

63
Thanks
64
Questions?
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
Cluster Significance Related Work

Randomization tests
Requires complete genome (confusing!)
Not useful for choosing parameter values
Very simple models
Excessively strict simplifying assumptions
Overly conservative cluster definitions
A few more general statistical approaches
Not applicable to max-gap clusters

69
Groups find very different clusters when
analyzing the same data
70
Generative Models of Genome Rearrangement

Construct a probabilistic model specifying rates
for each type of genomic rearrangement
Reject regions that are unlikely to have evolved
via the model
Challenges
Relative rates of rearrangement processes are not
known
requires identification of clusters
Rates may differ significantly
within regions of the genome
between species
over time (e.g. depending on population sizes)

71
Advantages of an analytical approach

Analyzing incomplete datasets
Principled parameter selection
Efficiency?
Accuracy?
Understanding statistical trends
Insight into tradeoffs between definitions

plot graph with fixed cluster size and varying
maximum gap sizes
is it monotonic?
is a function of density and size monotonic?

not capturing
difference in density between max-gap clusters
partially conserved order

74
Identifying gene clusters

Formally define a gene cluster
Devise an algorithm to identify clusters
Verify that clusters indicate common ancestry

...modeling
...algorithms
...statistics
75
Identifying gene clusters

Formally define a gene cluster
Devise an algorithm to identify clusters
Verify that clusters indicate common ancestry

...modeling
...algorithms
...statistics
76

These are criteria. Size and density
Hard to capture
One Ive chosen is widely use, but see at end of
talk has some problems

77
Genome
The complete set of genetic material of an
organism or species
Chromosome
A double-stranded molecule of DNA
Gene
CCCCGCCCCCCGCCCCCCCCCTCGTCTTCAGACCCTTAGCTAGACCTTTA
GGAGGATTAAAAATGAGGGAGAGGGGC
GGGGCGGGGGGCGGGGGGGGGAGCAGAAGTCTGGGAATCGATCTGGAAAT
CCTCCTAATTTTTACTCCCTCTCCCCG
A protein coding sequence
78
Genome
The complete set of genetic material of an
organism or species

TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
Genes protein coding sequences
Large stretches of DNA with unknown function.
as an ordered list of genes
Regions where proteins bind to turn genes on and
off
79
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Example h 4 and g 1
1 2 3 4 5 6
. n-L1 .
n
The maximum length of a chain L (h-1)g h
80
Ways to place the remaining h-1 genes when the
gaps and length are constrained
1 2 3 4 5 .
n-L1 . n
l lt L

Gaps are constrained
And sum of gaps is constrained

A known solution
81
g2
g3
gm-1
g1
l
A known solution
82
Counting chains at the end of the genome

Gaps are constrained
And sum of gaps is constrained

l w-1
l h
83
Ways to place the leftmost gene in the chain, so
there are at least L-1 slots left
Chains near the end of the genome
Ways to place the remaining h-1 genes, so no gap
exceeds g
1 2 3 4 5 .
n-L1 . n
1

.
L-h
84
Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Starting positions near end
Starting positions
Ways to place remaining h-1 genes

85
Whole-genome comparison cluster statistics
n1000, m250
g10
g20
Cluster Probability
h (cluster size)
86
(No Transcript)
87
Constructive Approach
Number of configurations that contain a cluster
of exactly size h
number of ways to place m-h remaining genes so
they do not extend the cluster
number of ways to position h genes so they form a
chain in both genomes
number of ways to position h genes so they form a
chain in a single genome
88
Constructive Approach
Number of configurations that contain a cluster
of exactly size h
number of ways to place m-h remaining genes so
they do not extend the cluster
number of ways to position h genes so they form a
cluster in both genomes
89
Building Phylogenetic Trees
Genes may be laterally transferred between
distantly related species
AAACATTTT E. coli
GTCGGTTGG E. coli
AAACATTTA Salmonella
AAACGTTTC Chlamydia
GTCGGTTGC Thermococcus
GTCAGTTGC Methanococcus

Trees are often constructed based on a single
gene
species with the fewest differences between their
gene sequences are grouped together in the tree
The history of a gene may not indicate the
history of the species
Construct trees based on evidence
from the whole genome

90
An Essential Task forSpatial Comparative Genomics
Identify gene clusters, groups of genes that are
derived from the same chromosomal region in an
ancestral genome
8
9
11
12
10
4
5
3
7
2
13
14
15
3
1
20
4
5
3
1
6
2
4
3
1
2
4
3
1
2
91
Phylogenetic Trees
Human
Chimp
Mouse
Rat
Dog
Possum
100 50
0
Million years Ago

Describe evolutionary relationships between
species
each internal node represents the most recent
common ancestor of the descendants
edge lengths correspond to time estimates.

92
Building Phylogenetic Trees
Human
AAACATTTTA
Opposable thumbs
Chimp
AAATATTTA
Mouse
AACATTTTG
Single pair of incisors
Rat
AACATTTCG
Flesh shearing teeth
Dog
ATCAGTTGC
No placenta Opposable thumbs
TGCACTTGT
Opossum

Trees can be built from
Physiological features
Gene sequences
Spatial genome organization

Species with the fewest differences between their
gene sequences are grouped together
93
Whole-genome phylogenies based on spatial
organization

Find gene clusters
Determine the minimum number of rearrangements
between genome pairs
Use rearrangement distances to build phylogenies

Guillaume Bourque et al. Genome Res. 2004 14
507-516
94
Conserved spatial organization between distantly
related species suggests functional associations
betweeen the genes
Snel, Bork, Huynen. PNAS 2002
B
C
D
A
C
E
D
A
B
E
D
E
D
?
E
D
?
95
(No Transcript)
96
Statistical Testing Provides Additional Evidence
for Common Ancestry

How can we verify that a gene cluster indicates
common ancestry?
True histories are rarely known
Experimental verification is often not possible
Rates and patterns of large-scale rearrangement
processes are not well understood

97
Constructive ApproachEnumerating configurations
that contain a cluster of exactly h gene pairs
h genes
m genes
m-h genes

Select h spots in each genome, so that they form
a max-gap chain
Choose h genes to compose the cluster
Assign each gene to a selected spot in each
genome
Choose the location of the remaining m-h genes so
they dont extend the cluster

98
Where are the gene clusters?

Intuitive notions of what clusters look like
Similar gene content
Neither gene content nor order is perfectly
preserved
Need more rigorous criteria

99
Ways to place the remaining h-1 genes when the
gaps and length are constrained
1 2 3 4 5 .
n-L1 . n
l lt L
A known solution
but not closed form
100
Ways to place the remaining h-1 genes when the
gaps and length are constrained
1 2 3 4 5 .
n-L1 . n
1

.
L-h
101
Future Work

Evluate
Developed statistical tests for max-gap clusters
identified by whole-genome comparison using a
combinatoric approach
Results raise concerns about current methods used
in comparative genomics studies

102
What characteristics should we use to evaluate a
cluster?

Extent of gene loss/insertion
Density? (constrained by def to 1/g)
Number of insertions/delections between matches
(constrained to g)
Size of fragment
Number of matching genes (unconstrained)
Degree of rearrangement
Number of order violations (unconstrained)

103
Assumptions

A single, linear chromosome
The mapping between genes is one-to-one

104
Evaluate clusters based on size
gap?gt 3
size 4

The size of a cluster is the number of matching
gene pairs it contains

105
(No Transcript)
106
Existing Algorithms Impose Order Constraints
g 2

Typical approaches to finding max-gap clusters
use a greedy, agglomerative algorithm
initialize a cluster as a single matching gene
pair
search for a gene in proximity in both genomes
either extend the cluster and repeat, or
terminate and choose a new seed

107
Algorithms and Definition Mismatch
g 2
A max-gap cluster of size four

Agglomerative algorithms will not find highly
disordered max-gap clusters
A divide-and-conquer algorithm has been developed
(Bergeron et al, 2002)
this work is not known by the biological community

108
Future Work

Generalize the model
Remove the assumption that gene correspondences
are one-to-one
Evaluate clusters based on
density, e.g. size and total gaps
the degree to which order is conserved
Take phylogenetic distance into account
for more closely related species, random gene
order is not a reasonable null hypothesis

109
In bacteria, genes in the same pathway often
occur together in the genome
E. coli
Tryptophan Synthesis Pathway
trpCF
trpD
trpB
trpA
trpE
Chorismate
trpC
trpB
trpA
trpE
trpD
trpF
Anthranilate
Bacillus Subtilis
trpD trpE
N-5-Phophoribosyl-anthranilate
Enol-1-o-carboxy phenylamino-1-deoxyribulose
phosphate
trpD trpE
Indole-3-glycerol phosphate
trpCF
trpA trpB
L-Tryptophan
trpA trpB
110
Speciation
An ancestral species a uniform population
111
Speciation

Initially the two populations have identical
genomes

The populations evolve independently

Eventually, there will be two species with
similar but distinct genomes

112
Time passes,
more rearrangements accumulate
113
Common blocks are now harder to detectbut there
is still evidence of common ancestry
5
8
9
11
12
18
7
19
20
11
4
3
2
13
10
14
15
17
16
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1

Gene clusters
Similar gene content
Neither gene content nor order is perfectly
preserved

114
Gene Clusters
5
8
9
11
12
18
7
19
20
11
4
3
2
13
10
14
15
17
16
3
18
1
4
8
12
6
20
17
9
11
10
5
1
2
3
1
2
13
14
15
16
4
1
7

Intuitive notions of what clusters look like
Similar gene content
Neither gene content nor order is perfectly
preserved
Need more rigorous criteria

115
Genome
The genetic material of an organism or
species Specifies the complete blueprint for the
organism
Chromosome
A long double-stranded molecule of DNA
TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAG
AGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTC
TCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG
Gene
A DNA sequence that encodes a protein Proteins
are the building blocks of cells
116

Benoits outline
example and a little motivation
here are the issues, in order to solve this we
need to
need to cluster
ways to cluster exist but we dont know how good
they are
want to have a statistical way of measuring it
cluster def

117
What are the processes of genomic change?

Small-scale point mutations
Change gene sequences
Large-scale genomic rearrangements
Change gene content and order

118
In bacteria, genes in the same pathway often
occur together in the genome
Tryptophan Synthesis Pathway
Enol-1-o-carboxy phenylamino-1-deoxyribulose
phosphate
N-5-Phophoribosyl-anthranilate
Indole-3-glycerol phosphate
Chorismate
Tryptophan
Anthranilate
trpD trpE
trpCF
trpA trpB
trpA trpB
trpD trpE
E. coli
trpCF
trpD
trpB
trpA
trpE
Bacillus Subtilis
trpD
trpC
trpB
trpA
trpE
trpF
119
Human genome
Human Chromosome 21 is broken into at least three
pieces in mouse
Accidental duplication of chromosome 21 causes
Down Syndrome
Human genome
X is scrambled but conserved
Mouse genome as scrambled human genome
Guillaume Bourque et al. Genome Res. 2004 14
507-516
120
Other applications

build evolutionary trees based on rearrangements
detect ancient whole genome duplications
identify operons
estimate rearrangement frequencies
...

121
Common Blocks
regions that descended from the same region in
the genome of the common ancestor
Species 1
8
9
11
12
10
4
5
3
7
2
13
14
15
3
1
8
7
11
12
10
20
17
16
9
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
122
Common Blocks
are harder to detect between more distantly
related organisms, but there is still evidence of
common ancestry
Species 1
5
8
9
11
12
18
11
4
3
7
2
13
10
14
15
17
16
19
20
3
18
1
8
11
12
10
5
6
2
3
2
20
13
14
15
17
16
7
9
4
1
1
4
1
Species 2

Similar gene content
Neither gene content nor order is perfectly
preserved

123
8
11
12
5
9
18
7
17
19
20
18
11
4
3
1
2
13
10
14
15
16
3
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1

Gene clusters
Similar gene content
Neither gene content nor order is perfectly
preserved

124
Inputs

Two genomes (i.e, ordered lists of genes)
A mapping of corresponding genes

125
Hypothesis Testing

Null hypothesis random gene order
Alternate hypothesis shared ancestry
Reject clusters that could have arisen under the
null model

126
Number of ways to position h genes in a genome
of n genes so they form a max-gap chain
Probability that h randomly placed genes will
form a chain in a genome of n genes

127
Probability of h randomly placed genes forming a
chain
n 1000 (total genes in genome)
h (size of the chain)
128
Number of ways to place h genes in two genomes so
they form a cluster
h genes
m genes
m-h genes
Choose h genes to compose the cluster
Assign each gene to a selected spot in each genome
Select h spots in a genome, so they form a
max-gap chain
129
Calculating the NumeratorEnumerate the
configurations that contain a cluster of exactly
h gene pairs
h genes
m genes
m-h genes
Assign each gene to a selected spot in each genome

Choose the location of the remaining m-h genes so
they dont extend the cluster

Choose h genes to compose the cluster
Select h spots in a genome, so they form a
max-gap chain
130
Closely related genomes
Species 1
8
11
12
9
10
5
7
2
13
4
3
1
14
15
3
20
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
13
14
15
4
3
1
2
Species 2
Related regions, regions that descended from the
same region in the genome of the common ancestor,
are easy to identify
131
More Diverged Genomes
5
8
9
11
12
18
20
11
4
3
7
2
13
10
14
15
17
16
19
3
18
1
8
9
11
12
10
4
5
6
2
3
2
20
13
14
15
17
16
4
7
1
1
1

Related regions are harder to detect, but there
is still spatial evidence of common ancestry
Similar gene content
Neither gene content nor order is perfectly
preserved

132
Genome Comparison
Species 1
8
12
9
11
10
4
5
3
7
1
2
13
14
15
3
17
16
8
9
7
11
12
10
4
5
3
1
6
2
4
3
1
2
20
13
14
15
4
3
1
2
Species 2
Our goal identify chromosomal regions that
descended from the same region in the genome of
the common ancestor
133
Comparing Genomes
75 Million years
Chromo-somes Millions of nucleotides Genes
Human 23 2900 20-25k
Mouse 20 2500 20-25k
Fly 4 180 13.6k
Rice 12 430 40k
E. coli 1 4.7 3200
Chlamydia 1 1 936
134
Comparing Genomes
75 Million years
Chromo-somes Genes
Human 23 20-25k
Mouse 20 20-25k
Fly 4 13.6k
Rice 12 40k
E. coli 1 3200
Chlamydia 1 936

Write a Comment

User Comments (0)