Title: Whole Genome Mammalian Clone Sets
1Whole Genome Mammalian Clone Sets for
High-Resolution BAC Arrays
Krzywinski M1, Bosdet I1, Smailus D1, Chiu R1,
Mathewson C1, Wye N1, Asano J1, Barber S1,
Brown-John M1, Chan S1, Chand S1, Chittaranjan S1
Cloutier A1 Fjell C1, Girn N1, Gray C1, Kutsche
R1, Lee D1, Lee SS1, Masson A1, Mayo M1, McLeavy
C1, Olson T1, Pandoh P1, Anna-Liisa Prabhu1, Shin
H1, Spence L1 Stott J1, Taylor S1, Tsai M1, Yang
G1, Albertson D2, Lam W1, Erik Shoenmakers3, Choy
C4, Osoegawa K4, Zhao S5, de Jong P4, Schein J1,
Jones S1, Marra M1
4. Coverage and Redundancy
1. Introduction
3. Clone Set Characteristics
Requirements The clone sets were generated with
an effort to 1. to fully represent the
underlying genome, as determined by
representation of the fingerprint map and genomic
assembly 2. to contain about 30,000 clones 3. to
provide about 2X coverage of the genome 4. to be
sampled from readily available libraries 5. to
contain clones whose fingerprints fall within 3s
of the population distributions of size and
number of fragments. 6. to validate the identity
of each clone by obtaining high resolution
fingerprints and comparing them to those stored
in the fingerprint map Libraries Human RPCI-11
(91), RPCI-13 (2), Caltech-D (7) Mouse
RPCI-23 (69), RPCI-24 (31) Rat CHORI-230
(99), RPCI-31 (lt1), RPCI-32 (lt1) Annotation Hum
an selection fpmap Nov 2001, assembly hg11
July 2002 current analysis fpmap Nov 2003,
assembly hg16 July 2003 Mouse selection fpmap
Jun 2003, assembly mm3 Feb 2003 current
analysis fpmap Jan 2004, assembly mm4 Oct
2003 Rat selection and analysis fpmap Jan
2004, assembly rn3 Jun 2003
Clone coverage and redundancy are shown in Figure
4 below. The difference in resolution and depth
of coverage between the sets is due to
differences in sizes of clones in the libraries
from which the clones were sampled. The human
BACs are on average 25 smaller than those from
rat. The number of gaps in the human set is
larger than the other two sets, although smaller
on average, primarily because portions of the
human assembly are uniquely represented by clones
from a variety of exotic libraries.
human mouse rat
rearray size rearray size rearray size
number of clones 32,855a 28,103 27,312
with paired-end BES 12,598 (38) 16,972 (60) 19,175 (62)
sequenced 7,345 (22) 15,561 (55) 15,989 (59)
with sequence coordinates 31,686 (96) 27,668 (98) 26,547 (97)
clone properties clone properties clone properties
clone libraries RPCI11, RPCI13, Caltech-D RPCI-23, RPCI-24 CHORI-230, RPCI-31, RPCI-32
avg clone size 147 kb 172 kb 204 kb
avg depth of coverage 1.9 x 2.0 x 2.2 x
avg clone overlap 73 kb 91 kb 118 kb
coverage and resolution coverage and resolution coverage and resolution
coverage of sequence assembly 99.5 99.7 98.7 (99.1b)
coverage of fingerprint map 98 98.9 99.3
average resolution 76 kb 80 kb 88 kb
rearray status rearray status rearray status
rearrayed yes yes in progress
validated yes in progress
available yes
The ability to detect and localize chromosomal
rearrangements with a high degree of sensitivity
and specificity across an entire genome plays a
major role in the study and classification of
genetic diseases and developmental abnormalities.
Genomic alterations have been implicated in the
growth and progression of cancer, in mental
retardation and in other congenital defects.
Effective study of chromosomal anatomy using
technologies such as FISH and array CGH requires
access to a set of clones representing the genome
with sufficient granularity. Identification and
construction of a clone set that fulfills these
requirements is the first step. To this end, we
have undertaken the selection of BAC clone sets
representing the human, mouse and rat genomes,
with the objective of achieving sub-100kb
resolution. The BAC clone sets are selected using
existing BAC library, fingerprint map and
sequence resources, with the end goal being a
clone set providing comprehensive coverage and
high-resolution sampling. Each full-genome clone
set contains approximately 30,000 BACs. The
clones provide an average of 2X redundant
coverage. Clones identities are verified using
fingerprints. The human BAC clone set has been
selected and arrayed and is available to the
public, providing 76kb sampling resolution and
coverage of 99.5 of the sequenced portion of the
genome. The mouse set, which has coverage
statistics equivalent to the human set, has been
arrayed and will soon undergo clone verification
assessment. We are currently finalizing the clone
selections for the rearray of the rat genome.
The density of these clone sets is an order of
magnitude greater than that of currently
available whole genome CGH arrays, offering the
prospect of detecting smaller chromosomal
rearrangements. Clone lists and annotations for
the sets, as well as this poster, are available
at http//mkweb.bcgsc.ca/bacarray.
We have attempted to use as much information as
possible to determine the precise location of
every clone in the set on the genome assembly.
Some clones, about 3-4 in each set, remain
unlocalized. We expect that this is due to
sequence assembly gaps and regions in the
fingerprint map that are not represented by the
assembly. The coverage statistics of the clone
sets are shown in Table 1 are calculated using
our sequence position annotations and fingerprint
data. Because the positions calculated using in
silico anchoring and assembly coordinates tend to
underestimate the full extent of the clone on the
assembly, we expect that the value of coverage in
B
A
D
C
Table 1. Summary of the three mamallian BAC
arrays. a32,432 clones in the human set have been
validated by fingerprinting and the remaining
were selected during a QA/QC replacement round
and will be validated in the near future
bexcluding chrUn
2. Methodology of Clone Selection
Table 1 represents a lower limit. Moreover, due
to the fact that fingerprints measure overlap
with lower sensitivity than obtained with
sequence information, the fingerprint map
coverage is also a lower limit. The coverage and
resolution of the three sets across the genomes
are summarized in Figure 3.
The aim of creating the BAC rearrays was to
generate a laboratory resource designed for
high-resolution BAC array CGH studies and other
whole-genome investigative approaches to relating
chromosomal changes with phenotypes. These
applications are particularly important to us as
we seek to develop genomic reagents of utility in
cancer research. The sets (a) contain on the
order of 30,000 clones, a number which can be
practically printed onto array slides, (b)
faithfully represent both the fingerprint map and
genomic assembly and (c) incorporate redundancy
in coverage by controlling the amount of overlap
between genome-adjacent rearray selections.
Figure 4. Depth of coverage, resolution and gaps
in the clone sets. A Redundancy in the clone
sets is measured by the fraction of the genome
represented by a given number of BACs. The human
clone set has a 1X2X depth ratio of 11, with
approximately 35 of the genome represented by
single BACs and the remaining 65 by two or more
BACs. Clone libraries used to construct the mouse
and rat sets are comprised of larger clones and,
given that the genomes of roughly the same size
and that the clone sets contain roughly the same
number of clones, both mouse and rat have a
larger average depth of coverage than human. B
The effective clone set resolution is measured by
a weighted average of clone covers, where the
weights are the cover sizes. All three sets have
resolution of approximately 80kb. This means that
if one randomly selects a point on the genome,
50 of the time it will be represented by a clone
cover of 75kb or smaller. C,D Gaps in the clone
sets were estimated by locating regions of the
sequence assembly that were not represented by
BACs in the sets. BAC assembly coordinates were
derived from BAC end, Golden Path and in silico
mapping methods. Both the Golden Path and in
silico mapping methods do not necessarily reflect
the full size of the insert of the clone. The
uncertainty in the coordinates obtained by in
silico mapping is approximately 10kb at each end.
The difference between the actual location of the
clone and the Golden Path coordinates is a
function of the fraction of the clones insert
that has been sequenced. The Genbank sequence
records for sequenced BACs do not always contain
the information required to derive the
coordinates of the full insert.
5. Data and Clone Set Access
Resolution Depth of assembled sequence coverage
by clones in the sets was calculated using BAC
sequence coordinates obtained from BES
alignments, in silico fingerprint anchoring and
Golden Path assembly information.
Figure 1. Rat fingerprint map contig 1012 (top).
Clones selected for the rat rearray are
highlighted in green. Statistics relating
similarity of the fingerprints of adjacent
selections are shown to the right of the selected
clones. UCSC track for the region is shown below.
For each of the sets, the selection was driven by
the fingerprint map (Fig 1) and ancillary clone
annotations in the form of BAC end sequence (BES)
records, BES-based coordinates on the assembly,
and Genbank accession status of the clone. Clones
were selected that (a) had fingerprint which were
typical of the observed population (thus
unusually small or large clones were avoided),
(b) met overlap criteria between adjacent
selections, and (c) were derived from selected
clone libraries (Table 1). The libraries from
which clones were selected are readily available
to investigators and are already found in many
labs. In order to precisely position the rearray
clone selections on their cognate genome, we
localized the clones using a combination of BAC
end coordinates, in silico fingerprint mapping
and sequence assembly coordinates, where
available. During the clone selection process, we
prioritized the selection of clones that were
sequenced and that had BES-based coordinates.
This was done to provide a dense coordinate
scaffold which could be used to localize the
remaining clones. Over 96 of clones in each set
are positioned on the genome (Table 1) and we
expect this value to increase with new versions
of the genome assemblies, in particular for mouse
and rat. In order to ensure that the correct
clone was selected and correctly rearrayed, the
identity each rearrayed clone was validated by
fingerprinting. The validation fingerprints were
compared to those stored in the fingerprint map.
The validation fingerprints for the human set
have been completed and will commence shortly for
the mouse set (Table 1).
The resolution of the set was determined by using
the concept of clone covers. The set of covers is
found by intersecting the cover of every clone
with those of all its neighbours. Any base pair
location will be covered by a group of clones.
The cover is the largest contiguous sequence
region covered by the same group of clones.
http//mkweb.bcgsc.ca/bacarray
A
B
Figure 5. Rearray data portal.
C
D
Clone set data and annotations based on the
latest releases of the assemblies and fingerprint
maps are publically available for download.
Visualization of clone layout is provided using
tracks in the UCSC Genome Browser (Fig 1). The
human rearray has been available for about one
year, both in the whole-genome and
chromosome-specific sets, from BACPAC Resources.
We anticipate that the mouse set will be
available shortly.
1 2 3 4 5 6
6 clone covers
Consider the example above with four BACs
(A,B,C,D) overlapping in the manner shown. There
are 6 intersections of clones. Thus, the sequence
region can be resolved into 6 regions. For
example, if BACs B and C show positive
hybridization in an experiment, the probe can be
localized to the fourth cover. The smaller the
average size of the cover, the higher the
effective resolution of a clone set.
Figure 3. Coverage (top) and resolution (bottom)
of the clone sets, evaluated in 700kb windows.
Acknowledgements
Orthologous Relationships We have related
orthologous members of all three rearray sets
using whole-genome alignments (Fig 2, Table 2).
In this process, we have identified the
orthologous locations, where possible, of each
BAC from each set on the other two genomes. In
Figure 2, for example, clone A from the human
rearray is aligned to three regions of the mouse
genome. Conversely, clones a, b, c from the mouse
genome are aligned to a region of the human
genome. Alignments were grouped if the distance
difference between adjacent alignment positions
on the source and target genomes was less than
10kb or if both distances were less than
20kb. Table 2 shows the number of BACs from each
array that align to the other genomes and the
total coverage provided by the alignments.
source rearray source rearray source rearray
human mouse rat
target genome human source BACs (a) target coverage (b) 31,686c 2.79 Gb 21,498 (78) 1.61 Gb (58) 18,808 (71) 1.52 Gb (54)
target genome mouse source BACs () target coverage () 23,887 (75) 1.37 Gb (55) 27,668 2.49 Gb 22,068 (83) 2.08 Gb (84)
target genome rat source BACs () target coverage () 27,716 (87) 1.19 Gb (45) 23,891 (86) 2.04 Gb (77) 26,547 2.66 Gb
Author Affiliations 1Genome Sciences Centre,
British Columbia Cancer Research Centre, 600 W
10th Avenue, Vancouver BC V5Z 4E6, Canada,
www.bcgsc.ca 2Cancer Research Institute, Box
0808, University of California at San Francisco,
San Francisco CA 94143-0808, USA, cc.ucsf.edu
3Human Genetics, 417, University Medical Center
Nijmegen, P.O. Box 9101, 6500 HB Nijmegen, The
Netherlands 4BACPAC Resources, Children's
Hospital Oakland Research Institute, 747 52nd
St., Oakland CA 94609, USA, bacpac.chori.org
5The Institute for Genomic Research, 9712 Medical
Center Drive, Rockville MD 20850, USA,
www.tigr.org Resources Clone libraries
RPCI-11/13 www.chori.org/bacpac CalTechD
www.tree.caltech.edu BAC physical maps
Washington University Genome Sequencing Centre
www.genome.wustl.edu, Genome Sciences Centre
www.bcgsc.ca/lab/mapping Sequence Assemblies
NCBI www.ncbi.nlm.nih.gov, Baylor Human Genome
Sequencing Centre www.hgsc.bcm.tmc.edu, UCSC
Genome Project genome.ucsc.edu BAC end database
TIGR www.tigr.org Funding NHGRI Genome
Canada Genome BC
Table 2. Number of BACs and coverage in
orthologous relationships. In the example of
mouse BACs aligned to the human genome, 21,498
BACs from the mouse rearray (78 of the 27,668
mouse BACs localized to the mouse genome) have
alignments to the human genome and the alignments
provide 1.61 Gb of coverage (58 of the human
genome). arelative to the number of BACs in the
cognate source rearray brelative to the size of
the target genome cdiagonal cells contain number
of BACs in the source rearray that have
localizations to the source genome along with the
total detected coverage by these BACs.
Figure 2. Orthologous relationships are
constructed when a clone set from one genome is
projected onto another genome using whole-genome
alignments.