Title: Population Genetics Bovine genome analysis
1Bioinformatics and Computational
Biology Advancing biocomputing to meet human needs
Genome VariationBovine genome analysis
Rafael Villa-Angulo
September 2007
2Bovine HapMap
- Basic concepts
- Bovine HapMap data
- Haplotype analysis
- Allele sharing
- Population structure
31. Basic concepts
In diploid organisms (such humans and chimps)
there are two (not completely identical) "copies"
of each chromosome, and hence of each region of
interest. A description of the data from a single
copy is called haplotype, while a description of
the conflated (mixed) data on the two copies is
called genotype (Gusfield 2002). The specific
physical appearance and constitution or the
specific manifestation of a trait is called
phenotype.
41. Basic concepts (cont)
Fig 2 Haplotype analysis to identify DNA
variations associated to a disease
51. Haplotyping process
Let, G g1, . . . , gn be a set of n
genotypes, where each gi consists of the
combined allele information of m SNPs, s1, . . .,
sm. Each gi ? G (gi which belongs to G) is a
vector of size m whose jth element gij (i 1, .
. . , n and j 1, . . . , m) is defined as
0 when the two alleles of SNP
sj are major homozygous, gij 1 when the
two alleles of SNP sj are minor homozygous,
2 when the two alleles of SNP sj are
heterozygous.
61. Haplotyping process
- Let, H h1, h1 . . . , hm, hm be the set of
all haplotypes (unknown haplotypes)., where each
hi, hi pair corresponds to the two haplotypes
conflated in genotype gi.
Each hi and hi ? H is a vector of size m whose
jth element hij and hij (i 1, . . . , n and j
1, . . . , m) is defined as 0 when the
allele of SNP sj is major, hij and hij 1
when the allele of SNP sj is minor.
71. Haplotyping process
The haplotype Phasing problem can be defined as
follows
Problem Haplotype Phasing (generic) Input A
set of G genotypes. Output A set of H
haplotypes, such that for each gi ? G there
exists hi , hi ? H such that the conflation of
hi with hi is gi.
81. Haplotyping process
- The solution to the haplotype Phasing problem is
not straightforward due to resolution ambiguity
- Computational and statistical algorithms for
addressing ambiguity in Haplotype Phasing - 1) parsimony
- 2) phylogeny
- 3) maximum-likelihood
- 4) Bayesian inference
91. Haplotyping process
- Parsimony-based methods (e.g., Clark)
- Assume target population shares a relatively
small number of common haplotypes due to linkage
disequilibrium. Then resolve an ambiguous
genotype using one of already identified
haplotypes.
101. Haplotyping process
- Phylogeny-based Methods (e.g., MERLIN)
Assume haplotypes in a population evolved along
the coalescent, a rooted tree describing the
evolutionary history of a set of DNA sequences.
Thus, aim to find haplotypes that resolve target
genotype data and follow the coalescent model as
well.
111. Haplotyping process
- Maximum-Likelihood-based methods (e.g.,
fastPHASE) - Treats genotypes as observed but incomplete data
with unknown haplotypes. Estimate haplotype
frequencies ? based on their likelihood, L ,
given the genotype data D.
121. Haplotyping process
- Bayesian Inference-based Methods (e.g., PHASE)
-
- Regards the unknown haplotypes as unobserved
random quantities and aims to evaluate their
conditional distribution in light of the genotype
data
1. Construct a Markov chain-Monte carlo algorithm
to estimate H from the observed genotype data G
taking account of the rate of decay of LD with
distance on the underlying recombination rates
? 2. Use a Markov chain Monte Carlo algorithm to
infer haplotypes. Sample from conditional
distribution of the haplotypes and recombination
parameters given the genotype data Pr (H,? \
G)
132 Bovine HapMap data
- The bovine HapMap will be a catalog of common
genetic variants that occur in cattle. - It will describe what these variants are, where
they occur in chromosome regions, and how are
they distributed among individuals within
populations and among populations in different
breeds.
142. Bovine HapMap data (cont)
Bovine
Bos Indicus
African
Composite
Bos Taurus
Beef breeds Charolais Limousin Piedmontese Romagn
ola Hereford Angus Red Angus
Dairy breeds Brown Swiss Guernsay Holstein Jersey
Norwegian Red
Gir Nelore Brahman
Sheko NDama
Beefmaster Santa Gertrudis
152. Bovine HapMap data (cont)
- A total of 501 animals were genotyped
representing 21 cattle breeds. - The breeds included a combination of Bos Taurus,
Bos indicus and composites from several
continents.
162. Bovine HapMap data (cont)
Table 1. Initial number of animals per breed in
the HapMap data
Breed No. of individuals Breed No. of individuals
Charolias 24 Jersey 28
Limousin 42 Norwegian Red 25
Piedmontese 24 Gir 24
Romagnola 24 Nelore 24
Hereford 27 Brahman 25
Angus 27 Beef Master 24
Red Angus 12 Santa Gertrudis 24
Brown Swiss 24 Sheko 20
Gurnsay 21 N Dame 25
Holstein 53 Buffalo and Anoa 2 each
172. Bovine HapMap data (cont)
Table 2. Initial number of markers per chromosome
in the HapMap data
Chromosome Markers Chromosome Markers Chromosomes Markers
1 1537 11 1242 21 669
2 1512 12 879 22 698
3 1316 13 999 23 588
4 1320 14 2794 24 694
5 1246 15 851 25 1208
6 2485 16 885 26 602
7 1101 17 828 27 498
8 1224 18 660 28 520
9 1018 19 690 29 479
10 1139 20 882 X 573
182. Bovine HapMap data (cont)
- Cleaning the dataset
- Missing data
- - Animals with genotype completeness lt89 were
removed (normalized to the more complete
individual). - - markers removed due to greater than 10
missing data gt 50 of Taurus and gt50
Indicus. -
- HWE test and genotyping error
- - Markers removed due to estimated genotyping
error rate gt 5 and at list one breed out of
HWE.
192. Bovine HapMap data (cont)
- Minor Allele Frequency
- - Markers removed for being monomorphic in all
breeds. - - Markers removed due to MAFlt0.05 in all breeds.
- Discordance
- - Markers were removed due to having gt2
discordant trios. - Unassigned chromosomes
- - Markers assigned to unknown chromosomes were
removed
202. Bovine HapMap data (cont)
- Summary
- The final dataset contains 29,394 markers from
487 animals.
213. Haplotype analysis
- Haplotypes inference
- A pair of haplotypes were estimated for
animals in each breed using fastPHASE, a
Maximum-Likelihood based method (Sheet and
Stephens 2006)
223. Haplotype analysis (cont)
- Results
- - 570 files containing inferred haplotypes
-
- 487 individuals
- 30 chromosomes per individual
- 2 haplotypes per chromosome
233. Haplotype analysis (cont)
- Genetic variation in genomes is organized in
haplotype blocks. (Guryev, et, al., 2006 ) - Haplotype maps characterize the common patterns
of linkage disequilibrium in populations. - Haplotype blocks provide substantial statistical
power in association studies of common genetic
variation across regions (Gabriel, et. al., 2002)
243. Haplotype analysis (cont)
- Block definition
- Blocks based on pairwise and grouped r2 values.
-
- (i) Begin a block by selecting the pair of
adjacent SNPs with the highest r2 value (no
less than ? 0.4) - (ii) repeatedly extend the block if the
average r2 value between an adjacent marker and
the current block members is above ? (0.3) and
all the individual r2 values are above ? (0.1).
253. Haplotype analysis (cont)
Table 3 Average values of block statistics
263. Haplotype analysis (cont)
- Figure 23 LD assessment for all SNPs inside
blocks
273. Haplotype analysis (cont)
- Figure 25 LD assessment for all SNPs inside
blocks
283. Haplotype analysis (cont)
- Consistency of block boundaries across breeds
- 1. adjacent pairs of SNPs with
intermarker distances up to 10 kb were
examined, - 2. If the SNPs pair is assigned to a single
block, count it as concordant (no evidence of
historical recombination), - 3. If the SNPs pair is not assigned to a single
block, count it as discordant (evidence of
recombination) - (Gabriel et al, 2002)
293. Haplotype analysis (cont)
(a) (b) (c)
Figure 26 Concordance and discordance assignments
for SNP pairs within distance lt 10 kb for Angus
vs Holstein breeds.
303. Haplotype analysis (cont)
Figure 28 Dendogram based on haplotype boundary
discordances in chromosome 14
313. Haplotype analysis (cont)
Figure 29 Dendogram based on haplotype boundary
discordances in chromosome 25
323. Haplotype analysis (cont)
- Principal Component Analysis from haplotype block
among all breeds
Figure 29 PCA1 against PCA2 in chromosome 6 based
haplotype blocks
333. Haplotype analysis (cont)
Figure 30 PCA1 against PCA2 in chromosome 14
based haplotype blocks
Figure 31 PCA1 against PCA2 in chromosome 25
based haplotype blocks
343. Haplotype analysis (cont)
Figure 32 Breeds sorted by PC1 derived from
haplotype block vectors on chromosomes 6, 14,
and 27
354. Population structure
- 4.1 MAF and nucleotide diversity
- 4.2 Linkage Disequilibrium
- 4.3 Genetic differentiation
364. Population structure (cont)
- MAF (all breeds polymorphic proportion graphs)
374. Population structure (cont)
- Principal Component Analysis (PCA) from MAF
distribution among all breeds
Figure 3 PCA1 against PCA2 in chromosome 6 using
minor allele frequencies
384. Population structure (cont)
Figure 4 PCA1 against PCA2 in chromosome 14 using
minor allele frequencies
Figure 5 PCA1 against PCA2 in chromosome 25 using
minor allele frequencies
394. Population structure (cont)
Figure 6 Breeds sorted by PC1 derived from minro
allele frequencies on chromosomes 6, 14, and 25
404. Population structure (cont)
- Linkage Disequilibrium Analysis
Figure 8 r2 plot for chromosome 6 in Angus breed
using all markers passing a X2 test.
414. Population structure (cont)
Figure 9 r2 plot for chromosome 14 in Angus breed
using all markers passing a X2 test.
Figure 10 r2 plot for chromosome 25 in Angus
breed using all markers passing a X2 test.
424. Population structure (cont)
Figure 11 Genetic differentiation (FST) by marker
in chromosome 6 between Beef and Dairy breeds
434. Population structure (cont)
Figure 12 Genetic differentiation (FST) by
marker in chromosome 14 between Beef and Dairy
breeds
Figure 13 Genetic differentiation (FST) by marker
in chromosome 25 between Beef and Dairy breeds
444. Population structure (cont)
Figure 14 Genetic differentiation (FST) by marker
in chromosome 6 between Bos Taurus and Bos
Indicus clusters
454. Population structure (cont)
Figure 15 Genetic differentiation (FST) by marker
in chromosome 14 between Bos Taurus and Bos
Indicus clusters
Figure 16 Genetic differentiation (FST) by marker
in chromosome 25 between Bos Taurus and Bos
Indicus clusters
465. Allele sharing analysis
- Multi-marker allele sharing on chromosomes with
dense markers (6, 14, 25) - Allele defined as the haplotypes observed within
a sliding window containing w 10 adjacent
markers spanning no more than 200 kb - Each window containing 10 markers and spanning no
more than 200 kb defines a single locus - Loci may overlap
475. Allele sharing analysis (cont)
- Proportion of shared alleles between two
populations P1 and P2 at locus k
Where i and j range over the individuals in
populations P1 and P2 , respectively. Sa(i, j, k)
is the number of shared alleles between
individuals i and j at locus k, and n1 and n2 are
the number of samples in P1 and P2.
48Allele sharing analysis (cont)
- Normalized proportion of shared alleles
- S(P1 ,P2 ,k) 1.0 when the proportional of
sharedalleles between P1 and P2 equals the
average of the proportional of shared alleles
within P1 and P2 . - S(P1 ,P2 ,k) ltlt 1.0 when the proportion of
shared alleles between the two populations is
much less than the average within the two
populations.
49Allele sharing analysis (cont)
Figure 17 Normalized proportion of shared
multi-marker alleles between Angus and Holstein
on a region of chromosome 6
50Allele sharing analysis (cont)
Figure 18 Normalized proportion of shared
multi-marker alleles between Angus and Holstein
on a region of chromosome 14
Figure 19 Normalized proportion of shared
multi-marker alleles between Angus and Holstein
on a region of chromosome 25
51Allele sharing analysis (cont)
- Clustering Breeds Based on shared alleles
- The proportion of shared alleles can be used as
a distance measure for clustering breeds.
- Normalized distance between P1 and P1
where u is the number of loci with shared alleles
- D(P1 ,P2) 0 if breeds P1 and P2 share the same
proportion of alleles as are shared by
individuals within each individual breed.
52Allele sharing analysis (cont)
Figure 20 Dendogram based on shared alleles on
chromosome 6.
53Allele sharing analysis (cont)
Figure 21 Dendogram based on shared alleles on
chromosome 14.
54Allele sharing analysis (cont)
Figure 22 Dendogram based on shared alleles on
chromosome 25.
55Preliminary conclusions
- High differentiation regions between breed and
breed clusters were identified from Fst Analysis - PCA analysis on MAF and block boundary
discordances permits us to cluster breed in the
geographical and ancestry groups - Proportion of shared alleles between breeds
exhibits considerable variation that is
significantly auto-correlated, indicating
possible effects of selection
56Future Directions
- Improved haplotype algorithms for Bovine data
sets - Comparison of Haplotype Inference Algorithms
- Further analysis of Bovine Genome variation
57Collaborators
Dr. John Grefenstette Dept. of Bioinformatics and
Computational Biology, GMU Dr. Lakshmi
Kumar Dept. of Bioinformatics and Computational
Biology, GMU USDA Dr. Clare Gill and Jungwoo
Choi (PhD student) Dept. of Animal Science, Texas
A M University
58Bibliography
- L. Hartl Daniel. (2000) A Primer of Population
genetics.Third edition. Sinauer Associates, Inc. - Gusfield, Dan (2002). An Overview of
Combinatorial Methods for Haplotype Inference.
Computational Methods for SNPs and Halpotype
Inference. Springer. LNBI 2983. - Scheet P., Stepehens M. (2006) A fast and
flexible statistical model for large-scale
population genotype data applications to
inferring missing genotypes and haplotypic phase.
Am J Human Genetics 78(4) 629-644. - Gibson G., Muse V. Spencer. (2004) A Primer of
Genome Science. Second Edition. Sinauer
Associates, Inc. - Wright S (1951) The genetical structure of
population. Annals of Eugenics, 15 323-354. - Wright S (1965) The interpretation of population
structure by F-statistics with special regard to
systems of mating. Evolution., 19 395-420.
59Bibliography (cont)
- Wright S (1978) Evolution and genetics of
populations. Vol. 4. Variability Within and Among
Natural Populations. Univ. of Chicago Press,
Chicago. - Michlataos-Beloin,S., Tishkoff,S.A., Bentley,
K.L., Kidd,K.K. and Ruano,G.(1996) Molecular
haplotyping of genetic markers 10kb apart by
allelic-specific long-range PCR. Nucleic Acids
Res., 24, 4841-4843. - Douglas,J.A., Boehnke,M., Gillanders,E.,
Trent,J.M. and Gruber,S.B. (2001) Experimentally
derived haplotypes substantially increase the
efficiency of linkage disequilibrium studies.
Nat. Genet., 28, 361-364. - Bowcock, A. M., Ruiz-Linares, A., Tomfohrde, J.,
Minch, E., Kidd, J. R. Cavalli-Sforza, L. L.
(1994) High resolution of human evolutionary
trees with polymorphic microsatellites. Nature
368, 455-457. - Felsenstein, J. 1989. PHYLIP - Phylogeny
Inference Package (Version 3.2). Cladistics 5
164-166. - Mountain, J. L. Cavalli-Sforza, L. L. (1997)
Multilocus genotypes, a tree of individuals, and
human evolutionary history. American Journal of
Human Genetics 61, 705-718. - Witherspoon, D. J., Wooding, S., Rogers, A. R.,
Marchani, E. E., Watkins, W. S., Batzer, M. A.
Jorde, L. B. (2007) Genetic similarities within
and between human populations. Genetics 176,
351-359.