Population Genetics Bovine genome analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Population Genetics Bovine genome analysis

Description:

where: each hi, hi' pair corresponds to the two haplotypes conflated in genotype gi. ... Then resolve an ambiguous genotype using one of already identified haplotypes. ... – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 43
Provided by: johngref
Learn more at: http://www.binf.gmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Population Genetics Bovine genome analysis


1
Bioinformatics and Computational
Biology Advancing biocomputing to meet human needs
Genome VariationBovine genome analysis
Rafael Villa-Angulo
September 2007
2
Bovine HapMap
  1. Basic concepts
  2. Bovine HapMap data
  3. Haplotype analysis
  4. Allele sharing
  5. Population structure

3
1. Basic concepts
In diploid organisms (such humans and chimps)
there are two (not completely identical) "copies"
of each chromosome, and hence of each region of
interest. A description of the data from a single
copy is called haplotype, while a description of
the conflated (mixed) data on the two copies is
called genotype (Gusfield 2002). The specific
physical appearance and constitution or the
specific manifestation of a trait is called
phenotype.
4
1. Basic concepts (cont)
Fig 2 Haplotype analysis to identify DNA
variations associated to a disease
5
1. Haplotyping process
  • Mathematical formulation

Let, G g1, . . . , gn be a set of n
genotypes, where each gi consists of the
combined allele information of m SNPs, s1, . . .,
sm. Each gi ? G (gi which belongs to G) is a
vector of size m whose jth element gij (i 1, .
. . , n and j 1, . . . , m) is defined as
0 when the two alleles of SNP
sj are major homozygous, gij 1 when the
two alleles of SNP sj are minor homozygous,
2 when the two alleles of SNP sj are
heterozygous.
6
1. Haplotyping process
  • Let, H h1, h1 . . . , hm, hm be the set of
    all haplotypes (unknown haplotypes)., where each
    hi, hi pair corresponds to the two haplotypes
    conflated in genotype gi.

Each hi and hi ? H is a vector of size m whose
jth element hij and hij (i 1, . . . , n and j
1, . . . , m) is defined as 0 when the
allele of SNP sj is major, hij and hij 1
when the allele of SNP sj is minor.
7
1. Haplotyping process
The haplotype Phasing problem can be defined as
follows
Problem Haplotype Phasing (generic) Input A
set of G genotypes. Output A set of H
haplotypes, such that for each gi ? G there
exists hi , hi ? H such that the conflation of
hi with hi is gi.
8
1. Haplotyping process
  • The solution to the haplotype Phasing problem is
    not straightforward due to resolution ambiguity
  • Computational and statistical algorithms for
    addressing ambiguity in Haplotype Phasing
  • 1) parsimony
  • 2) phylogeny
  • 3) maximum-likelihood
  • 4) Bayesian inference

9
1. Haplotyping process
  • Parsimony-based methods (e.g., Clark)
  • Assume target population shares a relatively
    small number of common haplotypes due to linkage
    disequilibrium. Then resolve an ambiguous
    genotype using one of already identified
    haplotypes.

10
1. Haplotyping process
  1. Phylogeny-based Methods (e.g., MERLIN)

Assume haplotypes in a population evolved along
the coalescent, a rooted tree describing the
evolutionary history of a set of DNA sequences.
Thus, aim to find haplotypes that resolve target
genotype data and follow the coalescent model as
well.
11
1. Haplotyping process
  • Maximum-Likelihood-based methods (e.g.,
    fastPHASE)
  • Treats genotypes as observed but incomplete data
    with unknown haplotypes. Estimate haplotype
    frequencies ? based on their likelihood, L ,
    given the genotype data D.

12
1. Haplotyping process
  • Bayesian Inference-based Methods (e.g., PHASE)
  • Regards the unknown haplotypes as unobserved
    random quantities and aims to evaluate their
    conditional distribution in light of the genotype
    data

1. Construct a Markov chain-Monte carlo algorithm
to estimate H from the observed genotype data G
taking account of the rate of decay of LD with
distance on the underlying recombination rates
? 2. Use a Markov chain Monte Carlo algorithm to
infer haplotypes. Sample from conditional
distribution of the haplotypes and recombination
parameters given the genotype data Pr (H,? \
G)
13
2 Bovine HapMap data
  • The bovine HapMap will be a catalog of common
    genetic variants that occur in cattle.
  • It will describe what these variants are, where
    they occur in chromosome regions, and how are
    they distributed among individuals within
    populations and among populations in different
    breeds.

14
2. Bovine HapMap data (cont)
  • Grouping bovine breeds

Bovine
Bos Indicus
African
Composite
Bos Taurus
Beef breeds Charolais Limousin Piedmontese Romagn
ola Hereford Angus Red Angus
Dairy breeds Brown Swiss Guernsay Holstein Jersey
Norwegian Red
Gir Nelore Brahman
Sheko NDama
Beefmaster Santa Gertrudis
15
2. Bovine HapMap data (cont)
  • A total of 501 animals were genotyped
    representing 21 cattle breeds.
  • The breeds included a combination of Bos Taurus,
    Bos indicus and composites from several
    continents.

16
2. Bovine HapMap data (cont)
Table 1. Initial number of animals per breed in
the HapMap data
Breed No. of individuals Breed No. of individuals
Charolias 24 Jersey 28
Limousin 42 Norwegian Red 25
Piedmontese 24 Gir 24
Romagnola 24 Nelore 24
Hereford 27 Brahman 25
Angus 27 Beef Master 24
Red Angus 12 Santa Gertrudis 24
Brown Swiss 24 Sheko 20
Gurnsay 21 N Dame 25
Holstein 53 Buffalo and Anoa 2 each
17
2. Bovine HapMap data (cont)
Table 2. Initial number of markers per chromosome
in the HapMap data
Chromosome Markers Chromosome Markers Chromosomes Markers
1 1537 11 1242 21 669
2 1512 12 879 22 698
3 1316 13 999 23 588
4 1320 14 2794 24 694
5 1246 15 851 25 1208
6 2485 16 885 26 602
7 1101 17 828 27 498
8 1224 18 660 28 520
9 1018 19 690 29 479
10 1139 20 882 X 573
18
2. Bovine HapMap data (cont)
  • Cleaning the dataset
  • Missing data
  • - Animals with genotype completeness lt89 were
    removed (normalized to the more complete
    individual).
  • - markers removed due to greater than 10
    missing data gt 50 of Taurus and gt50
    Indicus.
  • HWE test and genotyping error
  • - Markers removed due to estimated genotyping
    error rate gt 5 and at list one breed out of
    HWE.

19
2. Bovine HapMap data (cont)
  • Minor Allele Frequency
  • - Markers removed for being monomorphic in all
    breeds.
  • - Markers removed due to MAFlt0.05 in all breeds.
  • Discordance
  • - Markers were removed due to having gt2
    discordant trios.
  • Unassigned chromosomes
  • - Markers assigned to unknown chromosomes were
    removed

20
2. Bovine HapMap data (cont)
  • Summary
  • The final dataset contains 29,394 markers from
    487 animals.

21
3. Haplotype analysis
  • Haplotypes inference
  • A pair of haplotypes were estimated for
    animals in each breed using fastPHASE, a
    Maximum-Likelihood based method (Sheet and
    Stephens 2006)

22
3. Haplotype analysis (cont)
  • Results
  • - 570 files containing inferred haplotypes
  • 487 individuals
  • 30 chromosomes per individual
  • 2 haplotypes per chromosome

23
3. Haplotype analysis (cont)
  • Genetic variation in genomes is organized in
    haplotype blocks. (Guryev, et, al., 2006 )
  • Haplotype maps characterize the common patterns
    of linkage disequilibrium in populations.
  • Haplotype blocks provide substantial statistical
    power in association studies of common genetic
    variation across regions (Gabriel, et. al., 2002)

24
3. Haplotype analysis (cont)
  • Block definition
  • Blocks based on pairwise and grouped r2 values.
  • (i) Begin a block by selecting the pair of
    adjacent SNPs with the highest r2 value (no
    less than ? 0.4)
  • (ii) repeatedly extend the block if the
    average r2 value between an adjacent marker and
    the current block members is above ? (0.3) and
    all the individual r2 values are above ? (0.1).

25
3. Haplotype analysis (cont)
Table 3 Average values of block statistics
26
3. Haplotype analysis (cont)
  • Figure 23 LD assessment for all SNPs inside
    blocks

27
3. Haplotype analysis (cont)
  • Figure 25 LD assessment for all SNPs inside
    blocks

28
3. Haplotype analysis (cont)
  • Consistency of block boundaries across breeds
  • 1. adjacent pairs of SNPs with
    intermarker distances up to 10 kb were
    examined,
  • 2. If the SNPs pair is assigned to a single
    block, count it as concordant (no evidence of
    historical recombination),
  • 3. If the SNPs pair is not assigned to a single
    block, count it as discordant (evidence of
    recombination)
  • (Gabriel et al, 2002)

29
3. Haplotype analysis (cont)
(a) (b) (c)
Figure 26 Concordance and discordance assignments
for SNP pairs within distance lt 10 kb for Angus
vs Holstein breeds.
30
3. Haplotype analysis (cont)
Figure 28 Dendogram based on haplotype boundary
discordances in chromosome 14
31
3. Haplotype analysis (cont)
Figure 29 Dendogram based on haplotype boundary
discordances in chromosome 25
32
3. Haplotype analysis (cont)
  • Principal Component Analysis from haplotype block
    among all breeds

Figure 29 PCA1 against PCA2 in chromosome 6 based
haplotype blocks
33
3. Haplotype analysis (cont)
Figure 30 PCA1 against PCA2 in chromosome 14
based haplotype blocks
Figure 31 PCA1 against PCA2 in chromosome 25
based haplotype blocks
34
3. Haplotype analysis (cont)
Figure 32 Breeds sorted by PC1 derived from
haplotype block vectors on chromosomes 6, 14,
and 27
35
4. Population structure
  • 4.1 MAF and nucleotide diversity
  • 4.2 Linkage Disequilibrium
  • 4.3 Genetic differentiation

36
4. Population structure (cont)
  • MAF (all breeds polymorphic proportion graphs)

37
4. Population structure (cont)
  • Principal Component Analysis (PCA) from MAF
    distribution among all breeds

Figure 3 PCA1 against PCA2 in chromosome 6 using
minor allele frequencies
38
4. Population structure (cont)
Figure 4 PCA1 against PCA2 in chromosome 14 using
minor allele frequencies
Figure 5 PCA1 against PCA2 in chromosome 25 using
minor allele frequencies
39
4. Population structure (cont)
Figure 6 Breeds sorted by PC1 derived from minro
allele frequencies on chromosomes 6, 14, and 25
40
4. Population structure (cont)
  • Linkage Disequilibrium Analysis

Figure 8 r2 plot for chromosome 6 in Angus breed
using all markers passing a X2 test.
41
4. Population structure (cont)
Figure 9 r2 plot for chromosome 14 in Angus breed
using all markers passing a X2 test.
Figure 10 r2 plot for chromosome 25 in Angus
breed using all markers passing a X2 test.
42
4. Population structure (cont)
  • Genetic differentiation

Figure 11 Genetic differentiation (FST) by marker
in chromosome 6 between Beef and Dairy breeds
43
4. Population structure (cont)
Figure 12 Genetic differentiation (FST) by
marker in chromosome 14 between Beef and Dairy
breeds
Figure 13 Genetic differentiation (FST) by marker
in chromosome 25 between Beef and Dairy breeds
44
4. Population structure (cont)
Figure 14 Genetic differentiation (FST) by marker
in chromosome 6 between Bos Taurus and Bos
Indicus clusters
45
4. Population structure (cont)
Figure 15 Genetic differentiation (FST) by marker
in chromosome 14 between Bos Taurus and Bos
Indicus clusters
Figure 16 Genetic differentiation (FST) by marker
in chromosome 25 between Bos Taurus and Bos
Indicus clusters
46
5. Allele sharing analysis
  • Multi-marker allele sharing on chromosomes with
    dense markers (6, 14, 25)
  • Allele defined as the haplotypes observed within
    a sliding window containing w 10 adjacent
    markers spanning no more than 200 kb
  • Each window containing 10 markers and spanning no
    more than 200 kb defines a single locus
  • Loci may overlap

47
5. Allele sharing analysis (cont)
  • Proportion of shared alleles between two
    populations P1 and P2 at locus k

Where i and j range over the individuals in
populations P1 and P2 , respectively. Sa(i, j, k)
is the number of shared alleles between
individuals i and j at locus k, and n1 and n2 are
the number of samples in P1 and P2.
48
Allele sharing analysis (cont)
  • Normalized proportion of shared alleles
  • S(P1 ,P2 ,k) 1.0 when the proportional of
    sharedalleles between P1 and P2 equals the
    average of the proportional of shared alleles
    within P1 and P2 .
  • S(P1 ,P2 ,k) ltlt 1.0 when the proportion of
    shared alleles between the two populations is
    much less than the average within the two
    populations.

49
Allele sharing analysis (cont)
Figure 17 Normalized proportion of shared
multi-marker alleles between Angus and Holstein
on a region of chromosome 6
50
Allele sharing analysis (cont)
Figure 18 Normalized proportion of shared
multi-marker alleles between Angus and Holstein
on a region of chromosome 14
Figure 19 Normalized proportion of shared
multi-marker alleles between Angus and Holstein
on a region of chromosome 25
51
Allele sharing analysis (cont)
  • Clustering Breeds Based on shared alleles
  • The proportion of shared alleles can be used as
    a distance measure for clustering breeds.
  • Normalized distance between P1 and P1

where u is the number of loci with shared alleles
  • D(P1 ,P2) 0 if breeds P1 and P2 share the same
    proportion of alleles as are shared by
    individuals within each individual breed.

52
Allele sharing analysis (cont)
Figure 20 Dendogram based on shared alleles on
chromosome 6.
53
Allele sharing analysis (cont)
Figure 21 Dendogram based on shared alleles on
chromosome 14.
54
Allele sharing analysis (cont)
Figure 22 Dendogram based on shared alleles on
chromosome 25.
55
Preliminary conclusions
  • High differentiation regions between breed and
    breed clusters were identified from Fst Analysis
  • PCA analysis on MAF and block boundary
    discordances permits us to cluster breed in the
    geographical and ancestry groups
  • Proportion of shared alleles between breeds
    exhibits considerable variation that is
    significantly auto-correlated, indicating
    possible effects of selection

56
Future Directions
  • Improved haplotype algorithms for Bovine data
    sets
  • Comparison of Haplotype Inference Algorithms
  • Further analysis of Bovine Genome variation

57
Collaborators
Dr. John Grefenstette Dept. of Bioinformatics and
Computational Biology, GMU Dr. Lakshmi
Kumar Dept. of Bioinformatics and Computational
Biology, GMU USDA Dr. Clare Gill and Jungwoo
Choi (PhD student) Dept. of Animal Science, Texas
A M University
58
Bibliography
  • L. Hartl Daniel. (2000) A Primer of Population
    genetics.Third edition. Sinauer Associates, Inc.
  • Gusfield, Dan (2002). An Overview of
    Combinatorial Methods for Haplotype Inference.
    Computational Methods for SNPs and Halpotype
    Inference. Springer. LNBI 2983.
  • Scheet P., Stepehens M. (2006) A fast and
    flexible statistical model for large-scale
    population genotype data applications to
    inferring missing genotypes and haplotypic phase.
    Am J Human Genetics 78(4) 629-644.
  • Gibson G., Muse V. Spencer. (2004) A Primer of
    Genome Science. Second Edition. Sinauer
    Associates, Inc.
  • Wright S (1951) The genetical structure of
    population. Annals of Eugenics, 15 323-354.
  • Wright S (1965) The interpretation of population
    structure by F-statistics with special regard to
    systems of mating. Evolution., 19 395-420.

59
Bibliography (cont)
  • Wright S (1978) Evolution and genetics of
    populations. Vol. 4. Variability Within and Among
    Natural Populations. Univ. of Chicago Press,
    Chicago.
  • Michlataos-Beloin,S., Tishkoff,S.A., Bentley,
    K.L., Kidd,K.K. and Ruano,G.(1996) Molecular
    haplotyping of genetic markers 10kb apart by
    allelic-specific long-range PCR. Nucleic Acids
    Res., 24, 4841-4843.
  • Douglas,J.A., Boehnke,M., Gillanders,E.,
    Trent,J.M. and Gruber,S.B. (2001) Experimentally
    derived haplotypes substantially increase the
    efficiency of linkage disequilibrium studies.
    Nat. Genet., 28, 361-364.
  • Bowcock, A. M., Ruiz-Linares, A., Tomfohrde, J.,
    Minch, E., Kidd, J. R. Cavalli-Sforza, L. L.
    (1994) High resolution of human evolutionary
    trees with polymorphic microsatellites. Nature
    368, 455-457.
  • Felsenstein, J. 1989. PHYLIP - Phylogeny
    Inference Package (Version 3.2). Cladistics 5
    164-166.
  • Mountain, J. L. Cavalli-Sforza, L. L. (1997)
    Multilocus genotypes, a tree of individuals, and
    human evolutionary history. American Journal of
    Human Genetics 61, 705-718.
  • Witherspoon, D. J., Wooding, S., Rogers, A. R.,
    Marchani, E. E., Watkins, W. S., Batzer, M. A.
    Jorde, L. B. (2007) Genetic similarities within
    and between human populations. Genetics 176,
    351-359.
Write a Comment
User Comments (0)
About PowerShow.com