Title: Phasing of 2-SNP Genotypes
1- Phasing of 2-SNP Genotypes
- Based on Non-Random Mating Model
- Dumitru Brinza
- joint work with Alexander Zelikovsky
- Department of Computer Science
- Georgia State University
- Atlanta, USA
2Outline
- Molecular biology terms
- Motivation
- Problem formulation
- Previous work
- Our contribution
- Phasing of 2-SNP genotypes
- Phasing of multi-SNP genotypes
- Results
3Molecular biology terms
- Human Genome all the genetic material in the
chromosomes, length 3109 base pairs - Difference between any two people occur in 0.1
of genome - SNP single nucleotide polymorphism site where
two or more different nucleotides occur in a
large percentage of population. - Genotype The entire genetic identity of an
individual, including alleles, SNPs, or gene
forms. (e.g., AC CT TG AA AC TG) - Haplotype A single set of chromosomes (half of
the full set of genetic material). (e.g., A C T
A A T) - Genotype is a mixture of two haplotypes.
4From ACTG to 0,1,2 notations
- Haplotype
- Wild type SNPs are referred as 0
- Mutated SNPs are referred as 1
- Genotypes
- Homozygous SNPs are referred as 0 (mixture of
00) or 1 (mixture of 11) - Heterozygous SNPs are referred as 2 (mixture of
01,10)
5Motivation
- Haplotype may contain large amount of genetic
markers, which are responsible for human disease. - Haplotypes may increase the power of association
between marker loci and phenotypic traits. - Evolutionary tree can be reconstructed based on
haplotypes. - Physical phasing (haplotypes inferring) is too
expensive. Great need in computational methods
for extracting haplotype information from the
given genotype information. - Existing methods are either extremely slow or
less accurate for genome-wide study.
6Phasing problem (Haplotype inference)
- Inferring haplotypes or genotype phasing is
resolution of a genotype into two haplotypes - Given n genotype vectors (0, 1 or 2),
- Find n pairs of haplotype vectors, one pair
of haplotypes per - each genotype explaining genotypes
- For individual genotype with h heterozygous sites
there are 2h-1 possible haplotype pairs
explaining this genotype (h20k for the
genome-wide). also there are around 10 missing
data. - This is hopeless without genetic model
7Previous work
- PHASE Bayesian statistical method (Stephens et
al., 2001, 2003) - HAPLOTYPER proposed a Monte Carlo approach (Niu
et al., 2002) - Phamily phase the trio families based on PHASE
(Acherman et al., 2003) - GERBIL statistical method using maximum
likelihood (ML), MST and expectation-maximization
(EM) (Kimmel and Shamir, 2005) - SNPHAP use ML/EM assuming Hardy-Weinberg
equilibrium (Clayton et al., 2004)
8Contribution
- We explore phasing of genotypes with 2 SNPs
which have ambiguity when the both sites are
heterozygous. There are two possible phasing and
the phasing problem is reduced to inferring their
frequencies. - Having the phasing solution for 2-SNP
genotypes, we propose an algorithm for inferring
the complete haplotypes for a given genotype
based on the maximum spanning tree of a complete
graph with vertices corresponding to heterozygous
sites and edge weights given by the inferred
2-SNP frequencies. - Extensive experimental validation of proposed
methods and comparison with the previously known
methods
9Phasing of 2-SNP genotypes
- At least one SNP is homozygous phasing is well
defined -
- Both SNPs are heterozygous ambiguity
- Cis- phasing
- Trans- phasing
01
01
or
01
Example
21
01
11
0 0
22
1 1
0 1
22
1 0
10Odds of cis- or trans- phasing
- Odds ratio of being phased cis- / trans-
Additive odds ratio is better (also noticed in
PHASE)
LD (linkage disequilibrium) between SNPs i and j
11Confidence in cis- or trans- phasing
- Closer pairs of SNPs are more linked (less
crossovers) - The confidence cij in phasing 2 SNPs i and j
- is inverse proportional to squared distance
Logarithm is for sign-indication of cis-/trans-
preference cij 0 means cis- with certainty
cij cij gt 0 means trans- with certainty cij
0 0
22 i j
1 1
0 1
22 i j
0 1
12Certainty of cis- or trans- phasing
- n number of genotypes
- F00, F01, F10, F11 true haplotype frequencies
(observed true in 22)
j
i
Genotypes
? 1 0 2 1 1 0 1 0 1
? 01 2
? 00 2
1 1 0 0 1 0 0 2 0 1
0 1 2 0 1 2 0 1 0 1
? (00 1 , 11 1) or (01 1 , 10 1)
?
2 1 1 0 1 1 0 ? 0 1
? 11 2
0 1 1 0 1 2 0 0 2 1
? 10 1 , 11 1
13Haplotype frequencies in 22
- Random mating model gt Hardy-Weinberg Equilibrium
(HWE)
(F00F01F10F11)2 F002 F012 F102 F112
2F00F01 2F00F10 2F00F11 2F01F10 2F01F11
2F10F11
G00 G01 G10 G11 G02
G20 G22 G21
G12
- Even single-SNP haplotype frequencies may deviate
from HWE
- Accordingly we adjust expectation of 2-SNP
haplotype frequencies
- Compute expected haplotype frequencies in 22 as
best fitting to observed deviation in
single-site haplotype frequencies
14Phasing of multi-SNP genotypes
- Genotype graph for genotype g is a weighted
complete graph G(g ) where - Vertices 2s i.e., heterozygous SNPs in g
- Weight w(i,j) cij confidence in phasing 2
SNPs i and j - Phasing of 2 heterozygous SNPs
- cij gt 0 ? cis-edge 22 00 11
- cij lt 0 ? trans-edge 22 01 10
- Phasing Genotype graph coloring
- Color all vertices in two colors such that
- any 2 vertices connected with a cis-edge have
the same color, and - any 2 vertices connected with a trans-edge have
opposite colors
a b c d
a
2 1 2 0 1 2 0 2 0 1
Genotype
1 1 0 0 1 0 0 1 0 1
Haplotype 1
b
c
Haplotype 2
0 1 1 0 1 1 0 0 0 1
d
15Genotype graph coloring
Frequent conflicts when coloring genotype graph G
since it has cycles Genotype Graph Coloring
Problem Find coloring with total weight
(number) of conflicting edges minimized
Exact solution ILP slow and not accurate
Heuristic solution Find maximum spanning
tree (MST) of G and color MST instead of G
162SNP algorithm
- For each pair of SNPs do
- Collect statistics on haplotype/genotype
frequencies - Compute weights reflecting likelihood of
trans-/cis- - For each genotype g do
- Find MST for the complete graph G(g ) where
vertices are heterozygous sites - Color G(g ) vertices and phase based on coloring
- For each haplotype h with ?s (missing SNP
values) do - Find a haplotype h closest to h (with minimum
number of mismatches) - Replace ?s in h with the known SNP value in h
- Runtime (two bottlenecks)
- O(nm) computing haplotype frequencies for 20m
pairs of SNPs in each genotype, n is number
of genotypes, m number of SNPs. - O(n2m) missing data recovery, finding number of
mismatches for any two haplotypes
17Datasets
- Chromosome 5q31 129 genotypes with 103 SNPs
derived from the 616 KB region of human
Chromosome 5q31 (Daly et al., 2001). - Yoruba population (D) 30 genotypes with SNPs
from 51 various genomic regions, with number of
SNPs per region ranging from 13 to 114 (Gabriel
et al., 2002). - Random matching 5q31 128 genotypes each with 89
SNPs from 5q31 cytokine gene generated by random
matching from 64 haplotypes of 32 West African
Hull et al. (2004). - HapMap datasets 30 genotypes of Utah residents
and Yoruba residents available on HapMap by Dec
2005. The number of SNPs varies from 52 to 1381
across 40 regions including ENm010, ENm013,
ENr112, ENr113 and ENr123 spanning 500 KB regions
of chromosome bands 7p152, 7q2113, 2p163, 4q26
and 12q12 respectively, and two regions spanning
the gene STEAP and TRPM8 plus 10 KB upstream and
downstream.
18Unrelated individuals phasing validation
- Phasing methods can be validated on simulated
data (haplotypes are known) - The validation on real data is usually performed
on the trio data - Offspring haplotypes are mostly known (inferred
from parents haplotypes) - Error types
- Single-Site error
- Number of SNPs in offspring phased haplotypes
which differ from SNPs inferred from trio data,
divide by (total number of SNPs) x (total number
of haplotypes) - Individual error
- Number of correctly phased offspring genotypes
(no Single-Site errors) divide by total number of
genotypes - Switching error
- Minimum number of switches which should be done
in pair of haplotypes of offspring phased
genotype such that both haplotypes will coincide
with haplotypes inferred from trio data, divide
by total number of heterozygous positions in
offspring genotypes.
19Results
20Chromosome-Wide Phasing
Entire chromosomes for 30 Trios from
Hapmap Average Errors
Single-site 3.3 Switching 8.8
SNPs 1.5K runtime 2 sec
2.5K 8 sec
5.0K 25 sec
10.0K 55 sec
20.0K 220 sec
40.0K 17 min
60.0K 35 min
80.0K 70 min
21Conclusion
2SNP method Several orders of magnitude
faster Scalable for genome-wide study Phase 10000
SNPs in less than one hour Same accuracy as PHASE
and Gerbil