Phasing of 2-SNP Genotypes - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Phasing of 2-SNP Genotypes

Description:

GERBIL statistical method using maximum likelihood (ML), MST and expectation ... Phase 10000 SNPs in less than one hour. Same accuracy as PHASE and Gerbil ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 22
Provided by: Dum96
Learn more at: http://www.cs.gsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Phasing of 2-SNP Genotypes


1
  • Phasing of 2-SNP Genotypes
  • Based on Non-Random Mating Model
  • Dumitru Brinza
  • joint work with Alexander Zelikovsky
  • Department of Computer Science
  • Georgia State University
  • Atlanta, USA

2
Outline
  • Molecular biology terms
  • Motivation
  • Problem formulation
  • Previous work
  • Our contribution
  • Phasing of 2-SNP genotypes
  • Phasing of multi-SNP genotypes
  • Results

3
Molecular biology terms
  • Human Genome all the genetic material in the
    chromosomes, length 3109 base pairs
  • Difference between any two people occur in 0.1
    of genome
  • SNP single nucleotide polymorphism site where
    two or more different nucleotides occur in a
    large percentage of population.
  • Genotype The entire genetic identity of an
    individual, including alleles, SNPs, or gene
    forms. (e.g., AC CT TG AA AC TG)
  • Haplotype A single set of chromosomes (half of
    the full set of genetic material). (e.g., A C T
    A A T)
  • Genotype is a mixture of two haplotypes.

4
From ACTG to 0,1,2 notations
  • Haplotype
  • Wild type SNPs are referred as 0
  • Mutated SNPs are referred as 1
  • Genotypes
  • Homozygous SNPs are referred as 0 (mixture of
    00) or 1 (mixture of 11)
  • Heterozygous SNPs are referred as 2 (mixture of
    01,10)

5
Motivation
  • Haplotype may contain large amount of genetic
    markers, which are responsible for human disease.
  • Haplotypes may increase the power of association
    between marker loci and phenotypic traits.
  • Evolutionary tree can be reconstructed based on
    haplotypes.
  • Physical phasing (haplotypes inferring) is too
    expensive. Great need in computational methods
    for extracting haplotype information from the
    given genotype information.
  • Existing methods are either extremely slow or
    less accurate for genome-wide study.

6
Phasing problem (Haplotype inference)
  • Inferring haplotypes or genotype phasing is
    resolution of a genotype into two haplotypes
  • Given n genotype vectors (0, 1 or 2),
  • Find n pairs of haplotype vectors, one pair
    of haplotypes per
  • each genotype explaining genotypes
  • For individual genotype with h heterozygous sites
    there are 2h-1 possible haplotype pairs
    explaining this genotype (h20k for the
    genome-wide). also there are around 10 missing
    data.
  • This is hopeless without genetic model

7
Previous work
  • PHASE Bayesian statistical method (Stephens et
    al., 2001, 2003)
  • HAPLOTYPER proposed a Monte Carlo approach (Niu
    et al., 2002)
  • Phamily phase the trio families based on PHASE
    (Acherman et al., 2003)
  • GERBIL statistical method using maximum
    likelihood (ML), MST and expectation-maximization
    (EM) (Kimmel and Shamir, 2005)
  • SNPHAP use ML/EM assuming Hardy-Weinberg
    equilibrium (Clayton et al., 2004)

8
Contribution
  • We explore phasing of genotypes with 2 SNPs
    which have ambiguity when the both sites are
    heterozygous. There are two possible phasing and
    the phasing problem is reduced to inferring their
    frequencies.
  • Having the phasing solution for 2-SNP
    genotypes, we propose an algorithm for inferring
    the complete haplotypes for a given genotype
    based on the maximum spanning tree of a complete
    graph with vertices corresponding to heterozygous
    sites and edge weights given by the inferred
    2-SNP frequencies.
  • Extensive experimental validation of proposed
    methods and comparison with the previously known
    methods

9
Phasing of 2-SNP genotypes
  • At least one SNP is homozygous phasing is well
    defined
  • Both SNPs are heterozygous ambiguity
  • Cis- phasing
  • Trans- phasing

01
01
or
01
Example
21
01
11
0 0
22
1 1
0 1
22
1 0
10
Odds of cis- or trans- phasing
  • Odds ratio of being phased cis- / trans-

Additive odds ratio is better (also noticed in
PHASE)
LD (linkage disequilibrium) between SNPs i and j
11
Confidence in cis- or trans- phasing
  • Closer pairs of SNPs are more linked (less
    crossovers)
  • The confidence cij in phasing 2 SNPs i and j
  • is inverse proportional to squared distance

Logarithm is for sign-indication of cis-/trans-
preference cij 0 means cis- with certainty
cij cij gt 0 means trans- with certainty cij
0 0
22 i j
1 1
0 1
22 i j
0 1
12
Certainty of cis- or trans- phasing
  • n number of genotypes
  • F00, F01, F10, F11 true haplotype frequencies
    (observed true in 22)

j
i
Genotypes
? 1 0 2 1 1 0 1 0 1
? 01 2
? 00 2
1 1 0 0 1 0 0 2 0 1
0 1 2 0 1 2 0 1 0 1
? (00 1 , 11 1) or (01 1 , 10 1)
?
2 1 1 0 1 1 0 ? 0 1
? 11 2
0 1 1 0 1 2 0 0 2 1
? 10 1 , 11 1
13
Haplotype frequencies in 22
  • Random mating model gt Hardy-Weinberg Equilibrium
    (HWE)

(F00F01F10F11)2 F002 F012 F102 F112
2F00F01 2F00F10 2F00F11 2F01F10 2F01F11
2F10F11
G00 G01 G10 G11 G02
G20 G22 G21
G12
  • Even single-SNP haplotype frequencies may deviate
    from HWE
  • Accordingly we adjust expectation of 2-SNP
    haplotype frequencies
  • Compute expected haplotype frequencies in 22 as
    best fitting to observed deviation in
    single-site haplotype frequencies

14
Phasing of multi-SNP genotypes
  • Genotype graph for genotype g is a weighted
    complete graph G(g ) where
  • Vertices 2s i.e., heterozygous SNPs in g
  • Weight w(i,j) cij confidence in phasing 2
    SNPs i and j
  • Phasing of 2 heterozygous SNPs
  • cij gt 0 ? cis-edge 22 00 11
  • cij lt 0 ? trans-edge 22 01 10
  • Phasing Genotype graph coloring
  • Color all vertices in two colors such that
  • any 2 vertices connected with a cis-edge have
    the same color, and
  • any 2 vertices connected with a trans-edge have
    opposite colors

a b c d
a
2 1 2 0 1 2 0 2 0 1
Genotype
1 1 0 0 1 0 0 1 0 1
Haplotype 1
b
c
Haplotype 2
0 1 1 0 1 1 0 0 0 1
d
15
Genotype graph coloring
Frequent conflicts when coloring genotype graph G
since it has cycles Genotype Graph Coloring
Problem Find coloring with total weight
(number) of conflicting edges minimized
Exact solution ILP slow and not accurate
Heuristic solution Find maximum spanning
tree (MST) of G and color MST instead of G
16
2SNP algorithm
  • For each pair of SNPs do
  • Collect statistics on haplotype/genotype
    frequencies
  • Compute weights reflecting likelihood of
    trans-/cis-
  • For each genotype g do
  • Find MST for the complete graph G(g ) where
    vertices are heterozygous sites
  • Color G(g ) vertices and phase based on coloring
  • For each haplotype h with ?s (missing SNP
    values) do
  • Find a haplotype h closest to h (with minimum
    number of mismatches)
  • Replace ?s in h with the known SNP value in h
  • Runtime (two bottlenecks)
  • O(nm) computing haplotype frequencies for 20m
    pairs of SNPs in each genotype, n is number
    of genotypes, m number of SNPs.
  • O(n2m) missing data recovery, finding number of
    mismatches for any two haplotypes

17
Datasets
  • Chromosome 5q31 129 genotypes with 103 SNPs
    derived from the 616 KB region of human
    Chromosome 5q31 (Daly et al., 2001).
  • Yoruba population (D) 30 genotypes with SNPs
    from 51 various genomic regions, with number of
    SNPs per region ranging from 13 to 114 (Gabriel
    et al., 2002).
  • Random matching 5q31 128 genotypes each with 89
    SNPs from 5q31 cytokine gene generated by random
    matching from 64 haplotypes of 32 West African
    Hull et al. (2004).
  • HapMap datasets 30 genotypes of Utah residents
    and Yoruba residents available on HapMap by Dec
    2005. The number of SNPs varies from 52 to 1381
    across 40 regions including ENm010, ENm013,
    ENr112, ENr113 and ENr123 spanning 500 KB regions
    of chromosome bands 7p152, 7q2113, 2p163, 4q26
    and 12q12 respectively, and two regions spanning
    the gene STEAP and TRPM8 plus 10 KB upstream and
    downstream.

18
Unrelated individuals phasing validation
  • Phasing methods can be validated on simulated
    data (haplotypes are known)
  • The validation on real data is usually performed
    on the trio data
  • Offspring haplotypes are mostly known (inferred
    from parents haplotypes)
  • Error types
  • Single-Site error
  • Number of SNPs in offspring phased haplotypes
    which differ from SNPs inferred from trio data,
    divide by (total number of SNPs) x (total number
    of haplotypes)
  • Individual error
  • Number of correctly phased offspring genotypes
    (no Single-Site errors) divide by total number of
    genotypes
  • Switching error
  • Minimum number of switches which should be done
    in pair of haplotypes of offspring phased
    genotype such that both haplotypes will coincide
    with haplotypes inferred from trio data, divide
    by total number of heterozygous positions in
    offspring genotypes.

19
Results
20
Chromosome-Wide Phasing
Entire chromosomes for 30 Trios from
Hapmap Average Errors
Single-site 3.3 Switching 8.8
SNPs 1.5K runtime 2 sec
2.5K 8 sec
5.0K 25 sec
10.0K 55 sec
20.0K 220 sec
40.0K 17 min
60.0K 35 min
80.0K 70 min
21
Conclusion
2SNP method Several orders of magnitude
faster Scalable for genome-wide study Phase 10000
SNPs in less than one hour Same accuracy as PHASE
and Gerbil
Write a Comment
User Comments (0)
About PowerShow.com