Highly Scalable Genotype Phasing by Entropy Minimization - PowerPoint PPT Presentation

About This Presentation
Title:

Highly Scalable Genotype Phasing by Entropy Minimization

Description:

Title: PowerPoint Presentation Author: Hydra Group CS&E Last modified by * Created Date: 8/29/2002 10:09:00 PM Document presentation format: Letter Paper (8.5x11 in) – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 17
Provided by: HydraG8
Category:

less

Transcript and Presenter's Notes

Title: Highly Scalable Genotype Phasing by Entropy Minimization


1
Highly Scalable Genotype Phasing by Entropy
Minimization
Bogdan Pasaniuc and Ion Mandoiu
Computer Science Engineering Department,
University of Connecticut
2
SNPs
haplotypes
  • ataggtccCtatttcgcgcCgtatacacgggActata ? CCA
  • ataggtccGtatttcgcgcCgtatacacgggTctata ? GCT
  • ataggtccCtatttcgcgcCgtatacacgggTctata ?
    CCT

SNPs
Genome variation 0.1 of the DNA different from
one individual to another 80 of the variation
is represented by Single Nucleotide Polymorphisms
(SNPs) 2 possible nucleotides (alleles) for
each SNP Haplotype - description of SNP
alleles on one chromosome - 0/1 vector

3
Notations
  • Diploid organisms two copies of each chromosome
  • One from mother and one father
  • Genotype description of alleles on both
    chromosomes
  • 0/1/2 vector
  • 0 (1) - both chromosomes contain the dominant
    (resp. minor) allele
  • 2 - the chromosomes contain different alleles

two haplotypes per individual
genotype for the individual
4
Genotype Population Phasing
  • For a genotype with k 2s there are 2k-1 possible
    pairs of haplotypes explaining it
  • Physical phasing is too expensive
  • Computational phasing is much cheaper
  • Statistical methods PHASE, Phamily, PL, GERBIL
  • Combinatorial methods Parsimony, HAP, 2SNP, ENT
  • Current genotyping platforms -gt 500k SNPs in one
    experiment
  • Need for fast and accurate methods

5
Minimum Entropy Population Phasing
  • Phasing function f that assigns to each
    genotype g a pair of haplotypes (h,h) that
    explains g
  • Coverage of h in f number of times h appears in
    the image of f
  • Entropy of a phasing

Minimum Entropy Population Phasing HalperinKarp
04 Given a set of genotypes, find a phasing
with minimum entropy
6
Basic ENT Algorithm
  • Initialization
  • Random phasing
  • Iterative improvement
  • While there exists a genotype whose re-phasing
    decreases the entropy of f, find the genotype
    that yields the highest decrease in entropy and
    re-phase it
  • Entropy is informative only for small number of
    SNPs
  • Large number of SNPs ? no common haplotypes

7
Overlapping Window approach
  • Entropy is computed over short windows of size
    lf
  • l locked SNPs previously phased
  • f free SNPs are currently phased
  • only phasings consistent with the l locked SNPs
    are considered
  • l and f
  • user specified parameters
  • auto computed inside the algorithm, based on the
    number of ambiguous SNPs(2s) present

8
ENT Time Complexity
  • n unrelated genotypes over k SNPs
  • k/f windows
  • n2f candidate hap pairs/window are evaluated
    (pessimistic estimate)
  • Computing the entropy gain takes O(1) time per
    candidate pair
  • Empirically number of iterations linear in n
  • Total runtime O(n22fk/f)
  • Number of iterations reduced to constant by
    re-explaining multiple genotypes at one step

9
Extension to general pedigrees
  • Genotypes coming from related individuals
  • At each step re-explain an entire family
  • No Recombination Assumption a parent transmits
    one of its chromosome to the child
  • A trio family (mother, father child) is phased
    using 4 haplotypes

10
Experimental Setting
  • Dataset I 129 family trios over 103 SNPs Daly
    et al. 2001
  • From trios, using the no-recombination assumption
    we recovered partial haplotypes for children
  • ENT run on the children treated as unrelated gens
  • Partial haplotypes used for testing accuracy of
    our method
  • Switching error rate
  • Given the true haplotypes (t,t) and the
    inferred ones (h,h), the switching error rate is
    the ratio (given in percents) between the number
    of times we have to switch from reading h to h
    to obtain t and the number of ambiguous SNPs.

11
Daly dataset/different window sizes
ENT auto/auto used in following experiments
12
Comparison with other methodsDaly Dataset
Switching Error () Haplotype Accuracy() SNP Accuracy()
PHASE 2.1 3.09 55.81 98.29
GERBIL 3.18 44.96 97.89
2SNP 3.18 51.16 98.58
ENT auto/auto 4.21 42.64 98.96
13
Dataset II
  • Hapmap.org Phase I release 16 datasets
  • The International HapMap Project
  • Two 30 trio populations
  • CEU Utah residents with ancestry from northern
    and western Europe
  • YRB Yoruba people of Ibadan, Nigeria
  • Haplotypes obtained by PHASE
  • 2SNP BrinzaZelikovsky 05
  • phasing based on genotype statistics collected
    only for pairs of SNPs
  • Pure Parsimony Trio Phasing (PPTP) Brinza et al.
    05
  • minimizes the number of distinct haplotypes used
    for phasing
  • Integer Linear Programming based method

14
Hapmap phase I chromosome 22
Switching Error Switching Error Switching Error Switching Error Switching Error
Pop SNPs ENT 2SNP PPTP
CEU 15548 9.65 4.98 26.68
YRI 16386 6.96 8.97 22.47
Runtime CPU sec Runtime CPU sec Runtime CPU sec Runtime CPU sec Runtime CPU sec
Pop SNPs ENT 2SNP PPTP
CEU 15548 129 1784 1320
YRI 16386 65 2057 1400
  • All chrs ENT 3h,20m ?1,653,765 SNPs
  • All chrs PHASE over a month on two clusters
    with a combined total of 238 nodes.

15
Missing data recovery
  • We randomly deleted 1-10 of genotype SNPs
  • Used genotypes with missing data as input
  • Measured the percent of correctly recovered
    alleles

Percent of correctly recovered SNP alleles Percent of correctly recovered SNP alleles Percent of correctly recovered SNP alleles Percent of correctly recovered SNP alleles Percent of correctly recovered SNP alleles
Deleted SNPs alleles Deleted SNPs alleles Deleted SNPs alleles Deleted SNPs alleles
1 2 5 10
CEU(chr 22) 97.97 97.95 97.71 97.46
YRI(chr 22) 96.70 96.67 96.33 95.98
16
Conclusions
  • ENT is several orders of magnitude faster than
    current methods
  • Phasing accuracy close to the best methods
  • Current version handles any type of pedigree data
  • Code for download Web server
    http//dna.engr.uconn.edu/software/ent/
  • Thanks Alexander Gusev and NSF Grants 0546457
    and 0543365
Write a Comment
User Comments (0)
About PowerShow.com