Linear Reduction for Haplotype Inference - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Linear Reduction for Haplotype Inference

Description:

It is possible when there no 4-gamete rule violations: ... Known programs for general data (with possible 4-gamete rule violations) ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 19
Provided by: Gan999
Learn more at: https://www.cs.gsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Linear Reduction for Haplotype Inference


1
Linear Reduction forHaplotype Inference
  • Alex Zelikovsky
  • joint work with Jingwu He

WABI 2004
2
Outline
  • SNP, haplotypes and genotypes
  • Haplotype Inference
  • Linear reduction method
  • Improvements
  • Experimental results
  • Conclusions future work

3
Human Genome and SNP
  • Length of Human Genome ? 3 ? 109 base pairs
  • Difference between any two people ? 0.1 of
    genome
  • ? 3 ? 106 base pairs
  • Total number of single nucleotide polymorphisms
    (SNP)
  • ? 1 ? 107 base pairs
  • SNPs are mostly bi-allelic, e.g.,
  • two variants (alleles) out of 4 possible
    (A,C,T,G) A/C
  • having a nucleotide in a certain position or
    missing it A/-
  • Major allele more frequent allele wild type
    vs SNP
  • Minor allele (snip) frequency should be
    biologically considerable, e.g., over 1
  • There are more less frequent SNP

4
Haplotype and Disease Association
  • Deafness inheritance ? moral problems
  • SNP contribute to risk factors of complex
    diseases
  • having certain SNP increases 10 times chances of
    having diabetes
  • but association is too fragile for doctors 3 ?
    10-6 ? 30 ? 10-6
  • combinations of SNPs haplotypes are
    responsible for diseases
  • International HapMap project http//www.hapmap.or
    g
  • SNP maps are constructed across the human genome
    with density of about one SNP per thousand
    nucleotides.
  • HapMap tries to identify 1 million tag SNPs
    providing almost as much mapping information as
    entire 10 million SNPs
  • Unfortunately, not as much known about SNP
    combinations

5
Haplotypes and Genotypes
  • Diploid organisms two different copies of
    each chromosome recombined copies of parents
    chromosomes
  • Too expensive to examine two versions of a
    chromosome separately
  • Much cheaper to obtain genotype (mixed) data
    rather than haplotype (separated) data
  • Haplotype description of single copy (0wild
    type,1minor allele)
  • Genotype description of mixed two copies
    (000, 111, 201)

WABI 2004
6
Haplotype Inference Problem
  • Haplotype Inference (HI) Problem
  • Given n genotype vectors (0, 1 or 2),
  • Find n pairs of haplotype vectors, one pair of
    haplotypes per each genotype explaining genotypes
  • For individual genotype with h heterozygous sites
    there are 2h-1 possible haplotype pairs
    explaining this genotype
  • This is hopeless without genetic model
  • Parsimonious models ? minimize number of
    haplotypes

WABI 2004
7
Computational Haplotype Inference Problem
  • Assumptions
  • small number of repeated mutations
  • small number of recombinations
  • If data allow, then explain them only with
    mutations (perfect phylogeny)
  • It is possible when there no 4-gamete rule
    violations
  • for any pair of SNPs only 3 combinations out of
    4 (00/01/10/11) are present
  • Fastest implemented algorithm DPPH
  • Known programs for general data (with possible
    4-gamete rule violations)
  • PHASE, HAPLOTYPER, HAP, Set-cover based, etc.

WABI 2004
8
Reducing the Set of SNPs
  • Often many columns corresponding to SNP sites are
    analogous one column can be obtained from
    another by swapping 0s and 1s
  • One of such columns can be dropped same as for
    two equal columns
  • What would be generalization?
  • If one site is dependent (or can be
    reconstructed) from k other sites, then drop
    this dependent site it does not carry any
    useful additional information
  • General reduction method
  • Encoding reduce number of sites be removing
    dependent sites
  • Infer site-reduced haplotypes for the
    site-reduced genotypes using known haplotype
    inference method
  • Decoding reconstruct dependent SNPs from sites
    of reduced haplotypes
  • Main requirement to reduction method should be
    fast

WABI 2004
9
Linear Dependence of SNPs
  • Consider linear dependence
  • To make analogous sites linearly dependent
    change notations 0/1 ? -1/1
  • Also for genotypes 0/1/2 ? -1/1/0 and genotype is
    half-sum of (linearly dependent from explaining
    haplotypes)
  • Keep only linear independent SNP (tag SNPs)
    all other SNP can be reconstructed using linear
    combinations
  • Equivalent factorization problem find
    representation
  • G IX H

WABI 2004
10
Factorization Problem
  • Factorization problem
  • Given a 0/1/-1 genotype matrix G
  • Find representation,
  • G IX H
  • where IX graph incidence matrix
    (exactly two 1s in each row)
  • and H -1/1 haplotype matrix
  • Solution
  • Factorize G T (ETC)
  • T tags basis of columns of G
  • - solve factorization for T T IX
    H
  • - finally G (IX H) (ETC) IX
    (H (ETC)) IX H

WABI 2004
11
Linear Encoding Algorithm
WABI 2004
12
Linear Decoding Algorithm
WABI 2004
13
Graph-Based Decoding
  • Extend haplotype graph Xr obtained from HI
    algorithm to Xm for all m sites
  • Very often the graphs Xr and Xm are isomorphic,
    but not always
  • Consider example
  • g1 (1, 0, 1) and g2 (0, -1, -1)
  • reduced set (1,0) and (0,-1)
  • The corresponding reduced haplotype graph has 3
    vertices, while Xm has 4 vertices
  • The simple way is to split the vertices if we
    find an error

WABI 2004
14
Handling Imperfect Phylogeny
  • The genotype data may have indications of
    inconsistency with the perfect phylogeny model, 4
    gamete rule violation
  • We could choose h independent columns without
    such violation
  • Algorithm in greedy manner

WABI 2004
15
Experimental Results
  • In Table 1, Our Results show that the advantage
    in runtime of Linearly Reduced DPPH grow fast
    with testcase size and reaches factor of 60 for
    largest instances.
  • In all testcases, if DPPH find unique solution,
    so does the LR DPPH and the solution is
    identical.
  • In Table 2 and 3, we can see the running time is
    drastically reduced compared to the original
    PHASE while the quality measured is not larger.
  • In Table 4 and 5, we can see same advantage by
    using Linearly Reduced HAPLOTYPER instead
    original HAPLOTYPER.
  • The last two data, we work on the real data from
    the drosophila haplotypes and human chromosome.

WABI 2004
16
Experimental Results
WABI 2004
17
Experimental Results
WABI 2004
18
Conclusions and Future work
  • Our method significantly speed up popular
    haplotype inference tools such as DPPH,
    HAPLOTYPER and PHASE in all cases thus not
    compromising the quality.
  • We ever reach 50 faster than DPPH.
  • Future work includes implement handling imperfect
    phylogeny algorithm.
  • We are going to investigate an application of
    suggested linear reduction to finding a small
    number of representative sites sufficient to
    distinguish all haploytpes

WABI 2004
Write a Comment
User Comments (0)
About PowerShow.com