A Data Compression Problem - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

A Data Compression Problem

Description:

A Data Compression Problem The Minimum Informative Subset – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 55
Provided by: Apple71
Learn more at: https://cs.brown.edu
Category:

less

Transcript and Presenter's Notes

Title: A Data Compression Problem


1
A Data Compression Problem
  • The Minimum Informative Subset

2
Informativeness-based Tagging SNPs Algorithm

3
Outline
  • Brief background to SNP selection
  • A block-free tag SNP selection algorithm that
    maximizes informativeness
  • Halldorsson et al 2004

4
What does it mean to tag SNPs?
  • SNP Single Nucleotide Polymorphism
  • Caused by a mutation at a single position in
    human genome, passed along through heredity
  • Characterizes much of the genetic differences
    between humans
  • Most SNPs are bi-allelic
  • Estimated several million common SNPs (minor
    allele frequency gt10
  • To tag select a subset of SNPs to work with

5
Why do we tag SNPs?
  • Disease Association Studies
  • Goal Find genetic factors correlated with
    disease
  • Look for discrepancies in haplotype structure
  • Statistical Power Determined by sample size
  • Cost Determined by overall number of SNPs typed
  • This means, to keep cost down, reduce the number
    of SNPs typed
  • Choose a subset of SNPs, tag SNPs that can
    predict other SNPs in the region with small
    probability of error
  • Remove redundant information

6
What do we know?
  • SNPs physically close to one another tend to be
    inherited together
  • This means that long stretches of the genome
    (sans mutational events) should be perfectly
    correlated if not for
  • Recombination breaks apart haplotypes and slowly
    erodes correlation between neighboring alleles
  • Tends to blur the boundaries of LD blocks
  • Since SNPs are bi-allelic, each SNP defines a
    partition on the population sample.
  • If you are able to reconstruct this partition by
    using other SNPs, there would be no need to type
    this SNP
  • For any single SNP, this reconstruction is not
    difficult

7
Complications
  • But the Global solution to the minimum number of
    tag SNPs necessary is NP-hard
  • The predictions made will not be perfect
  • Correlation between neighboring tag SNPs not as
    strong as correlation between neighboring (not
    necessarily tagged) SNPs
  • Haplotype information is usually not available
    for technical reasons
  • Need for Phasing

8
  • Tagging SNPs can be partitioned into the
    following three steps
  • Determining neighborhoods of LD which SNPs can
    infer each other
  • Tagging quality assessment Defining a quality
    measure that specifies how well a set of tag SNPs
    captures the variance observed
  • Optimization Minimizing the number of tag SNPs

9
Haplotype-based tagging SNPs htSNPs
  • Block-Based
  • Define blocks as as set of SNPs that are in
    strong LD with each other, but not with
    neighboring blocks
  • Requires inference on exact location of haplotype
    blocks
  • Recombination between the blocks but not within
    the blocks
  • Within each block, choose a subset of SNPs
    sufficiently rich to be able to reconstruct
    diversity of the block
  • Many algorithms exist for creating blocks few
    select the same boundaries!

10
How do we create Haplotype Blocks?
  • Recombination-based block building algorithm
  • Infinite sites assumption each site mutates at
    most once
  • Assume no recombination within a block
  • Implies each block should follow the four-gamete
    condition for any pair of sites (See Hudson and
    Kaplan)
  • Diversity-based test A region is a block if at
    least 80 of the sequences occur in more than one
    chromosome.
  • Test does not scale well to large sample sizes.
    (See Patil et al (2001))
  • To generalize this notion, one could look for
    sequences within a region accounting for 80 of
    the sampled population that each occur in at
    least 10 of the sample.
  • LD-based test
  • D value of every pair of SNPs within the block
    shows significant LD given the individual SNP
    frequencies with a P-value of 0.001
  • Two SNPs are considered to have a useful level of
    correlation if they occur in the same haplotype
    block i.e. they are physically close with little
    evidence of recombination. The set of SNPs that
    can be used to predict SNP s can be found by
    taking the union of all putative haplotype blocks
    that contain SNP s.
  • It is possible that many overlapping block
    decompositions will meet the rules defined by a
    rule-based algorithm for finding haplotype blocks

11
Methods for inferring haplotype blocks
12
Hypothesis Haplotype Blocks?
  • The genome consists largely of blocks of common
    SNPs with relatively little recombination
    shuffling in the blocks
  • Patil et. al, Science, 2001 Jeffreys et al.
    Nature Genetics Daly et al. Nature Genetics,
    2001
  • Compare block detection methods.
  • How well we can detect haplotype blocks?
  • Are the detection methods consistent?

13
Block detection methods
  • Four gamete test, Hudson and Kaplan,Genetics,
    1985, 111, 147-164.
  • A segment of SNPs is a block if between every
    pair (aA and bB) of SNPs at most 3 gametes (ab,
    aB, Ab, AB) are observed.
  • P-Value test
  • A segment of SNPs is a block if for 95 of the
    pairs of SNPs we can reject the hypothesis (with
    P-value 0.05 or 0.001) that they are in linkage
    equilibrium.
  • LD-based, Gabriel et al. Science,2002,2962225-9
  • Next slide

14
Gabriel et al. method
Gabriel et al. method
  • For every pair of SNPs we calculate an upper and
    lower confidence bound on D (Call these Du,
    Dl)
  • We then split the pairs of SNPs into 3 classes
  • Class I Two SNPs are in Strong LD if Du gt .98
    and Dl gt .7.
  • Class II Two SNPs show Strong evidence for
    recombination if Du lt .9.

15
Gabriel et al. method
Gabriel et al. method
  • Class III The remaining SNP pairs, these are
    uninformative.
  • A contiguous set of SNPs is a block if
  • (Class II)/(Class I ClassII) lt 5.
  • Special rules to determine if 2, 3 or 4 SNPs are
    a block.
  • Furthermore there are distance requirements on
    the chromosome to determine if the SNPs are a
    block.

16
One definition of block
  • Based on the Four Gamete test.
  • Intuition when between two SNPs there are all
    four gametes, there is a recombination point
    somewhere inbetween the two sites

17
Four Gamete Block Test
  • Hudson and Kaplan 1985
  • A segment of SNPs is a block if between
    every pair of SNPs at most 3 out of the 4 gametes
    (00, 01,10,11) are observed.

0 0 1 0 1 1 1 1 0 1 1 1
0 0 1 0 1 1 1 1 0 1 0 1
BLOCK
VIOLATES THE BLOCK DEFINITION
18
Finding Recombination HotspotsMany Possible
Partitions into Blocks
19
The final result is a minimum-size set of sites
crossing all constraints.
A C T A G A T A G C C T
G T T C G A C A A C A T
Find the left-most right endpoint of any
constraint and mark the site before it a
recombination site.
A C T C T A T G A T C G
Eliminate any constraints crossing that site.
Repeat until all constraints are gone.
G T T A T A C G A C A T
A C T C T A T A G T A T
A C T A G C T G G C A T
20
Tagging SNPs
Only 4 SNPs are needed to tag all the different
haplotypes
A------A---TG-- G------G---CG-- A------G---TC-- A-
-----G---CC-- G------A---TG--
ACGATCGATCATGAT GGTGATTGCATCGAT ACGATCGGGCTTCCG AC
GATCGGCATCCCG GGTGATTATCATGAT
An example of real data set and its haplotype
block structure. Colors refer to the founding
population, one color for each founding
haplotype
21
Optimal Haplotype Block-Free Selection of Tagging
SNPs for Genome-Wide Association Studies
  • Halldorsson, Bafna, Lippert, Schwartz, Clark,
    Istrail (2004)

22
  • Tagging SNPs can be partitioned into the
    following three steps
  • Determining neighborhoods of LD which SNPs can
    infer each other
  • Tagging quality assessment Defining a quality
    measure that specifies how well a set of tag SNPs
    captures the variance observed
  • Optimization Minimizing the number of tag SNPs

23
Finding Neighborhoods
  • Goal is to select SNPs in the sample that
    characterize regions of common recent ancestry
    that will contain conserved haplotypes
  • Recent common ancestry means that there has been
    little time for recombination to break apart
    haplotypes
  • Constructing fixed size neighborhoods in which to
    look for SNPs is not desirable because of the
    variability of recombination rates and historical
    LD across the genome
  • In fact, the size of informative neighborhoods is
    highly variable precisely because of variable
    recombination rates and SNP density
  • Authors avoid block-building by recursively
    creating neighborhood with help of
    informativeness measure

24
Defning Informativeness
  • A measure of tagging quality assessment
  • Assume all SNPs are bi-allelic
  • Notation
  • I(s,t) Informativeness of a SNP s with respect
    to a SNP t
  • i, j are two haplotypes drawn at random from the
    uniform distribution on the set of distinct
    haplotype pairs.
  • Note I(s,t) 1 implies complete predictability,
    I(s,t)0 when t is monomorphic in the population.
  • I(s,t) easily estimated through the use of
    bipartite clique that defines each SNP
  • We can write I(s,t) in terms of an edge set
  • Definition of I easily extended to a set of SNPs
    S by taking the union of edge sets
  • Assumes the availability of haplotype phases
  • New measure avoids some of the difficulties
    traditional LD measures have experienced when
    applied to tagging SNP selection
  • The concept of pairwise LD fails to reliably
    capture the higher-order dependencies implied by
    haplotype structure

25
Bounded-Width Algorithm k Most Informative SNPs
(k-MIS)
  • Input A set of n SNPs S
  • Output subset of SNPs S such that I(S,S) is
    maximal
  • In its most general form, k-MIS is NP-hard by
    reduction of the set cover problem to MIS
  • Algorithm optimizes informativeness, although
    easily adapted for other measures
  • Define distance between two SNPs as the number of
    SNPs in between them
  • k-MIS can be solved as long as distance between
    adjacent tag SNPs not too large

26
  • Define
  • Assignment Asi
  • S(As)
  • Recursion function Iw(s,l, S(A)) score of the
    most informative subset of l SNPs chosen from
    SNPs 1 through s such that As described the
    assignment for SNP s.
  • Pseudocode
  • Complexity O(nk2w) in time and O(k2w) in space,
    assuming maximal window w

27
Evaluation
  • Algorithm evaluated by Leave-One-Out
    Cross-Validation
  • accumulated accuracy over all haplotypes gives a
    global measure of the accuracy for the given data
    set.
  • SNPs not typed were predicted by a majority vote
    among all haplotypes in the training set that
    were identical to the one being inferred
  • If no such haplotypes existed, the majority vote
    is taken among all training haplotypes that have
    the same allele call on all but one of the typed
    SNPs
  • etc.
  • When compared to block-based method of Zhang
  • Presumably, the advantage is due to the cost
    imposed by artificially restricting the range of
    influence of the few SNPs chosen by block
    boundaries
  • Informativeness was shown to be a good
    measure
  • aligned well with the leave-one-out cross
    validation results
  • extremely close to the results of optimizing for
    haplotype r2

28
A Data Compression Problem
  • Select SNPs to use in an association study
  • Would like to associate single nucleotide
    polymorphisms (SNPs) with disease.
  • Very large number of candidate SNPs
  • Chromosome wide studies, whole genome-scans
  • For cost effectiveness, select only a subset.
  • Closely spaced SNPs are highly correlated
  • It is less likely that there has been a
    recombination between two SNPs if they are close
    to each other.

29
Association studies
30
Association studies
  • Evaluate whether nucleotide polymorphisms
    associate with phenotype

31
Association studies
32
SNP-Selection AxiomHypothesis-free associations
  • Due to the many unknowns regarding the nature of
    common or complex disease, we should aim at SNP
    selection that confers maximal resolution power,
    i.e., genome-wide SNP scans with the hope of
    performing hypothesis-free disease associations
    studies, as opposed to hypothesis-driven
    candidate gene or region studies.

33
A New Measure
  • Informativeness

34
SNP-Selection AxiomMulti-allelic measure
  • The tagging quality of the selected SNPs should
    by described by multi-allelic measure sets of
    SNPs have combined information about predicting
    other SNPs

35
SNP-Selection AxiomsLD consistency and
Block-freeness
  • The highly concordant results of the block
    detection methods make the interior of LD blocks
    adequate for sparse SNP selection. However, block
    boundaries defined by these methods are not
    sharp, with no single true block partition. SNP
    selection should avoid dependence of particular
    definitions of haplotype block.

36
A New SNP Selection Measure Informativeness
  • It satisfies the
    following six Axioms
  • Multi-allelic measure
  • LD consistency compares well with measures of
    LD
  • Block-freeness independence on any particular
    block definition
  • Hypothesis-free associations optimization
    achieves maximum haplotype resolution
  • Algorithmically sound practical for genome-wide
    computations
  • Statistically sound passes overfitting and
    imputation tests

37


Informativeness
s
h1
h2
38
s1 s2 s3 s4
s5
Informativeness
I(s1,s2) 2/4 1/2
39
s1 s2 s3 s4
s5
Informativeness
I(s1,s2, s4) 3/4
40
s1 s2 s3 s4
s5
Informativeness
I(s3,s4,s1,s2,s5) 3

Ss3,s4 is a Minimal Informative
Subset
41
Informativeness
e6
e5
s5
Graph theory insight
Minimum Set Cover Minimum Informative Subset
e4
s4
e3
s3
s2
e2
s1
e1
Edges
SNPs
42
Informativeness
e6
e5
s5
Graph theory insight
Minimum Set Cover s3, s4 Minimum
Informative Subset
e4
s4
e3
s3
s2
e2
s1
e1
SNPs
Edges
43
Connecting Informativeness with Measures of LD
44
The Minimum Informative SNPs in a Block of
Complete LD
45
(k,w)-MIS Problem
46
(k,w)-MIS O(nk2w) solution
1 0 1 0 ? ? ? ? ? ? ? ?
Opt
As0
0 1 0 1 1 0 0
As1
1 1 0 1 1 0 0
As
1 0 1 1 0 0 1
47
ValidationTests on Publicly-Accessible Data
  • We performed tests using two publicly available
    datasets
  • LPL dataset of Nickerson et al. (2000)
  • 142 chromosomes typed at 88 SNPs
  • Chromosome 21 dataset of Patil et al. (2001)
  • 20 chromosomes typed at 24,047 SNPs
  • We also performed tests on an AB dataset
  • Most of Chromosome 22
  • 45 chromosomes typed at 4102 SNPs

48
A region of Chr. 2245 Caucasian samples
Two different runs of the Gabriel el al Block
Detection method Zhang et al SNP selection
algorithm
Our block-free algorithm
49
Block free taggingMinimum informative SNPs
Block Free method Block Method
Informativeness
Number of SNPs
  • Perlegen Data Set Chromosome 21
  • 20 individuals, 24047 SNPs

50
Block free taggingMinimum informative SNPs

Lipoprotein Lipase Gene, 71 individuals, 88 SNPs
51
Correct imputationblock vs. block free
correct imputations
Block Free
Zhang et al.
SNPs typed
Perlegen dataset
52
Correlations of informativeness with imputation
in leave one out studies
Leave one out
Informativeness
Block free
SNPs
Perlege dataset
53
  • Conclusions

54
Conclusions
  • Existing LD based measures are not adequate for
    SNP subset selection, and do not extend easily to
    multiple SNPs
  • The Informativeness measure for SNPs is
    Block-free, and extends easily to multiple SNPs.
  • Practically feasible algorithms for genome-wide
    studies to compute minimum informative SNP
    subsets
  • We are able to show that by typing only 20-30 of
    the SNPs, we are able to retain 90 of the
    informativeness.
Write a Comment
User Comments (0)
About PowerShow.com