Tagging SNPs - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Tagging SNPs

Description:

A block-free tag SNP selection algorithm that maximizes prediction accuracy' ... a mutation at a single position in human genome, passed along through heredity ... – PowerPoint PPT presentation

Number of Views:346

Avg rating:3.0/5.0

Slides: 27

Provided by: Owne745

Category:

more less

Transcript and Presenter's Notes

Title: Tagging SNPs

1
Tagging SNPs

Presentation by Eric Ruggieri
December 20, 2007

2
Outline

Brief background to SNP selection
A block-free tag SNP selection algorithm that
maximizes prediction accuracy
Halperin et al 2005
A block-free tag SNP selection algorithm that
maximizes informativeness
Halldorsson et al 2004

3
What does it mean to tag SNPs?

SNP Single Nucleotide Polymorphism
Caused by a mutation at a single position in
human genome, passed along through heredity
Characterizes much of the genetic differences
between humans
Most SNPs are bi-allelic
Estimated several million common SNPs (minor
allele frequency gt5
To tag select a subset of SNPs to work with

4
Why do we tag SNPs?

Disease Association Studies
Goal Find genetic factors correlated with
disease
Look for discrepancies in haplotype structure
Statistical Power Determined by sample size
Cost Determined by overall number of SNPs typed
This means, to keep cost down, reduce the number
of SNPs typed
Choose a subset of SNPs, tag SNPs that can
predict other SNPs in the region with small
probability of error
Remove redundant information

5
What do we know?

SNPs physically close to one another tend to be
inherited together
This means that long stretches of the genome
(sans mutational events) should be perfectly
correlated if not for
Recombination breaks apart haplotypes and slowly
erodes correlation between neighboring alleles
Tends to blur the boundaries of LD blocks
Since SNPs are bi-allelic, each SNP defines a
partition on the population sample.
If you are able to reconstruct this partition by
using other SNPs, there would be no need to type
this SNP
For any single SNP, this reconstruction is not
difficult

6
Complications

But the Global solution to the minimum number of
tag SNPs necessary is NP-hard
The predictions made will not be perfect
Correlation between neighboring tag SNPs not as
strong as correlation between neighboring (not
necessarily tagged) SNPs
Haplotype information is usually not available
for technical reasons
Need for Phasing

Tagging SNPs can be partitioned into the
following three steps
Determining neighborhoods of LD which SNPs can
infer each other
Tagging quality assessment Defining a quality
measure that specifies how well a set of tag SNPs
captures the variance observed
Optimization Minimizing the number of tag SNPs

8
Two Classes of tag SNP algorithmsbased on
distinction of how to determine neighborhoods of
LD

Block-Based
Define blocks that are in strong LD with each
other, but not with neighboring blocks
Requires inference on exact location of haplotype
blocks
Recombination between the blocks but not within
the blocks
Within each block, choose a subset of SNPs
sufficiently rich to be able to reconstruct
diversity of the block
Many algorithms exist for creating blocks few
select the same boundaries!
Most prominent algorithm due to Zhang et al
(several papers)

9
How do we create Haplotype Blocks?

Recombination-based block building algorithm
Infinite sites assumption each site mutates at
most once
Assume no recombination within a block
Implies each block should follow the four-gamete
condition for any pair of sites (See Hudson and
Kaplan)
Diversity-based test A region is a block if at
least 80 of the sequences occur in more than one
chromosome.
Test does not scale well to large sample sizes.
(See Patil et al (2001))
To generalize this notion, one could look for
sequences within a region accounting for 80 of
the sampled population that each occur in at
least 10 of the sample.
LD-based test
D value of every pair of SNPs within the block
shows significant LD given the individual SNP
frequencies with a P-value of 0.001
Two SNPs are considered to have a useful level of
correlation if they occur in the same haplotype
block i.e. they are physically close with little
evidence of recombination. The set of SNPs that
can be used to predict SNP s can be found by
taking the union of all putative haplotype blocks
that contain SNP s.
It is possible that many overlapping block
decompositions will meet the rules defined by a
rule-based algorithm for finding haplotype blocks
Metric LD Maps as described by Maniatis et al.
(2002)
Only those SNPs that are within a distance of lt 1
LD unit are considered to be significantly
correlated to each other.

Entropy-based or block-free
Avoids construction of blocks
Entropy is a measure of randomness
Seek to capture the most information across a
region without rigid boundaries of a block
Both papers presented today use this method

11
Tag SNP Selection in Genotype Data for Maximizing
SNP Prediction Accuracy Halperin et al 2005
12
Problem Formulation

Notation Side Board
Definition of Prediction Algorithm, f, and
restriction function, Z
Goal is to find a minimum size set of tag SNPs
and a prediction algorithm such that the
prediction error is minimized
Statistical note about 0-1 loss functions and
Maximum Likelihood Estimates
But, frequencies of genotypes in population
unknown, so taking expected value difficult
Instead, use training dataset to estimate the
distribution of the genotypes (Bootstrap Method,
non-parametric)
Minimize probability expression for a randomly
chosen genotype in training set
Alternatively, we can seek to minimize the actual
number of prediction errors un-normalized form
of the probability expression

13
The Prediction Algorithm

Of critical importance in the search for tag SNPs
is the definition of an adequate measure of the
prediction quality
Different measures will lead to different
optimal tag SNPs
Many of current tag SNP selection tools need to
first partition the region of interest into LD
blocks before making predictions
Current Prediction Algorithm is based upon
following assumption
Correlation between SNPs tends to decay as
physical distance between them increases

This translates to
given the genotype values of two SNPs, the
probabilities of the values at any intermediate
SNP do not change by knowing the values of
additional distal ones
Prediction function makes its prediction based
only upon the two nearest SNPs
Assumption does not hold for all data sets or for
all SNPs, but is a good approximation

15
The Prediction Algorithm, cont.

Predict predicts the value of SNP i given the
value of the tag SNPs
Aims to maximize the expected accuracy of
predicting untyped SNPs, given the unphased
(genotype) information of the tag SNPs
Uses a majority vote in order to make a
prediction (Maximum Likelihood prediction)
In order to used the phased information available
from the training set, two majority votes are
actually calculated, although they coincide if
the genotype takes the value 0 or 1
Two votes are necessary only if we have a
heterozygote allele at a tag SNP
All of the tag SNPs except for the closest two
are ignored
If there is not a tag SNP on one side of SNP i,
the two closest tag SNPs on the other side are
selected, whether they be the first two tag SNPs
or the last two tag SNPs.

16
An Exact Algorithm for Tag SNP Selection

STAMPA (Selection of tag SNPs for Maximizing
Prediction Accuracy)
Dynamic Programming
Recall, we are trying to minimize XT
Define indicator function
Three auxiliary score functions score(m1,m2),
score1(m1,m2), score2(m1,m2)
Score Gives the total number of prediction
errors in SNPs m1.m2-1, given that m1 and m2 are
tag SNPs and that there are no tag SNPs in
between
Score1 and score2 work similarly
Since Predict uses only nearest two tag SNPs to
make prediction, all variables are local and sums
can be readily computed

17
Building the Recursion

For lltt, define f(m,l) to be the minimum number
of prediction errors in SNPs 1,2,m given that
the lth SNP is in position m
For lt, f(m,t) represents the minimum number of
prediction errors in all SNPs given that the
final tag SNP is in position m
Recurrence relation
The minimum value of XT over all possible values
of tag SNPs of size t is simply the min f(m,t)
over all possible values of m
Use back pointers to get entire set of tag SNPs
Complexity Time O(m3n)
However, by placing a cap on distance between
adjacent tag SNPs O(mc(cnt))

18
An Alternate Method Random Sampling

Gives up predictive power for speed and
efficiency
Randomly generate 100 sets of tag SNPs by using
the uniform distribution on the set of all
available SNPs
Select any t of the m SNPs available
Compute XTi for all SNP sets, then choose SNP set
that minimizes XTi

19
Advantages to the Method

Uses genotype information and so does not require
phasing
In practice, only genotype data available
Does not rely on a specific block partition
Side Note Algorithm has the feel of the
k-nearest neighbor classifier

20
Optimal Haplotype Block-Free Selection of Tagging
SNPs for Genome-Wide Association Studies

Halldorsson et al (2004)
including Prof. Istrail

Tagging SNPs can be partitioned into the
following three steps
Determining neighborhoods of LD which SNPs can
infer each other
Tagging quality assessment Defining a quality
measure that specifies how well a set of tag SNPs
captures the variance observed
Optimization Minimizing the number of tag SNPs

22
Finding Neighborhoods

Goal is to select SNPs in the sample that
characterize regions of common recent ancestry
that will contain conserved haplotypes
Recent common ancestry means that there has been
little time for recombination to break apart
haplotypes
Constructing fixed size neighborhoods in which to
look for SNPs is not desirable because of the
variability of recombination rates and historical
LD across the genome
In fact, the size of informative neighborhoods is
highly variable precisely because of variable
recombination rates and SNP density
Authors avoid block-building by recursively
creating neighborhood with help of
informativeness measure

23
Defning Informativeness

A measure of tagging quality assessment
Assume all SNPs are bi-allelic
Notation
I(s,t) Informativeness of a SNP s with respect
to a SNP t
i, j are two haplotypes drawn at random from the
uniform distribution on the set of distinct
haplotype pairs.
Note I(s,t) 1 implies complete predictability,
I(s,t)0 when t is monomorphic in the population.
I(s,t) easily estimated through the use of
bipartite clique that defines each SNP
We can write I(s,t) in terms of an edge set
Definition of I easily extended to a set of SNPs
S by taking the union of edge sets
Assumes the availability of haplotype phases
New measure avoids some of the difficulties
traditional LD measures have experienced when
applied to tagging SNP selection
The concept of pairwise LD fails to reliably
capture the higher-order dependencies implied by
haplotype structure

24
Bounded-Width Algorithm k Most Informative SNPs
(k-MIS)

Input A set of n SNPs S
Output subset of SNPs S such that I(S,S) is
maximal
In its most general form, k-MIS is NP-hard by
reduction of the set cover problem to MIS
Algorithm optimizes informativeness, although
easily adapted for other measures
Define distance between two SNPs as the number of
SNPs in between them
k-MIS can be solved as long as distance between
adjacent tag SNPs not too large

Define
Assignment Asi
S(As)
Recursion function Iw(s,l, S(A)) score of the
most informative subset of l SNPs chosen from
SNPs 1 through s such that As described the
assignment for SNP s.
Pseudocode
Complexity O(nk2w) in time and O(k2w) in space,
assuming maximal window w

26
Evaluation

Algorithm evaluated by Leave-One-Out
Cross-Validation
accumulated accuracy over all haplotypes gives a
global measure of the accuracy for the given data
set.
SNPs not typed were predicted by a majority vote
among all haplotypes in the training set that
were identical to the one being inferred
If no such haplotypes existed, the majority vote
is taken among all training haplotypes that have
the same allele call on all but one of the typed
SNPs
etc.
When compared to block-based method of Zhang
Presumably, the advantage is due to the cost
imposed by artificially restricting the range of
influence of the few SNPs chosen by block
boundaries
Informativeness was shown to be a good
measure
aligned well with the leave-one-out cross
validation results
extremely close to the results of optimizing for
haplotype r2