Combinatorial Search CS for DiseaseAssociation: - PowerPoint PPT Presentation

1 / 1

About This Presentation

Title:

Combinatorial Search CS for DiseaseAssociation:

Description:

For tick-borne encephalitis virus-induced disease, a multi-SNP combination ... [4] Tick-borne encephalitis : 75 genotypes with 41 SNPs containing gene TLR3, PKR, ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 2

Provided by: tae55

Category:

more less

Transcript and Presenter's Notes

Title: Combinatorial Search CS for DiseaseAssociation:

1
Our contributions
SNP and Disease
MSC
x x 1 x x 2 x x x
0 1 1 0 1 2 1 0 2 sick

A novel combinatorial method for finding disease-
associated multi-SNP combinations was
developed.
Multi-SNP combinations significantly associating
with diseases were found.
For Crohn's disease data (Daly, et al., 2001), a
few associated multi-SNP combinations with
multiple-testing-adjusted to p lt 0.05 were found,
while no single SNP or pair of SNPs showed
significant association.
For a dataset for an autoimmune disorder (Ueda,
et al., 2003), a few previously unknown
associated multi-SNP combinations were found.
For tick-borne encephalitis virus-induced
disease, a multi-SNP combination within a group
of genes showing a high degree of linkage
disequilibrium significantly associated with the
severity of the disease was found.
A model-fitting disease susceptibility prediction
methods based on the developed search methods
were proposed.

SNP - single nucleotide polymorphism where two
or more different nucleotides occur in a large
percentage of population
0 willde type/major (frequency) allele
1 mutation/minor (frequency) allele
2 heterozygous allele
Searching for genetic risk factors for diseases
Monogenic diseases
A mutated gene is entirely responsible for the
disease
Complex diseases
Affected by the interaction of multiple genes
Significance of risk factor is usually measured
by Risk Rate or _ _ _Odds Ratio
We measure significance by the p-value of the
set of genotypes _defined by risk factor

0 1 1 1 0 2 0 0 1 sick
4 sick 1 healthy
0 0 1 0 0 0 0 2 1 sick
0 1 1 1 1 2 0 0 1 sick
check significance
0 0 1 0 1 2 1 0 2 sick
0 1 0 0 1 1 0 0 2 healthy
0 1 1 0 1 2 0 0 2 healthy
Statistical significance

Multi-SNP combination (MSC) define a set of case
and control individuals
MSC is considered statistically significant if
the frequency of cases and controls distribution
has p-value lt 0.05
A lot of reported findings are frequently not
reproducible on different populations. It is
believed that this happens because the p-values
are unadjusted to multiple testing

Disease-Associated Multi-SNP Combinations Search
Disease association analysis

Given a population of n genotypes (or
haplotypes) each containing values of m SNPs from
0,1,2 and disease status (case or control)
Find all multi-SNP combinations with multiple
testing adjusted p-value of the frequency
distribution below 0.05

Analysis of variation in suspected genes in case
and controls individuals is aimed at identifying
SNPs with considerably higher frequencies among
the case individuals than among the control
individuals
Most searches are done on a SNP-by-SNP basis
Recently two-SNP analysis shows promising results
(Marchini et al, 2005)
Multi-SNP analyses are expected to find even
stronger disease associations
Common diseases can be caused by combinations of
several unlinked gene (SNPs) variations
We address the computational challenge of
searching for such multi-gene causal combinations
The number of multi-SNP combinations is
infeasible high (3100 for 100 SNPs).
How to find associated multi-SNP combinations
without total checking?
Disease association analysis searches for a SNPs
or multi-SNP combinations with frequency among
cases considerably higher than among controls.

If the reported SNP is found among 100 SNPs then
the probability that the SNP is associated with a
disease by
mere chance becomes 100 times larger
(Bonferroni).
Bonferroni is too crude (e.g., 3-SNP
combinations among 100 SNPs, p lt 0.0510-6)
We adjust resulted p-values via randomization
Unadjusted p-value Probability of case/control
distribution in a set defined by MSC, computed by
binomial distribution
Multiple-testing adjusted p-value
randomization
Randomly permute the disease status of the
population to generate 10000 instances.
Apply searching methods on each instance to get
MSCs.
Compute the probability of MSCs that have a
higher unadjusted p-value than the observed
p-value.
In our search we report only MSC with adjusted
p-value lt 0.05

Combinatorial Search (CS) for Disease-Association
checks all one-SNP, two-SNP, ..., m-SNP
case-closed MSCs
Case-closure of a MSC C is an MSC C, with
maximum number of SNPs, which consists of the
same set of cases and minimum number of controls.
Case-closure allow finding of the statistically
significant MSC on the earlier stage of
searching.
Trivial MSCs and MSCs which coincide after
case-closure are avoided. That significantly
speedups the searching.
Faster than exhaustive search
Finds more significant association on the early
stage of searching
Still slow for wide-genome studies

Clustering-based Model-Fitting Algorithm for
Disease Susceptibility Prediction
For the given training dataset and tested
genotype consider two cases
tested genotype is added to the training dataset
as a sick
tested genotype is added to the training dataset
as a healthy
For the both cases obtain clustering by applying
CGS to find
the most disease-associated MSC (defines a set of
sick genotypes)
the most disease-resistant MSC (defines a set of
healthy genotypes)
Remove from the original dataset one which is
larger
Repeat this procedure until all genotypes are
removed
Predict susceptibility of the tested genotype
according to the case which has lower entropy of
clustering.

Results for Disease Susceptibility Prediction
Maximum Case(Control)-Free Cluster Problem

Quality measure

Find a maximum size cluster C containing only
cases or controls
Complimentary Greedy Search (CGS)
1. Find SNP with allele value removing a set
of genotypes with highest ratio of controls over
cases.
2. Add the SNP to resulted MSC
3. Repeat 1-2 until all controls are removed.
Resultant MSC defines a subset of sick genotypes.
4. Adjust to multiple testing the p-value of
the resultant MSC.

Leave-one-out cross validation results

Data Sets

3 Crohn's disease 387 genotypes with 103 SNPs
derived from the 616 KB region of human
Chromosome 5q31, 144 disease genotypes and 243
nondisease genotypes. (Daly et al., 2001).
10 Autoimmune disorder 1024 genotypes with
108 SNPs containing gene CD28, CTLA4 and ICONS,
378 disease genotypes and 646 nondisease
genotypes. (Ueda et al., 2003).
4 Tick-borne encephalitis 75 genotypes with
41 SNPs containing gene TLR3, PKR, OAS1, OAS2,
and OAS3, 21 disease genotypes and 54 nondisease
genotypes. (Barkash et al., 2006).

Disease Susceptibility Prediction Problem

Given a sample population S (a training set) and
one more individual t?S with the known SNPs but
unknown disease status (testing individual), find
(predict) the unknown disease status
Disease Clustering Problem
Given a population sample S, find a partition P
of S into clusters S S1?..?Sk , with disease
status 0 or 1 assigned to each cluster Si ,
minimizing entropy(P)

Comparison of 5 prediction methods on 4 data on
all SNPs. Area under the CSPs ROC curve is 0.87
vs 0.52 under the SVMs curve

Results/comparison of searching methods

Comparison of three methods for searching the
disease-associated and disease-resistant
multi-SNPs combinations with the largest PPV.

Combinatorial search is able to find
statistically significant multi-gene
interactions, for data where no significant
association was detected before
Complimentary greedy search can be used in
susceptibility prediction
Optimization approach to prediction
New susceptibility prediction is by 8 higher
than the best previously known
MLR-tagging efficiently reduces the datasets
allowing to find associated multi-SNP
combinations and predict susceptibility

for a given bound on the number of
individuals who are assigned incorrect status in
clusters of the partition P, error(P)lt ?P.

Write a Comment

User Comments (0)