Combinatorial Search CS for DiseaseAssociation: - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Combinatorial Search CS for DiseaseAssociation:

Description:

For tick-borne encephalitis virus-induced disease, a multi-SNP combination ... [4] Tick-borne encephalitis : 75 genotypes with 41 SNPs containing gene TLR3, PKR, ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 2
Provided by: tae55
Category:

less

Transcript and Presenter's Notes

Title: Combinatorial Search CS for DiseaseAssociation:


1
Our contributions
SNP and Disease
MSC
x x 1 x x 2 x x x
0 1 1 0 1 2 1 0 2 sick
  • A novel combinatorial method for finding disease-
    associated multi-SNP combinations was
    developed.
  • Multi-SNP combinations significantly associating
    with diseases were found.
  • For Crohn's disease data (Daly, et al., 2001), a
    few associated multi-SNP combinations with
    multiple-testing-adjusted to p lt 0.05 were found,
    while no single SNP or pair of SNPs showed
    significant association.
  • For a dataset for an autoimmune disorder (Ueda,
    et al., 2003), a few previously unknown
    associated multi-SNP combinations were found.
  • For tick-borne encephalitis virus-induced
    disease, a multi-SNP combination within a group
    of genes showing a high degree of linkage
    disequilibrium significantly associated with the
    severity of the disease was found.
  • A model-fitting disease susceptibility prediction
    methods based on the developed search methods
    were proposed.
  • SNP - single nucleotide polymorphism where two
    or more different nucleotides occur in a large
    percentage of population
  • 0 willde type/major (frequency) allele
  • 1 mutation/minor (frequency) allele
  • 2 heterozygous allele
  • Searching for genetic risk factors for diseases
  • Monogenic diseases
  • A mutated gene is entirely responsible for the
    disease
  • Complex diseases
  • Affected by the interaction of multiple genes
  • Significance of risk factor is usually measured
    by Risk Rate or _ _ _Odds Ratio
  • We measure significance by the p-value of the
    set of genotypes _defined by risk factor

0 1 1 1 0 2 0 0 1 sick
4 sick 1 healthy
0 0 1 0 0 0 0 2 1 sick
0 1 1 1 1 2 0 0 1 sick
check significance
0 0 1 0 1 2 1 0 2 sick
0 1 0 0 1 1 0 0 2 healthy
0 1 1 0 1 2 0 0 2 healthy
Statistical significance
  • Multi-SNP combination (MSC) define a set of case
    and control individuals
  • MSC is considered statistically significant if
    the frequency of cases and controls distribution
    has p-value lt 0.05
  • A lot of reported findings are frequently not
    reproducible on different populations. It is
    believed that this happens because the p-values
    are unadjusted to multiple testing

Disease-Associated Multi-SNP Combinations Search
Disease association analysis
  • Given a population of n genotypes (or
    haplotypes) each containing values of m SNPs from
    0,1,2 and disease status (case or control)
  • Find all multi-SNP combinations with multiple
    testing adjusted p-value of the frequency
    distribution below 0.05
  • Analysis of variation in suspected genes in case
    and controls individuals is aimed at identifying
    SNPs with considerably higher frequencies among
    the case individuals than among the control
    individuals
  • Most searches are done on a SNP-by-SNP basis
  • Recently two-SNP analysis shows promising results
    (Marchini et al, 2005)
  • Multi-SNP analyses are expected to find even
    stronger disease associations
  • Common diseases can be caused by combinations of
    several unlinked gene (SNPs) variations
  • We address the computational challenge of
    searching for such multi-gene causal combinations
  • The number of multi-SNP combinations is
    infeasible high (3100 for 100 SNPs).
  • How to find associated multi-SNP combinations
    without total checking?
  • Disease association analysis searches for a SNPs
    or multi-SNP combinations with frequency among
    cases considerably higher than among controls.
  • If the reported SNP is found among 100 SNPs then
    the probability that the SNP is associated with a
    disease by
  • mere chance becomes 100 times larger
    (Bonferroni).
  • Bonferroni is too crude (e.g., 3-SNP
    combinations among 100 SNPs, p lt 0.0510-6)
  • We adjust resulted p-values via randomization
  • Unadjusted p-value Probability of case/control
    distribution in a set defined by MSC, computed by
    binomial distribution
  • Multiple-testing adjusted p-value
    randomization
  • Randomly permute the disease status of the
    population to generate 10000 instances.
  • Apply searching methods on each instance to get
    MSCs.
  • Compute the probability of MSCs that have a
    higher unadjusted p-value than the observed
  • p-value.
  • In our search we report only MSC with adjusted
    p-value lt 0.05
  • Combinatorial Search (CS) for Disease-Association
  • checks all one-SNP, two-SNP, ..., m-SNP
    case-closed MSCs
  • Case-closure of a MSC C is an MSC C, with
    maximum number of SNPs, which consists of the
    same set of cases and minimum number of controls.
  • Case-closure allow finding of the statistically
    significant MSC on the earlier stage of
    searching.
  • Trivial MSCs and MSCs which coincide after
    case-closure are avoided. That significantly
    speedups the searching.
  • Faster than exhaustive search
  • Finds more significant association on the early
    stage of searching
  • Still slow for wide-genome studies
  • Clustering-based Model-Fitting Algorithm for
    Disease Susceptibility Prediction
  • For the given training dataset and tested
    genotype consider two cases
  • tested genotype is added to the training dataset
    as a sick
  • tested genotype is added to the training dataset
    as a healthy
  • For the both cases obtain clustering by applying
    CGS to find
  • the most disease-associated MSC (defines a set of
    sick genotypes)
  • the most disease-resistant MSC (defines a set of
    healthy genotypes)
  • Remove from the original dataset one which is
    larger
  • Repeat this procedure until all genotypes are
    removed
  • Predict susceptibility of the tested genotype
    according to the case which has lower entropy of
    clustering.

Results for Disease Susceptibility Prediction
Maximum Case(Control)-Free Cluster Problem
  • Quality measure
  • Find a maximum size cluster C containing only
    cases or controls
  • Complimentary Greedy Search (CGS)
  • 1. Find SNP with allele value removing a set
    of genotypes with highest ratio of controls over
    cases.
  • 2. Add the SNP to resulted MSC
  • 3. Repeat 1-2 until all controls are removed.
    Resultant MSC defines a subset of sick genotypes.
  • 4. Adjust to multiple testing the p-value of
    the resultant MSC.
  • Leave-one-out cross validation results

Data Sets
  • 3 Crohn's disease 387 genotypes with 103 SNPs
    derived from the 616 KB region of human
    Chromosome 5q31, 144 disease genotypes and 243
    nondisease genotypes. (Daly et al., 2001).
  • 10 Autoimmune disorder 1024 genotypes with
    108 SNPs containing gene CD28, CTLA4 and ICONS,
    378 disease genotypes and 646 nondisease
    genotypes. (Ueda et al., 2003).
  • 4 Tick-borne encephalitis 75 genotypes with
    41 SNPs containing gene TLR3, PKR, OAS1, OAS2,
    and OAS3, 21 disease genotypes and 54 nondisease
    genotypes. (Barkash et al., 2006).

Disease Susceptibility Prediction Problem
  • Given a sample population S (a training set) and
    one more individual t?S with the known SNPs but
    unknown disease status (testing individual), find
    (predict) the unknown disease status
  • Disease Clustering Problem
  • Given a population sample S, find a partition P
    of S into clusters S S1?..?Sk , with disease
    status 0 or 1 assigned to each cluster Si ,
    minimizing entropy(P)
  • Comparison of 5 prediction methods on 4 data on
    all SNPs. Area under the CSPs ROC curve is 0.87
    vs 0.52 under the SVMs curve

Results/comparison of searching methods
  • Comparison of three methods for searching the
    disease-associated and disease-resistant
    multi-SNPs combinations with the largest PPV.
  • Combinatorial search is able to find
    statistically significant multi-gene
    interactions, for data where no significant
    association was detected before
  • Complimentary greedy search can be used in
    susceptibility prediction
  • Optimization approach to prediction
  • New susceptibility prediction is by 8 higher
    than the best previously known
  • MLR-tagging efficiently reduces the datasets
    allowing to find associated multi-SNP
    combinations and predict susceptibility

for a given bound on the number of
individuals who are assigned incorrect status in
clusters of the partition P, error(P)lt ?P.
Write a Comment
User Comments (0)
About PowerShow.com