Discrete Algorithms for Disease Association Search and - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Discrete Algorithms for Disease Association Search and

Description:

Human Genome all the genetic material in the chromosomes, length 3 109 base pairs ... SNP single nucleotide polymorphism site where two or more different ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 31
Provided by: Dum75
Category:

less

Transcript and Presenter's Notes

Title: Discrete Algorithms for Disease Association Search and


1
Discrete Algorithms for Disease Association
Search and
UCSD, November 29, 2006
Susceptibility Prediction
  • Dumitru Brinza
  • Department of Computer Science
  • Georgia State University

2
Outline
  • SNPs, Haplotypes and Genotypes
  • Disease Association Analysis
  • Multiple-testing adjustment
  • MLR indexing for data compression
  • Optimum data clustering
  • Predicting susceptibility to complex diseases
  • Conclusions

3
SNP, Haplotypes, Genotypes
Human Genome all the genetic material in the
chromosomes, length 3109 base pairs Difference
between any two people occur in 0.1 of
genome SNP single nucleotide polymorphism site
where two or more different nucleotides occur in
a large percentage of population. Diploid two
different copies of each chromosome Haplotype
description of a single copy (expensive)
example 00110101 (0 is for major, 1 is
for minor allele) Genotype description of the
mixed two copies example
01122110 (000, 111, 201)
4
Types of Diseases
  • Monogenic disease
  • Mutated gene is entirely responsible for the
    disease
  • Break the pathway, no another compensatory
    pathway
  • Typically rare in population lt 0.1.
  • Complex disease
  • Interaction of multiple genes
  • One mutation does not cause disease
  • Breakage of all compensatory pathways cause
    disease
  • Hard to analyze - 2-gene interaction analysis for
    a genome-wide scan with 1 million SNPs has 1012
    pair wise tests
  • Multiple independent causes
  • There are different causes and each of these
    causes can be result of interaction of several
    genes
  • Each cause explains certain percentage of cases
  • Common diseases are Complex gt 0.1.
  • In NY city, 12 of the population has Type 2
    Diabetes

5
Case/Control study
Given a population of n genotypes each
containing values of m SNPs and disease status.
Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
-1 -1 -1 -1 1 1 1 1
Case genotypes
Control genotypes
Disease association analysis searches for
risk (resistance) factor with frequency among
case (control) individuals considerably higher
than among control (case) individuals.
6
Risk/Resistance factors
  • Risk/resistance factor one SNP with fixed
    allele value

0 1 1 0 1 2 1 0 2 case
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
0 0 1 0 1 2 1 0 2 case
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
present in 5 cases 1 control
Third SNP with fixed allele value 1 is a risk
factor with frequency among case individuals
higher than among control individuals.
  • We generalize risk/resistance factor to multi-SNP
    combination

7
Multi-SNP extension
  • multi-SNP combination (MSC)
  • a subset of SNP-columns of S (set of SNPs)
  • With fixed values of these SNPs, 0, 1, or 2

0 1 1 0 1 2 1 0 2 case
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
0 0 1 0 1 2 1 0 2 case
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
x x 1 x x 2 x x x
MSC
present in 4 cases 1 control
check significance
8
Significance of Risk/Resistance Factors
  • Measured P-value
  • probability that case/control distribution among
    exposed to risk factor happened by chance
  • compute by binomial distribution
  • Searching for risk factors among many SNPs
    requires multiple testing adjustment of the
    p-value

9
Multiple-testing adjustment
  • Bonferroni
  • easy to compute
  • overly conservative
  • If the reported SNP is found among 100 SNPs then
    the probability that the SNP is associated with a
    disease by mere chance becomes 100 times larger
  • Randomization
  • Randomly permute the disease status of the
    population to generate 10000 samples
  • Apply searching methods to each sample and get
    MSCs
  • Count of MSCs that have smaller unadjusted
    p-value than the observed p-value
  • If this lt 500 then the observed MSC is
    significant
  • computationally expensive
  • more accurate
  • In our search we report only MSC with adjusted
    p-value lt 0.05

10
Disease Association Search
Problem Formulation Given a case/control
study data consisting of n genotypes each
containing values of m SNPs and disease
status Find all Risk/Resistance factors (MSCs)
with multiple testing adjusted p-value below 0.05
11
Searching Approaches
  • Exhaustive search (ES)
  • computationally infeasible
  • searching for 3-SNP MSC on the sample with n
    genotypes and m SNPs requires O(n3m)
  • Case-closure of a MSC C is an MSC C, with
    maximum number of SNPs with fixed values, which
    consists of the same set of cases and minimum
    number of controls.
  • Efficient way for finding case-closure Extend
    MSC with those SNPs that have common values in
    all cases

i
i
0 1 1 0 1 2 1 0 2 case
0 1 1 0 1 2 1 0 2 case
2 0 1 1 0 2 0 0 1 case
2 0 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 0 1 2 0 0 2 control
0 1 1 0 1 2 0 0 2 control
Case-closure
0 1 1 0 1 2 0 0 2 control
0 2 1 0 1 2 0 1 2 control
x x 1 x x 2 x x x
MSC
x x 1 x x 2 x 0 x
MSC
Present in 2 cases 2 controls
Present in 2 cases 1 controls
Cluster C subset of genotypes which share the
same MSC
12
Combinatorial Search
  • Combinatorial Search Method (CS)
  • Searches only among case-closed MSCs
  • Avoids checking of clusters with small number of
    cases
  • Finds significant MSCs faster than ES
  • Still too slow for large data
  • Further speedup by reducing number of SNPs
  • Indexing compress S by extracting most
    informative SNPs
  • Use multiple regression method

13
Indexing
Step 1 Find index (SNP position) in sample
Find index (0, 1, 2)
Step 2 Reconstruct complete genotype
Computation Methods
  • Problem formulation
  • Given the full pattern of all SNPs in a sample
  • Find the minimum number of index SNPs that will
    allow the reconstruction of the complete genotype
    for each individual
  • Index SNPs Selection Algorithm
  • SNP Prediction Algorithm

14
MLR Indexing
  • SNP Prediction Algorithm
  • Based on Multiple Linear Regression (MLR)
  • Index SNPs Selection Algorithm
  • Choose as an index SNP the SNP which best
    predicts all other SNPs
  • Choose the next one which together with a first
    best predicts all other SNPs and so on.

15
5 Data Sets
  • Crohn's disease (Daly et al ) inflammatory bowel
    disease (IBD).
  • Location 5q31
  • Number of SNPs 103
  • Population Size 387
  • case 144 control 243
  • Autoimmune disorders (Ueda et al)
  • Location containing gene CD28, CTLA4 and
    ICONS
  • Number of SNPs 108
  • Population Size 1024
  • case 378 control 646
  • Tick-borne encephalitis (Barkash et al)
  • Location containing gene TLR3, PKR, OAS1,
    OAS2, and OAS3.
  • Number of SNPs 41
  • Population Size 75
  • case 21 control 54
  • Lung cancer (Dragani et al)
  • Number of SNPs 141
  • Population Size 500
  • case 260 control 240

16
Results of Disease association search
  • Indexed versus original
  • The number of statistically significant MSCs
    found on indexed data is more than on the
    non-indexed
  • CS versus ES
  • Over all datasets CS finds no less MSCs than ES
  • For some datasets ES could not find any
    significant MSC in reasonable
    amount of time, while CS found
  • Conclusion
  • We conclude that the proposed indexing approach
    and the CS method are very
    promising techniques
  • CS is still slow
  • Alternatively we can search not for all MSCs but
    for the best MSC

17
The most associated MSC
  • Optimum Association Search Problem
  • Given case/control study data
  • Find MSC that is the most associated with the
    disease
  • MSC which is present in control-free cluster of
    maximum size
  • Complexity
  • Generalization of max independent set
  • NP complete and cannot be well approximated
  • Hope
  • Sample S is not arbitrary
  • Biological structure

Cluster C subset of genotypes which share the
same MSC
18
Complimentary Greedy Search (CGS)
  • Intuition Greedy algorithm for finding maximum
    independent set by removing highest degree
    vertices
  • Algorithm
  • Start with empty MSC that is present in all
    genotypes
  • Find SNP with allele value removing a set of
    genotypes with highest ratio of controls over
    cases (Max(controls/cases))
  • Add the SNP to resulted MSC
  • Repeat 2-3 until all controls are removed
  • Output resulted MSC
  • Adjust to multiple testing the p-value of the
    resulted MSC
  • Extremely fast but inaccurate

Cases
Controls
19
CGS Results
  • CGS finds MSCs with non-trivially high
    association on real data
  • CGS finds more significant MSCs on full dataset
    than CS on indexed in reasonable amount of time

20
Future Work
  • CGS is fast, it can be used as basic operation in
    case/control data analysis
  • Cover data with clusters corresponding to MSCs
    found by CGS and analyze SNPs which belongs to
    many MSCs
  • Build classifier (prediction) based on MSCs found
    by CGS
  • We plan to randomize CGS using simulated
    annealing to find more significant MSCs with
    smaller number of SNPs

21
Genetic Susceptibility Prediction
Problem formulation
  • Given Case/Control study data S
  • Genotype of a testing individual
    t
  • Predict The disease status of the testing
    individual

Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
-1 -1 -1 -1 1 1 1 1
Case genotypes
Control genotypes
testing - gt
0110211101211201
?
22
Cross-validation
  • Leave-one-out test
  • The disease status of each genotype in the data
    set is predicted while the rest of the data is
    regarded as the training set

Real Disease Status
Predicted Disease Status
Genotype
-1
-1
0101201020102210
-1
-1
0220110210120021
Accuracy 80
-1
1
0200120012221110
1
1
0020011002212101
1
1
0020011002212101
  • Leave-many-out test
  • Repeat randomly picking 2/3 of the population as
    training set and predict the other 1/3

23
Quality Measures of Prediction
(confusion table)
  • Sensitivity The ability to correctly detect
    cases
  • sensitivity TP/(TPFN)
  • Specificity The ability to avoid calling control
    as case specificity TN/(FPTN)
  • Accuracy (TP TN)/(TPFPFNTN)
  • Risk Rate Measurements for risk factors.

24
Prediction Methods
  • Support vector machine
  • Random forest
  • LP-based prediction
  • Drawback of the prediction problem formulation
  • need of cross-validation ? no optimization

25
Optimum Clustering Problem
  • Given Case/Control study data represented by a
    population sample S
  • Find a partition P of S into clusters S
    S1?..?Sk , with disease status 0 or 1 assigned to
    each cluster Si , minimizing entropy(P) assuming
    0 errors

Clustering P partition into clusters defined by
MSCs
26
From Clustering to Prediction
  • Intuition
  • If tested genotype is predicted correctly then
    optimum clustering will have smaller entropy
  • Model-fitting prediction Algorithm
  • Set status of testing genotype to diseased
  • Add it to training dataset
  • Find optimum clustering of the dataset
  • Set status of testing genotype to non-diseased
  • Add it to training dataset
  • Find optimum clustering of the dataset
  • Predict status according to the case with smaller
    entropy

27
Results of Prediction Methods
Leave-One-Out Cross Validation
  • Leave-one-out cross-validation for combinatorial
    search-based prediction (CSP) and complimentary
    greedy search-based prediction (CGSP) are given
    when 20, 30, or all SNPs are chosen as
    informative SNPs.

28
ROC curve
  • Comparison of 5 prediction methods on (Barkash
    et. al,2006 ) data on all SNPs.
  • Area under the CSPs curve is 0.81 vs 0.52 under
    the SVMs curve.

29
Conclusions
  • Combinatorial search is able to find
    statistically significant multi-gene
    interactions, for data where no significant
    association was detected before
  • Complimentary greedy search can be used in
    susceptibility prediction
  • Optimization approach to prediction
  • New susceptibility prediction is by 15 higher
    than the best previously known
  • MLR-tagging efficiently reduces the datasets
    allowing to find associated multi-SNP
    combinations and predict susceptibility

30
Thank You!
  • Poster 14
  • Case(Control)-Free Multi-SNP Combinations in
    Case-Control Studies Algorithmic Biology 2006
  • Paper
  • Combinatorial Methods for Disease Association
    Search and Susceptibility Prediction
  • WABI 2006
Write a Comment
User Comments (0)
About PowerShow.com