Title: Discrete Algorithms for Disease Association Search and
1Discrete Algorithms for Disease Association
Search and
UCSD, November 29, 2006
Susceptibility Prediction
- Dumitru Brinza
- Department of Computer Science
- Georgia State University
2Outline
- SNPs, Haplotypes and Genotypes
- Disease Association Analysis
- Multiple-testing adjustment
- MLR indexing for data compression
- Optimum data clustering
- Predicting susceptibility to complex diseases
- Conclusions
3SNP, Haplotypes, Genotypes
Human Genome all the genetic material in the
chromosomes, length 3109 base pairs Difference
between any two people occur in 0.1 of
genome SNP single nucleotide polymorphism site
where two or more different nucleotides occur in
a large percentage of population. Diploid two
different copies of each chromosome Haplotype
description of a single copy (expensive)
example 00110101 (0 is for major, 1 is
for minor allele) Genotype description of the
mixed two copies example
01122110 (000, 111, 201)
4Types of Diseases
- Monogenic disease
- Mutated gene is entirely responsible for the
disease - Break the pathway, no another compensatory
pathway - Typically rare in population lt 0.1.
- Complex disease
- Interaction of multiple genes
- One mutation does not cause disease
- Breakage of all compensatory pathways cause
disease - Hard to analyze - 2-gene interaction analysis for
a genome-wide scan with 1 million SNPs has 1012
pair wise tests - Multiple independent causes
- There are different causes and each of these
causes can be result of interaction of several
genes - Each cause explains certain percentage of cases
- Common diseases are Complex gt 0.1.
- In NY city, 12 of the population has Type 2
Diabetes
5Case/Control study
Given a population of n genotypes each
containing values of m SNPs and disease status.
Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
-1 -1 -1 -1 1 1 1 1
Case genotypes
Control genotypes
Disease association analysis searches for
risk (resistance) factor with frequency among
case (control) individuals considerably higher
than among control (case) individuals.
6Risk/Resistance factors
- Risk/resistance factor one SNP with fixed
allele value
0 1 1 0 1 2 1 0 2 case
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
0 0 1 0 1 2 1 0 2 case
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
present in 5 cases 1 control
Third SNP with fixed allele value 1 is a risk
factor with frequency among case individuals
higher than among control individuals.
- We generalize risk/resistance factor to multi-SNP
combination
7Multi-SNP extension
- multi-SNP combination (MSC)
- a subset of SNP-columns of S (set of SNPs)
- With fixed values of these SNPs, 0, 1, or 2
0 1 1 0 1 2 1 0 2 case
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
0 0 1 0 1 2 1 0 2 case
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
x x 1 x x 2 x x x
MSC
present in 4 cases 1 control
check significance
8Significance of Risk/Resistance Factors
- Measured P-value
- probability that case/control distribution among
exposed to risk factor happened by chance - compute by binomial distribution
- Searching for risk factors among many SNPs
requires multiple testing adjustment of the
p-value
9Multiple-testing adjustment
- Bonferroni
- easy to compute
- overly conservative
- If the reported SNP is found among 100 SNPs then
the probability that the SNP is associated with a
disease by mere chance becomes 100 times larger - Randomization
- Randomly permute the disease status of the
population to generate 10000 samples - Apply searching methods to each sample and get
MSCs - Count of MSCs that have smaller unadjusted
p-value than the observed p-value - If this lt 500 then the observed MSC is
significant - computationally expensive
- more accurate
- In our search we report only MSC with adjusted
p-value lt 0.05
10Disease Association Search
Problem Formulation Given a case/control
study data consisting of n genotypes each
containing values of m SNPs and disease
status Find all Risk/Resistance factors (MSCs)
with multiple testing adjusted p-value below 0.05
11Searching Approaches
- Exhaustive search (ES)
- computationally infeasible
- searching for 3-SNP MSC on the sample with n
genotypes and m SNPs requires O(n3m) - Case-closure of a MSC C is an MSC C, with
maximum number of SNPs with fixed values, which
consists of the same set of cases and minimum
number of controls. - Efficient way for finding case-closure Extend
MSC with those SNPs that have common values in
all cases
i
i
0 1 1 0 1 2 1 0 2 case
0 1 1 0 1 2 1 0 2 case
2 0 1 1 0 2 0 0 1 case
2 0 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 0 1 2 0 0 2 control
0 1 1 0 1 2 0 0 2 control
Case-closure
0 1 1 0 1 2 0 0 2 control
0 2 1 0 1 2 0 1 2 control
x x 1 x x 2 x x x
MSC
x x 1 x x 2 x 0 x
MSC
Present in 2 cases 2 controls
Present in 2 cases 1 controls
Cluster C subset of genotypes which share the
same MSC
12Combinatorial Search
-
- Combinatorial Search Method (CS)
- Searches only among case-closed MSCs
- Avoids checking of clusters with small number of
cases - Finds significant MSCs faster than ES
- Still too slow for large data
- Further speedup by reducing number of SNPs
- Indexing compress S by extracting most
informative SNPs - Use multiple regression method
13Indexing
Step 1 Find index (SNP position) in sample
Find index (0, 1, 2)
Step 2 Reconstruct complete genotype
Computation Methods
- Problem formulation
- Given the full pattern of all SNPs in a sample
- Find the minimum number of index SNPs that will
allow the reconstruction of the complete genotype
for each individual - Index SNPs Selection Algorithm
- SNP Prediction Algorithm
14MLR Indexing
- SNP Prediction Algorithm
- Based on Multiple Linear Regression (MLR)
- Index SNPs Selection Algorithm
- Choose as an index SNP the SNP which best
predicts all other SNPs - Choose the next one which together with a first
best predicts all other SNPs and so on.
155 Data Sets
- Crohn's disease (Daly et al ) inflammatory bowel
disease (IBD). - Location 5q31
- Number of SNPs 103
- Population Size 387
- case 144 control 243
- Autoimmune disorders (Ueda et al)
- Location containing gene CD28, CTLA4 and
ICONS - Number of SNPs 108
- Population Size 1024
- case 378 control 646
- Tick-borne encephalitis (Barkash et al)
- Location containing gene TLR3, PKR, OAS1,
OAS2, and OAS3. - Number of SNPs 41
- Population Size 75
- case 21 control 54
- Lung cancer (Dragani et al)
- Number of SNPs 141
- Population Size 500
- case 260 control 240
16Results of Disease association search
- Indexed versus original
- The number of statistically significant MSCs
found on indexed data is more than on the
non-indexed - CS versus ES
- Over all datasets CS finds no less MSCs than ES
- For some datasets ES could not find any
significant MSC in reasonable
amount of time, while CS found - Conclusion
- We conclude that the proposed indexing approach
and the CS method are very
promising techniques - CS is still slow
- Alternatively we can search not for all MSCs but
for the best MSC
17The most associated MSC
- Optimum Association Search Problem
- Given case/control study data
- Find MSC that is the most associated with the
disease - MSC which is present in control-free cluster of
maximum size - Complexity
- Generalization of max independent set
- NP complete and cannot be well approximated
- Hope
- Sample S is not arbitrary
- Biological structure
Cluster C subset of genotypes which share the
same MSC
18Complimentary Greedy Search (CGS)
- Intuition Greedy algorithm for finding maximum
independent set by removing highest degree
vertices - Algorithm
- Start with empty MSC that is present in all
genotypes - Find SNP with allele value removing a set of
genotypes with highest ratio of controls over
cases (Max(controls/cases)) - Add the SNP to resulted MSC
- Repeat 2-3 until all controls are removed
- Output resulted MSC
- Adjust to multiple testing the p-value of the
resulted MSC -
- Extremely fast but inaccurate
Cases
Controls
19CGS Results
- CGS finds MSCs with non-trivially high
association on real data - CGS finds more significant MSCs on full dataset
than CS on indexed in reasonable amount of time
20Future Work
- CGS is fast, it can be used as basic operation in
case/control data analysis - Cover data with clusters corresponding to MSCs
found by CGS and analyze SNPs which belongs to
many MSCs - Build classifier (prediction) based on MSCs found
by CGS - We plan to randomize CGS using simulated
annealing to find more significant MSCs with
smaller number of SNPs
21Genetic Susceptibility Prediction
Problem formulation
- Given Case/Control study data S
- Genotype of a testing individual
t - Predict The disease status of the testing
individual
Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
-1 -1 -1 -1 1 1 1 1
Case genotypes
Control genotypes
testing - gt
0110211101211201
?
22Cross-validation
- Leave-one-out test
- The disease status of each genotype in the data
set is predicted while the rest of the data is
regarded as the training set
Real Disease Status
Predicted Disease Status
Genotype
-1
-1
0101201020102210
-1
-1
0220110210120021
Accuracy 80
-1
1
0200120012221110
1
1
0020011002212101
1
1
0020011002212101
- Leave-many-out test
- Repeat randomly picking 2/3 of the population as
training set and predict the other 1/3
23Quality Measures of Prediction
(confusion table)
- Sensitivity The ability to correctly detect
cases - sensitivity TP/(TPFN)
- Specificity The ability to avoid calling control
as case specificity TN/(FPTN) - Accuracy (TP TN)/(TPFPFNTN)
- Risk Rate Measurements for risk factors.
24Prediction Methods
- Support vector machine
- Random forest
- LP-based prediction
- Drawback of the prediction problem formulation
- need of cross-validation ? no optimization
25Optimum Clustering Problem
- Given Case/Control study data represented by a
population sample S - Find a partition P of S into clusters S
S1?..?Sk , with disease status 0 or 1 assigned to
each cluster Si , minimizing entropy(P) assuming
0 errors
Clustering P partition into clusters defined by
MSCs
26From Clustering to Prediction
- Intuition
- If tested genotype is predicted correctly then
optimum clustering will have smaller entropy - Model-fitting prediction Algorithm
- Set status of testing genotype to diseased
- Add it to training dataset
- Find optimum clustering of the dataset
- Set status of testing genotype to non-diseased
- Add it to training dataset
- Find optimum clustering of the dataset
- Predict status according to the case with smaller
entropy
27Results of Prediction Methods
Leave-One-Out Cross Validation
- Leave-one-out cross-validation for combinatorial
search-based prediction (CSP) and complimentary
greedy search-based prediction (CGSP) are given
when 20, 30, or all SNPs are chosen as
informative SNPs.
28ROC curve
- Comparison of 5 prediction methods on (Barkash
et. al,2006 ) data on all SNPs. - Area under the CSPs curve is 0.81 vs 0.52 under
the SVMs curve.
29Conclusions
- Combinatorial search is able to find
statistically significant multi-gene
interactions, for data where no significant
association was detected before - Complimentary greedy search can be used in
susceptibility prediction - Optimization approach to prediction
- New susceptibility prediction is by 15 higher
than the best previously known - MLR-tagging efficiently reduces the datasets
allowing to find associated multi-SNP
combinations and predict susceptibility
30Thank You!
- Poster 14
- Case(Control)-Free Multi-SNP Combinations in
Case-Control Studies Algorithmic Biology 2006 - Paper
- Combinatorial Methods for Disease Association
Search and Susceptibility Prediction - WABI 2006