Title: Disease Association Search and Susceptibility Prediction Algorithms
1Disease Association Search and Susceptibility
Prediction Algorithms
2Outline
- Introduction
- SNPs, Haplotypes, Genotypes
- Genetic Epidemiology
- Case/Control Study
- Risk/Resistance factors
- Significance of Risk/Resistance Factors
- Multiple-Testing Adjustment
- Disease Association Search
- Disease Susceptibility Prediction
-
3SNPs, Haplotypes, Genotypes
- Human Genome all genetic material in the
chromosomes(3109 base pairs). - Difference between any two people
occur in 0.1 of genome. - SNP single nucleotide polymorphism, site where
two or more different nucleotides occur in a
large percentage of population (? 3 ? 106) - mostly biallelic.
- Diploid two different copies of each chromosome
- Haplotype description of a single copy
(expensive) - (notation 0 is for major, and 1 is for minor
allele) - Genotype entire genetic identity of an
individual - mixture of two haplotypes
- (notation 0,1 is for
homozygote, 2 is for heterozygote)
4Genetic Epidemiology
- Genetic epidemiology searches for genetic risk
factors of diseases. - Monogenic disease
- A mutated gene is entirely responsible for the
disease . - Typically rare in population lt 0.1.
- Practically all cases are already reported
- Complex disease
- interaction of multiple non-linked genes
- 2SNP analysis vs one-by-one SNP analysis
- Multiple independent causes
- Each cause can be result of interaction of
several genes - Each cause explains lt 10-20 of cases
- Common diseases are mostly complex diseases gt
0.1.
5Case/Control study
Given a population of n genotypes each containing
values of m SNPs and disease status.
Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
0 0 0 0 1 1 1 1
Case genotypes
Control genotypes
Disease association analysis searches for
risk (resistance) factor of a disease.
6Risk/Resistance factors
- one SNP with fixed allele value
0 1 1 0 1 2 1 0 2 case
present in 4 cases 2 control
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
0 0 1 0 1 1 1 0 2 control
Third SNP with fixed allele value 1 is a risk
factor
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
- multi-SNP combination (MSC) subset of SNPs with
fixed values
0 1 1 0 1 2 1 0 2 case
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
present in 3 cases 1 control
0 0 1 0 1 1 1 0 2 control
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
x x 1 x x 2 x x x
MSC
Cluster (C) - subset of genotypes which share the
same MSC C d(C) cases in
cluster(C) , h(C) controls in cluster(C)
7Significance of Risk/Resistance Factors
- Measured by
-
- Relative risk (RR) a ratio of event probability
occurring in the cases versus controls - Odds ratio (OR) compares whether the probability
of a certain event is the same for two groups - P-value probability of obtaining at least the
same case/control distribution among exposed to
risk factor, assuming null hypothesis (happened
by chance) - Unadjusted p-value (computed by binomial
distribution)
where
8Multiple-testing adjustment
- Bonferroni
- easy to compute
- overly conservative
- reported SNP is among 100 tests gt p-value100
- Randomization
- Randomly permute the disease status gt 10000
samples - Apply searching methods to each sample and get
MSCs - Count of MSCs unadjusted p-value lt the
observed p-value - If this lt 500 gt MSC is significant
- computationally expensive
- more accurate
- statistically significant adjusted p-value lt
0.05
9Outline
- Introduction
- Disease Association Search
- Disease Association Problem
- Exhaustive and Combinatorial Searches
- Maximum Control-Free Cluster
- Complimentary Greedy Algorithm
- Complimentary Greedy Search
- CGS Results
- Future
- Disease Susceptibility Prediction
-
10Disease Association Search
Problem Given a case/control study data
consisting of n genotypes
(haplotypes), each containing values of m SNPs
and disease status (case or control) Find
(all) Risk/Resistance factors (MSCs) with
multiple testing adjusted p-value
below 0.05
11Exhaustive Combinatorial Searches
- Exhaustive search (ES)
- complete (infeasible)
- sample with n genotypes and m SNPs requires
O(n3m) - Combinatorial search (CS)
- Case-closure of MSC C is a MSC C (with
maximum number of SNPs with fixed values),which
consists of the same set of case as C and
minimum number of controls individuals from C. - Efficient way for finding case-closure
Extend MSC with those SNPs that have common
values in all cases. - Searches only among closed clusters
- Closure of cluster (C) cluster (C)
- d(C)d(C) and h(C) is minimized
- Avoids checking of trivial MSCs
- faster than ES, but still too slow for large data
- Tagging(indexing) multiple regression method
- ES and CS find more statistically significant
MSCs on indexed data.
12Maximum Control-Free Cluster
- Maximum Control-Free Cluster Problem
- Given case/control study
- Find cluster (C) that is does not contain
controls and has the maximum number of cases. - It is maximum control-free cluster.
- Maximal Control-Free Cluster Risk Factor
- Maximal Disease-Free Cluster Resistance Factor
- Complexity
- Includes max independent set problem
- NP complete
- However,
- Sample S is not arbitrary
13Complimentary Greedy Algorithm
- Algorithm
- Start with Clt-S
- Repeat until h(C)gt0 (control-free)
- For each SNP s with value i find
- hh(C)-h(C ? s)
- dd(C) d(C ? s)
- Find SNP (s, i) minimizing d/h
- Add s to MSC
- Min vertex cover picking and removing vertices
of maximum degree until no edges left
14Complimentary Greedy Search (CGS)
- Algorithm (covering)
- Start with empty MSC that is present in all
genotypes - Find SNP with allele value, that define a set of
genotypes with highest ratio of controls over
cases (Max(controls/cases)) - Remove it
- Add the SNP to resulted MSC
- Repeat 2-3 until all controls are removed
- Output resulted MSC
- Adjust to multiple testing the p-value of the
resulted MSC
Cases
Controls
15CGS Results
- CGS finds MSCs with non-trivially high
association on real data - CGS finds more significant MSCs on full dataset
than CS on indexed in reasonable amount of time
16Future
- Clustering algorithm
- Instead of removing found cluster for
maximum control-free cluster problem, redefine
controls in the cluster as cases (redefinition is
visible only for controls in sample).
Cases
Controls
17Outline
- Introduction
- Disease Association Search
- Disease Susceptibility Prediction
- Disease Susceptibility Prediction Problem
- Cross-Validation Tests
- Quality Measures of Prediction
- Prediction Methods
- Optimum Disease Clustering Problem
- From Clustering to Prediction
- Leave-One-Out Results
- CDC Algorithm
- Experiments
- Plans
-
18Disease Susceptibility Prediction
Problem
- Given Case/Control study
- Genotype of a testing individual
t - Find The disease status of the testing
individual
Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
0 0 0 0 1 1 1 1
Case genotypes
Control genotypes
testing - gt
0110211101211201
?
19Cross-validation tests
- Leave-one-out test
- The disease status of each genotype in the data
set is predicted while the rest of the data is
regarded as the training set
Real Disease Status
Predicted Disease Status
Genotype
0
0
0101201020102210
0
0
0220110210120021
Accuracy 80
0
1
0200120012221110
1
1
0020011002212101
1
1
0020011002212101
- Leave-many-out test
- Repeat randomly picking 2/3 of the population as
training set and predict the other 1/3
20Quality Measures of Prediction
- Sensitivity The ability to correctly detect
cases. - Sensitivity
TP/(TPFN) - Specificity The ability to avoid calling control
as case. - Specificity
TN/(FPTN) - Accuracy (TP TN)/(TPFPFNTN)
- Risk Rate Measurements for risk factors
21Prediction Methods
- Support Vector Machine
- Random Forest
- LP-based prediction
- Drawback of the prediction problem formulation
- need of cross-validation ? no optimization
22Optimum Disease Clustering Problem
- Given Case/Control study S
- Find partition P of S into clusters S
S1?..?Sk , with disease status 0 or 1 assigned to
each cluster Si , minimizing entropy(P) for a
given bound on the number of individuals who are
assigned incorrect status in clusters in
partition P
23From Clustering to Prediction
- Intuition
- If tested genotype is predicted correctly then
optimum clustering will have smaller entropy - Model-Fitting Prediction Algorithm
- Set status of testing genotype t to case
- Find optimum clustering P0 of the dataset S U
t - Set status of testing genotype to control
- Find optimum clustering P1 of the dataset S U t
- Find the clustering, which is better fits to
model (has smaller enthropy), and accordingly
predict status
24Leave-One-Out Results
- Leave-one-out cross validation results of four
prediction methods for three real data sets.
Results of combinatorial search-based prediction
(CSP) and complimentary greedy search-based
prediction (CGSP) are given when 20, 30, or all
SNPs are chosen as informative SNPs.
25CDC Algorithm (B.N. Goertzel Combinations of
SNPs in neuroendocrine effector and receptor
genes predict chronic fatique syndrome,
Pharmacogenomics(2006),7(3))
- Find the best pattern strength classifier for
training sample S - - all subsets of SNPs with cardinality less
than k potential rules. - - For each potential rule
- - evaluate each genotype g from S
- for each SNP in the potential
rule - if this SNP has value 0 or
1 in g gt add 2 for a sum - if the value is
2gt add 1. - - set threshold
- sum_casesltthresholdsum_controlsgtt
hreshold -gt min - - compute accuracy
- - pattern strength classifier potential
rule with the maximum accuracy. - Predict status of tested individual
- - compute the sum for a tested individuals
- for each SNP in the pattern strength
classifier - if this SNP has value 0 or
1 gt add 2 for a sum - if the value is 2gt add 1.
- - if the sum is less than threshold gt
control, - otherwise
gt case.
26 Experiments
- Different mask for sum in evaluation.
- Experiments with swap SNPs if SNP is more
associated with controls.
27Plans
- Finish Leave-one-out for all sum masks.
- CDC method is slow
- CDC method exhaustively search the best pattern
strength classifier - If there is any way to take it in greedy way?