Disease Association Search and Susceptibility Prediction Algorithms - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Disease Association Search and Susceptibility Prediction Algorithms

Description:

Find the best pattern strength classifier for training sample S: ... pattern strength classifier = potential rule with the maximum accuracy. ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 28

Provided by: Dum96

Category:

more less

Transcript and Presenter's Notes

Title: Disease Association Search and Susceptibility Prediction Algorithms

1
Disease Association Search and Susceptibility
Prediction Algorithms

Irina Astrovskaya

2
Outline

Introduction
SNPs, Haplotypes, Genotypes
Genetic Epidemiology
Case/Control Study
Risk/Resistance factors
Significance of Risk/Resistance Factors
Multiple-Testing Adjustment
Disease Association Search
Disease Susceptibility Prediction

3
SNPs, Haplotypes, Genotypes

Human Genome all genetic material in the
chromosomes(3109 base pairs).
Difference between any two people
occur in 0.1 of genome.
SNP single nucleotide polymorphism, site where
two or more different nucleotides occur in a
large percentage of population (? 3 ? 106)
mostly biallelic.
Diploid two different copies of each chromosome
Haplotype description of a single copy
(expensive)
(notation 0 is for major, and 1 is for minor
allele)
Genotype entire genetic identity of an
individual
mixture of two haplotypes
(notation 0,1 is for
homozygote, 2 is for heterozygote)

4
Genetic Epidemiology

Genetic epidemiology searches for genetic risk
factors of diseases.
Monogenic disease
A mutated gene is entirely responsible for the
disease .
Typically rare in population lt 0.1.
Practically all cases are already reported
Complex disease
interaction of multiple non-linked genes
2SNP analysis vs one-by-one SNP analysis
Multiple independent causes
Each cause can be result of interaction of
several genes
Each cause explains lt 10-20 of cases
Common diseases are mostly complex diseases gt
0.1.

5
Case/Control study
Given a population of n genotypes each containing
values of m SNPs and disease status.
Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
0 0 0 0 1 1 1 1
Case genotypes
Control genotypes
Disease association analysis searches for
risk (resistance) factor of a disease.
6
Risk/Resistance factors

one SNP with fixed allele value

0 1 1 0 1 2 1 0 2 case
present in 4 cases 2 control
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
0 0 1 0 1 1 1 0 2 control
Third SNP with fixed allele value 1 is a risk
factor
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control

multi-SNP combination (MSC) subset of SNPs with
fixed values

0 1 1 0 1 2 1 0 2 case
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
present in 3 cases 1 control
0 0 1 0 1 1 1 0 2 control
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
x x 1 x x 2 x x x
MSC
Cluster (C) - subset of genotypes which share the
same MSC C d(C) cases in
cluster(C) , h(C) controls in cluster(C)
7
Significance of Risk/Resistance Factors

Measured by
Relative risk (RR) a ratio of event probability
occurring in the cases versus controls
Odds ratio (OR) compares whether the probability
of a certain event is the same for two groups
P-value probability of obtaining at least the
same case/control distribution among exposed to
risk factor, assuming null hypothesis (happened
by chance)
Unadjusted p-value (computed by binomial
distribution)

where
8
Multiple-testing adjustment

Bonferroni
easy to compute
overly conservative
reported SNP is among 100 tests gt p-value100
Randomization
Randomly permute the disease status gt 10000
samples
Apply searching methods to each sample and get
MSCs
Count of MSCs unadjusted p-value lt the
observed p-value
If this lt 500 gt MSC is significant
computationally expensive
more accurate
statistically significant adjusted p-value lt
0.05

9
Outline

Introduction
Disease Association Search
Disease Association Problem
Exhaustive and Combinatorial Searches
Maximum Control-Free Cluster
Complimentary Greedy Algorithm
Complimentary Greedy Search
CGS Results
Future
Disease Susceptibility Prediction

10
Disease Association Search
Problem Given a case/control study data
consisting of n genotypes
(haplotypes), each containing values of m SNPs
and disease status (case or control) Find
(all) Risk/Resistance factors (MSCs) with
multiple testing adjusted p-value
below 0.05
11
Exhaustive Combinatorial Searches

Exhaustive search (ES)
complete (infeasible)
sample with n genotypes and m SNPs requires
O(n3m)
Combinatorial search (CS)
Case-closure of MSC C is a MSC C (with
maximum number of SNPs with fixed values),which
consists of the same set of case as C and
minimum number of controls individuals from C.
Efficient way for finding case-closure
Extend MSC with those SNPs that have common
values in all cases.
Searches only among closed clusters
Closure of cluster (C) cluster (C)
d(C)d(C) and h(C) is minimized
Avoids checking of trivial MSCs
faster than ES, but still too slow for large data
Tagging(indexing) multiple regression method
ES and CS find more statistically significant
MSCs on indexed data.

12
Maximum Control-Free Cluster

Maximum Control-Free Cluster Problem
Given case/control study
Find cluster (C) that is does not contain
controls and has the maximum number of cases.
It is maximum control-free cluster.
Maximal Control-Free Cluster Risk Factor
Maximal Disease-Free Cluster Resistance Factor
Complexity
Includes max independent set problem
NP complete
However,
Sample S is not arbitrary

13
Complimentary Greedy Algorithm

Algorithm
Start with Clt-S
Repeat until h(C)gt0 (control-free)
For each SNP s with value i find
hh(C)-h(C ? s)
dd(C) d(C ? s)
Find SNP (s, i) minimizing d/h
Add s to MSC
Min vertex cover picking and removing vertices
of maximum degree until no edges left

14
Complimentary Greedy Search (CGS)

Algorithm (covering)
Start with empty MSC that is present in all
genotypes
Find SNP with allele value, that define a set of
genotypes with highest ratio of controls over
cases (Max(controls/cases))
Remove it
Add the SNP to resulted MSC
Repeat 2-3 until all controls are removed
Output resulted MSC
Adjust to multiple testing the p-value of the
resulted MSC

Cases
Controls
15
CGS Results

CGS finds MSCs with non-trivially high
association on real data
CGS finds more significant MSCs on full dataset
than CS on indexed in reasonable amount of time

16
Future

Clustering algorithm
Instead of removing found cluster for
maximum control-free cluster problem, redefine
controls in the cluster as cases (redefinition is
visible only for controls in sample).

Cases
Controls
17
Outline

Introduction
Disease Association Search
Disease Susceptibility Prediction
Disease Susceptibility Prediction Problem
Cross-Validation Tests
Quality Measures of Prediction
Prediction Methods
Optimum Disease Clustering Problem
From Clustering to Prediction
Leave-One-Out Results
CDC Algorithm
Experiments
Plans

18
Disease Susceptibility Prediction
Problem

Given Case/Control study
Genotype of a testing individual
t
Find The disease status of the testing
individual

Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
0 0 0 0 1 1 1 1
Case genotypes
Control genotypes
testing - gt
0110211101211201
?
19
Cross-validation tests

Leave-one-out test
The disease status of each genotype in the data
set is predicted while the rest of the data is
regarded as the training set

Real Disease Status
Predicted Disease Status
Genotype
0
0
0101201020102210
0
0
0220110210120021
Accuracy 80
0
1
0200120012221110
1
1
0020011002212101
1
1
0020011002212101

Leave-many-out test
Repeat randomly picking 2/3 of the population as
training set and predict the other 1/3

20
Quality Measures of Prediction

Sensitivity The ability to correctly detect
cases.
Sensitivity
TP/(TPFN)
Specificity The ability to avoid calling control
as case.
Specificity
TN/(FPTN)
Accuracy (TP TN)/(TPFPFNTN)
Risk Rate Measurements for risk factors

21
Prediction Methods

Support Vector Machine
Random Forest
LP-based prediction

Drawback of the prediction problem formulation
need of cross-validation ? no optimization

22
Optimum Disease Clustering Problem

Given Case/Control study S
Find partition P of S into clusters S
S1?..?Sk , with disease status 0 or 1 assigned to
each cluster Si , minimizing entropy(P) for a
given bound on the number of individuals who are
assigned incorrect status in clusters in
partition P

23
From Clustering to Prediction

Intuition
If tested genotype is predicted correctly then
optimum clustering will have smaller entropy
Model-Fitting Prediction Algorithm
Set status of testing genotype t to case
Find optimum clustering P0 of the dataset S U
t
Set status of testing genotype to control
Find optimum clustering P1 of the dataset S U t
Find the clustering, which is better fits to
model (has smaller enthropy), and accordingly
predict status

24
Leave-One-Out Results

Leave-one-out cross validation results of four
prediction methods for three real data sets.
Results of combinatorial search-based prediction
(CSP) and complimentary greedy search-based
prediction (CGSP) are given when 20, 30, or all
SNPs are chosen as informative SNPs.

25
CDC Algorithm (B.N. Goertzel Combinations of
SNPs in neuroendocrine effector and receptor
genes predict chronic fatique syndrome,
Pharmacogenomics(2006),7(3))