Discrete Algorithms for Disease Association Search and - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Discrete Algorithms for Disease Association Search and

Description:

Human Genome all the genetic material in the chromosomes, length 3 109 base pairs ... SNP single nucleotide polymorphism site where two or more different ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 31

Provided by: Dum75

Category:

more less

Transcript and Presenter's Notes

Title: Discrete Algorithms for Disease Association Search and

1
Discrete Algorithms for Disease Association
Search and
UCSD, November 29, 2006
Susceptibility Prediction

Dumitru Brinza
Department of Computer Science
Georgia State University

2
Outline

SNPs, Haplotypes and Genotypes
Disease Association Analysis
Multiple-testing adjustment
MLR indexing for data compression
Optimum data clustering
Predicting susceptibility to complex diseases
Conclusions

3
SNP, Haplotypes, Genotypes
Human Genome all the genetic material in the
chromosomes, length 3109 base pairs Difference
between any two people occur in 0.1 of
genome SNP single nucleotide polymorphism site
where two or more different nucleotides occur in
a large percentage of population. Diploid two
different copies of each chromosome Haplotype
description of a single copy (expensive)
example 00110101 (0 is for major, 1 is
for minor allele) Genotype description of the
mixed two copies example
01122110 (000, 111, 201)
4
Types of Diseases

Monogenic disease
Mutated gene is entirely responsible for the
disease
Break the pathway, no another compensatory
pathway
Typically rare in population lt 0.1.
Complex disease
Interaction of multiple genes
One mutation does not cause disease
Breakage of all compensatory pathways cause
disease
Hard to analyze - 2-gene interaction analysis for
a genome-wide scan with 1 million SNPs has 1012
pair wise tests
Multiple independent causes
There are different causes and each of these
causes can be result of interaction of several
genes
Each cause explains certain percentage of cases
Common diseases are Complex gt 0.1.
In NY city, 12 of the population has Type 2
Diabetes

5
Case/Control study
Given a population of n genotypes each
containing values of m SNPs and disease status.
Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
-1 -1 -1 -1 1 1 1 1
Case genotypes
Control genotypes
Disease association analysis searches for
risk (resistance) factor with frequency among
case (control) individuals considerably higher
than among control (case) individuals.
6
Risk/Resistance factors

Risk/resistance factor one SNP with fixed
allele value

0 1 1 0 1 2 1 0 2 case
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
0 0 1 0 1 2 1 0 2 case
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
present in 5 cases 1 control
Third SNP with fixed allele value 1 is a risk
factor with frequency among case individuals
higher than among control individuals.

We generalize risk/resistance factor to multi-SNP
combination

7
Multi-SNP extension

multi-SNP combination (MSC)
a subset of SNP-columns of S (set of SNPs)
With fixed values of these SNPs, 0, 1, or 2

0 1 1 0 1 2 1 0 2 case
0 1 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 1 1 2 0 0 1 case
0 0 1 0 1 2 1 0 2 case
0 1 0 0 1 1 0 0 2 control
0 1 1 0 1 2 0 0 2 control
x x 1 x x 2 x x x
MSC
present in 4 cases 1 control
check significance
8
Significance of Risk/Resistance Factors

Measured P-value
probability that case/control distribution among
exposed to risk factor happened by chance
compute by binomial distribution
Searching for risk factors among many SNPs
requires multiple testing adjustment of the
p-value

9
Multiple-testing adjustment

Bonferroni
easy to compute
overly conservative
If the reported SNP is found among 100 SNPs then
the probability that the SNP is associated with a
disease by mere chance becomes 100 times larger
Randomization
Randomly permute the disease status of the
population to generate 10000 samples
Apply searching methods to each sample and get
MSCs
Count of MSCs that have smaller unadjusted
p-value than the observed p-value
If this lt 500 then the observed MSC is
significant
computationally expensive
more accurate
In our search we report only MSC with adjusted
p-value lt 0.05

10
Disease Association Search
Problem Formulation Given a case/control
study data consisting of n genotypes each
containing values of m SNPs and disease
status Find all Risk/Resistance factors (MSCs)
with multiple testing adjusted p-value below 0.05
11
Searching Approaches

Exhaustive search (ES)
computationally infeasible
searching for 3-SNP MSC on the sample with n
genotypes and m SNPs requires O(n3m)
Case-closure of a MSC C is an MSC C, with
maximum number of SNPs with fixed values, which
consists of the same set of cases and minimum
number of controls.
Efficient way for finding case-closure Extend
MSC with those SNPs that have common values in
all cases

i
i
0 1 1 0 1 2 1 0 2 case
0 1 1 0 1 2 1 0 2 case
2 0 1 1 0 2 0 0 1 case
2 0 1 1 0 2 0 0 1 case
0 0 1 0 0 0 0 2 1 case
0 0 1 0 0 0 0 2 1 case
0 1 1 0 1 2 0 0 2 control
0 1 1 0 1 2 0 0 2 control
Case-closure
0 1 1 0 1 2 0 0 2 control
0 2 1 0 1 2 0 1 2 control
x x 1 x x 2 x x x
MSC
x x 1 x x 2 x 0 x
MSC
Present in 2 cases 2 controls
Present in 2 cases 1 controls
Cluster C subset of genotypes which share the
same MSC
12
Combinatorial Search

Combinatorial Search Method (CS)
Searches only among case-closed MSCs
Avoids checking of clusters with small number of
cases
Finds significant MSCs faster than ES
Still too slow for large data
Further speedup by reducing number of SNPs
Indexing compress S by extracting most
informative SNPs
Use multiple regression method

13
Indexing
Step 1 Find index (SNP position) in sample
Find index (0, 1, 2)
Step 2 Reconstruct complete genotype
Computation Methods

Problem formulation
Given the full pattern of all SNPs in a sample
Find the minimum number of index SNPs that will
allow the reconstruction of the complete genotype
for each individual
Index SNPs Selection Algorithm
SNP Prediction Algorithm

14
MLR Indexing

SNP Prediction Algorithm
Based on Multiple Linear Regression (MLR)
Index SNPs Selection Algorithm
Choose as an index SNP the SNP which best
predicts all other SNPs
Choose the next one which together with a first
best predicts all other SNPs and so on.

15
5 Data Sets

Crohn's disease (Daly et al ) inflammatory bowel
disease (IBD).
Location 5q31
Number of SNPs 103
Population Size 387
case 144 control 243
Autoimmune disorders (Ueda et al)
Location containing gene CD28, CTLA4 and
ICONS
Number of SNPs 108
Population Size 1024
case 378 control 646
Tick-borne encephalitis (Barkash et al)
Location containing gene TLR3, PKR, OAS1,
OAS2, and OAS3.
Number of SNPs 41
Population Size 75
case 21 control 54
Lung cancer (Dragani et al)
Number of SNPs 141
Population Size 500
case 260 control 240

16
Results of Disease association search

Indexed versus original
The number of statistically significant MSCs
found on indexed data is more than on the
non-indexed
CS versus ES
Over all datasets CS finds no less MSCs than ES
For some datasets ES could not find any
significant MSC in reasonable
amount of time, while CS found
Conclusion
We conclude that the proposed indexing approach
and the CS method are very
promising techniques
CS is still slow
Alternatively we can search not for all MSCs but
for the best MSC

17
The most associated MSC

Optimum Association Search Problem
Given case/control study data
Find MSC that is the most associated with the
disease
MSC which is present in control-free cluster of
maximum size
Complexity
Generalization of max independent set
NP complete and cannot be well approximated
Hope
Sample S is not arbitrary
Biological structure

Cluster C subset of genotypes which share the
same MSC
18
Complimentary Greedy Search (CGS)

Intuition Greedy algorithm for finding maximum
independent set by removing highest degree
vertices
Algorithm
Start with empty MSC that is present in all
genotypes
Find SNP with allele value removing a set of
genotypes with highest ratio of controls over
cases (Max(controls/cases))
Add the SNP to resulted MSC
Repeat 2-3 until all controls are removed
Output resulted MSC
Adjust to multiple testing the p-value of the
resulted MSC
Extremely fast but inaccurate

Cases
Controls
19
CGS Results

CGS finds MSCs with non-trivially high
association on real data
CGS finds more significant MSCs on full dataset
than CS on indexed in reasonable amount of time

20
Future Work

CGS is fast, it can be used as basic operation in
case/control data analysis
Cover data with clusters corresponding to MSCs
found by CGS and analyze SNPs which belongs to
many MSCs
Build classifier (prediction) based on MSCs found
by CGS
We plan to randomize CGS using simulated
annealing to find more significant MSCs with
smaller number of SNPs

21
Genetic Susceptibility Prediction
Problem formulation

Given Case/Control study data S
Genotype of a testing individual
t
Predict The disease status of the testing
individual

Disease Status
SNPs
0101201020102210 0220110210120021 0200120012221110
0020011002212101 1101202020100110 012012001010001
1 0210220002021112 0021011000212120
-1 -1 -1 -1 1 1 1 1
Case genotypes
Control genotypes
testing - gt
0110211101211201
?
22
Cross-validation

Leave-one-out test
The disease status of each genotype in the data
set is predicted while the rest of the data is
regarded as the training set

Real Disease Status
Predicted Disease Status
Genotype
-1
-1
0101201020102210
-1
-1
0220110210120021
Accuracy 80
-1
1
0200120012221110
1
1
0020011002212101
1
1
0020011002212101

Leave-many-out test
Repeat randomly picking 2/3 of the population as
training set and predict the other 1/3

23
Quality Measures of Prediction
(confusion table)

Sensitivity The ability to correctly detect
cases
sensitivity TP/(TPFN)
Specificity The ability to avoid calling control
as case specificity TN/(FPTN)
Accuracy (TP TN)/(TPFPFNTN)
Risk Rate Measurements for risk factors.

24
Prediction Methods

Support vector machine
Random forest
LP-based prediction

Drawback of the prediction problem formulation
need of cross-validation ? no optimization

25
Optimum Clustering Problem

Given Case/Control study data represented by a
population sample S
Find a partition P of S into clusters S
S1?..?Sk , with disease status 0 or 1 assigned to
each cluster Si , minimizing entropy(P) assuming
0 errors

Clustering P partition into clusters defined by
MSCs
26
From Clustering to Prediction

Intuition
If tested genotype is predicted correctly then
optimum clustering will have smaller entropy
Model-fitting prediction Algorithm
Set status of testing genotype to diseased
Add it to training dataset
Find optimum clustering of the dataset
Set status of testing genotype to non-diseased
Add it to training dataset
Find optimum clustering of the dataset
Predict status according to the case with smaller
entropy

27
Results of Prediction Methods
Leave-One-Out Cross Validation

Leave-one-out cross-validation for combinatorial
search-based prediction (CSP) and complimentary
greedy search-based prediction (CGSP) are given
when 20, 30, or all SNPs are chosen as
informative SNPs.

28
ROC curve

Comparison of 5 prediction methods on (Barkash
et. al,2006 ) data on all SNPs.
Area under the CSPs curve is 0.81 vs 0.52 under
the SVMs curve.

29
Conclusions

Combinatorial search is able to find
statistically significant multi-gene
interactions, for data where no significant
association was detected before
Complimentary greedy search can be used in
susceptibility prediction
Optimization approach to prediction
New susceptibility prediction is by 15 higher
than the best previously known
MLR-tagging efficiently reduces the datasets
allowing to find associated multi-SNP
combinations and predict susceptibility

30
Thank You!

Poster 14
Case(Control)-Free Multi-SNP Combinations in
Case-Control Studies Algorithmic Biology 2006
Paper
Combinatorial Methods for Disease Association
Search and Susceptibility Prediction
WABI 2006

Write a Comment

User Comments (0)