Evaluating Classifiers for Disease Gene Discovery - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Evaluating Classifiers for Disease Gene Discovery

Description:

Evaluating Classifiers for Disease Gene Discovery. Lon Turnbull and Kino Coursey. lt0013_at_unt.edu, kino_at_daxtrom.com. University of North Texas. Biocomputing Fall 2005 ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 25

Provided by: Lon112

Learn more at: https://computerscience.engineering.unt.edu

Category:

more less

Transcript and Presenter's Notes

Title: Evaluating Classifiers for Disease Gene Discovery

1
Evaluating Classifiers for Disease Gene Discovery

Lon Turnbull and Kino Coursey
lt0013_at_unt.edu, kino_at_daxtrom.com
University of North Texas

2
Biocomputing Fall 2005

CSCD 4930.004/CSCE 5933.007
Biol 4930.773/Biol 5905.773
Instructors Armin Mikler and Kaja Abbas

3
Outline

An interesting hypothesis
What is a disease gene?
Can disease genes be classified using machine
learning tools?
If so, can we do better?
Classifiers Data
Analysis Conclusions

4
Hypothesis

It has been suggested that the genes which have
some relationship to hereditary disease might
have common variations in their DNA sequence
structure.

5
What is a disease gene?

Any gene that has mutated in such a way that the
proteins created from it are dysfunctional.

6
What is a disease gene?

Any gene that has mutated in such a way that the
proteins created from it are dysfunctional.
However, mutation can happen to any gene, so can
one actually search for physical characteristics
of a disease gene?

7
Reviewed Paper

A research group has used the alternating
decision tree algorithm from Weka to test the
hypothesis.
On average, 70 of the genes marked as disease
phenotype were correctly identified with their
automatic classifier they called PROSPECTR.
They found that about 40 of their chosen
features had statistically significant
differences.

8
PROSPECTR results
Feature Ratio
Gene encodes signal peptide 2.06
Gene Length 1.42
5' CpG islands 1.33
Protein length 1.29
Exon Number 1.25
cDNA length 1.15
Distance to neighboring gene 1.13
3' UTR length 1.09
9
Question

Can we do better with other methods of
classification?

10
Classification Methods

ADTree alternating decision tree, optimized for
two-class problems.
J48 a variant of classification 7.
Logistic Linear logistic regression.
SMO Sequential Minimal Optimization algorithm
for training a support vector classifier.
Naïve Bayes Standard probabilistic Naïve Bayes.
Ibk-K K-nearest neighbor classifier (k5).
PART Obtains rules from partial decision trees
build using C4.5 heuristics.

11
Test Data

A training set that consisted of 1,084 genes
known to be associated with a disease and 1,084
genes not known to be associated with genes
diseases.
A set with 675 disease genes listed in the Human
Gene Mutation Database (HGMD) and 675 genes not
known to be involved in disease.
A set based on oliongenic disorders. It
contained 54 genes known to be associated with an
oliongenic disorder and 54 genes not known to be
associated with gene diseases.

12
Classifier interpretation
There are four possible results from a
classification analysis. They are that a
selected gene either

Matches a disease gene.
Matches a non disease gene.
Is selected to match a disease gene but does not
do so.
Is selected to match a non-disease gene but does
not do so.

13
(No Transcript)
14
Validity

The analysis of an independent data set ought to
produce similar results to the training set. If
not the analysis is suspect.

15
(No Transcript)
16
Validity

If the analysis is valid, we would expect that
classification using the only the successful
subset of features found by the PROSPECTR
application would result in improved results.
The removal of non-relevant features ought to
decrease the number of mismatches.

17
(No Transcript)
18
All data analyzed
19
The Best Classifier Results
Classifier Percent total corrects Difference with best features
J48 88.7 -15.1
PART 80.7 -10.6
ADTree 75.5 -3.1
Ibk-K 75.4 -0.32
Naïve Bayes 73.0 -12.0
SMO 72.3 -6.0
Logistic 70.0 -4.9
20
Conclusions

We have shown that classifier 2, performs better
than classifier 1, the one chosen by PROSPECTR
method.
The features that showed the largest differences
in the PROPSPECTR study were most likely a
statistical anomaly.
It seems that using these machine learning
methods to classify disease genes is not very
productive. At best it needs to be combined with
some other independent method.

21
References

Euan Adie et. al., Speeding disease gene
discovery by sequence based candidate
prioritization, BMC Bioinformatics 2005, 655.
Hammond MP, Birney E, Genome information
resources - developments at Ensembl. Trends in
Genetics 2004, 20268-272.
http//www.biomedcentral.com/1471-2105/6/55.
http//www.ncbi.nlm.nih.gov/books/bv.fcgi?ridgnd
http//www.genetics.med.ed.ac.uk/prospectr/

22
Questions
23
What causes disease?

Causes of disease are a continuum of genetic
activity interacting with nongenetic factors.

The Metabolic Molecular Basis of Inherited
Disease. Vol 1, Chapter 1. 8ed.
RC 627.8.M47.2001

24
Weka

A collection of machine learning algorithms for
data mining tasks. The algorithms can either be
applied directly to a data set or called from
your own Java code.
Contains tools for data pre-processing,
classification, regression, clustering,
association rules, and visualization.
Well-suited for developing new machine learning
schemes.
Is open source software issued under the GNU
General Public License.