Evaluating Classifiers for Disease Gene Discovery - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Evaluating Classifiers for Disease Gene Discovery

Description:

Evaluating Classifiers for Disease Gene Discovery. Lon Turnbull and Kino Coursey. lt0013_at_unt.edu, kino_at_daxtrom.com. University of North Texas. Biocomputing Fall 2005 ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 25
Provided by: Lon112
Category:

less

Transcript and Presenter's Notes

Title: Evaluating Classifiers for Disease Gene Discovery


1
Evaluating Classifiers for Disease Gene Discovery
  • Lon Turnbull and Kino Coursey
  • lt0013_at_unt.edu, kino_at_daxtrom.com
  • University of North Texas

2
Biocomputing Fall 2005
  • CSCD 4930.004/CSCE 5933.007
  • Biol 4930.773/Biol 5905.773
  • Instructors Armin Mikler and Kaja Abbas

3
Outline
  • An interesting hypothesis
  • What is a disease gene?
  • Can disease genes be classified using machine
    learning tools?
  • If so, can we do better?
  • Classifiers Data
  • Analysis Conclusions

4
Hypothesis
  • It has been suggested that the genes which have
    some relationship to hereditary disease might
    have common variations in their DNA sequence
    structure.

5
What is a disease gene?
  • Any gene that has mutated in such a way that the
    proteins created from it are dysfunctional.

6
What is a disease gene?
  • Any gene that has mutated in such a way that the
    proteins created from it are dysfunctional.
  • However, mutation can happen to any gene, so can
    one actually search for physical characteristics
    of a disease gene?

7
Reviewed Paper
  • A research group has used the alternating
    decision tree algorithm from Weka to test the
    hypothesis.
  • On average, 70 of the genes marked as disease
    phenotype were correctly identified with their
    automatic classifier they called PROSPECTR.
  • They found that about 40 of their chosen
    features had statistically significant
    differences.

8
PROSPECTR results
Feature Ratio
Gene encodes signal peptide 2.06
Gene Length 1.42
5' CpG islands 1.33
Protein length 1.29
Exon Number 1.25
cDNA length 1.15
Distance to neighboring gene 1.13
3' UTR length 1.09
9
Question
  • Can we do better with other methods of
    classification?

10
Classification Methods
  1. ADTree alternating decision tree, optimized for
    two-class problems.
  2. J48 a variant of classification 7.
  3. Logistic Linear logistic regression.
  4. SMO Sequential Minimal Optimization algorithm
    for training a support vector classifier.
  5. Naïve Bayes Standard probabilistic Naïve Bayes.
  6. Ibk-K K-nearest neighbor classifier (k5).
  7. PART Obtains rules from partial decision trees
    build using C4.5 heuristics.

11
Test Data
  • A training set that consisted of 1,084 genes
    known to be associated with a disease and 1,084
    genes not known to be associated with genes
    diseases.
  • A set with 675 disease genes listed in the Human
    Gene Mutation Database (HGMD) and 675 genes not
    known to be involved in disease.
  • A set based on oliongenic disorders. It
    contained 54 genes known to be associated with an
    oliongenic disorder and 54 genes not known to be
    associated with gene diseases.

12
Classifier interpretation
There are four possible results from a
classification analysis. They are that a
selected gene either
  1. Matches a disease gene.
  2. Matches a non disease gene.
  3. Is selected to match a disease gene but does not
    do so.
  4. Is selected to match a non-disease gene but does
    not do so.

13
(No Transcript)
14
Validity
  • The analysis of an independent data set ought to
    produce similar results to the training set. If
    not the analysis is suspect.

15
(No Transcript)
16
Validity
  • If the analysis is valid, we would expect that
    classification using the only the successful
    subset of features found by the PROSPECTR
    application would result in improved results.
  • The removal of non-relevant features ought to
    decrease the number of mismatches.

17
(No Transcript)
18
All data analyzed
19
The Best Classifier Results
Classifier Percent total corrects Difference with best features
J48 88.7 -15.1
PART 80.7 -10.6
ADTree 75.5 -3.1
Ibk-K 75.4 -0.32
Naïve Bayes 73.0 -12.0
SMO 72.3 -6.0
Logistic 70.0 -4.9
20
Conclusions
  • We have shown that classifier 2, performs better
    than classifier 1, the one chosen by PROSPECTR
    method.
  • The features that showed the largest differences
    in the PROPSPECTR study were most likely a
    statistical anomaly.
  • It seems that using these machine learning
    methods to classify disease genes is not very
    productive. At best it needs to be combined with
    some other independent method.

21
References
  • Euan Adie et. al., Speeding disease gene
    discovery by sequence based candidate
    prioritization, BMC Bioinformatics 2005, 655.
  • Hammond MP, Birney E, Genome information
    resources - developments at Ensembl. Trends in
    Genetics 2004, 20268-272.
  • http//www.biomedcentral.com/1471-2105/6/55.
  • http//www.ncbi.nlm.nih.gov/books/bv.fcgi?ridgnd
  • http//www.genetics.med.ed.ac.uk/prospectr/

22
Questions
23
What causes disease?
  • Causes of disease are a continuum of genetic
    activity interacting with nongenetic factors.
  • The Metabolic Molecular Basis of Inherited
    Disease. Vol 1, Chapter 1. 8ed.
  • RC 627.8.M47.2001

24
Weka
  • A collection of machine learning algorithms for
    data mining tasks. The algorithms can either be
    applied directly to a data set or called from
    your own Java code.
  • Contains tools for data pre-processing,
    classification, regression, clustering,
    association rules, and visualization.
  • Well-suited for developing new machine learning
    schemes.
  • Is open source software issued under the GNU
    General Public License.
Write a Comment
User Comments (0)
About PowerShow.com