ELVIRA II - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

ELVIRA II

Description:

Title: Clasificador Naive-Bayes para datos de expresi n gen tica Author: a Last modified by: a Created Date: 5/12/2004 10:11:16 AM Document presentation format – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 24
Provided by: A82
Category:
Tags: elvira | genetica

less

Transcript and Presenter's Notes

Title: ELVIRA II


1
ELVIRA II
  • San Sebastián Meeting
  • May, 2004
  • Andrés Masegosa

2
Naive-Bayes classifier for gene expression data
  • Classifier with Continuous Variables
  • Feature Selection
  • Wrapper Search Method

3
Index
  • 1. The Naive-Bayes Classifier
  • 1.1 Hipotesis for the creation of the NB
    classfier.
  • 1.2 Description of a NB classifier.
  • 2. Previous Works
  • 2.1 The lymphoma data set
  • 2.2 Wright et al. paper
  • 3. Selective Gaussian Naive-Bayes
  • 3.1 Anova Phase
  • 3.2 Search Phase
  • 3.3 Stop Condition

4. Implementation 4.1 Implemented Classes 4.2
Methods 4.3 Results 5. Conclusions 6. Future
Works
4
Hypothesis for the creation of the Naive-Bayes
classifier
  • Hipothesis 1 The attribute variables are
    independent given the class variable.
  • Hipothesis 2 The attribute variables are
    distributed as
  • Normal Distribution given the class variable
    Andrés
  • Linear Exponential Mixtures given the class
    variable Javi

5
Naive-Bayes Classifier
  • There are three basic steps in the construction
    of the NaiveBayes classifier
  • Step 1 Structural Learning ? Its learned the
    classifier structure. The naive-Bayes model has
    only arcs from the class variable to the
    predictives variables, its assumed that the
    predicitives variables are independent given the
    class.
  • Step 2 Parametric Learning ? Its consists in
    estimating the distribition for each predictive
    variable.
  • Step 3 Propagation Method ? Its carries out the
    prediction of the class variable given the
    predictives variables. In our case, the known
    predictive variables are observed in the bayesian
    network and after a propagation method (Variable
    Elmination) is runned to get an á posteriori
    distribution of the class variable. The class
    with the greatest probability is the predicted
    value.

6
Navi-Bayes with MTE
  • Its learned a MTE for each predictive variable
    given the class.
  • An example

NB classifier learned from Iris data base
Estimating a Normal distribution with a MTE
7
Lymphoma data set I Alizadeth et al
(2000) http//llmpp.nih.gov
  • Hierarchical clustering of gene expression data.
    Depicted are the ,1.8 million measurements of
    gene expression from 128 microarray analyses of
    96 samples of normal and malignant lymphocytes.
  • The dendrogram at the left lists the samples
    studied and provides a measure of the relatedness
    of gene expression in each sample. The dendrogram
    is colour coded according to the category of mRNA
    sample studied (see upper right key).
  • Each row represents a separate cDNA clone on the
    microarray and each column a separate mRNA
    sample.
  • The results presented represent the ratio of
    hybridization of fluorescent cDNA probes prepared
    from each experimental mRNA samples to a
    reference mRNA sample. These ratios are a measure
    of relative gene expression in each experimental
    sample and were depicted according to the colour
    scale shown at the bottom.
  • As indicated, the scale extends from uorescence
    ratios of 0.25 to 4 (-2 to 2 in log base 2
    units). Grey indicates missing or excluded data.

8
Lymphoma data set II
http//llmpp.nih.gov
  • Alizadeh et al (2000) Its proposed a partition
    of the diffuse large B-cell lymphoma cases in two
    clusters by the gene expression profiling
  • Germinal Centre B-like High survival rate.
  • Activated B-like Low survival rate.
  • Rosenwald et al (2002) Its proposed a new
    partition of the diffuse large B-cell lymphoma
    cases in three clusters (274 patients)
  • Germinal Centre B-like (GCB) High survival rate
    (134 patients).
  • Activated B-cell (ABC) Low survival rate ( 83
    patients).
  • Type 3 (TypeIII) Medium survival rate (57
    patients).
  • Wright et al (2003) Its proposed a Bayesian
    predictor that estimates the probability of
    memebership in one of two cancer subgroups (GCB
    or ABC), with data set of Rosenewald et al.

9
Wright et al (2003)
  • Gene Expression Data http//llmpp.nih.gov/DLBCLpr
    edictor
  • 8503 genes.
  • 134 cases of GCB, 83 cases of ABC and 57 cases of
    Type III.
  • DLBC subgroup predictor
  • Linear Predictor Score
  • LPS(X) X (X1, X2, ...., Xn)
  • Only K genes with the most significant t
    statistics were used to form the LPS, the optimal
    k was determined by a leave one out method. A
    model including 27 genes had the lowest average
    error rate.
  • ?
  • where N(x, ?, ?) represents a Normal density
    function with mean ? and desviation ?.
  • Training set 67 GCB 42 ABC. Validation set 67
    GCB 41 ABC 57 Type III.

10
Wright et al (2003).
  • This predictor choses a cutoff of 90 certainty.
    The samples for which there was lt90 probability
    of being in either subgroup are termed
    unclassified.
  • Results

11
Index
  • 1. The Naive Bayes Classifier
  • 1.1 Hipotesis for the creation of the NB
    classfier.
  • 1.2 Description of a NB classifier.
  • 2. Previous Works
  • 2.1 The lymphoma data set
  • 2.2 Wright et al paper
  • 3. Selective Gaussian Naive-Bayes
  • 3.1 Anova Phase
  • 3.2 Search Phase

3.3 Stop Condition 3.4 The M best
Explanations 4. Implementation 4.1 Implemented
Classes 4.2 Methods 4.3 Results 5.
Conclusions 6. Future Works
12
Selective Gaussian Naive Bayes
  • Its a modified wrapper method to construct an
    optimal Naive Bayes classifier with a minimum
    number of predictive genes.
  • The main steps of the algorithm are
  • Step 1 First feature selection procedure.
    Selection of the most not correlated significant
    variables (Anova phase).
  • Step 2 Application of a wrapper search method to
    select the final subset of variables that
    minimizes the training error rate (Search Phase).
  • Step 3 Learning of the distribution for each
    selected gene (Parametric Phase).

13
Anova Phase Analisys of Variance.
  • A dispersion measurement is established for each
    gene ? Anova(X) ? 0,?.
  • The gene set is preordered from higher to lower
    Anova measurement.
  • The gene set is partitioned in K gene subsets
    where each gene pair of a subset has a
    correlation coefficient greater than a given
    bound, U.
  • For each gene subset, the variable with the
    greatest anova coefficient is selected. So, we
    get only k variables of the initial training set.

14
Search PhaseA wrapper method
  • Preliminary
  • Let A(m,n) the original training set, with m
    distinct cases and n features for each case.
  • Leat B(m,k), the projection of A over the k
    selected genes, the K set in the previous phase.
    B(m,k) ? (B(m,n),K).
  • Let KFC(B(m,k)), the error rate obtained with a
    simple Naive-Bayes classifier evaluation over B
    by using a T-fold-cross-validation procedure.
  • Algorithm
  • Let P? and QX1,...., Xk.
  • While (not stopCondition((P),r,?r) )
  • Sea iindMinKFC(?(B(m,k), P ? Xj)) Xj ? Q
  • rKFC(B(m,k),P)
  • PP?Xi, QQ/Xi.
  • ?rKFC(B(m,k),P) r.

15
Search PhaseStop Condition
  • The parameters are (P), number of elements of
    P r, actual error rate ?r, increment error
    rate).
  • General stop condition ?r?0 or ?rgt0.
  • Problems
  • Early stopping (only 3-5 genes are selected) with
    ?r?0.
  • OverFitting with ?rgt0.
  • Implemented stop condition

16
The M best explanations. Abductive Inference
  • Due to the high dimensionality of the gene
    expression data sets, its usually use cross
    validation methods to estimate the train error
    rate of a classification model.
  • If a T-fold-cross method is used, T final gene
    subsets are got by the wrapper search method. The
    question is how do I select a unique gene subset
    to apply to the validation data set?.
  • Method
  • Let Ci , i ? 1, ..., T, the subset returned by
    the wrapper method in the phase i of the cross
    validation procedure.
  • Let C ? Ci y N (C)
  • Let Z a data base of T cases, where the case j
    is a tuple a1, ...,
    aN, with ai 1, if Xi? Ci ai0,otherwise.
  • Let B a BN learned by a K2 learning method. An
    abductive inference method returns the M most
    probablity explanations of the BN that is
    equivalent to get the M most probable gene
    subsets.
  • The final subset is the subset with minimum
    leaving one out training error rate.

17
Implementation I
  • Included in the learning.classification.supervise
    d.mixed package.
  • This package contains the follow classes
  • MixedClassifer class? Its a abstract class. It
    was designed to be the parent of all the mixed
    classifiers to implement. It inherits from
    DiscreteClassifier class.
  • Mixed_Naive_Bayes class ? Its a public class to
    learn a mixed Naive-Bayes classfication model. It
    inherits form MixedClassifer class. This class
    implements the structural learning and the
    selective structural learning methods. This last
    method contains the implementation of our wrapper
    search method and it needs to define the
    following methods (to be implemented later)
  • double evaluationKFC(DataBaseCases cases, int
    numclass).
  • double evaluationLOO (DataBaseCases cases, int
    numclass).
  • Bnet getNewClassifier(DataBaseCases cases).

18
Implementation II
  • Gaussian_Naive_Bayes class ? Its a public class
    that implements the parametric learning of a
    mixed NB classifer. Its assumed that the
    predictive variables are ditributed as a normal
    distribution given the class. It inherits from
    Mixed_Naive_Bayes class.
  • Selective_GNB class ? Its a public class that
    implements a gaussian naive bayes with feature
    selection. So, this class
  • Implements the Selective_Classifier interface.
  • Overwrites the following methods
  • structuralLearnig, now this method call to
    selectiveStructuralLearning metod.
  • And evaluationKFC, evaluationLOO,
    getNewClassifier methods.
  • Selective_Classifier interface? Its a public
    interface for define variable selection in a
    classifier method.

19
Methods
  • 10 training and validation sets were randomly
    generated.
  • The three phases were applied to each training
    set.
  • The parameters were
  • 10 fold cross validation.
  • M was fixed to 20.
  • U was fixed to 0.15.
  • This predictor chose a cutoff of 80 certainty.
    The samples for which there was lt80 likelihood
    of being in either subgroup were termed
    unclassified.
  • The stop condition was implemented as
  • Avoid overfitting r gt nu2/n2 u20.03 and
    n220.
  • Avoid earlier stopping incRate lt (n1-n)u1/n1
    u10.1, n110.

20
Results I
  • Phase Anova (Confidence Intervals at 95)
  • Size (gene number) 74.3, 83.1
  • Train accuracy rate () 96.8, 98.6
  • Test accuracy rate () 92.8, 95.4
  • Test -log likelihood 41.6, 72.3
  • TypeIII Test accuracy rate () 17.75, 18.66
  • TypeIII Test -log likelihood Infinity

21
Results II
  • Phase Anova Phase Search (Confidence Intervals
    at 95)
  • Size (gene number) 6.17, 7.82
  • Train accuracy rate () 95.2, 98.0
  • Test accuracy rate () 88.83, 91.9
  • Test -log likelihood 26.72, 40.46
  • TypeIII Test accuracy rate () 20.0, 26.2
  • TypeIII Test -log likelihood 214.0, 264.6

22
Conclusions
  • Its a simple classification method that provides
    good results.
  • Its main problem is that, due to the search
    process, the train error rate goes down quickly
    and the mean number of selected genes is too low
    (around eight genes).
  • Altough this trend is corrected by the anova
    phase, the k-fold-cross validation and the
    flexible stop condition.
  • Get the M best explanations is a very good
    technique to fuse several groups of extracted
    genes by a feature selection method.

23
Future works
  • Develop more sophisticated models
  • Include replacement variables to manage losen
    data.
  • Consider Multidimensionals Gaussian
    distributions.
  • Improve the MTE Gaussian Naive Bayes model.
  • Apply this model to other data sets as breast
    cancer, colon cancer ...
  • Compare with other models with discrete
    variables.
Write a Comment
User Comments (0)
About PowerShow.com