WEKA Tutorial


decision trees, rule learners, naive Bayes, decision tables, locally weighted ... Will require a hack to bypass soft-bug during multiple read-writes ... – PowerPoint PPT presentation

WEKA Tutorial
  • Sugato Basu and Prem Melville

What is WEKA?
  • Collection of ML algorithms open-source Java
  • http//www.cs.waikato.ac.nz/ml/weka/
  • Schemes for classification include
  • decision trees, rule learners, naive Bayes,
    decision tables, locally weighted regression,
    SVMs, instance-based learners, logistic
    regression, voted perceptrons, multi-layer
  • Schemes for numeric prediction include
  • linear regression, model tree generators, locally
    weighted regression, instance-based learners,
    decision tables, multi-layer perceptron
  • Meta-schemes include
  • Bagging, boosting, stacking, regression via
    classification, classification via regression,
    cost sensitive classification
  • Schemes for clustering
  • EM and Cobweb

Getting Started
  • Set environment variable WEKAHOME
  • setenv WEKAHOME /u/ml/software/weka
  • Add WEKAHOME/weka.jar to your CLASSPATH
  • setenv CLASSPATH /u/ml/software/weka/weka.jar
  • Test
  • java weka.classifiers.j48.J48 t

ARFF File Format
  • Require declarations of _at_RELATION, _at_ATTRIBUTE and
  • _at_RELATION declaration associates a name with the
  • _at_RELATION ltrelation-namegt
  • _at_RELATION iris
  • _at_ATTRIBUTE declaration specifies the name and
    type of an attribute
  • _at_attribute ltattribute-namegt ltdatatypegt
  • Datatype can be numeric, nominal, string or date
  • _at_ATTRIBUTE sepallength NUMERIC
  • _at_ATTRIBUTE petalwidth NUMERIC
  • _at_ATTRIBUTE class Iris-setosa,Iris-versicolor,Iris
  • _at_DATA declaration is a single line denoting the
    start of the data segment
  • Missing values are represented by ?
  • _at_DATA
  • 5.1, 3.5, 1.4, 0.2, Iris-setosa
  • 4.9, ?, 1.4, ?, Iris-versicolor

Sparse ARFF Files
  • Similar to AARF files except that data value 0
    are not represented
  • Non-zero attributes are specified by attribute
    number and value
  • For examples of ARFF files see WEKAHOME/data

_at_data 0, X, 0, Y, class A 0, 0, W, 0, "class B"
_at_data 1 X, 3 Y, 4 "class A" 2 W, 4 "class B"
Running Learning Schemes
  • java ltlearner classgt options
  • Example learner classes
  • C4.5 weka.classifiers.j48.J48
  • Naïve bayes weka.classifiers.NaiveBayes
  • KNN weka.classifiers.IBk
  • Important generic options
  • -t lttraining filegt Specify training file
  • -T lttest filesgt If none, CV is performed on
    training data
  • -x ltnumber of foldsgt Number of folds for
  • -s ltrandom number seedgt For CV
  • -l ltinput filegt Use saved model
  • -d ltoutput filegt Output model to file
  • Invoking a learner without any options will list
    all the scheme-specific options

  • Summary of model if possible
  • Statistics on training data
  • Cross-validation statistics
  • Example
  • Output for numeric prediction is different
  • Correlation coefficient instead of accuracy
  • No confusion matrices

Using Meta-Learners
  • java ltmeta-scheme classgt -W ltbase-learnergt
    meta-options -- base-options
  • The double minus sign (--) separates the two
    lists of options, e.g.
  • java weka.classifiers.Bagging I 8 -W
    weka.classifiers.j48.J48 -t iris.arff -- -U
  • MultiClassClassifier allows you to use a binary
    classifier for multiclass data
  • java weka.classifiers.MultiClassClassifier W
    weka.classifiers.SMO t weather.arff
  • CVParameterSelection finds best value for
    specified param using CV
  • Use P option to specify the parameter and space
    to search
  • -P ltparam namegt ltstarting valuegt ltlast valuegt lt
    of stepsgt, e.g.
  • java CVParameterSelection W OneR P B 1 10
    10 t iris.arff

Using Filters
  • Filters can be used to change data files, e.g.
  • delete first and second attributes
  • java weka.filters.AttributeFilter R 1,2 i
    iris.arff o iris.new.arff
  • AttributeSelectionFilter lets you select a set of
    attributes using classes in the
    weka.attributeSelection package
  • java weka.filters.AttributeSelectionFilter E
    weka.attributeSelection.InfoGainAttributeEval i
  • Other filters
  • DiscretizeFilter Discretizes a range of numeric
    attributes in the dataset into nominal
  • NominalToBinaryFilter Converts nominal attributes
    into binary ones, replacing each attribute
    with k values with k-1 new binary attributes
  • NumericTransformFilter Transforms numeric
    attributes using given method
  • (java weka.filters. NumericTransformFilter C
    java.lang.Math M sqrt )

The Instance Class
  • All attribute values are stored as doubles
  • Value of nominal attribute is index of the
    nominal value in attribute definition
  • Some important methods
  • classAttribute() Returns class attribute
  • classValue() Returns an instance's class value
  • value(int) Returns an specified attribute value
    in internal format
  • enumerateAttributes() Returns an enumeration of
    all the attributes
  • weight() Returns the instance's weight
  • Instances is a collection of Instance objects
  • numInstances() Returns the number of instances
    in the dataset
  • instance(int) Returns the instance at the given
  • enumerateInstances() Returns an enumeration of
    all instances in the dataset

Writing Classifiers
  • Import the following packages
  • import weka.classifiers.
  • import weka.core.
  • import weka.util.
  • Extend Classifier
  • If predicting class probabilities then extend
  • Essential methods
  • buildClassifier(Instances) Generates a
  • classifyInstance(Instance) Classifies a given
  • distributionForInstance(Instance) Predicts the
    class memberships
  • (for DistributionClassifier)
  • Interfaces that can be implemented
  • UpdateableClassifier For incremental classifiers
  • WeightedInstanceHandler If classifier can make
    use of instance weights

Example ZeroR (Majority Class)
  • public class ZeroR extends DistributionClassifier
    implements WeightedInstancesHandler
  • private double m_ClassValue //The class value
    0R predicts
  • private double m_Counts //The number of
    instances in each class
  • public void buildClassifier(Instances instances)
    throws Exception
  • m_Counts new double instances.numClasses()
  • for (int i 0 i lt m_Counts.length i)
    //Initialize counts
  • m_Countsi 1
  • Enumeration enum instances.enumerateInstances(
  • while (enum.hasMoreElements()) //Add up
    class counts
  • Instance instance (Instance)
  • m_Counts(int)instance.classValue()
  • m_ClassValue Utils.maxIndex(m_Counts)
    //Find majority class
  • Utils.normalize(m_Counts) //Normalize

Example ZeroR - II
  • //Return index of the predicted class
  • public double classifyInstance(Instance instance)
  • return m_ClassValue
  • //Return predicted class probability distribution
  • public double distributionForInstance(Instance
  • throws Exception
  • return (double ) m_Counts.clone()

WekaUT Extensions to WEKA
  • Clusterers package
  • SemiSupClusterer Interface for semi-supervised
  • SeededEM, SeededKMeans Implements
    SemiSupClusterer, has seeding
  • HAC, MatrixHAC Implements top-down agglomerative
  • ConsensusClusterer Abstract class for consensus
  • ConsensusPairwiseClusterer Takes output of many
    clusterings, uses cluster collocation statistics
    as similarity values, applies clustering algo
  • CoTrainableClusterer Performs co-trainable
    clustering, similar to Nigams Co-EM
  • CVEvaluation 10-fold cross-validation with
    learning curves, in transductive framework

WekaUT (contd.)
  • Metrics
  • Metric Abstract class for metric
  • LearnableMetric Abstract class for learnable
    distance metric
  • Weighted DotP Learnable
  • WeightedL1Norm Learnable
  • WeightedEuclid Learnable
  • Mahalanobis metric Uses Jama for matrix

Making Weka Text-friendly
  • Preprocess text by making wrapper calls to
  • Mooneys IR package Tokenize, Porter Stemming,
  • McCallums BOW package Tokenize, Stem, TFIDF,
    Information-theoretic pruning, N-gram tokens,
    different smoothing algorithms
  • Fans MC toolkit Tokenize, TFIDF, pruning, CCS
  • No inverted index in Weka OK if not doing IR,
    but KNN is inefficient
  • May want to integrate VSR package of IR with Weka
  • Probability underflow currently have to do
    calculations with logs
  • NaiveBayes, KNN, etc Can have 2 versions of each
    (sparse, dense)
  • Sparse vector format
  • Wekas SparseInstance
  • IRs hashMapVector

Wekas SparseInstance format
  • Non-zero attributes explicitly stated, 0 values
    not stated
  • _at_data
  • 1the,3 small,6boy,9 ate,13
    the,17 small,21 pie
  • Strings mapped to integer indices using a
  • the 0
  • small 1
  • boy 2
  • ate 3
  • the 4
  • small 5
  • pie 6
  • Use StringToWordVectorFilter to convert text
    SparseInstance to word vector (in Weka 3-2-2)

Comparison of sparse vector formats
  • hashMapVector
  • Compact hashMap representation
  • Amortized constant-time access
  • Does not store position information, maybe
    necessary for future apps
  • Will need a lot of modification to Weka
  • SparseInstance
  • Efficient storage, in terms of indices of
    string values and position
  • Contains position information of
  • Will not require any modification to Weka
  • Uses binary search to insert new element to
  • Would need filters for TF, IDF, token counts,
  • Will require a hack to bypass soft-bug during
    multiple read-writes

Future Work
  • Write wrappers for existing C/C packages
  • mc, spkmeans, rainbow, svmlight, cluto
  • Data format converters e.g. CCStoARFF
  • 10 fold CVevaluation with learning curves
  • inductive (modify Wekas)
  • transductive (use clusterer CV code)
  • Statistical tests e.g. t-tests for classification
  • Cluster evaluation metrics
  • we have KL, MI, Pairwise
  • Making changes to handle text documents

Weka Problems
  • Internal variables private
  • Should have protected or package-level access
  • SparseInstance for Strings requires dummy at
    index 0
  • Problem
  • Strings are mapped into internal indices to an
  • String at position 0 is mapped to value 0
  • When written out as SparseInstance, it will not
    be written (0 value)
  • If read back in, first String missing from
  • Solution
  • Put dummy string in position 0 when writing a
    SparseInstance with strings
  • Dummy will be ignored while writing, actual
    instance will be written properly
