G54DMT - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

G54DMT

Description:

G54DMT Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb_at_cs.nott.ac.uk Topic 1: Preliminaries – PowerPoint PPT presentation

Number of Views:172
Avg rating:3.0/5.0
Slides: 54
Provided by: JaumeBa5
Category:
Tags: g54dmt | weka

less

Transcript and Presenter's Notes

Title: G54DMT


1
G54DMT Data Mining Techniques and
Applicationshttp//www.cs.nott.ac.uk/jqb/G54DMT
  • Dr. Jaume Bacardit
  • jqb_at_cs.nott.ac.uk
  • Topic 1 Preliminaries

2
Outline of the topic
  • General data mining definitions and concepts
  • Handling datasets
  • Repositories of datasets
  • Experimental evaluation of data mining methods
  • Information theory
  • Playing a bit with Weka

3
Dataset structure
  • It most cases we will treat a dataset as a table
    with rows and columns

Attributes




Instances
4
Instances and attributes
  • Instances are the atomic elements of information
    from a dataset
  • Also known as records or prototypes
  • Each instance is composed of a certain number of
    attributes
  • Also known as features or variables
  • In a dataset attributes can be of different types
  • Continuous (or Integer)
  • Discrete (i.e. being able to take a value from a
    finite set)

5
Supervised learning
  • Many times the datasets got a special attribute,
    the class or output
  • If they do, the task of the data mining process
    consists in generating a model that can predict
    the class/output for a new instance based on the
    values for the rest of attributes
  • In order to generate this model, we will use a
    corpus of data for which we already know the
    answer, the training set

6
Example of dataset
Witten and Frank, 2005 (http//www.cs.waikato.ac.
nz/eibe/Slides2edRev2.zip)
7
Process of supervised learning
8
Types of supervised learning
  • If the special attribute is discrete
  • We call it class
  • The dataset is a classification problem
  • If the special attribute is continuous
  • We call it output
  • The dataset is a regression problem
  • Also called modelling or function aproximation

9
Rule Learning
  • CN2, RISE, GAssist, BioHEL

1
If (Xlt0.25 and Ygt0.75) or (Xgt0.75 and Ylt0.25)
then ?
If (Xgt0.75 and Ygt0.75) then ?
If (Xlt0.25 and Ylt0.25) then ?
Y
Everything else ?
0
1
X
10
A decision tree (ID3/C4.5)
11
Linear Classification (Logistic Regression)
12
Support Vector Machines (SVM)
13
Unsupervised learning
  • When we do not have/not take into account the
    class/output attribute
  • If the goal is to identify aggregations of
    instances
  • Clustering problem
  • If the goal is to detect strong patterns in the
    data
  • Association rules/Itemset mining

14
Clustering (K-Means, EM)
Partitional clustering
Hierarchical clustering
http//www.mathworks.com/matlabcentral/fx_files/19
344/1/k_means.jpg
http//www.scsb.utmb.edu/faculty/luxon.htm
15
Association rules mining (Apriori, FP-growth)
Witten and Frank, 2005 (http//www.cs.waikato.ac.
nz/eibe/Slides2edRev2.zip)
16
Reinforcement learning
  • When the system is being given a reward or
    punishment whether its prediction was correct or
    not
  • The DM system is not being told what is exactly
    right or wrong, just the reward
  • RL methods work prediction by prediction
  • This is why many times they are called online
    systems

17
Dataset handling and format
  • In DM most datasets are represented as simple
    plain text files (or sometimes excel sheets)
  • More sophisticated (and efficient) methods exist
  • We need to decide how to specify the dataset
    structure (i.e. the metadata) and the content
    (the instances)

18
ARFF format
  • ARFF Attribute-Relation File Format
  • File format from the WEKA DM package
  • http//www.cs.waikato.ac.nz/ml/weka/arff.html

19
HDF5
  • Much more complex file format designed for
    scientific data handling
  • It can store heterogeneous and hierarchical
    organized data
  • It has been designed for efficiency
  • Presentation slides

20
Repositories of datasets
  • UCI repository
  • http//archive.ics.uci.edu/ml/
  • Probably the most famous collection of datasets
    in ML!
  • Currently has 235 datasets
  • Kaggle
  • It is not a static repository of datasets, but a
    site that manages Data Mining competitions
  • Example of the modern concept of crowdsourcing

21
Repositories of datasets
  • KDNuggets
  • http//www.kdnuggets.com/datasets/
  • PSPbenchmarks
  • http//www.infobiotic.net/PSPbenchmarks/
  • Datasets derived from Protein Structure
    Prediction problems
  • Interesting benchmarks because they can be
    parametrised in a very large variety of ways
  • Pascal Large Scale Learning Challenge
  • http//largescale.ml.tu-berlin.de/about/

22
Experimental Evaluation
  • We have Data Mining methods and datasets
  • How can we know that the patterns they extract
    are meaningful?
  • How can we know how well do they perform and
    which method is the best?
  • We need to follow a principled protocol to make
    sure that the results and conclusions we extract
    are sound

23
Verifying that the model is sound
  • The dataset, before the data mining process, is
    partitioned into two non-overlapped parts
  • Training set. Will be used to generate the model
  • Test set. Will be used to validate the model
  • We can compute the performance metric (more on
    the next slides) on both sets. If the metric for
    the training set is much higher than in the test
    set, we have a case of overlearning
  • That is, the DM method has not modelled the
    problem, only the training set

24
Performance metrics Classification Problems
  • Accuracy C/D
  • C number of correctly classified examples. D is
    size of instance set
  • Simplest and most widespread metric
  • But what if some classes got more examples than
    others? The majority class would get more benefit
    out of this metric

25
Performance metrics classification problems
  • Cohens Kappa
  • Computes the agreement between two distributions
    of categorical variables (i.e. real and predicted
    classes)
  • Takes into account agreement by chance
  • Hence, it may be more suitable for multi-class
    problems
  • There are more metrics (e.g. ROC curves)

C number of classes, xi examples from class
I Xii examples correctly classified from class i
(Garcia et al., 2009)
26
Performance metrics
  • Regression problems
  • RMSD
  • Metrics for clustering problems
  • Association rule mining
  • Support (percentage of instances covered by the
    pattern)
  • Confidence (agreement between predicate and
    consequent of the rule)

27
Performance estimation methodologies
  • We have to partition the dataset into training
    and test, but how is this partitioning done?
  • If this partitioning is wrong we will not obtain
    a good estimation of the methods performance

28
Performace estimation methodologies Holdout
  • Simply dividing the dataset into two
    non-overlaped sets (e.g. 2/3 of the dataset for
    training, 1/3 of the set for test).
  • Most simple method and computationally cheap
  • The performance is computed on the test set
  • Its reliability greatly depends on how the sets
    are partitioned

29
Performace estimation methodologies K-fold
cross-validation
  • Divide the dataset into K strata
  • Iterate K times
  • Use sets 1.. K-1 for training and the set K for
    test
  • Use sets 1..K-2,K for training and the set K-1
    for test
  • etc
  • Computer the performance estimation as the
    average of the metric for each of the K test sets
  • More robust estimator, but computationally more
    costlier
  • If each of the K strata has the same class
    distribution as the whole dataset, this method is
    called stratified K-fold cross-validation

30
Performance estimation methodologies
  • Leave-one-out cross-validation
  • Extreme case of cross-validation, where KD, the
    size of the dataset
  • Training sets will have size D-1 and test sets
    will have only one instance
  • Mostly used in datasets of small size
  • Bootstrap
  • Generate a sample with replacement from the
    dataset with size D. Use it as training set
  • Use the instances never selected as test set
  • Repeat this process a high number of times

31
Performance estimation methodologies
  • But which method is best?
  • Two criteria to take into account
  • Bias Difference between the estimation and the
    true error
  • Variance of the estimations sample
  • Formal and experimental analysis of some of these
    methods
  • Experimental study for microarray data

32
Who is the best?
  • At the end of an experimentation we will have
    tested N methods on D datasets, thus we will have
    a NxD table of performance metrics
  • How can we identify the best method?
  • Highest average performance across the D
    datasets?
  • Highest average rank across the D datasets?
  • Also, is the best method significantly better
    than the others?
  • For this we need statistical test (next slide)
  • Best study nowadays on statistical tests
  • Most of the tests described in the next slides
    can be found in many statistical packages (e.g.
    R)

33
Statistical tests
  • Procedure for making decisions about data
  • Each test defines an hypothesis (H0) about the
    observed data
  • The accuracy differences I observe between A and
    B are just by pure chance and the methods perform
    equally
  • Or in a more statistical way The two sets of
    data observations A and B belong to the same
    distribution
  • Then, the probability of H0 being true given the
    observed data is calculated
  • If the probability (p-value) is smaller than a
    certain threshold, H0 is rejected and the
    observed differences are statistically significant

34
Traditional approach Student T-test
  • Tests whether the difference between two
    distributions is significant or not
  • Generates a p-value, in this case the probability
    that the two distributions are not significantly
    different
  • P-values of 0.05 or 0.01 are typical thresholds
  • Can only applied if the distributions have the
    same variance and are normal. These conditions
    are hardly ever achieved

35
Wilcoxon test
  • Non-parametric tests, which is not affected by
    the normality of the distribution
  • It ranks the absolute differences in performance,
    dataset by dataset
  • Next, it sums the ranks separately for the
    positive and negative differences
  • A p-value will be generated depending on which
    sum of ranks is smaller and the number of datasets

36
Multiple pair-wise comparisons?
  • Both the t-test and the Wilcoxon test are used to
    compare two methods
  • What if we have more than two methods in our
    comparison?
  • You can hold individually that A is better than B
    and better than C with 95 confidence and but it
    is better than both B and C at the same time?
  • The p-values do not hold anymore
  • We need to apply a correction for multiple tests
    (e.g. Bonferroni, Holm, etc.)

37
The Friedman test
  • Designed explicitly to compare multiple methods
  • Based on the average rank of each method across
    the datasets
  • This test just says if the performance of the
    methods included is similar or not
  • Can be used when having more than 10 datasets and
    more than 5 methods
  • Once the test has determined that there are
    significant performance differences, a post-hoc
    test is used to spot them

38
Two types of post-hoc test
  • Comparing every method to each other (e.g. the
    Nemenyi test)
  • Comparing a control method against the others
    (e.g. the Holm test)
  • E.g. the best method against the others
  • This latter kind of test gives less information
    but are also more powerful (i.e. less
    conservative)
  • Both are based on ranks

39
Using R to compute statistical tests
  • R is an open source statistical computing package
  • It is extremely powerful, but not the most simple
    tool to use in the world.
  • It has build-in functions for most of the tests
    described in the previous slides
  • Example from the command line.

40
Data for the test
  • A plain text file with rows consisting of
  • ltmethodgt ltdatasetgt ltperformance metricgt
  • Example file
  • CV has already been computed
  • Lets load the data in R

R gt datalt-read.table("rchA.dat") gt data V1
V2 V3 1 C4_5 rchA_1_q2 0.664884 2
C4_5 rchA_2_q2 0.690657 3 C4_5 rchA_3_q2
0.731942 4 C4_5 rchA_4_q2 0.756022 5 C4_5
rchA_5_q2 0.733984
41
T-test
gt pairwise.t.test(dataV3,dataV1,pool.sdF,paired
T) Pairwise comparisons using t tests with
non-pooled SD data dataV3 and dataV1
C4_5 HEL LCS HEL 0.00056 - -
LCS 0.00038 0.53291 - NB 0.53291
0.52834 0.53291 P value adjustment method holm
42
Wilcoxon test
gt pairwise.wilcox.test(dataV3,dataV1,pairedT)
Pairwise comparisons using Wilcoxon rank sum
test data dataV3 and dataV1 C4_5
HEL LCS HEL 0.00038 - - LCS
0.00032 0.72586 - NB 0.72586 0.72586
0.72586 P value adjustment method holm gt
pairwise.wilcox.test(dataV3,dataV1,pairedT)p.v
alue C4_5 HEL LCS HEL
0.0003814697 NA NA LCS
0.0003204346 0.7258606 NA NB
0.7258605957 0.7258606 0.7258606 gt
write.table(pairwise.wilcox.test(dataV3,dataV1,p
airedT)p.value,"wilcox.dat)
Saving the results to a file
43
Friedman and post hoc tests
  • Code for the Holm and Nemenyi post hoc tests
  • Code for accuracy (between 0 and 1)

gt source("friedholm.r) gt fried.holm("rchA.dat")
Friedman rank sum test data dataV3, dataV1
and dataV2 Friedman chi-squared 22.4667, df
3, p-value 5.216e-05 1 "Average ranks"
C4_5 HEL LCS NB 3.666667
1.722222 2.166667 2.444444 1 "Control is
HEL" 1 "Confidence level of 0.950000" 1
"HEL is significantly better than C4_5" 1
"Confidence level of 0.990000" 1 "HEL is
significantly better than C4_5"
44
Evaluation pipeline (summary)
  • N datasets, M methods
  • Method can mean anything (mining, preprocessing,
    etc.)
  • For each of the N datasets
  • Partition it into training and test sets
  • For each of the K pairs of (Trainingkn,Testkn)
    sets
  • For each of the M methods
  • model Train Method M using set Trainingkn
  • MetricNMK Test model using set Testkn
  • MetricNM Average across k pairs
  • Compute average ranks of each method across
    datasets
  • Run statistical tests using Metric matrix
  • There are overall statistical significant
    performance differences?
  • Individual tests one-vs-rest or all-vs-all

45
Basic Information Theory
  • Shannons Entropy
  • Measure of the information content in discrete
    data
  • Metric inspired by statistical information
    transmission
  • How many bits per symbol (in average) would it
    take to transmit a message given the frequencies
    of each symbol?

46
Shannons Entropy
  • If we are tossing a coin and want to send a
    sequence of tosses
  • If heads and tails have the same probability it
    would take one bit per toss
  • If heads have a very small probability compared
    to tails we could just send a list of the heads
    and say that everything else is a tail
  • Each head message would be long, but there would
    be very few of them, so overall it would take
    less than one bit to transmit a symbol (in
    average).

47
Other information theory metrics
  • Conditional Entropy
  • Entropy of a certain variable given that we know
    all about a related variable
  • Information gain
  • How many bits would I save from transmitting Y if
    I know about X?
  • IG(YX) H(Y) H(YX)
  • Mutual information
  • Assesses the interrelationship between two
    variables

48
Why is it useful for in data mining?
  • We have a continuous variable where each data
    point is associated to a class
  • 10 red dots, 10 blue dots
  • H(X) -(0.5log(0.5)0.5log(0.5))1
  • What if we split this variable in two?
  • Where to split?
  • The point where we obtain maximum IG. Conditional
    variable X is the splitting point

49
Different cut points
  • H(Xleft) 0.469, H(Xright) 0.469, IG0.531
  • H(Xleft) 0, H(Xright) 0.863, IG0.384
  • H(Xleft) 0.863, H(Xright)0, IG0.384

50
Lets play a bit with Weka
  • Download the WEKA GUI from here
  • Dataset for this demo
  • What I am going to show are just some of the
    steps from this tutorial

51
Weka from the command line.
Confusion Matrix a b lt--
classified as 39 457 a 2 2 494 b
3 Stratified cross-validation
Correctly Classified Instances 491
49.496 Incorrectly Classified
Instances 501 50.504 Kappa
statistic -0.0101 Mean
absolute error 0.4992 Root
mean squared error
0.5111 Relative absolute error
99.8304 Root relative squared error
102.214 Total Number of Instances
992 Confusion Matrix a b
lt-- classified as 141 355 a 2 146 350
b 3
java weka.classifiers.trees.J48 -t dataset.arff
J48 pruned tree ------------------ att16 lt
1.28221 att1 lt 1.33404 att3 lt
1.37681 3 (6.0) att3 gt 1.37681 2
(4.0/1.0) att1 gt 1.33404 att17 lt
1.52505 att5 lt 1.34135 3 (3.0)
att5 gt 1.34135 att8 lt
1.3391 3 (4.0/1.0) att8 gt
1.3391 2 (19.0/1.0) att17 gt 1.52505 2
(18.0) att16 gt 1.28221 3 (938.0/456.0) Number
of Leaves 7 Size of the tree 13 Time
taken to build model 0.11 seconds Time taken to
test model on training data 0.05 seconds
Error on training data Correctly Classified
Instances 533 53.7298
Incorrectly Classified Instances 459
46.2702 Kappa statistic
0.0746 Mean absolute error
0.4774 Root mean squared error
0.4885 Relative absolute error
95.4706 Root relative squared error
97.7091 Total Number of Instances
992
52
KEEL
  • Knowledge Extration using Evolutionary Learning
  • Another data mining platform with integrated
    graphical design of experiments
  • Download prototype
  • Manual
  • Instructions about integrating new methods into
    it (slides,paper)

53
Questions?
Write a Comment
User Comments (0)
About PowerShow.com