Stream Data Classification Lecture - 1 - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Stream Data Classification Lecture - 1

Description:

As a pilot project, it is decided to try to separate sea bass from salmon using optical sensing ... longer than a salmon. Related feature: (or attribute) ... – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 50
Provided by: mehedy
Category:

less

Transcript and Presenter's Notes

Title: Stream Data Classification Lecture - 1


1
Stream Data ClassificationLecture - 1
  • M. Mehedy Masud

2
Presentation Overview
  • Classification
  • Stream Data
  • Data Selection
  • Ensemble
  • Our Approach
  • Results

3
Classification
4
An Example
Classification
  • (from Pattern Classification by Duda Hart
    Stork Second Edition, 2001)
  • A fish-packing plant wants to automate the
    process of sorting incoming fish according to
    species
  • As a pilot project, it is decided to try to
    separate sea bass from salmon using optical
    sensing

5
An Example (continued)
Classification
  • Features (to distinguish)
  • Length
  • Lightness
  • Width
  • Position of mouth

6
An Example (continued)
Classification
  • Preprocessing Images of different fishes are
    isolated from one another and from background
  • Feature extraction The information of a single
    fish is then sent to a feature extractor, that
    measure certain features or properties
  • Classification The values of these features are
    passed to a classifier that evaluates the
    evidence presented, and build a model to
    discriminate between the two species

7
An Example (continued)
Classification
  • Domain knowledge
  • A sea bass is generally longer than a salmon
  • Related feature (or attribute)
  • Length
  • Training the classifier
  • Some examples are provided to the classifier in
    this form ltfish_length, fish_namegt
  • These examples are called training examples
  • The classifier learns itself from the training
    examples, how to distinguish Salmon from Bass
    based on the fish_length

8
An Example (continued)
Classification
  • Classification model (hypothesis)
  • The classifier generates a model from the
    training data to classify future examples (test
    examples)
  • An example of the model is a rule like this
  • If Length gt l then sea bass otherwise salmon
  • Here the value of l determined by the classifier
  • Testing the model
  • Once we get a model out of the classifier, we may
    use the classifier to test future examples
  • The test data is provided in the form
    ltfish_lengthgt
  • The classifier outputs ltfish_typegt by checking
    fish_length against the model

9
An Example (continued)
Classification
Training Data
Test/Unlabeled Data
  • So the overall classification process goes like
    this ?

Preprocessing, and feature extraction
Preprocessing, and feature extraction
Feature vector
Feature vector
Training
Testing against model/ Classification
Prediction/Evaluation
Model
10
An Example (continued)
Classification
If len gt 12, then sea bass else salmon
Pre-processing, Feature extraction
12, salmon 15, sea bass 8, salmon 5, sea bass
Training
Training data
Model
Feature vector
Labeled data
sea bass (error!) salmon (correct) sea bass salmon
Pre-processing, Feature extraction
15, salmon 10, salmon 18, ? 8, ?
Test/ Classify
Evaluation/Prediction
Test data
Feature vector
Unlabeled data
11
An Example (continued)
Classification
  • Why error?
  • Insufficient training data
  • Too few features
  • Too many/irrelevant features
  • Overfitting / specialization

12
An Example (continued)
Classification
13
An Example (continued)
Classification
  • New Feature
  • Average lightness of the fish scales

14
An Example (continued)
Classification
15
An Example (continued)
Classification
If ltns gt 6 or len5ltns2gt100 then sea bass
else salmon
Pre-processing, Feature extraction
12, 4, salmon 15, 8, sea bass 8, 2, salmon 5, 10,
sea bass
Training
Training data
Model
Feature vector
salmon (correct) salmon (correct) sea bass salmon
Pre-processing, Feature extraction
15, 2, salmon 10, 7, salmon 18, 7, ? 8, 5, ?
Test/ Classify
Evaluation/Prediction
Test data
Feature vector
16
Terms
Classification
  • Accuracy
  • of test data correctly classified
  • In our first example, accuracy was 3 out 4 75
  • In our second example, accuracy was 4 out 4
    100
  • False positive
  • Negative class incorrectly classified as positive
  • Usually, the larger class is the negative class
  • Suppose
  • salmon is negative class
  • sea bass is positive class

17
Terms
Classification
false positive
false negative
18
Terms
Classification
  • Cross validation (3 fold)

Testing
Training
Training
Training
Training
Testing
Training
Testing
Training
Fold 2
Fold 3
Fold 1
19
Stream data
20
Problem Description
Stream Data
  • Suppose we have a continuous flow data
  • For example, a network server always receiving
    some data
  • We would like to detect intrusions / attacks in
    the data
  • Classification problem
  • Is the incoming data to the server is attack or
    normal ?
  • How do we solve this classification problem?

21
Problem Formulation
Stream Data
  • Distinguish normal traffic from attack traffic
  • Identify important features from domain knowledge
  • Extract features from the data
  • Prepare training data
  • Train a classifier
  • Classify future data

22
Problem Formulation (cont)
Stream Data
  • Problem 1
  • How much data should be used for training?
  • Train with first t hour data only?
  • What if no attack appears during first t hours?
  • What if the first t hour data were only attack

Training
Testing
Data ?
Time ? 0
t
now
23
An example
Stream Data
  • Trojan.Peacomm attack

24
Problem Formulation (cont)
Stream Data
  • Possible solution
  • Use all data upto now for training
  • Problem II
  • Cant store unlimited data
  • Cant train a classifier with large volume of
    data
  • Possible solution
  • choose only a subset of data for training

25
Problem Formulation (cont)
Stream Data
  • Problem II
  • Cant store unlimited data
  • Cant train a classifier with large volume of
    data
  • Possible solution
  • Divide data stream into chunks (e.g. 1 hour
    data)
  • Selectively add new data chunks to the training
    set (how?)

chunk1
chunk2
chunk3


26
Problem Formulation (cont)
Stream Data
  • Problem III concept drift
  • The concept (i.e., characteristic of classes) may
    change over time
  • For example, characteristics (length, lightness)
    of salmon and sea bass may change after
    thousand/million years
  • Thus, old training data would be outdated and
    discarded
  • Solution selectively discard old training data
    (how?)

27
Systematic data selection
  • Source Fan, W. Systematic data selection to mine
    concept-drifting data streams. In Proc. KDD 04.

28
Data Selection Problem
Systematic Data Selection
  • In the presence of concept drift, which data
    should be used to train the classifier?
  • Use all data? discard oldest? random selection?

29
Data Selection Problem
Systematic Data Selection
  • Concept drift
  • Si is the data received at time stamp i
  • FOi (x) is its optimum model
  • Let FOi-1(x) be the optimal hypothesis at time
    stamp i-1
  • We say that there is concept drift from time
    stamp i-1 to time stamp i if there exists some x
    such that
  • FOi (x) ? FOi -1(x)
  • Data sufficiency
  • Training data is sufficient if adding more data
    to the training set does not improve
    classification accuracy

30
Will Old Data Help?
Systematic Data Selection
  • Underlying model does not change (no concept
    drift)
  • Old data will help if the recent data is
    insufficient
  • Overfitting does not occur
  • Underlying model does change
  • Let SP S1 U U Si-1
  • The data in SP can be any of the three categories
  • 1. FOi (x) ? FOi -1(x) (disagree)
  • 2. FOi (x) FOi -1(x) y (agree and correct)
  • 3. FOi (x) FOi -1(x) ? y (agree but wrong)

31
Will Old Data Help? (cont)
Systematic Data Selection
  • 1. FOi (x) ? FOi -1(x) (disagree)
  • 2. FOi (x) FOi -1(x) y (agree and correct)
  • 3. FOi (x) FOi -1(x) ? y (agree but wrong)

3
2
1
32
Scenario-I
Systematic Data Selection
  • New data is sufficient by itself and there is no
    concept drift
  • Optimal model the one trained with new data
    only
  • Optimal model may also be the old model if that
    data was sufficient
  • Problem - we may never know whether the data is
    sufficient, or there is no concept drift
  • What if we
  • Train a new model from the new data
  • A new model from the combined new and old data
  • Compare with the original old model

33
Scenario-II
Systematic Data Selection
  • New data is sufficient by itself and there is
    concept drift
  • Optimal model the one trained with new data
    only
  • Problem - we may never know whether the data is
    sufficient, or there is no concept drift

34
Scenario-III
Systematic Data Selection
  • New data is insufficient by itself and there is
    no concept drift
  • Optimal model If the previous data is
    sufficient, then the existing model
  • Optimal model If previous data is not
    sufficient, then
  • Train a new model from new data plus existing
    data
  • Choose the one with high accuracy

35
Scenario-IV
Systematic Data Selection
  • New data is insufficient by itself and there is
    concept drift
  • Optimal model not obtainable from new data only
  • Choose only those examples from previous data
    chunks that
  • Have consistent concept with the new data chunk
  • And combine those examples with the new data

36
Computing Optimal Model
Systematic Data Selection
  • Optimal model is different under different
    situations
  • Choice depends on whether the data is sufficient
    and there is concept drift
  • Solution
  • Compare a few plausible models statistically
  • Chose the one with the highest accuracy
  • Notation
  • FN(x) a new model trained from recent data
  • FO(x) optimal model finally chosen

37
Computing Optimal Model (cont)
Systematic Data Selection
  • 1. Train a model FNi(x) from the new data chunk.
  • 2. Let Di-1be dataset that trained the most
    recent optimal model FOi-1(x)
  • Di-1 may not be the most recent data chunk Si-1
  • How Di-1 is obtained will be discussed shortly
  • Select the examples from Di-1 that both
  • The model FNi(x) and
  • The model FOi-1(x) make correct prediction
  • Say, these examples are si-1
  • That is, si-1 for all (x,y) ?, Di-1 such that
    FNi(x)FOi-1(x)y

38
Computing Optimal Model (cont)
Systematic Data Selection
  • 3. Train a model FNi(x) from the new data chunk
    plus the selected data in the last step, i.e.,
    from Si U si-1
  • 4. Update the most recent model FOi-1(x) with Si
    and call this model FOi-1(x). i.e., FOi-1(x)
    is trained from Di U Si
  • 5. Compare the accuracies of all four models
    FOi-1(x), FOi-1(x), FNi(x), FNi(x)
  • Using cross-validation and
  • Choose the one that is the most accurate
  • Call it FOi(x)

39
Computing Optimal Model (cont)
Systematic Data Selection
  • 6. Di is the training set that computes FOi(x).
    It is either of the followings
  • Si
  • Di-1
  • Si U si
  • Si U Di-1

40
Scenarios, Revisited
Systematic Data Selection
  • 1. New data is sufficient by itself and there is
    no concept change.
  • Conceptually FNi(x) should be the optimal model.
  • However, FNi (x), FOi-1(x) and FOi-1(x) could
    be its close match since there is no concept
    change.
  • 2. New data is sufficient by itself and there is
    concept change.
  • Obviously, FNi(x) should be the optimal model.
  • However, FNi (x) could be very similar in
    performance to FNi (x)

41
Scenarios, Revisited (continued)
Systematic Data Selection
  • 3. New data is insufficient by itself and there
    is no concept change
  • The optimal model should be either FOi-1(x) or
    FOi-1(x).
  • 4. New data is insufficient by itself and there
    is concept change.
  • The optimal model should be either FNi(x) or
    FNi(x).

42
Data Set
Systematic Data Selection
  • Synthetic data
  • Each data point is a d-dimensional vector
    x1,,xd where x ? 0,1
  • Concept drift is achieved by a moving hyperplane
  • Equation of the hyperplane
  • Weights are changed at a certain rate

43
Data Set (continued)
Systematic Data Selection
  • Synthetic data (continued)
  • Parameters
  • d dimension 10
  • t rate of change of weight
  • Weight is changed with the formula aiaisit/N
  • N 1000
  • k how many dimensions to change (varied from
    20 to 50)
  • s direction of change (randomly changed)
  • p noise set to 5

44
Data Set (continued)
Systematic Data Selection
  • Credit card fraud data (real)
  • Sampled from credit card transaction records
    within a one year period
  • Contains total 5 million transactions
  • Features
  • Time
  • Merchant type, location
  • Past payments
  • Summary of transaction history etc.

45
Experiments
Systematic Data Selection
  • Comparison with other methods
  • G1 decision tree trained from new data chunk
    only
  • GA decision tree trained from all data
  • Gi single decision tree trained from most recent
    i data chunks
  • Ei decision tree ensemble trained from most
    recent i data chunks, each tree from one chunk

46
Results
Systematic Data Selection
47
Criticism
Systematic Data Selection
  • Quote
  • will the training data Di become unnecessarily
    large? The answer is no. Di only grows in size
    (or includes older data) if and only if the
    additional data helps improve accuracy.
  • Although it is claimed
  • that training data will not grow large,
  • there is no guarantee that it will not exceed
    memory/system limitations
  • Can we do better?
  • Store models rather than data

48
Conclusion
Systematic Data Selection
  • Concept drift is a major problem with stream data
    mining
  • Systematic selection of data works better than
    random data selection
  • However, there is no guarantee that data will not
    grow beyond acceptable limit

49
Ensemble methods for stream data classification
  • Up Next
Write a Comment
User Comments (0)
About PowerShow.com