Traffic Classification based on Machine Learning using Flowlevel Information - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Traffic Classification based on Machine Learning using Flowlevel Information

Description:

6412 bittorrent.arff. 4913 clubbox.arff. 101355 edonkey.arff. 21060 fileguri.arff. 635 ftp.arff ... 1500 bittorrent.arff. 1500 clubbox.arff. 1500 edonkey.arff ... – PowerPoint PPT presentation

Number of Views:500
Avg rating:3.0/5.0
Slides: 20
Provided by: jgl9
Category:

less

Transcript and Presenter's Notes

Title: Traffic Classification based on Machine Learning using Flowlevel Information


1
Traffic Classification based on Machine Learning
using Flow-level Information
  • Jong Gun Lee (jglee_at_an.kaist.ac.kr)
  • Advanced Networking Lab.

2
Table of Contents
  • Motivation of this work
  • Background about machine learning
  • Our approach using machine learning
  • Experiment (dataset and result)
  • Conclusion

3
Motivation
  • We cannot effectively classify the traffic of
    some new emergent applications,
  • such as online games and streaming applications
  • because there is no application information, such
    as port number or a common byte sequence in
    payload

We propose a methodology to classify Internet
traffic with supervised and unsupervised learning
4
Basic Terminologies of Machine Learning
  • Classifier
  • is mapping unlabeled instances into classes
  • Instance
  • is a single object of the world
  • Attribute
  • is a single object of the world
  • Feature
  • is the specification of an attribute and its
    value
  • Feature vector
  • is a list of features describing an instance

5
Unsupervised and Supervised Learning
  • Supervised learning (with answer/teacher)
  • With a training set, a classifier learns the
    characteristics of each class. And when entering
    new instance, the classifier predicts the class
    of the instance.
  • Unsupervised learning (without answer/teacher)
  • With only a set of data (feature vectors), a
    classifier make a set of clusters.

6
K-Means
  • One of the unsupervised learning methods
  • K value is the number of clusters and this value
    is given as the initial parameter
  • Procedure
  • First, the classifier randomly chooses K points
    as the centers of K subspaces
  • Second, it divides the overall vector space into
    K subspaces according to the centers
  • Third, it picks new K centers for each subspaces
  • And then, it iterates 2nd and 3rd steps until all
    of the centers are not changed or moved within
    the threshold value

7
Example of K-Means
  • of instance 8, K2

8
Overall Process of Our Method
Unsupervised Learning
Feature Extraction
K Clusters
Classification Method
Supervised Learning
9
Flow-level Feature Information
  • Protocol number 6(TCP) or 17(UDP)
  • Duration seconds
  • Number of packets per second (PPS)
  • Mean of size of all packets
  • Mean of size of non-ACK packets
  • Rate of ACK packets
  • Interaction Information

10
Feature Extraction (Interaction Information)
  • Interaction Information
  • H 2-dimensional histogram, 16x16
  • p1, p2, p3, , pn
  • a sequence of packets size of a flow and its
    partner flow
  • according to timestamp
  • For i 1 n-1
  • Hpi/100pi1/100

A sequence of packets size 40, 80, 1500, , 40,
1500 Pair-wise representation 40, 80, 80,
1500, , 40, 1500 Histogram 40/100,
80/100, 80/100, 1500/100, , 40/100,
1500/100 0, 0, 0, 15,
, 0, 15
11
Guideline
Unknown TRaffic
12
Dataset
  • 1500 bittorrent.arff
  • 1500 clubbox.arff
  • 1500 edonkey.arff
  • 1500 fileguri.arff
  • 0 ftp.arff
  • 1500 http.arff
  • 1500 https.arff
  • 0 melon.arff
  • 1500 msnp.arff
  • 1500 nateon.arff
  • 0 nntp.arff
  • 0 pop3.arff
  • 0 sayclub.arff
  • 1500 smtp.arff
  • 0 ssh.arff
  • 13500 total
  • 6412 bittorrent.arff
  • 4913 clubbox.arff
  • 101355 edonkey.arff
  • 21060 fileguri.arff
  • 635 ftp.arff
  • 200274 http.arff
  • 3611 https.arff
  • 22 melon.arff
  • 4986 msnp.arff
  • 1565 nateon.arff
  • 169 nntp.arff
  • 63 pop3.arff
  • 224 sayclub.arff
  • 40556 smtp.arff
  • 67 ssh.arff
  • 385912 total

13
(No Transcript)
14
(No Transcript)
15
Sum of Squared Error (SSE)
  • How to get SSE
  • bins 88
  • clusters 120

16
Fitting of SSE
  • Y1.446e004 X(-1.194) 755.8

17
Estimation of SSE
18
Decrease Rate of SSE
0.1 decrease
19
To do list
  • Direction
  • Rx and Tx, Rx only, and Tx only
  • Dynamic bin size
  • Initial N packets or all the packets
  • Different (un)supervised learning method
  • Different feature extraction method
Write a Comment
User Comments (0)
About PowerShow.com