Traffic Classification based on Machine Learning using Flowlevel Information

About This Presentation

Title:

Traffic Classification based on Machine Learning using Flowlevel Information

Description:

6412 bittorrent.arff. 4913 clubbox.arff. 101355 edonkey.arff. 21060 fileguri.arff. 635 ftp.arff ... 1500 bittorrent.arff. 1500 clubbox.arff. 1500 edonkey.arff ... – PowerPoint PPT presentation

Number of Views:500

Avg rating:3.0/5.0

Slides: 20

Provided by: jgl9

Category:

more less

Transcript and Presenter's Notes

Title: Traffic Classification based on Machine Learning using Flowlevel Information

1
Traffic Classification based on Machine Learning
using Flow-level Information

Jong Gun Lee (jglee_at_an.kaist.ac.kr)
Advanced Networking Lab.

2
Table of Contents

Motivation of this work
Background about machine learning
Our approach using machine learning
Experiment (dataset and result)
Conclusion

3
Motivation

We cannot effectively classify the traffic of
some new emergent applications,
such as online games and streaming applications
because there is no application information, such
as port number or a common byte sequence in
payload

We propose a methodology to classify Internet
traffic with supervised and unsupervised learning
4
Basic Terminologies of Machine Learning

Classifier
is mapping unlabeled instances into classes
Instance
is a single object of the world
Attribute
is a single object of the world
Feature
is the specification of an attribute and its
value
Feature vector
is a list of features describing an instance

5
Unsupervised and Supervised Learning

Supervised learning (with answer/teacher)
With a training set, a classifier learns the
characteristics of each class. And when entering
new instance, the classifier predicts the class
of the instance.
Unsupervised learning (without answer/teacher)
With only a set of data (feature vectors), a
classifier make a set of clusters.

6
K-Means

One of the unsupervised learning methods
K value is the number of clusters and this value
is given as the initial parameter
Procedure
First, the classifier randomly chooses K points
as the centers of K subspaces
Second, it divides the overall vector space into
K subspaces according to the centers
Third, it picks new K centers for each subspaces
And then, it iterates 2nd and 3rd steps until all
of the centers are not changed or moved within
the threshold value

7
Example of K-Means

of instance 8, K2

8
Overall Process of Our Method
Unsupervised Learning
Feature Extraction
K Clusters
Classification Method
Supervised Learning
9
Flow-level Feature Information

Protocol number 6(TCP) or 17(UDP)
Duration seconds
Number of packets per second (PPS)
Mean of size of all packets
Mean of size of non-ACK packets
Rate of ACK packets
Interaction Information

10
Feature Extraction (Interaction Information)

Interaction Information
H 2-dimensional histogram, 16x16
p1, p2, p3, , pn
a sequence of packets size of a flow and its
partner flow
according to timestamp
For i 1 n-1
Hpi/100pi1/100

A sequence of packets size 40, 80, 1500, , 40,
1500 Pair-wise representation 40, 80, 80,
1500, , 40, 1500 Histogram 40/100,
80/100, 80/100, 1500/100, , 40/100,
1500/100 0, 0, 0, 15,
, 0, 15
11
Guideline
Unknown TRaffic
12
Dataset

1500 bittorrent.arff
1500 clubbox.arff
1500 edonkey.arff
1500 fileguri.arff
0 ftp.arff
1500 http.arff
1500 https.arff
0 melon.arff
1500 msnp.arff
1500 nateon.arff
0 nntp.arff
0 pop3.arff
0 sayclub.arff
1500 smtp.arff
0 ssh.arff
13500 total

6412 bittorrent.arff
4913 clubbox.arff
101355 edonkey.arff
21060 fileguri.arff
635 ftp.arff
200274 http.arff
3611 https.arff
22 melon.arff
4986 msnp.arff
1565 nateon.arff
169 nntp.arff
63 pop3.arff
224 sayclub.arff
40556 smtp.arff
67 ssh.arff
385912 total

13
(No Transcript)
14
(No Transcript)
15
Sum of Squared Error (SSE)

How to get SSE
bins 88
clusters 120

16
Fitting of SSE

Y1.446e004 X(-1.194) 755.8

17
Estimation of SSE
18
Decrease Rate of SSE
0.1 decrease
19
To do list

Direction
Rx and Tx, Rx only, and Tx only
Dynamic bin size
Initial N packets or all the packets
Different (un)supervised learning method
Different feature extraction method

Write a Comment

User Comments (0)

About PowerShow.com

Traffic Classification based on Machine Learning using Flowlevel Information - PowerPoint PPT Presentation

Traffic Classification based on Machine Learning using Flowlevel Information

6412 bittorrent.arff. 4913 clubbox.arff. 101355 edonkey.arff. 21060 fileguri.arff. 635 ftp.arff ... 1500 bittorrent.arff. 1500 clubbox.arff. 1500 edonkey.arff ... – PowerPoint PPT presentation