Title: Traffic Classification based on Machine Learning using Flowlevel Information
1Traffic Classification based on Machine Learning
using Flow-level Information
- Jong Gun Lee (jglee_at_an.kaist.ac.kr)
- Advanced Networking Lab.
2Table of Contents
- Motivation of this work
- Background about machine learning
- Our approach using machine learning
- Experiment (dataset and result)
- Conclusion
3Motivation
- We cannot effectively classify the traffic of
some new emergent applications, - such as online games and streaming applications
- because there is no application information, such
as port number or a common byte sequence in
payload
We propose a methodology to classify Internet
traffic with supervised and unsupervised learning
4Basic Terminologies of Machine Learning
- Classifier
- is mapping unlabeled instances into classes
- Instance
- is a single object of the world
- Attribute
- is a single object of the world
- Feature
- is the specification of an attribute and its
value - Feature vector
- is a list of features describing an instance
5Unsupervised and Supervised Learning
- Supervised learning (with answer/teacher)
- With a training set, a classifier learns the
characteristics of each class. And when entering
new instance, the classifier predicts the class
of the instance. - Unsupervised learning (without answer/teacher)
- With only a set of data (feature vectors), a
classifier make a set of clusters.
6K-Means
- One of the unsupervised learning methods
- K value is the number of clusters and this value
is given as the initial parameter - Procedure
- First, the classifier randomly chooses K points
as the centers of K subspaces - Second, it divides the overall vector space into
K subspaces according to the centers - Third, it picks new K centers for each subspaces
- And then, it iterates 2nd and 3rd steps until all
of the centers are not changed or moved within
the threshold value
7Example of K-Means
8Overall Process of Our Method
Unsupervised Learning
Feature Extraction
K Clusters
Classification Method
Supervised Learning
9Flow-level Feature Information
- Protocol number 6(TCP) or 17(UDP)
- Duration seconds
- Number of packets per second (PPS)
- Mean of size of all packets
- Mean of size of non-ACK packets
- Rate of ACK packets
- Interaction Information
10Feature Extraction (Interaction Information)
- Interaction Information
- H 2-dimensional histogram, 16x16
- p1, p2, p3, , pn
- a sequence of packets size of a flow and its
partner flow - according to timestamp
- For i 1 n-1
- Hpi/100pi1/100
A sequence of packets size 40, 80, 1500, , 40,
1500 Pair-wise representation 40, 80, 80,
1500, , 40, 1500 Histogram 40/100,
80/100, 80/100, 1500/100, , 40/100,
1500/100 0, 0, 0, 15,
, 0, 15
11Guideline
Unknown TRaffic
12Dataset
- 1500 bittorrent.arff
- 1500 clubbox.arff
- 1500 edonkey.arff
- 1500 fileguri.arff
- 0 ftp.arff
- 1500 http.arff
- 1500 https.arff
- 0 melon.arff
- 1500 msnp.arff
- 1500 nateon.arff
- 0 nntp.arff
- 0 pop3.arff
- 0 sayclub.arff
- 1500 smtp.arff
- 0 ssh.arff
- 13500 total
- 6412 bittorrent.arff
- 4913 clubbox.arff
- 101355 edonkey.arff
- 21060 fileguri.arff
- 635 ftp.arff
- 200274 http.arff
- 3611 https.arff
- 22 melon.arff
- 4986 msnp.arff
- 1565 nateon.arff
- 169 nntp.arff
- 63 pop3.arff
- 224 sayclub.arff
- 40556 smtp.arff
- 67 ssh.arff
- 385912 total
13(No Transcript)
14(No Transcript)
15Sum of Squared Error (SSE)
- How to get SSE
- bins 88
- clusters 120
16Fitting of SSE
- Y1.446e004 X(-1.194) 755.8
17Estimation of SSE
18Decrease Rate of SSE
0.1 decrease
19To do list
- Direction
- Rx and Tx, Rx only, and Tx only
- Dynamic bin size
- Initial N packets or all the packets
- Different (un)supervised learning method
- Different feature extraction method