A Machine Learning Approach Toward Effective Linkspeed Traffic Identification

About This Presentation

Title:

A Machine Learning Approach Toward Effective Linkspeed Traffic Identification

Description:

... comparison between classification algorithms, Adaboost C4.5 and ... Table 1 Comparison of accuracy of algorithms given complete set of features to choose from. ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 18

Provided by: live2

Category:

more less

Transcript and Presenter's Notes

Title: A Machine Learning Approach Toward Effective Linkspeed Traffic Identification

1
A Machine Learning Approach Toward Effective
Link-speedTraffic Identification

Wei Li
PhD student, Queen Mary, University of London
Visiting student, University of Cambridge
weili_at_dcs.qmul.ac.uk

2
Internet Applications and Traffic Flow

The Internet traffic is a huge mixture
thousands of different application -gt lots of
different traffic
Examples A stream of packets is started when you
Visit a website
Send or receive an email
Start downloading a file
Say hello to your friend on MSN
Make a Skype call
Watch IPTV on the Internet

Next Traffic Classes
3
Traffic Classes

Web-browsing
Email
File transfer sharing
Chatting
Voice over IP
IPTV
Gaming
Attacks, viruses, worms, spams, spyware, etc.

Next Motivation
4
Motivation

Different traffic deserves different handling.
(Now theyre treated in the same way on the
Internet. Why?)

Next Past work
5
Past Work The Path to A Better World

Port-based rules (current standard since 70s)
Signature-based rules (for IDS since 90s)
Behaviour-based fingerprinting(not yet
standardised)

Next objectives challenges
6
Objectives and Challenges

Answer the questions.
What features to use?
What algorithm to use?
Build a real system.
Monitor up to 10 Giga-bit link.
Real-time.

Next Countermeasure
7
Countermeasure

Trade-off between accuracy and cost
From theory, the less correlative information we
collect, the less accuracy we have.
Start-of-flow (SoF) sampling and observation
window
We need to prove the behaviour information in
first K packets is also sufficient for most of
the application classes.

Next Learning Process
8
Semi-Automatic Learning Process

Classification object is a Traffic Flow, into a
number of Traffic Classes.
An amount of data / a wide feature set -gt
Feature subset / observation window size -gt
Final classifier
Feature Selection
Correlation-based filtering on 10 half-hour trace
data.
The final subset include features appeared in at
least 1/3 of the CF sets.
Classification
Empirical comparison between classification
algorithms, AdaboostC4.5 and C4.5 are best among
all the algorithms, while Jointboosting, NBTree,
BNN, Logitboost, Ripper also achieved good
results.
We focus on C4.5 for its very low computing
overhead and good accuracy performance.

Next Training Data
9
Training Data

Gigabit Internet-link of a research campus of
1000 users
10 half-hour traces for two non-consecutive days
with 8 months interval respectively
250 features collected from the traffic
375,000 flows, 31GBytes in Day1
175,000 flows, 28GBytes in Day2.

Next Features
10
Features of A Traffic Flow

Uniquely defined by 4-tuple 2x port numbers, 2x
IPs
Different packet direction sequence
Different packet sizes
Different packet sending and arrival patterns
symmetry between server and client
interactivity between server and client
data transmission behaviour e.g. bursty / flat
Transmission layer features e.g. different flags
in Transmission Control Protocol (TCP) headers

Next Scattered plot
11
Scattered Plots
Next OW vs Accuracy
12
Algorithm Comparison (b/f feature selection)
Table 1 Comparison of accuracy of algorithms
given complete set of features to choose from. We
adopt the default algorithm implementations for
C4.5 and AdaBoost in Weka, under a
right-out-of-box condition. The accuracy values
are measured by two-fold cross validation on the
whole datasets unless otherwise noted. (a) Due to
the time cost in training, Joint Boosting used a
subset of 7340 flows (4404 flows for training and
2936 flows for testing) selected from Day1.
Next After FS
13
Observation Window vs Accuracy (C4.5)
Next Algorithms
14
With feature selection and 5 packet OW
Table 2 C4.5 and AdaboostC4.5. using OW5
packets, two-fold cross validation over Day1.
Time values are for all 375,000 flow objects.
Next Future Work
15
Future Work