A Machine Learning Approach Toward Effective Linkspeed Traffic Identification - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

A Machine Learning Approach Toward Effective Linkspeed Traffic Identification

Description:

... comparison between classification algorithms, Adaboost C4.5 and ... Table 1 Comparison of accuracy of algorithms given complete set of features to choose from. ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 18
Provided by: live2
Category:

less

Transcript and Presenter's Notes

Title: A Machine Learning Approach Toward Effective Linkspeed Traffic Identification


1
A Machine Learning Approach Toward Effective
Link-speedTraffic Identification
  • Wei Li
  • PhD student, Queen Mary, University of London
  • Visiting student, University of Cambridge
  • weili_at_dcs.qmul.ac.uk

2
Internet Applications and Traffic Flow
  • The Internet traffic is a huge mixture
  • thousands of different application -gt lots of
    different traffic
  • Examples A stream of packets is started when you
  • Visit a website
  • Send or receive an email
  • Start downloading a file
  • Say hello to your friend on MSN
  • Make a Skype call
  • Watch IPTV on the Internet

Next Traffic Classes
3
Traffic Classes
  • Web-browsing
  • Email
  • File transfer sharing
  • Chatting
  • Voice over IP
  • IPTV
  • Gaming
  • Attacks, viruses, worms, spams, spyware, etc.

Next Motivation
4
Motivation
  • Different traffic deserves different handling.
  • (Now theyre treated in the same way on the
    Internet. Why?)

Next Past work
5
Past Work The Path to A Better World
  • Port-based rules (current standard since 70s)
  • Signature-based rules (for IDS since 90s)
  • Behaviour-based fingerprinting(not yet
    standardised)

Next objectives challenges
6
Objectives and Challenges
  • Answer the questions.
  • What features to use?
  • What algorithm to use?
  • Build a real system.
  • Monitor up to 10 Giga-bit link.
  • Real-time.

Next Countermeasure
7
Countermeasure
  • Trade-off between accuracy and cost
  • From theory, the less correlative information we
    collect, the less accuracy we have.
  • Start-of-flow (SoF) sampling and observation
    window
  • We need to prove the behaviour information in
    first K packets is also sufficient for most of
    the application classes.

Next Learning Process
8
Semi-Automatic Learning Process
  • Classification object is a Traffic Flow, into a
    number of Traffic Classes.
  • An amount of data / a wide feature set -gt
  • Feature subset / observation window size -gt
  • Final classifier
  • Feature Selection
  • Correlation-based filtering on 10 half-hour trace
    data.
  • The final subset include features appeared in at
    least 1/3 of the CF sets.
  • Classification
  • Empirical comparison between classification
    algorithms, AdaboostC4.5 and C4.5 are best among
    all the algorithms, while Jointboosting, NBTree,
    BNN, Logitboost, Ripper also achieved good
    results.
  • We focus on C4.5 for its very low computing
    overhead and good accuracy performance.

Next Training Data
9
Training Data
  • Gigabit Internet-link of a research campus of
    1000 users
  • 10 half-hour traces for two non-consecutive days
    with 8 months interval respectively
  • 250 features collected from the traffic
  • 375,000 flows, 31GBytes in Day1
  • 175,000 flows, 28GBytes in Day2.

Next Features
10
Features of A Traffic Flow
  • Uniquely defined by 4-tuple 2x port numbers, 2x
    IPs
  • Different packet direction sequence
  • Different packet sizes
  • Different packet sending and arrival patterns
  • symmetry between server and client
  • interactivity between server and client
  • data transmission behaviour e.g. bursty / flat
  • Transmission layer features e.g. different flags
    in Transmission Control Protocol (TCP) headers

Next Scattered plot
11
Scattered Plots
Next OW vs Accuracy
12
Algorithm Comparison (b/f feature selection)
Table 1 Comparison of accuracy of algorithms
given complete set of features to choose from. We
adopt the default algorithm implementations for
C4.5 and AdaBoost in Weka, under a
right-out-of-box condition. The accuracy values
are measured by two-fold cross validation on the
whole datasets unless otherwise noted. (a) Due to
the time cost in training, Joint Boosting used a
subset of 7340 flows (4404 flows for training and
2936 flows for testing) selected from Day1.
Next After FS
13
Observation Window vs Accuracy (C4.5)
Next Algorithms
14
With feature selection and 5 packet OW
Table 2 C4.5 and AdaboostC4.5. using OW5
packets, two-fold cross validation over Day1.
Time values are for all 375,000 flow objects.
Next Future Work
15
Future Work
  • Tackle temporal and spatial instability
  • Feedback channel
  • Incremental and reinforcement learning
  • (see next slide an architecture)
  • Tailor the classifier for our problem
  • Decision trees, Boosting, SVM or Bayesian Neural
    Network?
  • Class-imbalance and different misclassification
    costs
  • Another Problem 0-Day detection of unknown
    traffic

Next An architecture
16
An Architecture
Thanks!
17
  • Thanks
  • Questions and Comments?
Write a Comment
User Comments (0)
About PowerShow.com