Title: A Machine Learning Approach Toward Effective Linkspeed Traffic Identification
1A Machine Learning Approach Toward Effective
Link-speedTraffic Identification
- Wei Li
- PhD student, Queen Mary, University of London
- Visiting student, University of Cambridge
- weili_at_dcs.qmul.ac.uk
2Internet Applications and Traffic Flow
- The Internet traffic is a huge mixture
- thousands of different application -gt lots of
different traffic - Examples A stream of packets is started when you
- Visit a website
- Send or receive an email
- Start downloading a file
- Say hello to your friend on MSN
- Make a Skype call
- Watch IPTV on the Internet
Next Traffic Classes
3Traffic Classes
- Web-browsing
- Email
- File transfer sharing
- Chatting
- Voice over IP
- IPTV
- Gaming
- Attacks, viruses, worms, spams, spyware, etc.
Next Motivation
4Motivation
- Different traffic deserves different handling.
- (Now theyre treated in the same way on the
Internet. Why?)
Next Past work
5Past Work The Path to A Better World
- Port-based rules (current standard since 70s)
- Signature-based rules (for IDS since 90s)
- Behaviour-based fingerprinting(not yet
standardised)
Next objectives challenges
6Objectives and Challenges
- Answer the questions.
- What features to use?
- What algorithm to use?
- Build a real system.
- Monitor up to 10 Giga-bit link.
- Real-time.
Next Countermeasure
7Countermeasure
- Trade-off between accuracy and cost
- From theory, the less correlative information we
collect, the less accuracy we have. - Start-of-flow (SoF) sampling and observation
window - We need to prove the behaviour information in
first K packets is also sufficient for most of
the application classes. -
-
-
Next Learning Process
8Semi-Automatic Learning Process
- Classification object is a Traffic Flow, into a
number of Traffic Classes. - An amount of data / a wide feature set -gt
- Feature subset / observation window size -gt
- Final classifier
- Feature Selection
- Correlation-based filtering on 10 half-hour trace
data. - The final subset include features appeared in at
least 1/3 of the CF sets. - Classification
- Empirical comparison between classification
algorithms, AdaboostC4.5 and C4.5 are best among
all the algorithms, while Jointboosting, NBTree,
BNN, Logitboost, Ripper also achieved good
results. - We focus on C4.5 for its very low computing
overhead and good accuracy performance.
Next Training Data
9Training Data
- Gigabit Internet-link of a research campus of
1000 users - 10 half-hour traces for two non-consecutive days
with 8 months interval respectively - 250 features collected from the traffic
- 375,000 flows, 31GBytes in Day1
- 175,000 flows, 28GBytes in Day2.
Next Features
10Features of A Traffic Flow
- Uniquely defined by 4-tuple 2x port numbers, 2x
IPs - Different packet direction sequence
- Different packet sizes
- Different packet sending and arrival patterns
- symmetry between server and client
- interactivity between server and client
- data transmission behaviour e.g. bursty / flat
- Transmission layer features e.g. different flags
in Transmission Control Protocol (TCP) headers
Next Scattered plot
11Scattered Plots
Next OW vs Accuracy
12Algorithm Comparison (b/f feature selection)
Table 1 Comparison of accuracy of algorithms
given complete set of features to choose from. We
adopt the default algorithm implementations for
C4.5 and AdaBoost in Weka, under a
right-out-of-box condition. The accuracy values
are measured by two-fold cross validation on the
whole datasets unless otherwise noted. (a) Due to
the time cost in training, Joint Boosting used a
subset of 7340 flows (4404 flows for training and
2936 flows for testing) selected from Day1.
Next After FS
13Observation Window vs Accuracy (C4.5)
Next Algorithms
14With feature selection and 5 packet OW
Table 2 C4.5 and AdaboostC4.5. using OW5
packets, two-fold cross validation over Day1.
Time values are for all 375,000 flow objects.
Next Future Work
15Future Work
- Tackle temporal and spatial instability
- Feedback channel
- Incremental and reinforcement learning
- (see next slide an architecture)
- Tailor the classifier for our problem
- Decision trees, Boosting, SVM or Bayesian Neural
Network? - Class-imbalance and different misclassification
costs - Another Problem 0-Day detection of unknown
traffic
Next An architecture
16An Architecture
Thanks!
17 - Thanks
- Questions and Comments?