Title: Structural Analysis of Network Traffic Flows
1Structural Analysis of Network Traffic Flows
- Lakhina, K. papagiannaki, M. Crovella, C. Diot,
E.D. Kolaczyk and N. Taft - Presented by guanghui He
2Outline of the paper
- Motivation and objective
- Principal Component Analysis
- Empirical studies
- Conclusion
3Motivation
- Traditional Traffic Analysis
- Focus on
- Short stationary timescales
- Traffic on a single link in isolation
- Principal results
- Scaling properties
- Packet delays and loses
- What ISPs Care About
- Focus on
- Long, nonstationary timescales
- Traffic on all links simultaneously
- Principle goals
- Traffic engineering
- Anomaly detection
- Capacity planning
4For Whole-network Traffic Analysis
- Traffic Engineering YAX
- How to tune A? How does traffic move throughout
the network? - Attack/Anomaly Detection
- On which links is there unusual traffic?
- Capacity planning
- How much and where in network to upgrade?
5Complicated Job
- Measuring and modeling traffic on all links
simultaneously is challenging. - Hundreds to thousands of links in a large IP
backbone network - Even single link modeling is difficult
- High-dimensional timeseries
- Significant correlation structure
- Is there a more fundamental representation?
6One way out OD flows
- Link traffic arises from the superposition of
Origin-Destination (OD) flows - Modeling OD flows instead of link traffic removes
a significant source of correlation
7Still too complicated
- Each OD flow serves a different customer
population - No two OD flows carry same traffic
- Are they still correlated?
- Even more OD flows than links
- Cause YAX a ill-posed problem
- How to extract meaning from this high dimensional
structure in a systematic fashion?
8Principal Component Analysis
- Look for a low-dimensional representation
preserving the most important features of data - Usually, a high-dimensional structure may be
explainable in terms of a small number of
independent variables - Commonly used too Principal Component Analysis
(PCA)
9Specific Questions
- Are there low dimensional representations for a
set of OD flows? - Do OD flows share common features?
- What do the feature look like?
- Can we get a high-level understanding of a set of
OD flows in terms of these features?
10PCA (1)
- For any given dataset, PCA finds a new coordinate
system that maps maximum variability in the data
to a minimum number of coordinates - New axes are called Principal Axes or Components
11Properties of Principal Components
- Let p be the number of OD flows and t denote the
number of successive time intervals of interest.
Then X is a matrix representing the
timeseries of all OD flows in a network - Each PC points in the direction of maximum
(remaining) energy in the data
12PCA on OD flows
- Set of flows mapped onto a single PC is called an
eigenflow. V is a new basis for X
13PCA on OD flows (2)
14An example of Eigenflow and PC
15Empirical studies
- Find intrinsic dimensionality of OD flows
16(No Transcript)
17(No Transcript)
18Major types of eigenflows
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23Contribution of eigenflow type
24Contribution to each OD flows
25Summary Specific questions
- Are there low dimensional representations for a
set of OD flows? - 5 or 6 eigenflows is sufficient for good
approximation of a set of 100 OD flows - Do OD flows share common features?
- The common features across OD flows are
eigenflows - What do the features look like?
- Three types, D,S or N.
- Can we get a high-level understanding of a set of
OD flows in terms of these features? - High volume flows tend to be dominated by D type
- Low volume flows tend to be dominated by N type
- S type contributes across all OD flows
26Possible applications
- Traffic Matrix Estimation
- Anomaly Detection
- Traffic Forecasting
- Traffic Engineering
27Data Streaming Algorithms for Efficient and
Accurate Estimation of Flow Size Distribution
- A. Kumar, M. Sung, J. Xu, J. Wang
28Problem statement
- Computing the distribution of the sizes of the
flows. Let the flow sizes change from 1 to z. The
total number of flows is n, is the number
of flows with i packets. We need to find
29The approach
- Data streaming using a lossy data structure.
30Solution Architecture
- Measurement proceeds in epochs (e.g. 100s)
- Maintain an array of counters in fast memory
(SRAM) - For each packet, a counter is chosen via hashing
and incremented. - No attempt to detect or resolve collisions.
- Data collection is lossy (erroneous), but very
fast. - At the end of the epoch, the counter array is
paged to disk
31Offline estimation mechanisms
- Ideally, no collision happens, then the
distribution can be accurately estimated. With
real-world hash functions, collisions do occur.
32Estimation module
- The counter array is processed to obtain the
Counter Value Distribution. is the of
counters with value 0, and is the of
counters with value i, i1,2,,z. - Use Bayesian statistics to derive the following
quantities - The total number of flows n
- The total number of flows with exactly 1 packet,
. - The flow distribution
33Estimation of n and
- Let the total number of counters be m.
- The number of flows hashing to any counter c is
modeled by the Poisson random variable with
parameter - There is a simple estimator for the total number
of flows - The result can be extended to derive an estimator
of flows of size 1
34Why?
- Assume flows have been inserted, the
number of flows hashed to any counter is Poisson(
), then the number of counters not hit is
- Among these counters, the number of counters
with exact 1 packet is , so we
have
35Estimating the entire flow distribution
- Begin with a guess of the flow distribution,
. - Based on this, compute the various possible ways
of splitting a particular counter value and the
respective probabilities of such events. - Then a refined estimate of the flow distribution
- can be computed.
- Use in the next iteration.
- Repeating this until the estimate converge. (EM)
36The algorithm
37Calculate
- Let be the event that flows of size
,, - of size collide into a slot, then
- and
38Computational complexity
- For counters with value larger than 300, ignore
the cases involving the collision of 4 or more
flows. - For counters with value between 50 and 300,
ignore the cases involving the collision of 5 or
more flows. - Other counter values, ignore the cases involving
the collision of 7 or more flows.
39Evaluation
40Evaluationsmall flow size
41Multi-resolution array of counters
- The multi-resolution array of counters allow the
scheme to operate for any value of n, with
graceful degradation in accuracy for large number
of flows.
42Evaluation of MRAC
43Conclusion
- Data-streaming based solution for estimating
flow-distribution - Lossy data structure and Bayesian statistics
generate accurate streaming. - Estimation using EM algorithm