Title: Hailuoto Workshop
1Hailuoto Workshop
- A Statisticians Adventures
- in Internetland
- J. S. Marron
- Department of Statistics and Operations Research
- University of North Carolina
- December 3, 2015
2Co-Authors in this work
- Felix Hernández Campos Fred Godtliebsen
- George Michailidis Cheolwoo Park
- Juhyun Park Vladas Pipiras
- Vitaliana Rondonotti David Rolls
- Haipeng Shen F. D. Smith
- Richard Smith Stilian Stoev
- Murad Taqqu Zhengyuan Zhu
3Internet Traffic Background
- Gigantic (worldwide) Communicns Networks
- Telephone Network
- Internet
- Both based on connections between 2 points
4Fundamental Difference, I
- Manner of Equipment Usage
- Telephone each conn has sole use (of 2 wires)
- Congestion no connection (busy
signal) - Internet all connections share resources
- (transmissions split into small packets)
- Congestion packet loss delays
5Fundamental Difference, II
- Distribution of duration (time) of connections
- Telephone (roughly) exponential distribn
- (must speak how long can hold phone?)
- Internet heavy tailed distributions
- (very long and very short connections!)
6Fundamental Difference, III
- Mathematical Models
- Telephone queueing theory
- Poisson arrivals, exponential durations
- Internet Heavy Tail Connn Durations
Long Range Dependence
7Internet Modes of Study
- Internet Structure
- Connectivity Graphs
- Internet Tomography
- Use measurements at edges
- To infer general structure and behavior
- Traffic measurements at one point
- Time series of packet information at one location
8Internet Measurement Modeling Goals
- Models for simulation???
- For protocol fixing
- QoS fixing gross inefficiencies
- Testbed for business applications
- web developers
- Goodness of Fit of models???
- How do we know this works like real traffic?
- More realistically How bad is it?
9Our Data Collection Point
- Tap on Main Link at UNC (U. North Carolina)
- Heavy traffic both directions
- 35,000 web browsers
- Sunsite (mirror site for large data bases)
- Some indication of scale
- 1998 peak traffic 3 minutes for 1 mil.
Packets - 2001 peak traffic 1 minute for 1 mil. Packets
10Data Source
- Sequence of Packet Header Info, such as
- Arrival Time
- Source Destination addresses
- Packet Type (request, data, ackment, )
- Packet Size (40 1500 bytes)
- Sequence number
- Data extraction
- Heavy database filtering, by UNC Comp. Sci.
folks, - Jeffay, Smith, Ott, Hernandez Campos, Long
11Toy Example View of Packets Connections
Connections Made up of Packets
Observations Starting time Duration (time) Size
(bytes) Packet counts Byte counts . .
12Mice and Elephants Graphic
- Mice and Elephants plot
- Visual display of HTTP Responses
- Show only times of HTTP Response starts
ends - As horizontal line segments (in time)
- Condense string of packets to only a line
segment - Visually separate by adding random height
- Tukeys jitter plot idea
- Only show random sample of 5000 out of 100,000
13Mice and Elephants Graphic
Many very small mice Few very large
elephants
14Mice and Elephants Simulated Durations
Same start times Simd Exponential Durations No
elephants No mice Only betweens Exponential
Model is very poor
15Some Important Time Series
- Binned Counts (aggregated over time)
Count - Packets - Bytes (often
similar) This talk 1 or 10 ms bins
16A Menu of Interesting Issues
- Bin Count Time Series
- Long Range dependence?
- Point Process of Flow Start Times
- Duration Distributions (heavy tails)
- Heavy tail Durations LRD
- Relationship between Size and Duration?
- Time series of packets within flows?
17Flow Duration Distributions
- Study sizes (bytes) of HTTP responses, as
surrogate for - Time between first and last packet
- In mice and elephant plot
- Study 4 hour time block
- Heavy Traffic Time
- Thursday Morning 800PM 1200Noon
- In April 2001
- From UNC Main Link
18Flow Duration Distributions
- Log-Log Complementary Cumulative Distribution
Function Plot - as a function of
- Would be linear for Pareto
- (slope shape parameter)
- Q-Q plot against exponential
19Flow Duration Distns log log CCDF
- Log-log scale stretches quantiles
- Allows clear view of tail
20Flow Duration Distns log log CCDF
- Wiggles about possible linear fit
- Wiggles really there? Or natural
variation?
21Flow Duration Distns log log CCDF
- Does Pareto Model fit?
- Looks OK ???
- Careful, have 5.6 million data points
- What is sampling variation?
- How can we assess it?
22Flow Duration Distns log log CCDF
- Downey (2001) Controversy
- Log Normal fits as well?
- Big problem, log normal is not heavy tailed?
- E.g. all moments (exponl, not polyl, tail!)
- Implications for Heavy tails ? LRD theory?
- Modified theory
- Hanning, Marron, Samorodnitzky and Smith (2002)
- Idea log normal (with changing parameters)?LRD
23Flow Duration Distns log log CCDF
- Log normal fit
- Looks OK???
- What is natural variation?
24Flow Duration Distns log log CCDF
- Interesting Viewpoint
- Gong, Liu, Misra and Towsley (2001)
- Key idea distributional fragility
- Several very different distributions can all
fit in tails - Conclusions
- Careful tail fits not very interpretable
- heavy tails may be slippery concept
25Flow Duration Distns log log CCDF
- Now address sampling variation issue
- Add overlay of 100 data sets
- Same sample size
- from given distribution
- Gives good intuitive idea of sampling variation
26Flow Duration Distns log log CCDF
- Pareto Model clearly does not fit this data
27Flow Duration Distns log log CCDF
- Overlay from Pareto Distribution
- Gives good intuitive idea of sampling variation
- Shows wiggles are really there
- not sampling artifact
- Suggests distribution not heavy tailed?
- Definition is only asymptotic
- Implications for above theory?
28Flow Duration Distns log log CCDF
- New concept motivated by above analysis
- Several tail regions
- near tail very complete info
- rich data, no envelope
- far tail sketchy info
- sparse data, wide envelope
- extreme tail no info - beyond range of
data - Important point have these regardless of size
of data set
29Flow Duration Distns log log CCDF
- Different Viewpoint Look across time blocks
- 7 week days Sun. - Sat. 3 time blocks
- Morning 800 AM 1200
- Afternoon 1200 400 PM
- Evening 730 PM 1130
- 21 log log CCDF plots
- All overlaid in blue, each highlighted in red
- Structure amazingly similar!
- Wiggles go the same way!!!
- Suggests these are important popn structure!
30Flow Duration Distns log log CCDF
- A deeper distributional look Interesting
candidate - Double Pareto Log Normal distribution
- Reed, W. J. (2001) http//www.math.uvic.ca/faculty
/reed/ - 4 parameter family
- Pareto type tails, center and spread like
log-normal - Interpretable as stopped geometric Brownian
Motion - Makes physical sense (via sequence of file
updates)
31Flow Duration Distns log log CCDF
- Seems a better fit?
- But doesnt model wiggles
32Flow Duration Distns log log CCDF
- Simed envelope shows DPLN actually doesnt fit
the data
33Flow Duration Distns log log CCDF
- Richer family
- Mixture distributions
- Mixture of 3 DPLN distributions
- Parameters fit visually
- E.g. maximum likelihood looks slippery
34Flow Duration Distns log log CCDF
- Amazingly good fit
- For 5.6 million data points!
35Flow Duration Distns log log CCDF
- Interpretation of Mixture parameters
- 55, sizes 102 bytes maybe
- tiny layout images
- HTML error status pages
- navigation bars in multi-frame pages
- 45, sizes 104 bytes maybe
- most standard HTML text pages and images
- 0.1, sizes 106 bytes maybe
- software
- multimedia content (such as movies)
- PDF document
- Makes physical sense!
36Flow Duration Distns log log CCDF
- Similar lessons for log-normal
- Distributl fragility!
- Want 4th compt?
- Only 5 data pts.
37Flow Duration Distns log log CCDF
- Serious implications for current theory
- Wobbly tail is not a heavy tail in classical
sense - Classical Definition requires convergence at a
particular rate - Not wobbling between rates (as modelled above)
- Question from Downey (2001) so where does LRD
come from?
38Variable Tail Index
- Idea extended model, which allows wiggly tails
- Approach consider
- location dependent tail index
- as - slope of log log CCDF
39Flow Duration Distns log log CCDF
- Tail Index wobbles mostly between 1 2, and
outside, too
40Variable Tail Index
- Enhanced Theory
- Hernandez Campos, Marron, Samorodnitzky Smith
(2002) Variable Heavy Tailed Durations in
Internet Traffic. - Sound bite version
- For variable tail index
- often between 1 2
- still get LRD