Hailuoto Workshop - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Hailuoto Workshop

Description:

... center and spread like log-normal Interpretable as stopped geometric Brownian Motion Makes physical sense ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 41
Provided by: cwp62
Category:

less

Transcript and Presenter's Notes

Title: Hailuoto Workshop


1
Hailuoto Workshop
  • A Statisticians Adventures
  • in Internetland
  • J. S. Marron
  • Department of Statistics and Operations Research
  • University of North Carolina
  • December 3, 2015

2
Co-Authors in this work
  • Felix Hernández Campos Fred Godtliebsen
  • George Michailidis Cheolwoo Park
  • Juhyun Park Vladas Pipiras
  • Vitaliana Rondonotti David Rolls
  • Haipeng Shen F. D. Smith
  • Richard Smith Stilian Stoev
  • Murad Taqqu Zhengyuan Zhu

3
Internet Traffic Background
  • Gigantic (worldwide) Communicns Networks
  • Telephone Network
  • Internet
  • Both based on connections between 2 points

4
Fundamental Difference, I
  • Manner of Equipment Usage
  • Telephone each conn has sole use (of 2 wires)
  • Congestion no connection (busy
    signal)
  • Internet all connections share resources
  • (transmissions split into small packets)
  • Congestion packet loss delays

5
Fundamental Difference, II
  • Distribution of duration (time) of connections
  • Telephone (roughly) exponential distribn
  • (must speak how long can hold phone?)
  • Internet heavy tailed distributions
  • (very long and very short connections!)

6
Fundamental Difference, III
  • Mathematical Models
  • Telephone queueing theory
  • Poisson arrivals, exponential durations
  • Internet Heavy Tail Connn Durations
    Long Range Dependence

7
Internet Modes of Study
  • Internet Structure
  • Connectivity Graphs
  • Internet Tomography
  • Use measurements at edges
  • To infer general structure and behavior
  • Traffic measurements at one point
  • Time series of packet information at one location

8
Internet Measurement Modeling Goals
  • Models for simulation???
  • For protocol fixing
  • QoS fixing gross inefficiencies
  • Testbed for business applications
  • web developers
  • Goodness of Fit of models???
  • How do we know this works like real traffic?
  • More realistically How bad is it?

9
Our Data Collection Point
  • Tap on Main Link at UNC (U. North Carolina)
  • Heavy traffic both directions
  • 35,000 web browsers
  • Sunsite (mirror site for large data bases)
  • Some indication of scale
  • 1998 peak traffic 3 minutes for 1 mil.
    Packets
  • 2001 peak traffic 1 minute for 1 mil. Packets

10
Data Source
  • Sequence of Packet Header Info, such as
  • Arrival Time
  • Source Destination addresses
  • Packet Type (request, data, ackment, )
  • Packet Size (40 1500 bytes)
  • Sequence number
  • Data extraction
  • Heavy database filtering, by UNC Comp. Sci.
    folks,
  • Jeffay, Smith, Ott, Hernandez Campos, Long

11
Toy Example View of Packets Connections
Connections Made up of Packets
Observations Starting time Duration (time) Size
(bytes) Packet counts Byte counts . .
12
Mice and Elephants Graphic
  • Mice and Elephants plot
  • Visual display of HTTP Responses
  • Show only times of HTTP Response starts
    ends
  • As horizontal line segments (in time)
  • Condense string of packets to only a line
    segment
  • Visually separate by adding random height
  • Tukeys jitter plot idea
  • Only show random sample of 5000 out of 100,000

13
Mice and Elephants Graphic
Many very small mice Few very large
elephants
14
Mice and Elephants Simulated Durations
Same start times Simd Exponential Durations No
elephants No mice Only betweens Exponential
Model is very poor
15
Some Important Time Series
  • Binned Counts (aggregated over time)

Count - Packets - Bytes (often
similar) This talk 1 or 10 ms bins
16
A Menu of Interesting Issues
  • Bin Count Time Series
  • Long Range dependence?
  • Point Process of Flow Start Times
  • Duration Distributions (heavy tails)
  • Heavy tail Durations LRD
  • Relationship between Size and Duration?
  • Time series of packets within flows?

17
Flow Duration Distributions
  • Study sizes (bytes) of HTTP responses, as
    surrogate for
  • Time between first and last packet
  • In mice and elephant plot
  • Study 4 hour time block
  • Heavy Traffic Time
  • Thursday Morning 800PM 1200Noon
  • In April 2001
  • From UNC Main Link

18
Flow Duration Distributions
  • Log-Log Complementary Cumulative Distribution
    Function Plot
  • as a function of
  • Would be linear for Pareto
  • (slope shape parameter)
  • Q-Q plot against exponential

19
Flow Duration Distns log log CCDF
  • Log-log scale stretches quantiles
  • Allows clear view of tail

20
Flow Duration Distns log log CCDF
  • Wiggles about possible linear fit
  • Wiggles really there? Or natural
    variation?

21
Flow Duration Distns log log CCDF
  • Does Pareto Model fit?
  • Looks OK ???
  • Careful, have 5.6 million data points
  • What is sampling variation?
  • How can we assess it?

22
Flow Duration Distns log log CCDF
  • Downey (2001) Controversy
  • Log Normal fits as well?
  • Big problem, log normal is not heavy tailed?
  • E.g. all moments (exponl, not polyl, tail!)
  • Implications for Heavy tails ? LRD theory?
  • Modified theory
  • Hanning, Marron, Samorodnitzky and Smith (2002)
  • Idea log normal (with changing parameters)?LRD

23
Flow Duration Distns log log CCDF
  • Log normal fit
  • Looks OK???
  • What is natural variation?

24
Flow Duration Distns log log CCDF
  • Interesting Viewpoint
  • Gong, Liu, Misra and Towsley (2001)
  • Key idea distributional fragility
  • Several very different distributions can all
    fit in tails
  • Conclusions
  • Careful tail fits not very interpretable
  • heavy tails may be slippery concept

25
Flow Duration Distns log log CCDF
  • Now address sampling variation issue
  • Add overlay of 100 data sets
  • Same sample size
  • from given distribution
  • Gives good intuitive idea of sampling variation

26
Flow Duration Distns log log CCDF
  • Pareto Model clearly does not fit this data

27
Flow Duration Distns log log CCDF
  • Overlay from Pareto Distribution
  • Gives good intuitive idea of sampling variation
  • Shows wiggles are really there
  • not sampling artifact
  • Suggests distribution not heavy tailed?
  • Definition is only asymptotic
  • Implications for above theory?

28
Flow Duration Distns log log CCDF
  • New concept motivated by above analysis
  • Several tail regions
  • near tail very complete info
  • rich data, no envelope
  • far tail sketchy info
  • sparse data, wide envelope
  • extreme tail no info - beyond range of
    data
  • Important point have these regardless of size
    of data set

29
Flow Duration Distns log log CCDF
  • Different Viewpoint Look across time blocks
  • 7 week days Sun. - Sat. 3 time blocks
  • Morning 800 AM 1200
  • Afternoon 1200 400 PM
  • Evening 730 PM 1130
  • 21 log log CCDF plots
  • All overlaid in blue, each highlighted in red
  • Structure amazingly similar!
  • Wiggles go the same way!!!
  • Suggests these are important popn structure!

30
Flow Duration Distns log log CCDF
  • A deeper distributional look Interesting
    candidate
  • Double Pareto Log Normal distribution
  • Reed, W. J. (2001) http//www.math.uvic.ca/faculty
    /reed/
  • 4 parameter family
  • Pareto type tails, center and spread like
    log-normal
  • Interpretable as stopped geometric Brownian
    Motion
  • Makes physical sense (via sequence of file
    updates)

31
Flow Duration Distns log log CCDF
  • Seems a better fit?
  • But doesnt model wiggles

32
Flow Duration Distns log log CCDF
  • Simed envelope shows DPLN actually doesnt fit
    the data

33
Flow Duration Distns log log CCDF
  • Richer family
  • Mixture distributions
  • Mixture of 3 DPLN distributions
  • Parameters fit visually
  • E.g. maximum likelihood looks slippery

34
Flow Duration Distns log log CCDF
  • Amazingly good fit
  • For 5.6 million data points!

35
Flow Duration Distns log log CCDF
  • Interpretation of Mixture parameters
  • 55, sizes 102 bytes maybe
  • tiny layout images
  • HTML error status pages
  • navigation bars in multi-frame pages
  • 45, sizes 104 bytes maybe
  • most standard HTML text pages and images
  • 0.1, sizes 106 bytes maybe
  • software
  • multimedia content (such as movies)
  • PDF document
  • Makes physical sense!

36
Flow Duration Distns log log CCDF
  • Similar lessons for log-normal
  • Distributl fragility!
  • Want 4th compt?
  • Only 5 data pts.

37
Flow Duration Distns log log CCDF
  • Serious implications for current theory
  • Wobbly tail is not a heavy tail in classical
    sense
  • Classical Definition requires convergence at a
    particular rate
  • Not wobbling between rates (as modelled above)
  • Question from Downey (2001) so where does LRD
    come from?

38
Variable Tail Index
  • Idea extended model, which allows wiggly tails
  • Approach consider
  • location dependent tail index
  • as - slope of log log CCDF

39
Flow Duration Distns log log CCDF
  • Tail Index wobbles mostly between 1 2, and
    outside, too

40
Variable Tail Index
  • Enhanced Theory
  • Hernandez Campos, Marron, Samorodnitzky Smith
    (2002) Variable Heavy Tailed Durations in
    Internet Traffic.
  • Sound bite version
  • For variable tail index
  • often between 1 2
  • still get LRD
Write a Comment
User Comments (0)
About PowerShow.com