Things about Trace Analysis presentation

About This Presentation

Transcript and Presenter's Notes

Title: Things about Trace Analysis

1
Things about Trace Analysis

Wei-jen Hsu
In class presentation for CIS6930
wjhsu_at_ufl.edu
(Advisor Ahmed Helmy)

2
Objective

More background knowledge related to trace-based
study
Details about the trace format an intro for one
of the assignments
Share the experience in trace analysis

3
Why trace analysis?

Traces provide the realism of how the system
work
Verification of established system
Diagnosis of system operation (identify faults)
Identifying design flaws
Large-scale properties (e.g. self-similar
traffic)
Understand how a new system works
Provide domain knowledge for analysis work
Verifying an idea

4
Typical Work Flow for Trace Analysis

Build the system
Identify point(s) of trace collection and the
methodology used
Obtain the data
Clean-up and sanity check
Analyze the data and post processing
Explain the results
Apply the results to further study or modify the
existing system

5
WLAN Traces Study

It starts back around 2000
WLAN was new, people wanted to understand how
people used it (usage study)
Surveys v.s. trace
Work by Tang and Baker (00), Kotz and Essien
(02) are pioneer examples
Statistics of usage ( of users, amount of
traffic, etc.)

6
WLAN Traces Study

Mobility-related
MIT work (home location, prevalence, and
persistence)
UCSD (PDA users)
WLAN mobility model (INFOCOM05, T-model,
T-model)
Other user properties
Handoff
Pause time distribution

7
Trace Format

For association
Usually with format
(Node_id, start_time, location, end_time)
But with various ways to get you there.
Syslog Event-based
SNMP Polling
USC raw trace
Wireless association (time start/stop switch-port
MAC)
DHCP log (time MAC IP)
Traffic log

8
Trace Format Example

USC wireless association trace
(Time Start/Stop Switch_IP Switch_port
MAC_of_node)
Mon Oct 10 011652 Start
172.16.8.245 31005 03065f9c0ae
Mon Oct 10 011700 Stop
172.16.8.245 21044 0e359964d1
Mon Oct 10 011702 Start
172.16.8.245 31015 01124dfc03a
USC DHCP trace
(Time IP_of_node MAC_of_node)
Jan 27 002119 207.151.229.50
018f310ea4c
Jan 27 002120 207.151.232.184
018de33792
Jan 27 002120 207.151.229.50
018f310ea4c
USC traffic trace
(Start_time End_time Destination_IP_port
Source_IP_port protocol(TCP6, UDP17) ?
Packet_number Data_size)
0127.235942.925 0127.235944.905
128.125.253.143 53 207.151.239.208 1795
17 0 3 1368
0127.235942.925 0127.235952.677
63.236.56.237 80 207.151.239.208 3257
6 2 4 192

9
Work with the Trace

An exercise
Does the Encounter-Relationship graph change
with respect to time??
From WLAN traces,
We find encounters to measure inter-node
relationship

Note Is this a good assumption??
10
Encounter distribution

How many other nodes does a node encounter with?

Prob. (unique encounter fraction gt x)
11
Encounter-Relationship graph

Imagine that there is a link to connect the node
pairs if they ever encounter with each other
What does the graph look like?

But, is ER grapha connected graph? What are its
properties?
12
Encounter-Relationship graph

To our surprise, ER graphs are connected!!

Disconnected Ratio ()
13
Encounter-Relationship graph

What are the graph properties of the relationship
graphs?

High clustering as regular graph Low path length
as random graph
14
Encounter-Relationship graph

Relationship graphs are SmallWorld graph
High clustering coefficient, low avg. path length

Normalized CC and PL
15
Work with the Trace

An exercise
Does the Encounter-Relationship graph change
with respect to time??
Chop the trace into multiple segments
Analyze the average clustering coefficient and
average path length of the resultant graph
How to deal with changing population?
Does the encounter duration matter?

16
Work with the Trace

Ask questions! What to look for from the trace?
Its importance
Its implication
Its potential usage
Its alternative solutions
Apply new techniques to look into the data
Find/Create interesting data sets

17
Lessons Learned

You need a lot of patience and care
Exceptions in the data
Flaws in your assumption
You need a lot of hard-drive space too!
You need good questions
For each question there are multiple ways to come
up with an answer
New questions require new data sets and tools
You need to read a lot of papers

18
More Potential Direction

Mobility modeling/prediction
Data mining and clustering
Behavior-aware service/advertisements
Behavior-aware routing
Caveat Over-generalization from WLAN to
futuristic networks (such as DTN)?
Re-examine assumptions in earlier work

19
Related Skills

General programming (C/C)
Perl/shell script/awk
Matrix manipulation (MATLAB)
Statistics software (R)
http//www.r-project.org/
Clustering/Machine learning
Principal component analysis/ Singular value
decomposition
http//www.cs.cmu.edu/elaw/papers/pca.pdf
Data mining? Database analysis?

20
Good Online Resources

MobiLib
http//nile.cise.ufl.edu/MobiLib
Links to various traces, USC trace and some
processing tools download
CRAWDAD
http//crawdad.cs.dartmouth.edu/
Various traces download, related papers

21
References

Stanford D. Tang and M. Baker, Analysis of a
Local-area Wireless Network
Stanford2 D. Tang and M. Baker, Analysis of a
Metropolitan-area Wireless Network
Dartmouth D. Kotz and K. Essien, Analysis of a
Campus-wide Wireless Network
Dartmouth2 T. Henderson, D. Kotz, and I.
Abyzov, The Changing Usage of a Mature
Campus-wide Wireless Network
MIT/IBM M. Balazinska and P. Castro,
Characterizing Mobility and Network Usage in a
Corporate Wireless Local-area Network

22
References

UCSD M. McNett and G. Voelker, Access and
Mobility of Wireless PDA Users
UCLA X. Meng, S. Wong, Y. Yuan, and S. Lu,
Characterizing Flows in Large Wireless Data
Networks
USC D. Bhattacharjee, A. Rao, C. Shah, M. Shah,
and A. Helmy, Empirical Modeling of Campus-wide
Pedestrian Mobility Observations on the USC
Campus
USC2 K. Merchant, W. Hsu, H. Shu, C. Hsu, and
A. Helmy, Weighted Waypoint Mobility Model and
Its Impacts on Ad Hoc Networks

23
References

Dartmouth M. Kim and D Kotz, Methodology for
Classifying Mobile Users and Access Points
Dartmouth L. Song, D. Kotz, R. Jain, and X. He,
Evaluating location predictors with extensive
Wi-Fi mobility data
SIGCOMM01 A. Balachandran, G. Voelker, P. Bahl,
and V. Rangan, Characterizing User Behavior and
Network Performance in a Public Wireless LAN
INFOCOM05 C. Tuduce and T. Gross, A Mobility
Model Based on WLAN Traces and its Validation
T-model D Lelescu, UC Kozat, R Jain, M
Balakrishnan, Model T an empirical joint
space-time registration model
T-model R Jain, D Lelescu, M Balakrishnan,
Model T an empirical model for user
registration patterns in a campus wireless LAN

24
More on Mobility Modeling
25
Mobility Observations from WLANs

Skewed location visiting preferences
Nodes spend 95 of time at top 5 preferred
locations.
Heavily visited preferred spots

Periodical re-appearance
Nodes show up repeatedly at the same location
after integer multiples of days.
Periodical daily/weekly schedules

26
Mobility Observations from WLANs

Problems of simple random models (random walk,
random waypoint, random direction)
No preferred locations in spatial domain (uniform
nodal distribution across space)
No structure in time domain (homogeneous behavior
across time)
Nodes behave statistically identical to one
another
Benefit Math analysis tractability
Can we improve realism and not sacrifice math
tractability?

27
Time-variant Community Model

Skewed location visiting preferences
Create communities to be the preferred
destination
Each node can have its own community
Periodical re-appearance
Create structure in time Periods
Node move with different parameters in periods
Repetitive structure

75
25
28
Time-variant Community Model

Major trends of mobility characteristics
preserved (extensions later)
In addition, mathematical tractability is retained

29
More on Matrix-based Analysis
30
Introduction

Wide-spread WLAN deployments create large-scale
infrastructures.
Large number of users lead to large scale
management and design issues.
We need methods to quantify, summarize, and
compare long-run trends (in the order of months)
of individual user associations
Usage model / association model
Personalized services
Behavior aware ads / monetization
Behavior-aware routing protocols

31
Questions

Q1. How to quantify user association consistency?
(Challenge) What is a proper representation of
user association, and how do we measure
consistency?
Q2. How do we summarize long run user association
patterns?
(Challenge) How to utilize existing data
reduction techniques?
Q3. How to group users with similar association
patterns?
(Challenge) How to quantify the similarity of
user association patterns?
How to reduce computational complexity?
Contribution Generic methods to address these
questions and empirically validated using USC and
Dartmouth WLAN traces.

32
Representation of User Association Patterns

We choose to represent summary of user
association in each day by a single vector.
For a given day d, user association vector is
defined by a n-element vector a aj the
percentage of online time the user i spends at
APj on day d.
The elements of a vector sum to 1.
Use zero vector for off-line users.
The elements in the vectors quantify the relative
importance (or, attraction) of the AP to the user.

Association vector (library, office, class)
(0.2, 0.4, 0.4)
33
Q1. User Association Consistency

User i is consistent, if its daily association
vectors can be grouped into few clusters (e.g.,
less than 10 of the number of days).
Evaluation use hierarchical clustering with
Manhattan distance measure (L1)
Distance between two vectors is at most 2.

34
Q1. User Association Consistency

Hierarchical Clustering
Start Each vector is a single-member cluster.
Recursion Two closest clusters are merged.
End Until remaining clusters have distances
larger than a threshold

35
Q1. User Association Consistency
Distribution of Number ofclusters under
cut-offthreshold 0.9
80 of users show at most9 clusters of behavior
modesduring the 94-day trace
complete link Distance between clusters
distance between the furthest components inthe
considered clusters
Observation many users are multimodal but with
much less association modes than total number
of days in the trace period.
36
Q2. Summarizing user associations

Association matrix concatenate user association
vectors for all days into a matrix.
To summarize, perform SVD and store the top-k
eigen values/vectors.
What value of k we have to use for a good
representation of the matrix?
Captured matrix power
How much is the reconstruction error?
Matrix norms X-Xkp/Xp where

37
Q2. Summarizing user associations
Only top 6 singular vectorsare needed to capture
at least90 of power for more than 95 of
association matrices
Reconstruction error of low-rank
approximationis low (5 singular vectorsgive
error lt 0.05)
Observation although users are multi-modal,a
few major modes dominate its behavior
38
Q2. Summarizing user associations

Association matrix concatenate user association
vectors for all days into a matrix.
To summarize, perform SVD and store the top-k
eigen values/vectors.
What value of k we have to use for a good
representation of the matrix?
Captured matrix power
How much is the reconstruction error?
Matrix norms X-Xkp/Xp where

39
Q2. Summarizing user associations
Only top 6 singular vectorsare needed to capture
at least90 of power for more than 95 of
association matrices
Reconstruction error of low-rank
approximationis low (5 singular vectorsgive
error lt 0.05)
Observation although users are multi-modal,a
few major modes dominate its behavior
40
Q3. Similarity Metrics between Users

Naive method to compare similarity between user i
and j
Intuition for every daily association vector of
i, if there is a similar association vector for
j, then (i,j) have similar behavior.
From user i, pick association vector aid of user
i on day d.
Find the association vector of user j, denoted by
ajd , which is the nearest to aid
Find average of ajd - aid over all days d.
Drawback expensive
O(nd2) for each pair
Lots of file reads for large dataset . Read raw
data
Need a faster method which reads summaries

41
Q3. Similarity Metrics between Users

Compare the similarity of the eigen-vectors
obtained from SVD.
Similarity between users determined by weighted
inner products of eigen vectors.
wi proportion of power of singular vector
D(U,V) 1 - Sim(U,V)
Are the 2 metrics similar?
0.911 correlation coefficient for studied users.

42
Q3. Similarity Metrics between Users

Are we able to get clusters with similar users?
Compare the PDF/CDF for inter- and intra- cluster
users (Example 200 clusters).

43
Q3. Similarity Metrics between Users

Take users in the same clusters and concatenate
the asso. matrices, and perform SVD and find
power captured by top k eigen vectors.
Also take random users and concatenate the
eigenvectors and do the same.
There is a clear distinction between the 2
clustering methods.

straight-forward similarity decided based
onpair-wise comparison of association
vectors feature-based similarity decided based
on singular vectors
44
Q3. Similarity Metrics between Users

For all clusters, use a scatter plot to show the
power captured by top-4 eigenvectors.
(distance-based cluster vs random cluster)

Write a Comment

User Comments (0)

About PowerShow.com

Things about Trace Analysis PowerPoint PPT Presentation