Modeling Distances in LargeScale Networks by Matrix Factorization PowerPoint PPT Presentation

presentation player overlay
1 / 25
About This Presentation
Transcript and Presenter's Notes

Title: Modeling Distances in LargeScale Networks by Matrix Factorization


1
Modeling Distances in Large-Scale Networks by
Matrix Factorization
  • Yun Mao
  • CIS dept, University of Pennsylvania
  • October, 2004
  • Joint work with Prof. Lawrence Saul

2
Motivation
  • Network distances are useful
  • Content distribution networks
  • Peer-to-peer networks
  • Overlay routing, multicast
  • Network games
  • Measurement comes w/ cost
  • time, bandwidth,
  • Is there a map or something to estimate without
    measurement?

3
Sorry, no global routing maps (yet)
4
Our goal
Node1 location1
Node2 location2
Node3 location3
5
Outline
  • What has been done?
  • Matrix factorization model
  • Internet Distance Estimation Service
  • Evaluation
  • Conclusions

6
What is network distance?
  • Round trip latency
  • Symmetric
  • Relatively stable
  • Triangle inequality violation
  • Bandwidth, loss rate
  • Not really distance, but useful
  • Asymmetric
  • This talk RTT (unless specified)

7
State of the art
  • Euclidean Embedding
  • Each host has a coordinate
  • Network distances are estimated as Euclidean
    distances
  • Known systems
  • GNP (IMW01, INFOCOM02)
  • Simplex downhill
  • ICS, Virtual Landmark (IMC03)
  • Lipschitz PCA
  • Vivaldi (HotNets03,SIGCOMM04)
  • spring energy simulation, height extension
  • Many others..

8
Shared limitations
  • Triangle inequality violations in RTT metrics
  • Symmetric constraint
  • Not suitable for complicated metrics
  • Even if triangle inequality and symmetry
    properties hold, increasing dimensionality
    doesnt help to improve accuracy in many cases

9
Simple topologies that dont have exact embeddings
(0.5,0.5)
(-0.5,0.5)
1

H1
H2
H1
H2
The estimated distance between H1 and H4 is 1.414
while the real distance is 2.0 Extra dimensions
dont help
(0,0)
1
1
H3
H4
H3
H4
1

(0.5,-0.5)
(-0.5,-0.5)
One Possible 2-D Embedding
Is there a better model?
Another tree topology example
10
An algebraic perspective
Distance Matrix Dij is the distance from host i
to host j
Each host i has two vectors Xi and Yi Distance
function is the dot product.
11
Comparison
(outgoing vector)
(coordinate)
(incoming vector)
(Euclidean distance)
(dot product)
12
Questions
  • How to factorize a matrix (D) into two (much)
    smaller matrices (X,Y)?
  • Accuracy? d?
  • How to build a system to assign outgoing/incoming
    vectors to Internet-scale networks?

13
Algorithm 1 Singular Value Decomposition (SVD)
Simple, and can find a global minimum of
14
Algorithm 2 Nonnegative Matrix Factorization
(NMF)
  • Can handle distance matrix with missing elements
    with our simple extension.
  • Non-negative constraint
  • Iterative method converges to local minima
    quickly
  • See paper for details.

15
Do they work?
  • Data sets
  • NLANR AMP
  • 100x100. Hosts were mostly in US, distance is
    minimum RTT ping time.
  • P2PSim
  • 1000x1000. Hosts were obtained from an
    Internet-scale Gnutella trace. Distances were
    collected by the King method IMW02
  • Comparison
  • SVD and NMF based on matrix factorization model
  • LipschitzPCA based on Euclidean embedding model
    (used in ICS and VL)
  • Error function

16
Accuracy comparison
17
So far
  • Matrix factorization model seems promising
  • Accurate, flexible
  • How to build a scalable system on this model?

18
IDES Internet Distance Estimation Service
(Xnew,Ynew)?
(Xnew2,Ynew2)?
The distances are not measured, but can be
predicted
19
Vectors of ordinary hosts
  • We hope
  • So, minimize the least-squares

(for all landmark i)
20
Practical concerns
  • What if some landmarks fail?
  • How to reduce the load on the landmarks?
  • How many dimensions?
  • How scalable, robust is IDES?
  • Answered in our paper

21
Evaluation Efficiency accuracy
  • Datasets
  • Landmarks were selected randomly
  • Comparison
  • IDES/SVD and IDES/NMF
  • Euclidean embedding
  • GNP
  • ICS

22
Efficiency
Environment Pentium 4 3.2GHz CPU, 2G memory GNP
is obtained from the original release written in
C. IDES and ICS are implemented in Matlab.
23
Accuracy
P2PSim dataset (d8)
AMP dataset (d8)
24
Conclusions
  • A new model based on matrix factorization
  • Simple to implement
  • No constraints on network distances
  • Two algorithms SVD, NMF
  • Internet Distance Estimation Service
  • Efficient, accurate, robust.

25
Thank you!Questions?
Write a Comment
User Comments (0)
About PowerShow.com