Title: Toward Validation and Control of Network Models
1Toward Validation and Control of Network Models
- Michael Mitzenmacher
- Harvard University
2Internet Mathematics
Articles Related to This Talk
The Future of Power Law Research
A Brief History of Generative Models for Power
Law and Lognormal Distributions
3Motivation General
- Network Science and Engineering is emerging as
its own (sub)field. - NSF cross-cutting area starting this year.
- Courses Cornell (Easley/Kleinberg), Kearns (U
Penn), many others. - For undergrads, not just grads!
- In popular culture books like Linked by
Barabasi or Six Degrees by Watts. - Other sciences Economics, biology, physics,
ecology, linguistics, etc. - What has been and what should be the research
agenda?
4My (Biased) View
- The 5 stages of networking research.
- Observe Gather data to demonstrate a behavior
in a system. (Example power law behavior.) - Interpret Explain the importance of this
observation in the system context. - Model Propose an underlying model for the
observed behavior of the system. - Validate Find data to validate (and if
necessary specialize or modify) the model. - Control Design ways to control and modify the
underlying behavior of the system based on the
model.
5My (Biased) View
- In networks, we have spent a lot of time
observing and interpreting behaviors. - We are currently very active in modeling.
- Many, many possible models.
- Perhaps easiest to write papers about.
- We need to now put much more focus on validation
and control. - Have been moving in this direction.
- And these are specific areas where computer
science has much to contribute!
6Models
- After observation, the natural step is to
explain/model the behavior. - Outcome lots of modeling papers.
- And many models rediscovered.
- Example power laws
- Lots of history
7History
- In 1990s, the abundance of observed power laws
in networks surprised the community. - Perhaps they shouldnt have power laws appear
frequently throughout the sciences. - Pareto income distribution, 1897
- Zipf-Auerbach city sizes, 1913/1940s
- Zipf-Estouf word frequency, 1916/1940s
- Lotka bibliometrics, 1926
- Yule species and genera, 1924.
- Mandelbrot economics/information theory, 1950s
- Observation/interpretation were/are key to
initial understanding. - My claim but now the mere existence of power
laws should not be surprising, or necessarily
even noteworthy. - My (biased) opinion The bar should now be very
high for observation/interpretation.
8So Many Models
- Preferential Attachment
- Optimization (HOT)
- Monkeys typing randomly (scaling)
- Multiplicative processes
- Kronecker graphs
- Forest fire model (densification)
9What Makes a Good Model
- New variations coming up all of the time.
- Question What makes a new network model
sufficiently interesting to merit attention
and/or publication? - Strong connection to an observed process.
- Many models claim this, but few demonstrate it
convincingly. - Theory perspective significant new mathematical
insight or sophistication. - A matter of taste?
- My (biased) opinion the bar should start being
raised on model papers.
10Validation The Current Stage
- We now have so many models.
- It is important to know the right model, to
extrapolate and control future behavior. - Given a proposed underlying model, we need tools
to help us validate it. - We appear to be entering the validation stage of
research. BUT the first steps have focused on
invalidation rather than validation.
11Examples Invalidation
- Lakhina, Byers, Crovella, Xie
- Show that observed power-law of Internet topology
might be because of biases in traceroute
sampling. - Pedarsani, Figueiredo, Grossglauser
- Show that densification may also arise by
sampling approaches, not necessarily intrinsic to
network. - Chen, Chang, Govindan, Jamin, Shenker, Willinger
- Show that Internet topology has characteristics
that do not match preferential-attachment graphs. - Suggest an alternative mechanism.
- But does this alternative match all
characteristics, or are we still missing some?
12My (Biased) View
- Invalidation is an important part of the process!
BUT it is inherently different than validating a
model. - Validating seems much harder.
- Indeed, it is arguable what constitutes a
validation. - Question what should it mean to say
This model is consistent with observed data.
13An Alternative View
- There is no right model.
- A model is the best until some other model comes
along and proves better. - Greedy refinement via invalidation in model
space. - Statistical techniques compare likelihood ratios
for various models. - My (biased) opinion this is one useful
approach but not the end of the question. - Need methods other than comparison for confirming
validity of a model.
14Time-Series/Trace Analysis
- Many models posit some sort of actions.
- New pages linking to pages in the Web.
- New routers joining the network.
- New files appearing in a file system.
- A validation approach gather traces and see if
the traces suitably match the model. - Trace gathering can be a challenging systems
problem. - Check model match requires using appropriate
statistical techniques and tests. - May lead to new, improved, better justified
models.
15Sampling and Trace Analysis
- Often, cannot record all actions.
- Internet is too big!
- Sampling
- Global snapshots of entire system at various
times. - Local record actions of sample agents in a
system. - Examples
- Snapshots of file systems full systems vs.
actions of individual users. - Router topology Internet maps vs. changes at
subset of routers. - Question how much/what kind of sampling is
sufficient to validate a model appropriately? - Does this differ among models?
16To Control
- In many systems, intervention can impact the
outcome. - Maybe not for earthquakes, but for computer
networks! - Typical setting individual agents acting in
their own selfish interest. Agents can be given
incentives to change behavior. - General problem given a good model, determine
how to change system behavior to optimize a
global performance function. - Distributed algorithmic mechanism design.
- Mix of economics/game theory and computer science.
17Possible Control Approaches
- Adding constraints local or global
- Example total space in a file system.
- Example preferential attachment but links
limited by an underlying metric. - Add incentives or costs
- Example charges for exceeding soft disk quotas.
- Example payments for certain AS level
connections. - Limiting information
- Impact decisions by not letting everyone have
true view of the system.
18My Related Work Hash Algorithms
- On the Internet, we need a measurement and
monitoring infrastructure, for validation and
control. - Approximate is fine speed is key.
- Must be general, multi-purpose.
- Must allow data aggregation.
- Solution hash-based architecture.
- Eventual goal every router has a programmable
hash engine.
19Vision
- Three-pronged research data.
- Low Efficient hardware implementations of
relevant algorithms and data structures. - Medium New, improved data structures and
algorithms for old and new applications. - High Distributed infrastructure supporting
monitoring and measurement schemes.
20The High-Level Pitch
- Lots of hash-based schemes being designed for
approximate measurement/monitoring tasks. - But not built into the system to begin with.
- Want a flexible router architecture that allows
- New methods to be easily added.
- Distributed cooperation using such schemes.
21What We Need
On-Chip Memory
Off-Chip Memory
CAM(s)
Memory
Hashing Computation Unit
Programming Language
Unit for Other Computation
Computation
Control System
Communication Architecture
Communication Control
22Lots of Design Questions
- How much space for various memory levels? How to
dynamically divide memory among competing
applications? - What hash functions should be included? Openness
to new hash functions? - What programming language and functionality?
- What communication infrastructure?
- Security?
- And so on
23Which Hash Functions?
- Theorists
- Want analyzable hash functions.
- Dislike standard assumption of perfectly random
hash functions. - Hard to prove things about actual performance.
- Practitioners
- Want easy implementation, speed, small space.
- Want simple analysis (back-of-the-envelope).
- Will accept simulated results under right
settings.
24Why Do Weak Hash Functions Work So Well?
- In reality, assuming perfectly random hash
functions seems to be the right thing to do. - Easier to analyze.
- Real systems almost always work that way, even
with weak hash functions! - Can Theory explain strong performance of weak
hash functions?
25Recent Work
- A new explanation (joint work with Salil Vadhan)
- Choosing a hash function from a pairwise
independent family is enough if data has
sufficient entropy. - Randomness of hash function and data combine.
- Behavior matches truly random hash function with
high probability. - Techniques based on theory of randomness
extraction. - Extensions of Leftover Hash Lemma.
26What Functionality?
- Hash tables should be a basic primitive.
- Best hash tables cuckoo hashing.
- Worst case constant lookup time.
- Simple to build, design.
- How can we make them even better?
- Move cuckoo hashing from theory to practice!
27Cuckoo Hashing Pagh,Rodler
- Basic scheme each element gets two possible
locations. - To insert x, check both locations for x. If one
is empty, insert. - If both are full, x kicks out an old element y.
Then y moves to its other location. - If that location is full, y kicks out z, and so
on, until an empty slot is found.
28Cuckoo Hashing Examples
A
B
C
E
D
29Cuckoo Hashing Examples
A
B
C
F
E
D
30Cuckoo Hashing Examples
A
B
F
C
E
D
31Cuckoo Hashing Examples
A
B
F
C
G
E
D
32Cuckoo Hashing Examples
E
G
B
F
C
A
D
33Cuckoo Hashing Examples
A
B
C
G
E
D
F
34Cuckoo Hashing Failures
- Bad case 1 inserted element runs into cycles.
- Bad case 2 inserted element has very long path
before insertion completes. - Could be on a long cycle.
- Bad cases occur with small probability when load
is sufficiently low, but not low enough - Theoretical solution re-hash everything if a
failure occurs. - For 2 choices, load less than 50, n elements
gives failure rate of Q(1/n) maximum insert time
O(log n). - Better space utilization and rate for more
choices, more elements per bucket.
35Recent Work A CAM-Stash
- Use a CAM (Content Addressable Memory) to stash
away elements that would cause failure. - Joint with Kirsch/Wieder.
- Intuition if failures were independent,
probability that s elements cause failures goes
to Q(1/ns). - Failures not independent, but nearly so.
- A stash holding a constant number of elements
greatly reduces failure probability. - Implemented as a CAM in hardware, or a cache line
in hardware/software. - Lookup requires also looking at stash.
36Modeling Economic Principles
- Joint work with Corbo, Jain, Parkes.
- Exploration what models make sense for AS
connectivity. - Extending approach of Chang, Jamin, Mao,
Willinger. - Entering nodes link according to business model,
utility function. - Nodes revise their links based on new entrants.
- Like the forest fire model.
- Future considerations how to validate such
models.
37Conclusion My (Biased) View
- There are 5 stages of networking research.
- Observe Gather data to demonstrate power law
behavior in a system. - Interpret Explain the import of this
observation in the system context. - Model Propose an underlying model for the
observed behavior of the system. - Validate Find data to validate (and if
necessary specialize or modify) the model. - Control Design ways to control and modify the
underlying behavior of the system based on the
model. - We need to focus on validation and control.
- Lots of open research problems.
38A Chance for Collaboration
- The observe/interpret stages of research are
dominated by systems modeling dominated by
theory. - And need new insights, from statistics, control
theory, economics!!! - Validation and control require a strong
theoretical foundation. - Need universal ideas and methods that span
different types of systems. - Need understanding of underlying mathematical
models. - But also a large systems buy-in.
- Getting/analyzing/understanding data.
- Find avenues for real impact.
- Good area for future systems/theory/others
collaboration and interaction.
39More About Me
- Website www .eecs.harvard.edu/michaelm
- Links to papers
- Link to book
- Link to blog mybiasedcoin
- mybiasedcoin.blogspot.com