Herald: Achieving a Global Event Notification Service - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Herald: Achieving a Global Event Notification Service

Description:

Herald event notification service: Design criteria and initial solution strategies. ... Herald will be exploring a subset of the potential design space. ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 47

Provided by: mike435

Learn more at: http://netseminar.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Herald: Achieving a Global Event Notification Service

1
Herald Achieving a Global Event Notification
Service

Marvin Theimer
Joint work with
Michael B. Jones, Helen Wang, Alec Wolman
Microsoft Research
Redmond, Washington

2
What this Talk will be About

Event notification more than just simple
multicast.
Problems you run into at large scale.
Herald event notification service Design
criteria and initial solution strategies.
Some results on overlay networks.

3
Event Notification Its more than just Simple
Multicast

Additional types of functionality
Reliability/Persistence
QoS
Security
Richer publish/subscribe semantics
Herald will be exploring a subset of the
potential design space.

4
Multicast Reliability/Persistence
Unreliable multicast
Reliable, un-ordered multicast
Ordered multicast
Persistent, multicast
Persistent, ordered multicast
Multicast with history
Persistent, ordered multicast with history
5
Multicast QoS
Best-effort multicast
Bounded delivery time multicast
Simultaneous delivery time multicast
6
Multicast Security
Unsecured multicast
Sender issues
Receiver issues
Forwarder issues
Confidentiality of messages, Authorized receivers
Integrity of messages, Authorized senders
Federated forwarders
Revocation
7
Richer Publish/Subscribe Semantics
Unfiltered, single event topics
Broaden subscription
Narrow subscription
Filtered subscriptions e.g. event.typeexceptio
n or max delivery 1 msg/sec
Subscription patterns e.g. machines//cpu-load
Topic composition e.g. a (bc)
Standing DB queries e.g. if ServerFarm.VirusAlarm
then Select machine where
CombinedLoad(machine.CpuLoad, machine.NetLoad)gt0.9
and Contains(machine.Processes,
WebServer)
8
Global Event Notification Services

Local-node and local-area event notification have
well-established user communities.
Communication via event notification is also
well-suited
for Internet-scale distributed applications (e.g.
instant messaging and multi-player games)
for loosely-coupled eCommerce applications
General event notification systems currently
scale to tens of thousands of clients
do not have global reach

9
Internet Scalability is Hard

Internet scale implies that individual systems
can never have complete information
Information will be unavailable, stale or even
inaccurate
Some participants will always be unreachable
Partitions, disconnection, and node downtime will
always be occurring somewhere
There will (probably) not be a single
organization that owns the entire event
notification infrastructure. Hence a federated
design is required.

10
Scaling to Extreme Event Notification

1011 computers and embedded systems in the world
(soon)
gt 1011 event topics
gt 1011 publishers subscribers in aggregate
Global audiences
1010 subscribers for some event topics
Trust no one
Potentially 1010 federated security domains

11
Some Event Notification Semantics Dont Scale Well

Fully general subscription semantics
Simultaneous delivery times
Highly available, reliable, ordered events from
multiple publishers

12
Global Event Notification is Hard to Validate

What kinds of work loads?
Instant messaging ?
Billions of inter-connected gizmos (BIG)?
eCommerce?
How do you explore and validate a design that
should scale to tens of billions?
Rule of thumb every order of magnitude results
in qualitatively different system behavior

13
No Good Means of Exploration/Validation

Build a real system
Most accurate approach lets you discover the
unexpected.
Current test beds only go up to a few thousand
nodes but can virtualize to get another factor
of 5-10.
Simulation
Detailed (non-distributed) simulators dont scale
beyond a few thousand nodes plus they only model
the effects youre already aware of.
Crude (non-distributed) simulators scale to about
a million nodes but they ignore many important
factors.
Distributed simulators Are accurate ones easier
to build or more scalable than a real system??

14
Herald Scalable Event Notification as an
Infrastructure Service

The design space of interesting choices is huge.
Focus on the scalability of basic message
delivery and distributed state management
capabilities first (i.e. bottom-up approach)
Employ a very simple message-oriented design.
Try to layer richer event notification semantics
on top.
Explore the trade-off between scalability and
various forms of semantic richness.

15
Herald Event Notification Model
1 Create Event Topic
4 Notify
3 Publish
2 Subscribe
16
Design Criteria

The usual criteria
Scalability
Resilience
Self-administration
Timeliness
Additional criteria
Heterogeneous federation
Security
Support for disconnected and partitioned operation

17
Heterogeneous Federation

Federation of machines within cooperating but
mutually suspicious domains of trust
Federated parties may include both small and
large domains
Can we support fully peer-to-peer models?
How will lumpy systems, containing both small
and large domains, behave?

18
Security

Whats the right threat model?
Trust all nodes within an administrative domain?
Trust no one?
Dealing with large numbers
Managing large numbers of access control groups.
Revocation in the face of high change volumes.
How should anonymity and privacy be treated and
supported?

19
Support for Disconnected and Partitioned Operation

Capabilities wanted by some applications
Eventual delivery to disconnected subscribers.
Event histories to allow a posteriori examination
of bounded regions of the past.
Continued operation on both sides of a network
partition.
Eventual (out-of-order) delivery after partition
healing.
Whats the cost of supporting disconnected and
partitioned operation and who should pay it?
Who maintains event histories?
How do dissemination trees get reconfigured when
partitions and partition healings occur?

20
Non-Goals

Developing the best way to do
Naming
Filtering
Complex subscription queries
Eventually want to support filtering and complex
subscription queries.
Ideally, Herald should be agnostic to the choices
made for these topics.

21
Initial Solution Strategies

Keep it simple
Small set of core mechanisms.
Everything else driven with policy modules.
Peer-to-peer network of managed servers
Servers are spread around the edge of the
Internet.
Leverage smart clients when possible.

22
Keep it Simple

We believe we only need these mechanisms
State replication and transfer
Overlay distribution networks
Time contracts (to age discard state)
Event histories
(Non-scalable) administrative event topics

23
Distributed Systems Alternatives

Centralized mega-services dont meet our design
criteria.
Peer-to-peer Internet clients make poor service
providers
Limited bandwidth links.
Intermittently available.
Trust issues.
Peer-to-peer network of managed servers
Avoid the pitfalls of using clients as servers.
Can reflect administrative boundaries.
But still want to be able to take advantage of
client capabilities/resources.

24
Project Status

Have focused mostly on peer-to-peer overlay
networks so far
Scalable topic name service.
Event distribution algorithms
Broadcasting over per-topic overlays versus event
forwarding trees.
Pastry versus CAN overlay networks.
Prototyped using MSR Cambridge network simulator.

25
Next Steps

Build a full Herald implementation
Run implementation on test bed of real machines
Use Farsite cluster of 300 machines
Eventually look for larger, more distributed test
beds.
Tackle federation security issues
Understand behavior under a variety of
environmental and workload scenarios

26
An Evaluation of Scalable Application-Level
Multicast Built Using Peer-To-Peer Overlay
Networks

Miguel Castro, Michael B. Jones, Anne-Marie
Kermarrec, Antony Rowstron, Marvin Theimer,Helen
Wang, Alec Wolman
Microsoft Research
Cambridge, England and
Redmond, Washington

27
Scalable Application-Level Multicast

Peer-to-peer overlay networks such as CAN, Chord,
Pastry, and Tapestry can be used to implement
scalable application-level multicast.
Two approaches
Build multicast tree per group
Deliver messages by routing over tree
Build separate overlay network per group
Deliver messages by intelligently flooding to
entire group

28
Classes of Overlay Networks

Chord, Pastry and Tapestry all use a form of
generalized hypercube routing with longest-prefix
address matching.
O(Log(N)) routing state O(Log(N)) routing hops.
CAN uses a numerical distance metric to route
through a Cartesian hyper-space.
O(D) routing state O(N1/D) routing hops.

29
Observation

Approach to multicast is independent of overlay
network choice.
Possible to perform a head-to-head comparison of
flooding versus tree-based multicast on both
styles of overlays.

30
What Should One Use?

Evaluate
Forwarding tree versus flooding multicast
approaches
On both CAN and Pastry
On the same simulation platform
Running the same experiments

31
Simulation Platform

Packet-level discrete event simulator.
Counts the number of packets sent over each
physical link and assigns a constant delay to
each link.
Does not model queuing delays or packet losses.
Georgia Tech transit/stub network topology with
5050 core router nodes.
Overlay nodes attached to the routers via LAN
links.

32
Experiments

Two sets of experiments
Single multicast group with all overlay nodes as
members (80,000 total).
1500 multicast groups with a range of membership
sizes and locality characteristics
Zipf-like distribution governing membership size.
Both uniform and Zipf-like distributions
governing locality of members.
Each experiment had 2 phases
Members subscribe to groups.
One message multicast to each group.

33
Evaluation Criteria

Relative delay penalty
RMD max delayapp-mcast / max delayip-mcast
RAD avg delayapp-mcast / avg delayip-mcast
Link stress
Node stress
Number of routing table entries
Number of forwarding tree table entries
Duplicates

34
RDP for Flooding on CAN
35
RDP for Trees on CAN
36
Link Stress with CAN
Flooding
State size 29 38 59 111
Join phase
Max 134378 143454 398174 480926
Average 182 217 276 424
Flood phase
Max 1897 1343 958 652
Average 4.13 3.35 3.06 2.94
Tree-based
State size 29 38 59 111
Mcast phase
Max 219 178 188 202
Average 1.52 1.42 1.35 1.31
37
Tree-Per-Group vs. Overlay-Per-Group

Tree-based multicast approach consistently
outperforms flooding approach.
Biggest disadvantage of flooding is cost of
constructing a new overlay for each multicast
group.
Results consistent for both Pastry and CAN.
But the flooding approach doesnt require
trusting third-party intermediary nodes.

38
CAN vs. Pastry for Tree-Based Multicast

Equivalent per-node routing table state, with
representative tuned settings
RDP values are 20 to 50 better with Pastry than
with CAN
Comparable average link stress values but max.
link stress was twice as high for Pastry as for
CAN
Max. number of forwarding tree entries was about
three times as high for Pastry as for CAN.
Pastry can be de-tuned to provide comparable
RDP values to CAN in exchange for comparable
costs.
Conclusion Pastry can optionally provide higher
performance than CAN but at a higher cost.

39
Routing Choice Observations

Pastry employs a single mechanism to obtain good
routing choices greedy selection of routing
table alternatives based on network proximity
Easier to implement
Easier to tune
CAN employs a multitude of different mechanisms
Multiple dimensions
Multiple realities
Greedy routing based on network proximity
Overloaded zones
Uniformly distributed zones
Topologically aware assignment of zone addresses
(most useful)
CAN is more difficult to implement and adjust for
alternate uses (such as multicast)
CAN is harder to tune

40
Programming Model Observations

Pastry model simple
Unicast delivery to node with nearest ID in ID
space
Can send to a specific node with a known ID
CAN model more complex
Anycast delivery to sets of nodes inhabiting same
regions of hypercube or hyper cubes
Requires application-level coordination

41
Topologically Aware Address Assignment

Topological assignment
Choosing node IDs or regions based on topology
Two methods
Landmark-based (CAN only)
Transit/stub aware clustering
Hard to get right for CAN
Helps CAN a lot, both for unicast and multicast
Helps Pastry-based flooding hurts Pastry-based
forwarding trees

42
Summary and Conclusions

Embedded forwarding trees are preferable to
flooding of mini-overlay networks.
Pastry is capable of lower delays than CAN, but
may incur higher maximal costs Pastry can be
tuned to offer equivalent delays to CAN at
equivalent costs.
Topologically aware assignment
Most important optimization for CAN.
Mixed results for Pastry.
CAN
More complex programming model.
Difficult to tune.

43
Future Work

Behavior with respect to fault tolerance.
Better understanding of topologically aware
address assignment.

44
Related Work

Non-global event notification systems
Gryphon, Ready, Siena,
Netnews
P2P systems
Gnutella, Farsite,
Overlay multicast networks
CAN, Chord, Pastry, Tapestry, Scribe,
Content Distribution Networks (CDNs)
Akamai,
OceanStore

45
Some Useful Pointers

Talk with me
Marvin Theimer - theimer_at_microsoft.com
http//research.microsoft.com/theimer/
Herald Scalable Event Notification Project
http//research.microsoft.com/sn/Herald/
Peer-to-peer measurements
A Measurement Study of Peer-to-Peer File Sharing
Systems (UW)
Overlay network with dominant control traffic
Resilient Overlay Networks (MIT)
Scalability limits of ISIS/Horus
Bimodal Multicast (Cornell) August 1999 TOCS

46
Some Useful Pointers (cont)

CAN - ICIR (formerly ACIRI)
A Scalable Content-Addressable Network
Chord - MIT
Chord A Scalable Peer-to-peer Lookup Service for
Internet Applications
Pastry - MSR Cambridge Rice
Storage management and caching in PAST, a
large-scale, persistent peer-to-peer storage
utility
Tapestry - Berkeley
Tapestry An Infrastructure for Fault-tolerant
Wide-area Location and Routing
End-System Multicast - CMU
A Case For End System Multicast