Herald: Achieving a Global Event Notification Service - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Herald: Achieving a Global Event Notification Service

Description:

Herald event notification service: Design criteria and initial solution strategies. ... Herald will be exploring a subset of the potential design space. ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 47
Provided by: mike435
Category:

less

Transcript and Presenter's Notes

Title: Herald: Achieving a Global Event Notification Service


1
Herald Achieving a Global Event Notification
Service
  • Marvin Theimer
  • Joint work with
  • Michael B. Jones, Helen Wang, Alec Wolman
  • Microsoft Research
  • Redmond, Washington

2
What this Talk will be About
  • Event notification more than just simple
    multicast.
  • Problems you run into at large scale.
  • Herald event notification service Design
    criteria and initial solution strategies.
  • Some results on overlay networks.

3
Event Notification Its more than just Simple
Multicast
  • Additional types of functionality
  • Reliability/Persistence
  • QoS
  • Security
  • Richer publish/subscribe semantics
  • Herald will be exploring a subset of the
    potential design space.

4
Multicast Reliability/Persistence
Unreliable multicast
Reliable, un-ordered multicast
Ordered multicast
Persistent, multicast
Persistent, ordered multicast
Multicast with history
Persistent, ordered multicast with history
5
Multicast QoS
Best-effort multicast
Bounded delivery time multicast
Simultaneous delivery time multicast
6
Multicast Security
Unsecured multicast
Sender issues
Receiver issues
Forwarder issues
Confidentiality of messages, Authorized receivers
Integrity of messages, Authorized senders
Federated forwarders
Revocation
7
Richer Publish/Subscribe Semantics
Unfiltered, single event topics
Broaden subscription
Narrow subscription
Filtered subscriptions e.g. event.typeexceptio
n or max delivery 1 msg/sec
Subscription patterns e.g. machines//cpu-load
Topic composition e.g. a (bc)
Standing DB queries e.g. if ServerFarm.VirusAlarm
then Select machine where
CombinedLoad(machine.CpuLoad, machine.NetLoad)gt0.9
and Contains(machine.Processes,
WebServer)
8
Global Event Notification Services
  • Local-node and local-area event notification have
    well-established user communities.
  • Communication via event notification is also
    well-suited
  • for Internet-scale distributed applications (e.g.
    instant messaging and multi-player games)
  • for loosely-coupled eCommerce applications
  • General event notification systems currently
  • scale to tens of thousands of clients
  • do not have global reach

9
Internet Scalability is Hard
  • Internet scale implies that individual systems
    can never have complete information
  • Information will be unavailable, stale or even
    inaccurate
  • Some participants will always be unreachable
  • Partitions, disconnection, and node downtime will
    always be occurring somewhere
  • There will (probably) not be a single
    organization that owns the entire event
    notification infrastructure. Hence a federated
    design is required.

10
Scaling to Extreme Event Notification
  • 1011 computers and embedded systems in the world
    (soon)
  • gt 1011 event topics
  • gt 1011 publishers subscribers in aggregate
  • Global audiences
  • 1010 subscribers for some event topics
  • Trust no one
  • Potentially 1010 federated security domains

11
Some Event Notification Semantics Dont Scale Well
  • Fully general subscription semantics
  • Simultaneous delivery times
  • Highly available, reliable, ordered events from
    multiple publishers

12
Global Event Notification is Hard to Validate
  • What kinds of work loads?
  • Instant messaging ?
  • Billions of inter-connected gizmos (BIG)?
  • eCommerce?
  • How do you explore and validate a design that
    should scale to tens of billions?
  • Rule of thumb every order of magnitude results
    in qualitatively different system behavior

13
No Good Means of Exploration/Validation
  • Build a real system
  • Most accurate approach lets you discover the
    unexpected.
  • Current test beds only go up to a few thousand
    nodes but can virtualize to get another factor
    of 5-10.
  • Simulation
  • Detailed (non-distributed) simulators dont scale
    beyond a few thousand nodes plus they only model
    the effects youre already aware of.
  • Crude (non-distributed) simulators scale to about
    a million nodes but they ignore many important
    factors.
  • Distributed simulators Are accurate ones easier
    to build or more scalable than a real system??

14
Herald Scalable Event Notification as an
Infrastructure Service
  • The design space of interesting choices is huge.
  • Focus on the scalability of basic message
    delivery and distributed state management
    capabilities first (i.e. bottom-up approach)
  • Employ a very simple message-oriented design.
  • Try to layer richer event notification semantics
    on top.
  • Explore the trade-off between scalability and
    various forms of semantic richness.

15
Herald Event Notification Model
1 Create Event Topic
4 Notify
3 Publish
2 Subscribe
16
Design Criteria
  • The usual criteria
  • Scalability
  • Resilience
  • Self-administration
  • Timeliness
  • Additional criteria
  • Heterogeneous federation
  • Security
  • Support for disconnected and partitioned operation

17
Heterogeneous Federation
  • Federation of machines within cooperating but
    mutually suspicious domains of trust
  • Federated parties may include both small and
    large domains
  • Can we support fully peer-to-peer models?
  • How will lumpy systems, containing both small
    and large domains, behave?

18
Security
  • Whats the right threat model?
  • Trust all nodes within an administrative domain?
  • Trust no one?
  • Dealing with large numbers
  • Managing large numbers of access control groups.
  • Revocation in the face of high change volumes.
  • How should anonymity and privacy be treated and
    supported?

19
Support for Disconnected and Partitioned Operation
  • Capabilities wanted by some applications
  • Eventual delivery to disconnected subscribers.
  • Event histories to allow a posteriori examination
    of bounded regions of the past.
  • Continued operation on both sides of a network
    partition.
  • Eventual (out-of-order) delivery after partition
    healing.
  • Whats the cost of supporting disconnected and
    partitioned operation and who should pay it?
  • Who maintains event histories?
  • How do dissemination trees get reconfigured when
    partitions and partition healings occur?

20
Non-Goals
  • Developing the best way to do
  • Naming
  • Filtering
  • Complex subscription queries
  • Eventually want to support filtering and complex
    subscription queries.
  • Ideally, Herald should be agnostic to the choices
    made for these topics.

21
Initial Solution Strategies
  • Keep it simple
  • Small set of core mechanisms.
  • Everything else driven with policy modules.
  • Peer-to-peer network of managed servers
  • Servers are spread around the edge of the
    Internet.
  • Leverage smart clients when possible.

22
Keep it Simple
  • We believe we only need these mechanisms
  • State replication and transfer
  • Overlay distribution networks
  • Time contracts (to age discard state)
  • Event histories
  • (Non-scalable) administrative event topics

23
Distributed Systems Alternatives
  • Centralized mega-services dont meet our design
    criteria.
  • Peer-to-peer Internet clients make poor service
    providers
  • Limited bandwidth links.
  • Intermittently available.
  • Trust issues.
  • Peer-to-peer network of managed servers
  • Avoid the pitfalls of using clients as servers.
  • Can reflect administrative boundaries.
  • But still want to be able to take advantage of
    client capabilities/resources.

24
Project Status
  • Have focused mostly on peer-to-peer overlay
    networks so far
  • Scalable topic name service.
  • Event distribution algorithms
  • Broadcasting over per-topic overlays versus event
    forwarding trees.
  • Pastry versus CAN overlay networks.
  • Prototyped using MSR Cambridge network simulator.

25
Next Steps
  • Build a full Herald implementation
  • Run implementation on test bed of real machines
  • Use Farsite cluster of 300 machines
  • Eventually look for larger, more distributed test
    beds.
  • Tackle federation security issues
  • Understand behavior under a variety of
    environmental and workload scenarios

26
An Evaluation of Scalable Application-Level
Multicast Built Using Peer-To-Peer Overlay
Networks
  • Miguel Castro, Michael B. Jones, Anne-Marie
    Kermarrec, Antony Rowstron, Marvin Theimer,Helen
    Wang, Alec Wolman
  • Microsoft Research
  • Cambridge, England and
  • Redmond, Washington

27
Scalable Application-Level Multicast
  • Peer-to-peer overlay networks such as CAN, Chord,
    Pastry, and Tapestry can be used to implement
    scalable application-level multicast.
  • Two approaches
  • Build multicast tree per group
  • Deliver messages by routing over tree
  • Build separate overlay network per group
  • Deliver messages by intelligently flooding to
    entire group

28
Classes of Overlay Networks
  • Chord, Pastry and Tapestry all use a form of
    generalized hypercube routing with longest-prefix
    address matching.
  • O(Log(N)) routing state O(Log(N)) routing hops.
  • CAN uses a numerical distance metric to route
    through a Cartesian hyper-space.
  • O(D) routing state O(N1/D) routing hops.

29
Observation
  • Approach to multicast is independent of overlay
    network choice.
  • Possible to perform a head-to-head comparison of
    flooding versus tree-based multicast on both
    styles of overlays.

30
What Should One Use?
  • Evaluate
  • Forwarding tree versus flooding multicast
    approaches
  • On both CAN and Pastry
  • On the same simulation platform
  • Running the same experiments

31
Simulation Platform
  • Packet-level discrete event simulator.
  • Counts the number of packets sent over each
    physical link and assigns a constant delay to
    each link.
  • Does not model queuing delays or packet losses.
  • Georgia Tech transit/stub network topology with
    5050 core router nodes.
  • Overlay nodes attached to the routers via LAN
    links.

32
Experiments
  • Two sets of experiments
  • Single multicast group with all overlay nodes as
    members (80,000 total).
  • 1500 multicast groups with a range of membership
    sizes and locality characteristics
  • Zipf-like distribution governing membership size.
  • Both uniform and Zipf-like distributions
    governing locality of members.
  • Each experiment had 2 phases
  • Members subscribe to groups.
  • One message multicast to each group.

33
Evaluation Criteria
  • Relative delay penalty
  • RMD max delayapp-mcast / max delayip-mcast
  • RAD avg delayapp-mcast / avg delayip-mcast
  • Link stress
  • Node stress
  • Number of routing table entries
  • Number of forwarding tree table entries
  • Duplicates

34
RDP for Flooding on CAN
35
RDP for Trees on CAN
36
Link Stress with CAN
Flooding
State size 29 38 59 111
Join phase
Max 134378 143454 398174 480926
Average 182 217 276 424
Flood phase
Max 1897 1343 958 652
Average 4.13 3.35 3.06 2.94
Tree-based
State size 29 38 59 111
Mcast phase
Max 219 178 188 202
Average 1.52 1.42 1.35 1.31
37
Tree-Per-Group vs. Overlay-Per-Group
  • Tree-based multicast approach consistently
    outperforms flooding approach.
  • Biggest disadvantage of flooding is cost of
    constructing a new overlay for each multicast
    group.
  • Results consistent for both Pastry and CAN.
  • But the flooding approach doesnt require
    trusting third-party intermediary nodes.

38
CAN vs. Pastry for Tree-Based Multicast
  • Equivalent per-node routing table state, with
    representative tuned settings
  • RDP values are 20 to 50 better with Pastry than
    with CAN
  • Comparable average link stress values but max.
    link stress was twice as high for Pastry as for
    CAN
  • Max. number of forwarding tree entries was about
    three times as high for Pastry as for CAN.
  • Pastry can be de-tuned to provide comparable
    RDP values to CAN in exchange for comparable
    costs.
  • Conclusion Pastry can optionally provide higher
    performance than CAN but at a higher cost.

39
Routing Choice Observations
  • Pastry employs a single mechanism to obtain good
    routing choices greedy selection of routing
    table alternatives based on network proximity
  • Easier to implement
  • Easier to tune
  • CAN employs a multitude of different mechanisms
  • Multiple dimensions
  • Multiple realities
  • Greedy routing based on network proximity
  • Overloaded zones
  • Uniformly distributed zones
  • Topologically aware assignment of zone addresses
    (most useful)
  • CAN is more difficult to implement and adjust for
    alternate uses (such as multicast)
  • CAN is harder to tune

40
Programming Model Observations
  • Pastry model simple
  • Unicast delivery to node with nearest ID in ID
    space
  • Can send to a specific node with a known ID
  • CAN model more complex
  • Anycast delivery to sets of nodes inhabiting same
    regions of hypercube or hyper cubes
  • Requires application-level coordination

41
Topologically Aware Address Assignment
  • Topological assignment
  • Choosing node IDs or regions based on topology
  • Two methods
  • Landmark-based (CAN only)
  • Transit/stub aware clustering
  • Hard to get right for CAN
  • Helps CAN a lot, both for unicast and multicast
  • Helps Pastry-based flooding hurts Pastry-based
    forwarding trees

42
Summary and Conclusions
  • Embedded forwarding trees are preferable to
    flooding of mini-overlay networks.
  • Pastry is capable of lower delays than CAN, but
    may incur higher maximal costs Pastry can be
    tuned to offer equivalent delays to CAN at
    equivalent costs.
  • Topologically aware assignment
  • Most important optimization for CAN.
  • Mixed results for Pastry.
  • CAN
  • More complex programming model.
  • Difficult to tune.

43
Future Work
  • Behavior with respect to fault tolerance.
  • Better understanding of topologically aware
    address assignment.

44
Related Work
  • Non-global event notification systems
  • Gryphon, Ready, Siena,
  • Netnews
  • P2P systems
  • Gnutella, Farsite,
  • Overlay multicast networks
  • CAN, Chord, Pastry, Tapestry, Scribe,
  • Content Distribution Networks (CDNs)
  • Akamai,
  • OceanStore

45
Some Useful Pointers
  • Talk with me
  • Marvin Theimer - theimer_at_microsoft.com
  • http//research.microsoft.com/theimer/
  • Herald Scalable Event Notification Project
  • http//research.microsoft.com/sn/Herald/
  • Peer-to-peer measurements
  • A Measurement Study of Peer-to-Peer File Sharing
    Systems (UW)
  • Overlay network with dominant control traffic
  • Resilient Overlay Networks (MIT)
  • Scalability limits of ISIS/Horus
  • Bimodal Multicast (Cornell) August 1999 TOCS

46
Some Useful Pointers (cont)
  • CAN - ICIR (formerly ACIRI)
  • A Scalable Content-Addressable Network
  • Chord - MIT
  • Chord A Scalable Peer-to-peer Lookup Service for
    Internet Applications
  • Pastry - MSR Cambridge Rice
  • Storage management and caching in PAST, a
    large-scale, persistent peer-to-peer storage
    utility
  • Tapestry - Berkeley
  • Tapestry An Infrastructure for Fault-tolerant
    Wide-area Location and Routing
  • End-System Multicast - CMU
  • A Case For End System Multicast
Write a Comment
User Comments (0)
About PowerShow.com