Self Healing Wide Area Network Services - PowerPoint PPT Presentation

About This Presentation
Title:

Self Healing Wide Area Network Services

Description:

In case a crash is detected - try and restart. No central monitoring station involved. ... Remote (re)start attempted after Hello timeout. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 24
Provided by: Bhav5
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Self Healing Wide Area Network Services


1
Self Healing Wide Area Network Services
  • Bhavjit S Walha
  • Ganesh Venkatesh

2
Layout
  • Introduction
  • Previous Work
  • Issues
  • Solution
  • Preliminary results
  • Problems Future Extensions
  • Conclusion

3
Motivation
  • Companies may have servers distributed over a
    wide area network
  • Akamai Content Distribution Network.
  • Distributed web-servers
  • Manual monitoring may not be feasible
  • Centralized control may lead to problems in
    case of a network partition
  • Typical server applications
  • May crash due software bugs
  • Little state is retained
  • Simple restart is thus sufficient

4
Motivation
  • What if peers monitored each others health?
  • In case a crash is detected - try and restart.
  • No central monitoring station involved.
  • Loosely based on a worm
  • Resilient to sporadic failures
  • Spreads to uninfected nodes
  • But
  • No backdoor involved
  • May not always shift to new nodes

5
  • Introduction
  • Previous Work
  • Issues
  • Solution
  • Preliminary results
  • Problems Future Extensions
  • Conclusion

6
Medusa
  • All nodes a part of a Multicast Group
  • Each node is thus in touch with all other nodes
    through Heatbeat messages.
  • Nodes send regular updates to the multicast tree
  • All communication through reliable multicast
  • In case a node goes down
  • Other nodes try to restart it
  • Request for service sent to multicast group

7
Medusa Problems
  • Scalability
  • Assumptions of reliable packet delivery
  • State information shared with all nodes.
  • Reliable Multicast
  • Assumes reliable delivery of packets to all nodes
  • No explicit ACKs
  • The kill operations fail in case of a temporary
    break in Multicast tree.
  • Security
  • No way of authenticating packets

8
  • Introduction
  • Previous Work
  • Issues
  • Solution
  • Preliminary results
  • Problems Future Extensions
  • Conclusion

9
Proposed solution
  • Nodes form peering relationships with only a
    subset of other nodes.
  • Exchange Hello packets
  • Scalable as the degree is fixed
  • No central control
  • No dependence on reliable multicast
  • Distributed communication protocol
  • Explicit ACKs for packets
  • Some super-nodes required to be up when booted
  • Power of Randomly-connected graphs graphs

10
Design
  • Each node continually sends Hello Packets to its
    peer nodes.
  • Indicates everything is up and working
  • A timeout indicates something is wrong
  • Application crash
  • Network Partition
  • Aim at application crashes
  • Application should be stateless
  • No code transfer
  • Remotely restartable
  • SSH needed A login account and distributed keys.

11
Initialization
  • 3-5 super-nodes form a fully-connected connected
    graph.
  • Are expected to be up all the time
  • All nodes have information about their IPs
  • May be under manual supervision
  • May have information about the topology
  • Responsible for forwarding join requests to other
    nodes

12
Remote start
  • SSH to a remote node to restart
  • Remote (re)start attempted after Hello timeout.
  • Current implementation requires keys to be
    distributed beforehand
  • Starts a small watchdog program which immediately
    returns
  • Checks if there is a another copy already running
  • Current implementation uses ps
  • In case the application start fails, do nothing
    wait for retry to restart
  • Possible extension allow the service to spread

13
New node comes up
  • Waits for others to contact it
  • After timeout
  • Send JoinRequest to a super-node with the number
    of peers needed.
  • Supernode forwards this request to other nodes
  • AddRequest
  • Some node may ask new node to become its peer
  • Add to neighbourList and send AddACK
  • Hello
  • Can add to neighbourList if unsolicited Hello
    received
  • Beneficial in case of a short temporary failures
  • After Request-timeout
  • Contact another super-node with another
    JoinRequest.
  • Timeout can be dynamically specified in
    JoinRequestACK.

14
New node comes upRandom Walk.
  • Request forwarded by super-node to 3 random nodes
    on behalf of new node
  • Each node forwards it to others
  • Decrease hop count by 1 each time
  • If hop count 0, check if it can support more
    nodes
  • YES!
  • Send AddRequest to new node
  • Add to neighbourList on receiving AddACK.
  • NO!
  • Ignore the request
  • New node may already have found neighbours
  • Due to duplicate joinRequest or repair of Network
    partition
  • New node thus replies to AddRequest with Die
    packet.

15
Shutdown
  • Critical to ensure that all nodes go down
  • 3-way protocol
  • Send kill to target node
  • Target node replies with die
  • Send dieACK to target node.
  • kill
  • used when multiple copies detected
  • Possibly to balance load
  • die
  • Reply to unsolicited Hello
  • No perfect solution in case of a network partition

16
Global Shutdown
  • Secret killAll packet
  • Sent by an external program for complete system
    shutdown
  • Forwarded to all neighbours
  • Node does not die until it receives a killACK
    from everyone
  • Stops sending hellos immediately
  • No further restart attempts
  • Reply only to die, kill and killAll
  • May send unnecessary traffic
  • Eventually time out on seeing zero neighbours.

17
Performance
  • Tested on 6 nodes in GradLab
  • Hello interval 5s
  • Hello timeout 22s
  • Wait before joinRequest 10s
  • joinRequest timeout 20s
  • Hop count 2
  • Initial degree request 3
  • Super-nodes 3
  • Preliminary tests on PlanetLab

18
Results
  • LAN
  • No timeouts or packet losses observed
  • No duplicate copies
  • killAll works perfectly
  • Re-start latency 22s
  • Decreases after a number of restarts
  • Join latency 15s
  • PlanetLab
  • Re-start latency 27s
  • Join latency 21s

19
  • Introduction
  • Previous Work
  • Issues
  • Solution
  • Preliminary results
  • Problems and Future Extensions
  • Conclusion

20
Limitations
  • Security
  • The packets are not authenticated
  • Stray copies
  • After a killAll there may be stray copies
  • Harmless as they do not try to spread
  • But prevents another copy from running
  • No new nodes
  • Node discovery
  • Why should they be idle in first place?
  • What to do when the original nodes come back up?
  • Solution
  • Send regular updates to super-nodes
  • Extra servers can be killed easily

21
Parameter tweaking
  • Hop count for Random Walk
  • Connectivity
  • Min-degree to ensure connectivity
  • Max-degree to spread the failure probability
  • Timeouts
  • Request timeout
  • Depends on hop-count
  • Hello timeout
  • Different for WAN LAN
  • Global timeout
  • In case of network partition
  • Loss of Kill ACK packets

22
Conclusion
  • Maintaining High Availability does not always
    require central control
  • Achieving a global shutdown is problematic
  • Need to explore connectivity requirements to
    ensure a connected graph at all times.

23
Thank You !
Write a Comment
User Comments (0)
About PowerShow.com