MyOps - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

MyOps

Description:

Direct automation of common operations. Indirect through remote contacts and incentives ... Automation. Automatic correction of common bootstrap problems ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 24
Provided by: Goog519
Category:

less

Transcript and Presenter's Notes

Title: MyOps


1
MyOps
  • An Operational Framework for PlanetLab Deployments

2
Outline
  • Objective of MyOps
  • Current status
  • Future ideas
  • Questions at any time

3
Example of Feedback
4
Objective Close Operational Cycle
  • System - Provides service (slice)
  • Monitoring - Feedback from running system
  • Operator - Interpret feedback into tasks
  • Management - Control running system

5
Challenges Break-down
  • System may not deliver service
  • Monitoring not observe useful metrics
  • Operator may not know
  • how to interpret observations
  • how to control the system
  • what the service goals are
  • Management may not control system

6
Requirements for Operational Systems
  • Satisfy Minimal Conditions
  • Physical Integrity
  • Interconnectivity
  • Controllable
  • Provide a Service
  • Two requirements
  • Reliably reach the final condition
  • When failures occurs, repair or report
    automatically
  • Two approaches in MyOps
  • Precise bootstrap stages (not discussed)
  • Operational monitoring management in platform

7
System PlanetLab Slices
8
Monitoring Types
  • Open-loop monitoring
  • Identify the unknown
  • More information, fine-grained
  • Operational monitoring (closed-loop)
  • Correctness
  • Less information, coarse-grained
  • Actionable

9
Management Types
  • Open-loop management
  • Bootstrap/Deploy from the ground up
  • Inefficient, coarse-grained
  • No feed-back
  • Operational management (closed-loop)
  • Tweak the system to correct behavior
  • More efficient, fine-grained

10
Example
  • Observe Node is Off-Line
  • Control Attempt to Power-On
  • Observe Node is On-line but Failed to boot
  • Observe Failed to boot Error
  • Control Create ticket  Send email to local
    contact
  • Time passes
  • Control Disable slice creation
  • Observe Local contact responds
  • Observe Node is Power-on and Running
  • Control Re-enable slice creation
  • Contro Close ticket

11
History of PlanetLab Operations
  • Open-loop Monitoring with Open-loop Management
  • Collect fine-grained statistics using CoMon
  • Act with coarse-grained operations (e.g.
    Reinstall)
  • Manual bridge between the two
  • Moving towards Closed-loop Operations
  • Collect targeted metrics
  • Take directed, problem-specific actions
  • Automate actions based on policy

12
PlanetLab Operations
  • Close the monitor/management cycle
  • Direct automation of common operations
  • Indirect through remote contacts and incentives

13
MyOps Architecture
  • Collection from Node
  • Translated by policy to Automated action

14
MyOps Architecture
  • Collection from Node
  • Send notice to Local contact to take action

15
MyOps Architecture
  • When there is no response
  • Indirect influence with incentives

16
Collection
  • Operational monitoring specific targets, such as
  • Boot status, Filesystem status
  • DNS - internal and external
  • RPMs
  • System services, etc
  • Periodic collection
  • Coarse-grained collection at a human-timescale
  • Time-series of events and status

17
Policy
  • Constraints over a time-series of events
  • To satisfy a constraint
  • Automated action
  • Send notice
  • Apply incentive
  • Policy defines
  • Preferred status of system
  • Frequency of actions
  • Magnitude of incentives

18
Automation
  • Automatic correction of common bootstrap problems
  • Communication errors with MyPLC
  • Corrupt filesystem repair
  • Retry when state is unknown
  • PCU Reboot
  • Reinstall
  • Automation Notices
  • Bad disk
  • Minimal hardware
  • Bad DNS
  • Bad node configuration

19
Notices Incentives
  • Notices are indirect paths to node management
  • Node down / online / specific problem (i.e. DNS,
    disk)
  • Site down / online
  • Privilege reduced / restored
  • PCU errors
  • The incentives on MyPLC
  • Sites 10 slices
  • Disable slice creation
  • Disable running slices

20
Validation of Notices Incentives
A
B
C
D
E
Notice Bug
Fix
Kernel Bug
Fix
Fix2
21
Time to Restore Down Node (all issues)
22
Future Ideas
  • Generalize Configuration
  • Collect from multiple sources
  • Expose policy
  • Act on multiple targets
  • Self-monitoring
  • Positive Incentives
  • Special access to services
  • Additional resources (Slices, Bandwidth, CPU, etc)

23
Time to Reply (when there is a reply)
Write a Comment
User Comments (0)
About PowerShow.com