Management Pack University - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Management Pack University

Description:

A manageable application is the first step to creating a useful management pack ... Example: Data access layer. N-tier application with front end, middle tier and DAL ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 22
Provided by: brian286
Category:

less

Transcript and Presenter's Notes

Title: Management Pack University


1
Getting Manageability Right
  • Management Pack University
  • Nishtha Soni

2
Agenda
  • A manageable application is the first step to
    creating a useful management pack
  • Failure Mode Analysis is designed to increase
    manageability

3
Who are our customers?
  • Anxious IT Managers Dont Sleep Well
  • InformationWeek - March 12, 2007
  • Two out of three IT managers say that they are
    kept awake at night worrying about work
  • 75 percent admit ongoing anxiety about
    application performance concerns
  • 25 of the responded reported suffering physical
    symptoms, including nausea, headaches, migraines,
    panic attacks, heart arrhythmia, and muscle
    twitches. And nightmares.
  • Terry Beehr, a Central Michigan University
    professor of psychology
  • "If IT goes down, a lot of other departments
    can't do their work.
  • "IT is 24-by-7, plus that's combined with heavy
    workloads and work that needs to be done
    quickly,"

4
Operations Roles
  • Tier 2
  • Tier 1
  • Where do they work?
  • Centralized help desk workers
  • Most problem reports go here first
  • IT skill levels trained
  • What do they do?
  • Spot trends
  • Look at IT health all up
  • First response to problem alerts
  • Decide when to escalate
  • Fix the common stuff
  • Low privileges typically
  • May specialize
  • DBAs, Network, Apps
  • Managers are responsible for all up IT health
    reporting
  • Where do they work?
  • Cubicles Offices
  • Dedicated application teams
  • ERP, LOB
  • Dedicated technology teams
  • E.g. security, network, AD
  • IT Skill levels specialist
  • What do they do?
  • Provision servers, get applications ready to run
  • Power/cooling typically separate
  • Get things working when Tier 1 cannot
  • Diagnose new problems
  • Systematize new remedies
  • Manage monitoring
  • Make sure outages can be detected and explained
  • Long term trends and capacity planning

5
Problems our customers have
  • After setup, product health is their top concern
  • Instrumentation often inadequate
  • Tier 3 focused instrumentation is prominent
  • Noisy monitoring is as bad as no monitoring
  • Alerts when there is nothing to fix increases
    cost of ownership
  • Ideal case is to not have to staff monitoring
    roles
  • Practical reality with MSFT products is this
    isnt feasible
  • Differentiating maintenance from break fix
  • If maintenance is not performed (admin), break
    fix occurs
  • We invest in administration interfaces, so why
    not operations interfaces
  • Ops Manager is the operators interface

6
Getting Instrumentation Right
  • Instrumentation is the largest gating factor to
    achieving proactive management.

7
Getting Manageability right
  • Manageability Maturity Model can help us
    determine right investment levels for current
    state of product.
  • Model focuses on product instrumentation as a
    factor in gauging maturity
  • Six Levels
  • Level 0 most instrumentation (events/counters)
    undocumented
  • Level 1 manual diagnostic info published for
    all events (MMD health model)
  • Level 2 Instrumentation is symptomatic,
    Management pack is rudimentary as a result. Many
    false alarms elevated customer costs result.
  • Level 3 Instrumentation approach changes to
    proactive and cause based. Knowledge articles
    focus on how to fix outages, not on diagnosis.
  • Level 4 Root cause issue detection fully
    supported by instrumentation. Tasks added to MP
    to help streamline restore/repair.
  • Level 5 Instrumentation supports predictive
    management, capacity planning and efficient data
    collection with low privilege levels. Customer
    costs are minimal TCO is best in class
  • You cant get level 5 results with level 1
    instrumentation

8
Thinking about monitoring
  • Most instrumentation seen in the wild today
  • Added by the developer for debugging or code path
    tracing of problems
  • Doesnt necessarily tell me if a service or
    application is working well
  • Reports a symptom, and rarely alone is suitable
    to make a diagnosis or break/fix decision
  • Most monitoring today is
  • Implemented by the operations people who need to
    manage the IT asset
  • Rarely a part of the up front system or
    application design effort
  • A best guess on the part of the person or team
    who designed monitoring rules based on what
    instrumentation is visible after setting up an
    application on a test environment.
  • If an event manifest for the product is available
    it is helpful, but without deep knowledge of
    what each event signifies, not always useful in a
    proactive way.
  • Failure Mode Analysis helps drive improvements on
    both areas

9
What to measure
Deployment
Did it deploy? Ready to run?
File counts Smoke test OK
What went wrong? How is it configured?
Verification
Is everything in the right place? Right
version?
In Compliance? Patch level good?
What is missing? End to end trace?
Running
SLA being met? Resources ok? Performing well?
Can it be used to do work? Is it responsive?
What is the internal state right now?
Questions
Observations
Diagnosing
10
Life cycle states roles
Deployment
Tier 2
Tier 2
Tier 2 smoke test Tier 3 deep dive
Verification
Tier 2 Tier 3
Tier 2 App owner Admin
Tier 2
Running
Tier 2 Tier 3
Tier 1
Tier 1
Questions
Observations
Diagnosing
11
Three important questions
  • Is my application healthy?
  • Use health measures to show there are no customer
    impacting issues
  • Look at redundant measures that detect elements
    that have failed
  • Look at the balance of work across the system
  • Are critical dependencies able to perform in
    concert without major disruption to users?
  • Are the users of my application happy?
  • How fast do your pages load from request to
    responsiveness?
  • Look at abandon page rates relative to overall
    traffic
  • Can an end to end interaction happen without
    interruption?
  • Consider artificial transactions as a weak proxy
    for these
  • How well do the parts of my application work
    together?
  • Look at subsystem measures that signal imbalances
  • Instrument for detecting problems where they
    occur
  • Be able to follow a call from end to end if
    necessary

12
Failure Mode Analysis
  • Moving up the scale on the manageability maturity
    model

13
Failure Modes what are they?
  • Failure modes drive support incidents
  • Planning helps lower support costs for your
    product
  • Planning for failure lets you optimize what you
    instrument and monitor to detect
  • No service is free of failure modes
  • Planning for failures makes products more
    resilient
  • Examples
  • Hardware monitoring
  • Fan can fail
  • Disk can fail in a raid 5 array, or be full
  • Configuration monitoring
  • Configuration file is not in correct location
  • Access Control List to remote host can be
    changed, causing a failure
  • Critical bug fixes dont get applied
  • Capacity
  • Database can become full, causing a failure
  • Too much data in system making queries slow
  • Too much traffic overusing resources

14
Failure-Mode analysis
  • Definition
  • An up-front design effort for a monitoring plan
    that is similar to threat modeling
  • Produces
  • Monitoring plan
  • Used to drive management pack technical design
  • Coverage Matrix failure mode coverage
  • Capacity plan what to collect to enable
    trending and capacity planning
  • Instrumentation plan
  • Design artifacts used to write code that helps
    detect failures
  • Derives from coverage matrix and capacity plan
    (union)
  • Typically shows up in dev specs QA tests using
    coverage matrix plan
  • Health Façade
  • Describes health at the end-user and subsystem
    level
  • Used to understand impact of specific types of
    failures on each subsystem
  • Guides mitigation and recovery documentation
  • Helps drive escalations from Tier 1 to Tier 2
  • Driven by
  • Monitoring champion Ensures that monitoring is
    part of design process

15
Failure-Mode Analysis Steps
  • Process
  • Step 1 List what can go wrong and cause harm to
    service
  • Identify all failure modes List predictable ways
    to fail
  • Understand if an item is a way to fail or an
    effect of a failure
  • Prioritize according to impact on service health,
    probability, cost
  • Include physical, software, and network
    components
  • Step 2 Identify a detection strategy for each
    failure mode
  • Each high-impact item needs at least two
    detection methods
  • Detection can be a measure or event, or can
    require watchdog (code)
  • Step 3 Add these detection elements to your code
    effort
  • Some are probes, some are monitors (automate as
    much as possible)
  • Step 4 Plan your management pack content
  • Result is the basis of instrumentation and
    monitoring plans
  • Failure modes are root causes
  • Detecting root causes directly is optimal
  • Inferring root causes via symptoms requires
    correlation

16
Getting it wrong
  • Instrumentation that is useful for tracing is not
    always right for finding and pinpointing issues

The name hidden Server management pack increased
my costs by 40. We had to hire more operators
just to close all of the non-actionable alerts
CENSORED
17
The code-path problem
  • Example Data access layer
  • N-tier application with front end, middle tier
    and DAL
  • What does error 100 mean?

Public iDatagetBusinessData( parameters )
try mConfig.open (mConfigPath) connectToDBmConfi
g.ConnectString data conn.getDatafromDb(Sproc,
parameters) return data catch
(exception e) WriteEventLogEvent(100,
E_ExceptionInDal) throw
18
Code-path problem explodes
try call_middle_Tier(params) catch
(exception e) WriteEventLogEvent(101,
E_ExceptionWeb) throw
Front End
try call_DAL(prams) catch (exception
e) WriteEventLogEvent(102, E_)
throw
Middle Tier
BAM
DAL
BAM
19
Failure Modes
  • Failure Modes are Predictable Causes
  • Configuration
  • Config file missing
  • Config file permissions
  • Config file corrupt no defaults
  • Connect string incorrect
  • Database
  • DB availability database is offline
  • DB permissions log-in denied
  • DB permissions execute permission on sproc
    denied
  • DB data error
  • Environment
  • Network DNS lookup fails
  • Network ACL issues (looks a lot like db
    availability)
  • Instrument these
  • Unique event per root cause predicted problem
  • Diagnostic event log when other.
  • Context is key. Know the source who has the
    context to diagnose the problem
  • These go into the management pack as trouble
    signals

Public iDatagetBusinessData( parameters )
try mConfig.open (mConfigPath) connectToDBmConfi
g.ConnectString data conn.getDatafromDb(Sproc,
parameters) return data catch
(exception e) WriteEventLogEvent(100,
E_ExceptionInDal) throw
20
Failure-mode analysis key message
  • Debug instrumentation is high cost to customers
  • Better than NO instrumentation
  • MMD health model critical to help debug at low
    maturity levels
  • Contextual failure alerting is predictive (level
    4 and up)
  • Management pack alerts should be actionable
  • It might mean a or b is not a starting point
  • There are only 5 essential operator actions
    (locked down env)
  • Reboot Host
  • Stop/Start service
  • Run maintenance task
  • Add more or reduce overall capacity (e.g. shut it
    off)
  • Change configuration
  • Instrumentation should identify cause and map to
    action
  • Manual diagnosis drives additional expense

21
Learning more
  • Templates
  • Failure Mode Analysis Template
  • White papers
  • Introduction to Operations
  • Thinking Operationally why we measure
Write a Comment
User Comments (0)
About PowerShow.com