Design for Operations: Health Model and Instrumentation - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Design for Operations: Health Model and Instrumentation

Description:

Alexander (Sasha) Nosov sashan_at_microsoft.com. Brian Reistad brianrei_at_microsoft.com ... Service application stops for unclear reason ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 30
Provided by: sasha
Category:

less

Transcript and Presenter's Notes

Title: Design for Operations: Health Model and Instrumentation


1
Design for Operations Health Model and
Instrumentation
Session Code ARC332
  • Alexander (Sasha) Nosov sashan_at_microsoft.com
  • Brian Reistad brianrei_at_microsoft.com
  • Microsoft Corporation

2
DSI Architecture (ARC230)Design for Operations
Local NodeMgmt
System Level Management
Remote Node Mgmt
Management Tools
Dynamic System Services
Dev Tools
Managed System
Your System Definition
SDM Service
SDM Store
Windows
Managed Node
Settings ARC333
Your Application
Health ARC332
Tasks ARC334
Hardware
Dynamic Data Center
3
Agenda
  • Problem Domain
  • Health Model
  • Instrumentation Technologies
  • Automating the Health Model

4
Application Availability Problem
  • Service application stops for unclear reason
  • User receives no warning or information how to
    correct it
  • Telephone rings in the Help Center

5
What Your Customers Expect
  • Business today depends upon the computing and
    network infrastructure
  • Customers expect that their services or
    applications are secure, available and never lose
    data
  • An actionable warning is received before failure
  • The root cause of problems can be quickly
    determined
  • Failure conditions dont impact their users
  • Their environment can be managed with fewer people

6
Why We Are Not There Today
  • Applications are not designed with operations in
    mind
  • Poor quality of instrumentation
  • Limited structure and discovery
  • No clear correlation between instrumentation,
    root cause and corrective actions
  • Low signal/noise ratio
  • Limited infrastructure
  • Barrier to entry is high for developers
  • Limited OS support to automate problem detection
    and resolution
  • Limited feedback loop from support services

7
Health Model
8
What is the Health Model?
  • Holistic view of the Applications different
    potential problems
  • How your service may fail from end users
    perspective
  • State diagram that captures transition to
    different levels of degradation
  • Stopped
  • Healthy
  • Service Totally Unavailable
  • Service Partially Unavailable multiple of these
  • Instrumentation is driven by states and
    transitions
  • User guidance in what to do in failure cases
  • The model benefits
  • Help desk personnel
  • Admins and IT pros
  • Product devrlopers

9
What is a Health State?
  • Definition
  • Description of the state (Whats working, Whats
    not)
  • Severity from the app perspective
  • Detection
  • What are the different entry points into the
    state (e.g. events, thresholds, state changes,
    external checks)
  • What are the dependencies that are relevant for
    this state transition
  • Diagnosis
  • How to determine the root cause of why were in
    this state
  • Recovery
  • What actions should be taken to return to
    operational state
  • Verification
  • How to verify that the application is still in he
    bad state
  • How to verify that application has recovered from
    unhealthy state (after correction)

10
Terminal Server example
X
Problem the clients cannot connect to a
pre-existing session
11
Terminal Server Example (cont.)
Definition The Terminal Server X failed to join
the Session Directory. The clients cannot connect
to pre-existing sessions in the Session
Directory. Instead they are be connected to new
sessions. Severity Error Detection 12 different
Error Events EVENT_CALL_TSSDRPCSEVEROFFLINE_FAIL E
VENT_SESSIONDIRECTORY_NAME_INVALID EVENT_SESSIONDI
RECTORY_UNAVAILABLE EVENT_FAIL_RPCBINDINGSETAUTHIN
FOEX . . . Verification Inspect Session Directory
Server configuration (list of Terminal
Servers) Diagnosis Different dependencies
identified in different entry points (i.e.
events) Check RPC, SD server running, Correct
configuration for SD Server, Network
connectivity, DNS resolution Recovery Refresh SD
Settings on Terminal Server to force
rejoin Verification Information event reported on
operation success EVENT_JOIN_SESSIONDIRECTORY_SUCC
ESS
State Healthy
State Cant Talk to SD Server
12
Implementing Health Model
13
Instrumentation Technologies
14
Instrumentation Technologies
  • Events (Event Log)
  • Report occurrences of exceptional conditions,
    record changes
  • Traces (ETW)
  • Trace execution of key operations
  • Probes (WMI)
  • Expose complex internal state of applications
  • Expose methods to correct unhealthy states
  • Perf Counters (Perflib)
  • Expose simple numeric values for performance
    monitoring and threshholding
  • Watson messages (Corporate Error Reporting)
  • Centrally collect records of failures to provide
    feedback into product teams

15
Consider Privacy
  • Any instrumentation can pose a security or
    privacy risk
  • Exposure of at risk items must comply with your
    corporate privacy guidelines At risk items
  • Passwordsbefore or after encryption or hashing.
  • User or account names, or SIDs.
  • Security keys or access tokens.
  • User data (network, file system, etc)
  • Configuration information not immediately
    relevant to code execution (enterprise policies
    applied, other software patch level, etc)

16
Events and Traces
17
Event Log Enhancements
  • Structured and Schematized events
  • Common Viewing, Configuration and Querying of
    Event logs and Trace logs
  • Scales to support application logs
  • No need for proprietary logs
  • Filtering and real time notifications
  • Forwarding and collection of events across
    multiple machines
  • Firewall friendly, using SOAP protocol
  • The event viewer leverages the new features

18
WMI Enhancements
  • Definition Probes access to internal state
  • Exposes existing properties and methods
  • Needed for monitoring rules
  • Manual access from command shell available
  • Easily exposed using attribution scheme
  • Leverages .net reflection
  • Schematized instrumentation catalog
  • Identified by URI
  • Existing WMI providers automatically published
    as probes
  • Remote SOAP access to probes

19
Implementing Health Model
20
Automating Health Model
21
Monitoring and Autorecovery
  • Workflow
  • Detect problems before users call
  • Speed diagnosis of root cause
  • Automatic corrective actions where possible
  • Components
  • Knowledge captured in Health Model
  • Problem detection, diagnostics and resolution
    data
  • Instrumented application
  • Validated by the Health Model
  • Monitoring infrastructure
  • MOM agent
  • Windows Monitoring Service
  • Result enterprise ready application
  • Higher service availability
  • Higher admin efficiency/low cost
  • Higher users trust in your product

22
Monitoring with Microsoft Operations Manager (MOM)
  • MOM is Microsofts enterprise management solution
    today
  • Framework for implementing health model
  • Enables health monitoring of distributed
    applications from one console
  • Key features
  • Scalable architecture / network efficient
  • Automatic discovery / deployment to servers
  • Natively consumes many data types events,
    performance data, custom application logs
  • Centralized view of a distributed system
  • Reporting
  • Enables higher IT service quality at a reduced
    operational cost

23
Delivering Knowledge with Management Packs
  • Implementation of health model
  • Built by product owners and experts
  • Creates MOM Alerts
  • Indication of a detected conditions that requires
    administrator investigation / action
  • Contain embedded knowledge aid diagnosis
  • Appear in MOM console, email or pager
    notifications
  • Basic Alerts from state transitions
  • Advanced Alerts from scripts, e.g.
  • Synthetic transactions
  • Security and configuration verification

24
Monitoring with Longhorn
  • Monitoring capabilities built into the OS
  • Event filtering and correlation
  • Forwarding events and alerts
  • Correlation of events and data
  • Automated actions and notification
  • Rich set of rule types and libraries
  • Common service enables monitoring of
  • Health, security, performance and configuration
  • No extra deployment
  • Monitoring is part of the applications setup
  • Application manifest includes monitoring rules
  • Admin can customize default rules, including
    actions
  • Your investments in MOM management packs will
    carry forward

25
Monitor Application Health
  • Build monitoring rules to correct the problems
    automatically

26
Summary What gets better
  • Lower manual cost of problem detection, root
    cause analysis and resolution
  • Higher service availability using health
    monitoring and automatic recovery
  • How?
  • Health Model drives the quality and quantity of
    information
  • Instrumentation consistent across components
  • The instrumentation is discoverable before
    runtime
  • Admin controls the levels of diagnostics
    dynamically
  • Feedback to improve your products next release.
  • Enhanced management infrastructure in the OS

27
Call to Action
  • Visit the booth 19 in Microsoft Pavilion
  • Great opportunity to drill into technical
    details with the developers and program managers
  • Exercise hands-on Labs 401,406,407,408
  • See next slide for more info
  • Ask The Experts
  • Tuesday 7 pm 9 pm in Hall G,H
  • Design for operations
  • Build the model for your application
  • Have your technical support use and test it
  • Write and deploy Management Packs
  • Get ready for Longhorn - install PDC build and
    create your own manageable application

28
Resources
  • Longhorn documentation and whitepapers
  • www.microsoft.com/windowsserver2003/technologies/m
    anagement/dsi/designops.mspx
  • Windows Management Instrumentation Preview
  • Windows Event Log Preview
  • Task Scheduler Service Preview
  • Event Forwarding Service Preview
  • Monitoring Service Preview
  • HOL 401 Health Modeling and Instrumentation
  • MOM training
  • HOL-406 Building MOM Management Packs to Manage
    .NET Applications
  • HOL-408 Monitoring SQL Server with the SQL Server
    management pack
  • HOL-407 Extending MOM using the Microsoft
    Connector Framework and SDK
  • Web Sites
  • http//pdcbloggers.net
  • http//msdn.microsoft.com/pdc/
  • Management Community Forum
  • http//www.microsoft.com/windowsserver2003/commu
    nity/centers/management/default.mspx

29
Questions?
  • Dont forget to submit your feedback
Write a Comment
User Comments (0)
About PowerShow.com