Design for Operations: Health Model and Instrumentation - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Design for Operations: Health Model and Instrumentation

Description:

Alexander (Sasha) Nosov sashan_at_microsoft.com. Brian Reistad brianrei_at_microsoft.com ... Service application stops for unclear reason ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 30

Provided by: sasha

Category:

more less

Transcript and Presenter's Notes

Title: Design for Operations: Health Model and Instrumentation

1
Design for Operations Health Model and
Instrumentation
Session Code ARC332

Alexander (Sasha) Nosov sashan_at_microsoft.com
Brian Reistad brianrei_at_microsoft.com
Microsoft Corporation

2
DSI Architecture (ARC230)Design for Operations
Local NodeMgmt
System Level Management
Remote Node Mgmt
Management Tools
Dynamic System Services
Dev Tools
Managed System
Your System Definition
SDM Service
SDM Store
Windows
Managed Node
Settings ARC333
Your Application
Health ARC332
Tasks ARC334
Hardware
Dynamic Data Center
3
Agenda

Problem Domain
Health Model
Instrumentation Technologies
Automating the Health Model

4
Application Availability Problem

Service application stops for unclear reason
User receives no warning or information how to
correct it
Telephone rings in the Help Center

5
What Your Customers Expect

Business today depends upon the computing and
network infrastructure
Customers expect that their services or
applications are secure, available and never lose
data
An actionable warning is received before failure
The root cause of problems can be quickly
determined
Failure conditions dont impact their users
Their environment can be managed with fewer people

6
Why We Are Not There Today

Applications are not designed with operations in
mind
Poor quality of instrumentation
Limited structure and discovery
No clear correlation between instrumentation,
root cause and corrective actions
Low signal/noise ratio
Limited infrastructure
Barrier to entry is high for developers
Limited OS support to automate problem detection
and resolution
Limited feedback loop from support services

7
Health Model
8
What is the Health Model?

Holistic view of the Applications different
potential problems
How your service may fail from end users
perspective
State diagram that captures transition to
different levels of degradation
Stopped
Healthy
Service Totally Unavailable
Service Partially Unavailable multiple of these
Instrumentation is driven by states and
transitions
User guidance in what to do in failure cases
The model benefits
Help desk personnel
Admins and IT pros
Product devrlopers

9
What is a Health State?

Definition
Description of the state (Whats working, Whats
not)
Severity from the app perspective
Detection
What are the different entry points into the
state (e.g. events, thresholds, state changes,
external checks)
What are the dependencies that are relevant for
this state transition
Diagnosis
How to determine the root cause of why were in
this state
Recovery
What actions should be taken to return to
operational state
Verification
How to verify that the application is still in he
bad state
How to verify that application has recovered from
unhealthy state (after correction)

10
Terminal Server example
X
Problem the clients cannot connect to a
pre-existing session
11
Terminal Server Example (cont.)
Definition The Terminal Server X failed to join
the Session Directory. The clients cannot connect
to pre-existing sessions in the Session
Directory. Instead they are be connected to new
sessions. Severity Error Detection 12 different
Error Events EVENT_CALL_TSSDRPCSEVEROFFLINE_FAIL E
VENT_SESSIONDIRECTORY_NAME_INVALID EVENT_SESSIONDI
RECTORY_UNAVAILABLE EVENT_FAIL_RPCBINDINGSETAUTHIN
FOEX . . . Verification Inspect Session Directory
Server configuration (list of Terminal
Servers) Diagnosis Different dependencies
identified in different entry points (i.e.
events) Check RPC, SD server running, Correct
configuration for SD Server, Network
connectivity, DNS resolution Recovery Refresh SD
Settings on Terminal Server to force
rejoin Verification Information event reported on
operation success EVENT_JOIN_SESSIONDIRECTORY_SUCC
ESS
State Healthy
State Cant Talk to SD Server
12
Implementing Health Model
13
Instrumentation Technologies
14
Instrumentation Technologies

Events (Event Log)
Report occurrences of exceptional conditions,
record changes
Traces (ETW)
Trace execution of key operations
Probes (WMI)
Expose complex internal state of applications
Expose methods to correct unhealthy states
Perf Counters (Perflib)
Expose simple numeric values for performance
monitoring and threshholding
Watson messages (Corporate Error Reporting)
Centrally collect records of failures to provide
feedback into product teams

15
Consider Privacy

Any instrumentation can pose a security or
privacy risk
Exposure of at risk items must comply with your
corporate privacy guidelines At risk items
Passwordsbefore or after encryption or hashing.
User or account names, or SIDs.
Security keys or access tokens.
User data (network, file system, etc)
Configuration information not immediately
relevant to code execution (enterprise policies
applied, other software patch level, etc)

16
Events and Traces
17
Event Log Enhancements

Structured and Schematized events
Common Viewing, Configuration and Querying of
Event logs and Trace logs
Scales to support application logs
No need for proprietary logs
Filtering and real time notifications
Forwarding and collection of events across
multiple machines
Firewall friendly, using SOAP protocol
The event viewer leverages the new features

18
WMI Enhancements

Definition Probes access to internal state
Exposes existing properties and methods
Needed for monitoring rules
Manual access from command shell available
Easily exposed using attribution scheme
Leverages .net reflection
Schematized instrumentation catalog
Identified by URI
Existing WMI providers automatically published
as probes
Remote SOAP access to probes

19
Implementing Health Model
20
Automating Health Model
21
Monitoring and Autorecovery

Workflow
Detect problems before users call
Speed diagnosis of root cause
Automatic corrective actions where possible
Components
Knowledge captured in Health Model
Problem detection, diagnostics and resolution
data
Instrumented application
Validated by the Health Model
Monitoring infrastructure
MOM agent
Windows Monitoring Service
Result enterprise ready application
Higher service availability
Higher admin efficiency/low cost
Higher users trust in your product

22
Monitoring with Microsoft Operations Manager (MOM)

MOM is Microsofts enterprise management solution
today
Framework for implementing health model
Enables health monitoring of distributed
applications from one console
Key features
Scalable architecture / network efficient
Automatic discovery / deployment to servers
Natively consumes many data types events,
performance data, custom application logs
Centralized view of a distributed system
Reporting
Enables higher IT service quality at a reduced
operational cost

23
Delivering Knowledge with Management Packs

Implementation of health model
Built by product owners and experts
Creates MOM Alerts
Indication of a detected conditions that requires
administrator investigation / action
Contain embedded knowledge aid diagnosis
Appear in MOM console, email or pager
notifications
Basic Alerts from state transitions
Advanced Alerts from scripts, e.g.
Synthetic transactions
Security and configuration verification

24
Monitoring with Longhorn

Monitoring capabilities built into the OS
Event filtering and correlation
Forwarding events and alerts
Correlation of events and data
Automated actions and notification
Rich set of rule types and libraries
Common service enables monitoring of
Health, security, performance and configuration
No extra deployment
Monitoring is part of the applications setup
Application manifest includes monitoring rules
Admin can customize default rules, including
actions
Your investments in MOM management packs will
carry forward

25
Monitor Application Health

Build monitoring rules to correct the problems
automatically

26
Summary What gets better

Lower manual cost of problem detection, root
cause analysis and resolution
Higher service availability using health
monitoring and automatic recovery
How?
Health Model drives the quality and quantity of
information
Instrumentation consistent across components
The instrumentation is discoverable before
runtime
Admin controls the levels of diagnostics
dynamically
Feedback to improve your products next release.
Enhanced management infrastructure in the OS

27
Call to Action

Visit the booth 19 in Microsoft Pavilion
Great opportunity to drill into technical
details with the developers and program managers
Exercise hands-on Labs 401,406,407,408
See next slide for more info
Ask The Experts
Tuesday 7 pm 9 pm in Hall G,H
Design for operations
Build the model for your application
Have your technical support use and test it
Write and deploy Management Packs
Get ready for Longhorn - install PDC build and
create your own manageable application

28
Resources

Longhorn documentation and whitepapers
www.microsoft.com/windowsserver2003/technologies/m
anagement/dsi/designops.mspx
Windows Management Instrumentation Preview
Windows Event Log Preview
Task Scheduler Service Preview
Event Forwarding Service Preview
Monitoring Service Preview
HOL 401 Health Modeling and Instrumentation
MOM training
HOL-406 Building MOM Management Packs to Manage
.NET Applications
HOL-408 Monitoring SQL Server with the SQL Server
management pack
HOL-407 Extending MOM using the Microsoft
Connector Framework and SDK
Web Sites
http//pdcbloggers.net
http//msdn.microsoft.com/pdc/
Management Community Forum
http//www.microsoft.com/windowsserver2003/commu
nity/centers/management/default.mspx