Title: Design for Operations: Health Model and Instrumentation
1Design for Operations Health Model and
Instrumentation
Session Code ARC332
- Alexander (Sasha) Nosov sashan_at_microsoft.com
- Brian Reistad brianrei_at_microsoft.com
- Microsoft Corporation
2DSI Architecture (ARC230)Design for Operations
Local NodeMgmt
System Level Management
Remote Node Mgmt
Management Tools
Dynamic System Services
Dev Tools
Managed System
Your System Definition
SDM Service
SDM Store
Windows
Managed Node
Settings ARC333
Your Application
Health ARC332
Tasks ARC334
Hardware
Dynamic Data Center
3Agenda
- Problem Domain
- Health Model
- Instrumentation Technologies
- Automating the Health Model
4Application Availability Problem
- Service application stops for unclear reason
- User receives no warning or information how to
correct it - Telephone rings in the Help Center
5What Your Customers Expect
- Business today depends upon the computing and
network infrastructure - Customers expect that their services or
applications are secure, available and never lose
data - An actionable warning is received before failure
- The root cause of problems can be quickly
determined - Failure conditions dont impact their users
- Their environment can be managed with fewer people
6Why We Are Not There Today
- Applications are not designed with operations in
mind - Poor quality of instrumentation
- Limited structure and discovery
- No clear correlation between instrumentation,
root cause and corrective actions - Low signal/noise ratio
- Limited infrastructure
- Barrier to entry is high for developers
- Limited OS support to automate problem detection
and resolution - Limited feedback loop from support services
7Health Model
8What is the Health Model?
- Holistic view of the Applications different
potential problems - How your service may fail from end users
perspective - State diagram that captures transition to
different levels of degradation - Stopped
- Healthy
- Service Totally Unavailable
- Service Partially Unavailable multiple of these
- Instrumentation is driven by states and
transitions - User guidance in what to do in failure cases
- The model benefits
- Help desk personnel
- Admins and IT pros
- Product devrlopers
9What is a Health State?
- Definition
- Description of the state (Whats working, Whats
not) - Severity from the app perspective
- Detection
- What are the different entry points into the
state (e.g. events, thresholds, state changes,
external checks) - What are the dependencies that are relevant for
this state transition - Diagnosis
- How to determine the root cause of why were in
this state - Recovery
- What actions should be taken to return to
operational state - Verification
- How to verify that the application is still in he
bad state - How to verify that application has recovered from
unhealthy state (after correction)
10Terminal Server example
X
Problem the clients cannot connect to a
pre-existing session
11Terminal Server Example (cont.)
Definition The Terminal Server X failed to join
the Session Directory. The clients cannot connect
to pre-existing sessions in the Session
Directory. Instead they are be connected to new
sessions. Severity Error Detection 12 different
Error Events EVENT_CALL_TSSDRPCSEVEROFFLINE_FAIL E
VENT_SESSIONDIRECTORY_NAME_INVALID EVENT_SESSIONDI
RECTORY_UNAVAILABLE EVENT_FAIL_RPCBINDINGSETAUTHIN
FOEX . . . Verification Inspect Session Directory
Server configuration (list of Terminal
Servers) Diagnosis Different dependencies
identified in different entry points (i.e.
events) Check RPC, SD server running, Correct
configuration for SD Server, Network
connectivity, DNS resolution Recovery Refresh SD
Settings on Terminal Server to force
rejoin Verification Information event reported on
operation success EVENT_JOIN_SESSIONDIRECTORY_SUCC
ESS
State Healthy
State Cant Talk to SD Server
12Implementing Health Model
13Instrumentation Technologies
14Instrumentation Technologies
- Events (Event Log)
- Report occurrences of exceptional conditions,
record changes - Traces (ETW)
- Trace execution of key operations
- Probes (WMI)
- Expose complex internal state of applications
- Expose methods to correct unhealthy states
- Perf Counters (Perflib)
- Expose simple numeric values for performance
monitoring and threshholding - Watson messages (Corporate Error Reporting)
- Centrally collect records of failures to provide
feedback into product teams
15Consider Privacy
- Any instrumentation can pose a security or
privacy risk - Exposure of at risk items must comply with your
corporate privacy guidelines At risk items - Passwordsbefore or after encryption or hashing.
- User or account names, or SIDs.
- Security keys or access tokens.
- User data (network, file system, etc)
- Configuration information not immediately
relevant to code execution (enterprise policies
applied, other software patch level, etc)
16Events and Traces
17Event Log Enhancements
- Structured and Schematized events
- Common Viewing, Configuration and Querying of
Event logs and Trace logs - Scales to support application logs
- No need for proprietary logs
- Filtering and real time notifications
- Forwarding and collection of events across
multiple machines - Firewall friendly, using SOAP protocol
- The event viewer leverages the new features
18WMI Enhancements
- Definition Probes access to internal state
- Exposes existing properties and methods
- Needed for monitoring rules
- Manual access from command shell available
- Easily exposed using attribution scheme
- Leverages .net reflection
- Schematized instrumentation catalog
- Identified by URI
- Existing WMI providers automatically published
as probes - Remote SOAP access to probes
19Implementing Health Model
20Automating Health Model
21Monitoring and Autorecovery
- Workflow
- Detect problems before users call
- Speed diagnosis of root cause
- Automatic corrective actions where possible
- Components
- Knowledge captured in Health Model
- Problem detection, diagnostics and resolution
data - Instrumented application
- Validated by the Health Model
- Monitoring infrastructure
- MOM agent
- Windows Monitoring Service
- Result enterprise ready application
- Higher service availability
- Higher admin efficiency/low cost
- Higher users trust in your product
22Monitoring with Microsoft Operations Manager (MOM)
- MOM is Microsofts enterprise management solution
today - Framework for implementing health model
- Enables health monitoring of distributed
applications from one console - Key features
- Scalable architecture / network efficient
- Automatic discovery / deployment to servers
- Natively consumes many data types events,
performance data, custom application logs - Centralized view of a distributed system
- Reporting
- Enables higher IT service quality at a reduced
operational cost -
23Delivering Knowledge with Management Packs
- Implementation of health model
- Built by product owners and experts
- Creates MOM Alerts
- Indication of a detected conditions that requires
administrator investigation / action - Contain embedded knowledge aid diagnosis
- Appear in MOM console, email or pager
notifications - Basic Alerts from state transitions
- Advanced Alerts from scripts, e.g.
- Synthetic transactions
- Security and configuration verification
24Monitoring with Longhorn
- Monitoring capabilities built into the OS
- Event filtering and correlation
- Forwarding events and alerts
- Correlation of events and data
- Automated actions and notification
- Rich set of rule types and libraries
- Common service enables monitoring of
- Health, security, performance and configuration
- No extra deployment
- Monitoring is part of the applications setup
- Application manifest includes monitoring rules
- Admin can customize default rules, including
actions - Your investments in MOM management packs will
carry forward
25Monitor Application Health
- Build monitoring rules to correct the problems
automatically
26Summary What gets better
- Lower manual cost of problem detection, root
cause analysis and resolution - Higher service availability using health
monitoring and automatic recovery - How?
- Health Model drives the quality and quantity of
information - Instrumentation consistent across components
- The instrumentation is discoverable before
runtime - Admin controls the levels of diagnostics
dynamically - Feedback to improve your products next release.
- Enhanced management infrastructure in the OS
27Call to Action
- Visit the booth 19 in Microsoft Pavilion
- Great opportunity to drill into technical
details with the developers and program managers - Exercise hands-on Labs 401,406,407,408
- See next slide for more info
- Ask The Experts
- Tuesday 7 pm 9 pm in Hall G,H
- Design for operations
- Build the model for your application
- Have your technical support use and test it
- Write and deploy Management Packs
- Get ready for Longhorn - install PDC build and
create your own manageable application
28Resources
- Longhorn documentation and whitepapers
- www.microsoft.com/windowsserver2003/technologies/m
anagement/dsi/designops.mspx - Windows Management Instrumentation Preview
- Windows Event Log Preview
- Task Scheduler Service Preview
- Event Forwarding Service Preview
- Monitoring Service Preview
- HOL 401 Health Modeling and Instrumentation
- MOM training
- HOL-406 Building MOM Management Packs to Manage
.NET Applications - HOL-408 Monitoring SQL Server with the SQL Server
management pack - HOL-407 Extending MOM using the Microsoft
Connector Framework and SDK - Web Sites
- http//pdcbloggers.net
- http//msdn.microsoft.com/pdc/
- Management Community Forum
- http//www.microsoft.com/windowsserver2003/commu
nity/centers/management/default.mspx
29Questions?
- Dont forget to submit your feedback