Title: MOM Essentials 5 Advanced Configuration and Administration
1MOM Essentials 5 Advanced Configuration and
Administration
- Gordon McKenna MOM MVP
- Inframon Ltd.
2Management Pack Tuning
3Agenda
- Management Pack tuning
- - Management Pack Architectural Overview
- - Management Pack Overview Demo
4Management Pack Architectural OverviewKey Terms
- Data sources
- Events Windows, application, WMI, service
change, SNMP traps, timed events, missing events,
UNIX syslogs - Performance data Used for graphs, reports, and
to set thresholds - Alerts
- MOMs indication of a particular issue ? What
operators see first - Based on events, performance thresholds or script
output - Response
- Reaction to an alert (auto-resolve, send e-mail,
page, run script) - Management Pack (MP)
- Set of Processing Rules to monitor applications
- Supporting views and reports
5MOM Rule Unit Of Instruction/Policy
- Event Rules
- Collection rules
- Filtering rules
- Missing event rules
- Consolidation rules
- Duplicate Alert Suppression
- Performance Rules
- Measuring
- Threshold
- Alert Rules
6Management Packs
- Management Pack imported via MOM Server
- Discovery finds computers in need of a given
Management Pack - MOM deploys appropriate Management Packs
- No need to touch managed nodes to install
Management Packs - Rules Implement all MOM monitoring behavior
- Watch for indicators of problems
- Verify key elements of functionality
- Management Packs provide a definition of server
health
7Management Pack Overview
8Basic rule of thumb when first deploying MPs
- DO NOT deploy all your required Management Packs
at the same time - - Install one, tune properly, then do the
next - Be brutal, switch off rules that you do not need
- - use Alert Overrides wherever possible (rather
than just disabling) - Document your changes
- - XML, Excel, Third party
- Put some form of change control in place to
prevent random changes - - Make sure that only the people who need MOM
Admin or author rights get it.
9Other Dos and Donts
- Do read each MP guide before you deploy
- Dont just disable a rule because it is giving
you an error - Do use the community to search for solutions to
issues - Dont let Alert Creep set in be pro-active
with your tuning - Do use reports to help you stay on top of alerts
- Do have regular reviews of the environment
- DONT PANIC !
10Common Alerts After MP Deployment
- ADMP
- Replication taking too long need to modify
script for worst case scenario - Script Failures (access denied) usually
permission related, make sure Local System has
enough rights if unsure then create an AAA
according to MP guide. - Script Failures (failed to bind LDAP) If AD is
behind firewall then FQDN will be missing from
computer table. - MOMLatencyMonitor not created Make sure your
first AD Agent is on Infrastructure Master
11Common Alerts After MP Deployment
- Exchange
- Mail flow scripts fail Permission related check
out Top 3 issues effecting Exchange MP in
product documentation on MS MOM site - Disk Write Latency perf ctr Some
re-configuration required (see Exchange MP guide) - General perf ctrs Their will be a lot of base
lining to do due to nature of application - OWA\OMA\AS disable rule group if this does not
apply
12Key Processes for alert tuning strategy
- What to do with Alerts
- Who owns the MOM alert
- Who gets notified
- Who takes action
- How will it be recorded/closed
- What to do with Reports
- How to publish them
- Who reviews them
13Using Reports To Evaluate Alert Flow
14Tips for Getting Application Owners Onboard
- Sell each each MP to the technology owner i.e.
SQL MP to the SQL Team - Demonstrate how MOM can help them
- Scope them there own console
- Make sure they explain fully how their
environment works - Make them part of the deployment process
- Make them the MP owner
- Get them to own MP change documentation
- Put a process in place so they can be notified
when changes take place to MPs - Give them a sandpit environment.
15Methods for Maintaining MPs
- Create an excel spreadsheet of all rules,
documenting changes Can be time consuming - Use MOM resource kit tools to convert .AKM file
to .XML, then use XML editor or MPDiff (res kit)
Fiddly - Develop your own solution (web based\.NET)
- Use a third-party tool like Silect software
- http//www.silect.com
16Application Engineering Standards
- Get App developers to think Manageability
- Design for Operations
- Write a good set of standards
- Introduce new technologies if need be like
AVICode for .Net - www.avicode.com
- Provide value to them with peformance and
availability reporting - Make them understand how hard your job is -)
17Ideas for Standards
- Registry Keys Info on App, override criteria,
thresholds - Event Logs
- Performance Counters
- WMI Process and Thread IDs, app data
- Status Monitoring Health Modelling
- Scripting
- .NET (AVICode)
- Other Methods MOM API, C
18Maintaining MPs
19Advanced Administration Techniques
20MOM OperationsGuiding principles
- Database Manage alert, event and perf data
volumes - Management Servers Monitor the health of the
management server queues - Agent administration Watch the Pending
Actions computer group closely and watch for
agents not heart beating (This principle is
covered in the appendix)
21MOM OperationsGuiding principles
- Database Manage alert, event and perf data
volumes - Management Servers Monitor the health of the
management server queues - Agent administration Watch the Pending
Actions computer group closely and watch for
agents not heart beating (This principle is
covered in the appendix)
22Database VolumeView Alert and event counts daily
- select count() from eventall -- all events in
the db - select count() from alertview -- all alerts
- -- Date/Time the System Center dts job last ran
- select timedtslastran from reportingsettings
- -- see appendix for perf data volume query
Tip The DTS job for the System Center data-base
logs a date/time when it completes successfully.
The grooming job (MOMXGroomByDays) will not groom
out any data newer than this date
23Database GroomingAlert resolution criteria
Tip Perfmon, event and alert data is held in
the Onepoint database for Groom data older than
the following number of days The clock does
not start ticking for alerts until the alert is
resolved
24Database GroomingAuto alert resolution
- Relevant code from stored procedure
AlertUpdateNewToResolved. - SET ResolutionState 255, TimeResolved
_at_LastModified, LastModified _at_LastModified,
LastModifiedBy 'AutoResolved', ResolvedBy
'AutoResolved' - WHERE ResolutionState 0 AND
- TimeOfLastEvent lt _at_dGroomDate AND
(ProblemState ltgt 3 OR ProblemState IS NULL)
Tip Only alerts that are in a New resolution
state AND do NOT have an active problem state
AND whose last event is older than the number of
days you specify (Global Settings/Database
Grooming) get auto resolved
25Management ServersIncoming and outgoing queues
Management Server
Outgoing Queue Blocked 22061 RTN 22062
Incoming Queue Blocked 21268 RTN 21269
MOM Agents
agent
No disruption of service is caused when
the outgoing queue fills up. If the incoming
queue is full, agents begin caching data
locally until they can find a management server
to write to
One pointData-base
26Management ServersIncoming and outgoing queues
- During extreme alert, event, or perf data storms,
the incoming queue may fill up. The management
servers communicate this to their agents. Under
this condition, agents will try hopping to
their failover management server. They log
event 21249 when they do so and 21250 when they
come back - During this time period, failed heartbeat alerts
(21284, 21209) are inaccurate and you get a lot
of agent stopped sending required config
requests events (22085) - This condition also defeats your agent load
balancing strategies for management servers - A soon to be public QFE is available to throttle
how soon agents try their failover management
server
27Server Queue Full Pointers
Perf counters to watch under MOM Server
Perfmon object Db Perf Insert Simple Count Db
Alert Insert Simple Count Db Event Insert Simple
Count Queue Space Percent Used
28Management ServersIncoming and outgoing queues
- Perf data incoming rate is fairly constant from
measuring or collection rules. Rules that run
scripts that submit perf data can cause large
spikes. Scripts to watch out for if you have
tens of thousands of mailboxes on your Exchange
servers - Exchange 2003 - Collect Mailbox Statistics
- Exchange 2003 - Collect number of mailboxes per
server - Exchange 2003 - Collect Message Tracking Log
Statistics - These scripts have parameters which you can use
to tune how many mailboxes they report on. Some
customers use them to report only one mailbox,
as they still deliver very useful global counters
for the information stores
29Management ServersIncoming and outgoing queues
- Tip If the management server incoming queue is
filling up there are three options - Increase the size of the temporary storage on the
management server Global Settings/ Management
Servers/Temporary Storage (easiest) - Reduce the volume of perfmon, event, and alert
data by tuning rules (best) - Increase throughput of SQL server by increasing
disk I/O, memory, etc (usually least practical)
30Appendix Agent Admin Pending actions computer
group
- The only legitimate reasons for computers that
are pending action are - Computer rules have discovered a new computer or
more computers have been added to manualmc.txt - Computer rules have changed to be less inclusive
or computers have been removed from manualmc.txt
Tip When agents that match computer
discoveryrules or manualmc.txt are pending
uninstallation. Watch out for domain name
changes and failed computer discoveries
31Appendix Agent Admin Pending actions - domain
changes
- If the agent computer changes domains
- You will get failed heartbeat and failed
discovery alerts on the old computer account - The agent will be in a new domain still trying to
contact its management server. It will log this
MOM Alert Name The MOM server received invalid
or corrupt network data which may indicate a
security problem. Description follows
The MOM Server rejected configuration or data
package from computer domain\computername.
Package failed to pass server security
verification. Event Number21289 SourceMicrosoft
Operations Manager
32Agent Admin Pending actions - failed discoveries
- Give MOM agents a good healthy uninstall delay.
Global Settings/Management Servers/Properties/
Automatic Management/Uninstall Delay. This will
ensure that discovery must fail gt once for a
server to be marked for uninstall - Disjoint AD name space Is the FQDN of your
servers not the same as their primary DNS suffix?
MOM discovery may have a few minor issues in
this scenario. Mutual authentication with agents
may not work either. We have almost 1800 agents
in 12 domains. MOM has trouble discovery about 16
servers. Most of those are in two domains but
many other servers in these two domains are
discovered successfully. We dont know what
causes this but the problem can be remedied.
33Agent AdminPending actions failed discoveries
- The manualmc.txt file can contain computer names
with several different formats - Fully Qualified Domain Name (FQDN) name
- NetBIOS name
- Domain\ComputerName
- Usually either NetBIOS name or Domain\Computer
name works, but I have found that you get fewer
discovery errors using NetBIOS name. Where
discovery fails using NetBIOS name you should try
domain\computername format. Sometimes it works
where NetBIOS fails IF you add the FQDN to the
hosts file (yes, the hosts file)! - Computer discovery rules work somewhat better but
are no guarantee for successful discovery
34Agent Admin Agent management hints
- Install all the nodes of a cluster on the same
management server or you will not get accurate
heartbeat alerts - Occasionally turn on rule Microsoft Operations
Manager\Operations Manager 2005\Agents on all MOM
roles\ Agent communication failure
troubleshooting events These events can be
viewed from a public view called Agent
communication failure troubleshooting events
under Microsoft Operations Manager/ Operations
Manager 2005/Agent Configuration and
Connectivity. You might be surprised at some of
the warnings and errors your agents are logging - If you turn agent proxying on and you submit an
alert on behalf of a computer that does not have
a MOM agent in the same management group as the
box that submitted the alert then the computer
you submitted the alert for will show up in the
Unmanaged Computers group
35Agent Admin Agent management hints
- If a push installation (automated from management
server) fails on an agent try installing the
agent manually using momagent.msi. When the agent
starts check the NT eventlog for diagnostic
messages. Some of them are pretty good - Script a process to check your server database
each day for populating the manualmc.txt file.
Manualmc.txt allows you to chose what servers to
manage rather than getting all the computers
discovered by computer rules and adding exclude
rules for those you dont want. Also, removing a
computer from manualmc.txt can be automated
whereas removing computer rules cannot
36Agent AdminSource and Logging computers
MOM agents can generate alerts on behalf of other
computers if agent proxying is turned on for that
agent (Agent-managed computers/computername/proper
ties/Security/uncheck prevent agent proxying).
Management servers do this all the time for
failed heartbeats, discoveries, etc. This is what
the source and logging computer in the
advance criteria tab are all about
SELECT C.Name as 'Source', D.name as
'Logging', es.number,dateadd(hh,-6,e.timegenerated
),e.message FROM EventAll E INNER JOIN Computer C
ON (E.idGeneratedBy C.idComputer) INNER JOIN
Computer D ON (E.idloggedOn D.idComputer) INNER
JOIN EventSource ES ON (E.idEventSource
ES.idEventSource) where e.idloggedon ltgt
e.idgeneratedby order by 2 desc
37Agent AdminUseful views in the MOM MP
- The MOM management pack has some extremely useful
alert and event views under public
views/Microsoft Operations Manager/Operations
Manager 2005. Here are my favorites - Everything under Computer Discovery. Great
place to look for what discoveries are failing
and why. To this folder I would add an event view
for event number 21185 which shows a summary of
failed discoveries - Agent Configuration and Connectivity
- Agent communication failure troubleshooting
events - Agent communication failures
- Agent Deployment
- Agent Installation Failures All
- Agent Performance
- MOM Host - Processor Time
- MOM Service - Processor Time
38Account AdminChanging account passwords
- Resetting an Action Account (domain user)
- SetActionAccount.exe ltmanagement groupgt options
- Options
- -query //returns the current Action Account
settings for the specified management group. - -set ltdomaingt ltusernamegt //sets the Action
Account for the specified management group. Note
- the tool will prompt you for the new password. - Note - the management group must be specified,
even if the agent is not multihomed.
39Account AdminChanging account passwords
- Changing the DAS Account password
- Change the account settings on the Identity tab
of the Microsoft Operations Manager Active
Operations Data Access Service COM application
on the Management Server. - If you are using a different account, you must
also add that account as a SQL Server Security
Login with Permit server access. - Give the db_owner Account access to the OnePoint
database on the MOM database server, if you are
using a different account. MOM setup grants the
DAS account this access by default. - If you also have the MOM to-MOM Product Connector
installed, add the account to the MOM Service
security group on the Management Server. - Note
- You must restart the COM application to commit
the changes.
40Configuring the Webconsole
- Configuring the Web Console As Read-Only
- You can configure the Web console to be
Read-only, so that operations data can be seen,
but tasks cannot be run and changes cannot be
made. This setting does not affect the Operator
console read/write access. - To enable or disable Read-Only access for the
Web Console - On the server hosting the Microsoft Operations
Manager 2005 Web console application, open the
INSTALLDRIVE\ Program Files\Microsoft
Operations Manager 2005\WebConsole\web.config
file in a text editor. - In the ltappSettingsgt node, change the node
lt!--add key"Readonly" value"true"/--gt to
ltadd key"Readonly" value"true"/gt. - Restart the Microsoft Operations Manager 2005
Web console application in the Internet
Information Services (IIS) snap-in.
41For Further Information..
- Read the MOM 2005 Operations guide
- Use the microsoft.com\mom website
- Attend the Windows Management Webcasts on MOM
- ATTEND ALL OF THE TECHNET SESSIONS ON MOM
42Troubleshooting MOM 2005
MOM 2005 Logs
43Written by Developers for Developers
- Original concept by Mission Critical Software
- To assist developers with debugging their own
code! - Needs familiarity with the code itself
- Augmented by NetIQ
- Further augmented by Microsoft
- With a move away from mc8 to log files for easier
troubleshooting by PSS
44Logging Levels
- HKLM\Software\Mission Critical Software\Tracelevel
- Default 0x1
- Service restart not required for Trace Level
changes - Verbosity levels
- 0xFFFFFFFF Off
- gt 0 Errors (Err)
- gt 3 Errors and Warnings (Wrn)
- gt 6 Errors, Warnings and Info (Inf)
- gt 9 Errors, Warnings, Info and Debug (Dbg)
45Log Locations
- Service logs (.mc8, .log)
- Systemroot\Temp\Microsoft Operations Manager
- DAS log (DllHost.mc8)
- Documents Settings\ltDas accountgt\Local
Settings\Temp\Microsoft Operations Manager - MOM MMC Snap-in log
- Documents Settings\All Users\Local
Settings\Temp\Microsoft Operations Manager
46Service Logs
- MOMService(Init).mc8
- Logged to for the first TraceInit seconds
- Circular line logging commences after TraceInit
seconds (default 60) - MOMService(A-B).mc8
- Circular line log files
- Logs roll over after TraceCircularLines
- TraceCircularLines default 50,000
47Reading service logs
- Notepad
- Lacks translation of date/time
- Useful for quick examination search functions
- MOMLogViewer
- MOM ResKit utility
- Displays pertinent information such as Date/Time
- Realtime update
- Doesnt always read ALL lines!
- Trace32.exe
- From SMS support tools
- Useful for real-time logging and highlighting
- Lacks column translation
48Which Trace Level?Which Trace Level was enabled
for log file analysis
- MOMLogViewer
- Add Column Trace Level
- Notepad / Trace32
- Fourth distinct column or Err,Wrn,Inf,Dbg in
the text column
49Why Trace?
- On PSS / PG Advice
- PSS or the Product Group ask for more data
- Used by CPR or developers to isolate code causing
a potential problem - Do not enable Trace Level gt6 unless advised (it
is resource intensive) - Obtaining more information
- Default Trace Level 1 Errors
- Mc8 logs will contain information on error
(although not always significant) - Review logs after crash or significant failure.
50Err (0-2) Tracing (Errors)
- Expected errors
- ErrDirectory Service NT Event log not registered
on this machine. Not processing Directory
Service log. - i.e. This is not a domain controller
- Unexpected errors
- ErrLogged event -1073715815(Error) args
"momsrv.w2k3lab.com" "A connection attempt failed
because the connected party did not properly
respond after a period of time, or established
connection failed because connected host has
failed to respond." "1270" - i.e. Communication failure
51Wrn (3-5) Tracing (Warning)
- Often produced prior or during code failure
- Agent/Server queue nearing full
- e.g. WrnHigh water wait timed out. Waiting
indefinitely for 957 bytes (exact space) - WrnConnection failed with status 3 (ie. TCP
connection failed unexpectedly) - Sometimes expected
- E.g. WrnCannot start eventlog reader for
target, log nameDirectory Service
52Inf (6-8) Tracing (Information)
- More general output
- E.g. starting threads,host processes, registering
callbacks, rule updates, computer grouping etc - Logging Events to the EventLog
- Useful for log to eventlog tallying
(troubleshooting by time of occurrence) - Depth troubleshooting requires access to source
code - Often needs increase to TraceCircularLines
- Log wrap too fast to catch!
53Dbg (gt9) Tracing (Debug)
- Not recommended in production
- Very resource intensive
- Logs roll over very quickly
- Needle in a haystack scenario!
- Can be of use for PSS for
- Troubleshooting channel errors
- Troubleshooting service failures
- Troubleshooting provider errors
- TraceCircularLines needs to be increased
- The level of logging will wrap logs quickly
54Tallying EventLog Events
Inf Information Trace Level gt6
InfLogged event 9009(Information) args
"D\mom2005\MOMService.exe" Logged Time
02/21/2006 103500
Eventid 9009 Source Microsoft Operations
Manager Created 02/21/2006 103500 Description
559015 Error Tracing
Typical Event Log Entry
The Microsoft Operations Manager service
(MOMService.exe) received an unexpected
exception.nnThread Id 1nThread Name
6nException code 2nException description
3nException address 4nException flags 5
- Search the MOMService().mc8 file for 9015 event
generation. - Examine preceding log entries for cause of 9015
and compare with the event log entry. - Supply any additional log file information to PSS
if raising a support call.
56MOM Agent Cluster Discovery
momclustermonitor.cpp - InfFound cluster
groupSQL InfFound cluster group
resourceDisk1 WrnGetClusterResourceNetworkName
return FALSE, resourceDisk1, GetLastError5002Th
e cluster resource dependency cannot be
found. InfFound cluster group resourceSQL IP
Address1(SQL01) WrnGetClusterResourceNetworkName
return FALSE, resourceSQL IP Address1(SQL01),
GetLastError5002The cluster resource dependency
cannot be found. InfFound cluster group
resourceSQL Network Name(SQL01) WrnGetClusterRes
ourceNetworkName return FALSE, resourceSQL
Network Name(SQL01), GetLastError5002The
cluster resource dependency cannot be
found. InfFound cluster group resourceSQL
Server (MOM1) InfFound virtual serverSQL01, for
cluster group resourceSQL Server
(MOM1) InfFound cluster group resourceSQL
Server Agent (MOM1) InfFound cluster group
resourceSQL Server Fulltext (MOM1) InfFound
cluster group resourcespool InfFound cluster
group nodeW2K3NODE2 InfFound cluster group
nodeW2K3NODE1 InfFound cluster groupGroup
0 InfFound cluster group resourceCluster IP
Address WrnGetClusterResourceNetworkName return
FALSE, resourceCluster IP Address,
GetLastError5002The cluster resource dependency
cannot be found. InfFound cluster group
resourceCluster Name WrnGetClusterResourceNetwor
kName return FALSE, resourceCluster Name,
GetLastError5002The cluster resource dependency
cannot be found. InfFound cluster group
resourceDisk Q WrnGetClusterResourceNetworkName
return FALSE, resourceDisk Q,
GetLastError5002The cluster resource dependency
cannot be found. InfFound cluster group
resourceMSDTC InfFound virtual
serverW2K3CLUSTERCE, for cluster group
resourceMSDTC InfFound cluster group
nodeW2K3NODE1 InfFound cluster group
nodeW2K3NODE2
57MOM Cluster Discovery (cont)
momclustermanager.cpp InfDiscovered virtual
serverSQL01 InfFull virtual server
nameW2K3LAB\SQL01 InfDiscovered virtual
serverW2K3CLUSTERCE InfFull virtual server
nameW2K3LAB\W2K3CLUSTERCE
MMC
Microsoft Cluster Servers SQL01 W2K3CLUSTERCE
58MOM Cluster Discovery (cont)
- MSCS APIs only
- MOM Cluster discovery registers for MSCS
callbacks - i.e. New cluster group added
- Cluster resource name change
- No need to stop and restart the MOM agent
59Host Log files
- MOMAgentScriptHost-ltCGgt(A-B).MC8
- MOMAgentPerformanceHost-ltCGgt(A-B).MC8
- MOMAgentCPRHost-ltCGgt(A-B).MC8
- MOMAgentBatchHost-ltCGgt(A-B).MC8
- Similarly MOMServer.mc8 for server responses
60DAS Logging (overview)
- Controlled by
- Trace Level
- HKLM\Software\Mission Critical Software\DasServer\
LoggingFlagsREG_DWORD - See KB329451 for details
- MOM 2005 specific article pending
- Note on dllhost(A-B).mc8
- May be in one or more locations
- systemroot\Temp\Microsoft Operations Manager
- Documents and Settings\ltdasAccountgt\Local
Settings\Temp\Microsoft Operations Manager
61MOM Resources
- Microsoft Operations Manager http//www.microsoft.
com/MOM - Getting Started Resourceshttp//www.microsoft.com
/MOM/Beginners - Technical Walkthrough
- Key Documentation
- MOM Evaluation Download
- Partner Product Cataloghttp//www.microsoft.com/M
OM/ManagementPacks - MOM Communityhttp//www.microsoft.com/MOM/communi
ty/ - Solution Acceleratorshttp//www.microsoft.com/mom
/evaluation/solutions/default.mspx -
62What else does TechNet give you?
-
- FREE TechNet Newsletter
- FREE Events and Webcasts
- FREE quarterly TechNet magazine
- FREE comprehensive technical website
- FREE TechNet Radio, Security Centre, Learning
Paths and Virtual Labs - TechNet Plus Subscription DVD
A range of tools and resources for IT
professionals that let you plan, manage ,deploy
To subscribe to the newsletter or just to find
out more, please visit www.microsoft.com/uk/techne
t
63Thank you for attending this TechNet Event
- http//www.microsoft.com/uk/technet
- PS (The evaluation form is now sent out
electronically with your thank you e-mail. This
can take up to 5 working days. Please do
feedback as we read all the comments and use them
to shape future event content)