Implementing Service Monitoring in a Business Context - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Implementing Service Monitoring in a Business Context

Description:

Auto Clear' Events. Event De-duplication. State-based Correlation. Automated Resolution ... TBSM enables a service-centric approach to management. 57. Security ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 70
Provided by: jasonf9
Category:

less

Transcript and Presenter's Notes

Title: Implementing Service Monitoring in a Business Context


1
Implementing Service Monitoring in a Business
Context
  • Tivoli User Group

2
Agenda
  • Session 1
  • Phase 0 Assessment Planning
  • Phase 1 Foundation
  • Phase 2 Discovery
  • Session 2
  • Phase 3 Service Visibility

3
Phase 0 Assessment
4
The cost of Service Outage
  • Revenue lost for transactions that fail
  • Directly impacts profitability, ever minute of
    every outage (Downtime cost for typical retail
    outlet is 7,800 per minute (IDC))
  • Dissatisfied customers go to competitors, and
    potentially never return
  • Tests at Amazon revealed every 100 ms increase in
    load time of Amazon.com decreased sales by 1
  • Google found that moving from a 10-result page
    loading in 0.4 seconds to a 30-result page
    loading in 0.9 seconds decreased traffic and ad
    revenues by 20 (Linden 2006).

5
Cost of Service Outage contd.
  • Business reputation is damaged
  • Even customers not directly impacted by outage
    may hear about the poor service
  • The Business vilifies IT when it happens
  • But still change is pushed through with little
    consideration of operational needs
  • We need to break this cycle, and Business Service
    Management can be a catalyst

6
How to Start?
  • Firstly Identify Stakeholders
  • Critical to ultimate success
  • Starting in the right way is key to a good
    outcome
  • Set realistic goals and expectations
  • Dont be drawn into trying to boil the ocean
  • Work by service, step by step
  • See the IT service in its Business Context
  • View our services based on how they support the
    overall organisation and its end customers
  • Essential in converting Business Managers to see
    IT as a key enabler, not a cost burden

7
Identify the Service
  • What constitutes the service?
  • Even relatively simple Services cross traditional
    IT silo boundaries
  • IT Infrastructure
  • Servers Clients
  • Middleware (Web, Database, Messaging)
  • Applications
  • Storage
  • Security
  • Processes (Business IT)
  • People
  • I.E. Everything supporting transactions
  • BSM tools help span the boundaries

8
Example Service
  • For this presentation we will use a standard
    e-business application called Plants By Websphere

8
9
PlantsByWebSphere
  • PlantsByWebSphere is a standard E-business
    application. It has
  • 1 Load Balancer (Linux server)
  • 2 Web Servers (Linux Apache and Windows IIS)
  • 2 WebSphere Servers (1 Unix and 1 Windows acting
    as a pair)
  • 1 DB2 Database (Linux)

9
10
First Steps
  • Now we have our service we can use the following
    structured approach
  • Review (or establish) SLAs supporting OLAs
  • Pull out the Important Business KPIs
  • Agree relative priority of these
  • Articulate the KPIs as a measurable IT function
  • Establish the components that support the
    end-to-end Service
  • Monitor, report, improve

11
Service Level Agreements
  • Not an essential pre-requisite to BSM
  • BSM solutions still deliver value without formal
    SLAs
  • Existence will increase the value of the tool
  • Everyone understands what we are striving to
    deliver, and why
  • Knowing SLA (and thus Business) impact of
    incidents helps with recovery prioritisation
  • Targets (and cost of failure) are clear to all
  • Facilitates identification of KPIs

12
Service Level Agreements
  • Useful framework for dialogue with Business
  • A common understanding of which IT Services are
    Business Critical
  • Establish reasonable, achievable, and measurable
    availability and performance targets
  • Evaluate cost of meeting targets, and business
    appetite to invest in suitable infrastructure
  • Vehicle to justify suitable Test and DR
    environments
  • High availability is expensive and not always
    justifiable

13
Objective Measurement
  • An SLA target that cannot be objectively
    qualified is pointless
  • Directly relate KPIs to IT infrastructure, then
    measure and report in Business context
  • Get the foundations right and build up
  • Cover all elements of a specific service
  • Complete coverage of one IT component has little
    value to Service Monitoring if others are missing

14
PlantsByWebSphere SLA
  • Luckily for this exercise we already have an SLA!
  • The PlantsByWebSphere SLA service is allowed 15
    minutes of downtime in a calendar month
  • After this each minute of downtime will cost 100
    per minute

15
Key Performance Indicators
  • Secondly we need our KPIs
  • KPIs should reflect the Service SLAs in place
  • They need to be Business focused, objective, and
    achievable
  • It must be possible to monitor and measure IT
    component(s) that equate to the KPI
  • Relative importance should be agreed
  • Baseline of current performance advantageous

16
PBW Availability KPIs
17
PBW Performance KPIs
18
Identify the End to End Service Infrastructure
  • BSM solutions are most effective when all
    components of a service are monitored
  • How do we identify them?
  • Asset Registry
  • Discovery Tools
  • CMDB
  • Bob (whos worked here for years)
  • The truth is probably a combination of the above

19
End to End Service Infrastructure
  • Having identified the Service components
  • Verify which are monitored
  • Close any gaps in cover
  • Ensure KPI Monitors are in place
  • Integrate with the BSM layer
  • Display in appropriate views (IT and Business)
  • Dont forget the supporting processes
  • Incident, Problem, and Change Management
  • Capacity Planning

20
Monitor, Report, Improve
  • Even if everything goes perfectly, you will miss
    something (whatever Bob says)
  • BSM is an iterative process, as there is always
    change
  • Ensure your processes consider your monitoring
    solution (Change, Release)
  • BSM should be scoped in all new projects
  • Revisit SLAs KPIs regularly to keep them
    relevant

21
Assessment Deliverables
  • Validate stakeholders and business drivers
  • Vital to an effective implementation project
  • Establish service components and KPIs
  • What to monitor and why
  • Produce high level solution design
  • How to monitor and visualise results
  • Common understanding of costs
  • Hardware, Software, Services, Operation

22
Assessment Deliverables
  • Investment justification (Business Case)
  • Benefits and milestones (ROI)
  • Agree priorities and phasing
  • Structure project to deliver early Business Value
  • Process and Organisational change
  • The crucial elements that wrap around the
    technology implementation
  • Improved communication and understanding
  • IT and Business working to common goals

23
Phase 1 Foundation
  • Simon Barnes

24
Foundation
  • The Foundation phase uses the deliverables from
    the Assessment and establishes core capability to
    manage a service end to end
  • This phase focuses on the capability to provide
    base monitoring of all components supporting a
    service including
  • Server
  • Network
  • Applications

25
What is included with the IBM BSM Foundation
Solution?
Service Infrastructure
Other
Other
Storage
Storage
Business
Business
Systems
Systems
Applications
Applications
Voice
Voice
Mainframe
Mainframe
Network
Network
Security
Security
Wireless
Wireless
26
What is the aim?
  • The aim of this phase is to reduce MTTR
  • Or the time it takes from a problem occurs until
    it is fixed
  • 80 of MTTR is spent in the Diagnose/Isolate
    phase (IDC)

27
Why is this so hard?
  • How do customers manage their IT infrastructure?

HP NNM
Net iQ
Microsoft MOM
Tivoli Monitoring
NetView/z
Customer unable to place new online order
SOA
Intranet
Oracle
Mainframe
Billing
Web server
Network infrastructure
Security
Customer
28
Key Performance Indicators
  • At this stage we need to know the KPIs
  • It must be possible to monitor and measure IT
    components that equate to the KPI
  • Our KPIs are
  • Performance
  • End to End Transaction Time (Load
    Balancer-gtWebServer-WebSphere-gtDatabase)
  • Component Transaction Time (WebSphere, WebServer,
    DB2)
  • Client Transaction Time (Speed for client to open
    application)
  • Availability
  • Client Web Site Availability (Failed open
    requests)
  • Component unavailability due to critical failure
    (component failure)
  • Number of Severity One Service IDs

28
29
Server KPIs
  • Our Server KPIs are
  • Availability
  • Component unavailability due to critical failure
    (component failure)
  • To this end we will monitor
  • Server Availability
  • Critical Server Components (Disk, Memory,
    Processor)
  • Critical Processes
  • However DO NOT monitor too much
  • As a rule only create alerts for things that have
    a corrective action

29
30
PlantsByWebSphere
  • The first aspect of our foundation is to add
    Server Monitoring to capture component problems
  • Server Monitoring is provided by adding a base OS
    IBM Tivoli Monitoring 6.2.1 agent to all key
    components

31
Network KPIs
  • Our small example does not have any Network
    specific KPIs
  • However in a normal environment you will need to
    know all the network components that affect a
    service
  • Such as
  • Load Balancers
  • Switches
  • Firewalls
  • Network Monitoring is provided by IBM Tivoli
    Network Manager

31
32
Application Monitoring
  • Application Monitoring can take many forms
  • Specialist Agents such as DB2, WebSphere, Web
    Server
  • Logfile and Process Monitoring
  • SNMP Monitoring
  • Custom Agent
  • End to End Transaction Monitoring

33
Why End to End?
End-user experience
High-performing resources dont always translate
into high-performing applications
33
34
Key Performance Indicators
  • Application monitoring is the key phase to any
    service management project as almost all KPIs are
    measured in this way
  • To this end our KPIs are
  • Performance
  • End to End Transaction Time (Load
    Balancer-gtWebServer-WebSphere-gtDatabase)
  • Component Transaction Time
  • Client Transaction Time
  • Availability
  • Client Web Site Availability
  • Component unavailability due to critical failure
    (component failure)

ITCAM for RT
ITCAM for RT, ITCAM for Web Resources
ITCAM for RT
ITCAM for RT
ITM for Databases
34
35
PBW Application Monitoring?
Web Response
Web Response
35
36
ITCAM for RT
Any GUI client or Web Transaction can be recorded
and uploaded
36
37
Client Response Agent
  • Responds to requests for data from a client
  • Lots of out of the box integrations
  • Can record your own to listen to start and end
    API calls of any application

37
38
Web Response Agent
  • The Web Response Time agent collects user
    response time for HTTP and HTTPS Web
    transactions.
  • HTTP traffic - agent listens locally to TCP/IP
    stack and measures the response time of the
    transaction.
  • HTTPS traffic - as it needs to access an
    unencrypted HTTP data stream, the agent runs on
    the Web server machine and makes use of the Web
    server exits to get access to the data stream.
  • Appliance mode - allows the agent to collect HTTP
    traffic from other machines in the same network
    segment by enabling collection of network packets
    in promiscuous mode.

38
39
Correlation
  • Correlation should be included at the beginning
    so that root cause identification can be
    accelerated from the outset
  • This can then be added to as a process of
    continuous improvement
  • In the case of the PlantsByWebSphere application
    it is provided by OMNIbus
  • This will automatically provide things like
    deduplication

40
Tivoli Business Bottom-Line Effect
Events
gt10M
Netcool Advanced Data Processing Delivers
Business Assurance and Increased Operations
Efficiency Through Massive Event Reduction and
Prioritization
gt1k
gt100
gt10
ITNM Integration Topology-Based RCA
OMNIbus Auto Clear Events Event
De-duplication State-based Correlation Automated
Resolution Device-based RCA
OMNIbus Event Collection/Consolidation Maximum
Event Generation Probes and Monitors
OMNIbus Probe Monitor Level Event
Filtering Suppression
Degree of Netcool Advanced Data Processing
Implemented
41
ITNM Root Cause
  • Automated discovery and graphic topology
  • Devices
  • Device relationships
  • Real time status and event management
  • Events and their impacts
  • Root-cause analysis (RCA)

42
Event Reduction
  • Because of root cause analysis the number of
    events is reduced greatly

Event Reduction 141
Event Reduction 521
The failed device becomes the root cause for all
connectivity events
The failed device becomes the root cause for all
connectivity events on isolated devices
43
Phase 2 Discovery
  • Simon Barnes

44
TADDM Provides 3 Key Benefits - Enabling the IT
Service Mgmt user to
Computer System
  • Understand what you have
  • Application Mapping with Dependencies
  • Agent-less and Credential-free
  • Discover interdependencies between Applications,
    middleware, servers and network components)

Switch
Infrastructure Application
Business Application
45
TADDM Provides 3 Key Benefits - Enabling the IT
Service Mgmt user to
  • Learn how your CIs are configured ( changing
    over time)
  • Configuration Auditing
  • Tracks changes in applications
  • Depicts that information on the map
  • Depicts that information thru reports

Automatically tracks changes on all CIs
attribute values over time
Application
46
TADDM Provides 3 Key Benefits - Enabling the IT
Service Mgmt user to
  • Determine if it is compliant

Comparing two instances of an Apache Web Server
to the reference master
  • Compliance
  • Compare configuration to reference master
  • Compare to your standard policy

Values in red and blue are policy violations
47
And what does this mean?
  • Reduce MTTR
  • Accurate and comprehensive cross-tier service
    visibility
  • Deep configuration details and interdependencies
  • Change history data to identify and isolate
    application changes.
  • Improve Operational Efficiency
  • Make decisions based on accurate operational data
  • Enhance business availability
  • Align IT infrastructure with the business through
    discovery automation

48
ITM and TADDM Integration
  • Simon Barnes

49
Monitoring Coverage
  • You can now use TADDM to display ITM 6 monitoring
    coverage

50
Launch In Context to search portal
  • With 7.1.2, LIC to the search portal handles LTPA
    (Lightweight Third Party Authentication) tokens,
    enabling SSO to occur.

50
51
ITNM and TADDM Integration
  • Simon Barnes

52
Integration Overview
NetworkResources
ITNM Discovery
TADDM Discovery
TADDM ITNM GUI
Resource Relationship Data
IDMLBook
Bulk Loader
DLA (Discovery Library Adapter)
53
End of Part 1
54
Phase 3 Service Visibility
  • Simon Barnes

55
Service Visibility
  • In this second part we will learn how to offer
    visibility of the availability and performance of
    specific business services to different business
    units using Tivoli Business Services Manager
    (TBSM)
  • Also we will show how we can monitor and measure
    these against business oriented metrics e.g.
    volume of transactions, revenue flow.

56
IBM Tivoli Business Service Manager
TBSM enables a service-centric approach to
management
  • Capabilities
  • Custom business views dashboards
  • View key performance indicators (KPIs)
  • Model any service
  • Service status/health from external sources
  • Track real-time Service Level Agreements
  • Advanced numeric rules for calculations
  • Service definition from CMDB/inventory
  • Tight BSM product integration
  • ITCAM for ISM ITM
  • TADDM, TSLA
  • OMNIbus TEC
  • Can add value to non-Tivoli monitored
    environments!

57
Event Sources
Status
58
Discovery Dependencies
Structure
59
Business Data Processes
Status and Structure
60
How does it work?
  • TBSM uses service models
  • A service model describes dependencies of units
    of operation within an organization.
  • Model elements can describe hardware, software,
    business processes, transactions and others.
  • Models can be built manually in TBSM or built
    automatically from event and/or external data.

61
Templates and Services
  • Templates define how Service Instances will
    behave.
  • Services are instantiations of templates.
  • Web Server Template ? WebServer1
  • There are multiple ways that instances can be
    created in TBSM
  • Manual configuration via graphical user interface
    (GUI).
  • Auto-population based on events
  • RAD API.
  • Data Fetcher or External Service Dependency
    Adapter (ESDA) for auto population from an
    external source.

62
PlantsByWebsphere
63
Service Status from Events
  • Services can derive status from incoming OMNIbus
    events.
  • Hundreds of event feeds through OMNIbus!
  • Propagation of status in a service tree is
    defined in template dependency rules.

64
Service Status from Business Sources Data
Fetchers
  • Data Fetcher is a database poller to
  • Retrieve key performance indicator values
  • Drive status of service instances (similar to
    events)
  • Retrieved data can be used with numerical health
    calculations
  • Useful feed of KPIs for scorecard/dashboard views
  • Data can be used in SLA calculations
  • Extends auto-population feature

WebServer6 TroubleTkts 2
WebServer13 TroubleTkts 7
WebFarm3
Rows
Data Fetcher
WebServer21 TroubleTkts 0
WebServer15 TroubleTkts 4
65
Numeric Modeling
  • Metrics can be associated with a service and
    optionally be used to determine status.
  • e.g. Web server response time
  • Dependencies can be based on numerics
    (aggregations).
  • e.g. Web farm response time as the calculated
    average response time for each web server in the
    web farm
  • Predefined aggregation calculations
  • Average
  • Maximum, Minimum
  • Percentile
  • Sum
  • Use Netcool/Impact policies to create customized
    calculations.
  • Configure a status threshold that will result in
    the service designated as Good/Bad/Marginal based
    on the value of the numeric rule.

66
Product Integration - Discovery
TBSM
TADDM
Application Maps (IDML)
Configuration and change history query
Application Detailed Configuration and Change
History Data
Business Systems View
  • TBSM / TADDM Value Proposition
  • Accurate and comprehensive application visibility
  • Cross-tier application topology
  • Deep configuration details and interdependencies
  • Automatically create and maintain
    application/service groupings
  • Identify and isolate application changes to
    dramatically reduce MTTR

67
Service Level Agreements
  • Service Level Agreements
  • Can be defined for
  • Services
  • Applications
  • Devices
  • 3 Types of SLAs
  • Instance
  • Cumulative
  • Violation Count
  • SLA Metrics
  • Availability
  • Downtime (MTTR)
  • Penalties ()

68
PlantsByWebSphere SLA
  • Service Level Agreements
  • Can be defined for
  • Services
  • Applications
  • Devices
  • 3 Types of SLAs
  • Instance
  • Cumulative
  • Violation Count
  • The PlantsByWebSphere SLA service is set for
    a the Number of minutes down (cumulative
    duration) for a calendar month (1st to 1st).
  • The service is allowed to be down for 15 minutes
  • We would like a warning after 10 minutes.
  • We also want to have a penalty of 100 per minute
    down.

?
?
  • SLA Metrics
  • Availability
  • Downtime (MTTR)
  • Penalties ()

?
?
69
End and Thank You
Write a Comment
User Comments (0)
About PowerShow.com