Framework Functionality, Tuning and Troubleshooting - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Framework Functionality, Tuning and Troubleshooting

Description:

Framework Functionality, Tuning and Troubleshooting – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 82
Provided by: DaveTh95
Category:

less

Transcript and Presenter's Notes

Title: Framework Functionality, Tuning and Troubleshooting


1
Framework Functionality, Tuning and
Troubleshooting
2
Purpose
  • Tuning is a delicate and and complex mechanism.
    The variables
  • are vast and tuning must begin at the operating
    system and
  • network levels. Before applications can be
    streamlined, ensuring
  • that the operating system can service requests,
    as well as network
  • stability is crucial. An Understanding of the
    tuning components are
  • imperative and only careful analysis and
    implementation will
  • ensure that the environment is streamlined and
    scalable. All of
  • these variable are related to each other and
    balancing them is key.
  • Every environment is unique in regards to
    applications and
  • functions that they serve. This document will
    explain the
  • components that must be tuned. As with most
    scenarios a baseline
  • set of configuration are implemented before
    reaching a final
  • working configuration.

3
Agenda
  • Operating System Tuning
  • File Descriptors
  • Process Per User
  • Network Tuning
  • Memory Buffer
  • Socket Buffer
  • TCP
  • UDP
  • Tivoli Tuning
  • Oserv
  • Gateway
  • Dump/Cores
  • Endpoint Manager
  • MDist/MDist2
  • Endpoints
  • Endpoint Communication Reliability
  • 3.7.1 4.1
  • 4.1.1 4.3.1

4
Functionality, Tuning and Troubleshooting
  • Gateway Daemon
  • Jobq threads
  • TMF_Dispatch Thread
  • Threads Categorized
  • Gateway Functionality - logstatus
  • epact.bdb Endpoint Activity Database
  • Gateway Commands
  • Gateway Tuning Considerations
  • Gateway tuning script
  • Examples of errors

5
Operating System Tuning
  • File Descriptors
  • Handle created by a Process when opened.
  • Retired when a file is closed or terminated.
  • Commands to view descriptors
  • Commands to change descriptors
  • Make the change permanent.
  • Sockets
  • Formulas
  • Processes Per User
  • Predefined Limit
  • Maximum of concurrent processes per user-id.
  • Formulas

6
Operating System Tuning contd
Network Tuning
  • Network Options
  • Memory Buffer
  • Formula
  • Socket Buffer Maximum
  • Formula
  • TCP
  • sendspace
  • recvspace
  • Formula
  • UDP
  • Sendspace
  • recvspace
  • Formula

7
Tivoli Tuning
  • ITM/DM
  • request_manager.threads
  • taskengine.max_threads
  • Endpoints
  • Ifs_ignore
  • Login thread usage
  • Endpoint command usage
  • continue_eplogin_onerror
  • Oserv
  • rpc_max_threads
  • iom_by_name
  • set_force_bind
  • Gateway
  • rpc_maxthreads
  • max_concurrent_jobs
  • max_concurrent_logins
  • eplogin_timeout
  • continue_eplogin_onerror
  • tcp_backlog
  • Endpoint Manager
  • epmgr_rpc_max_threads
  • login_limit
  • timeout_interval
  • MDist/MDist2
  • max_sessions
  • mem_max
  • net_load
  • max_conn
  • packet size
  • rpt_dir
  • disk_dir

8
Endpoint Communication Reliability
GATEWAY Receives endpoint login Request, and puts
login request in the JOBQ JOBQ handler start
login thread
EPMGR Processes, initial migratory or isolated
logins. Sends login_info to the GW
LCFD Sends login request to the Gateway Receives
login results from the Gateway
Process Normal Login Sends result back to
Endpoint
Yes
Is the endpoint login a normal login?
No
For initial, migratory or isolated logins
Endpoint login threads contact the endpoint
manager. Sends new login_info to the Endpoint
Login Process for TMF 3.7.x or 4.1
9
Endpoint Communication Reliability continued
GATEWAY Receives endpoint login request, and puts
login request in endpoint login JOBQ JOBQ
handler starts login thread
LCFD Sends login request to Gateway Receives
login timeout Sets login interval Receives
login results from the gateway or -1 timeout,
if the endpoint manager is busy and resets
timeout in login cycle
EPMGR Processes initial, migratory or isolated
logins Sends login_info to the gateway If the
endpoint manager is busy, throws an exception
Yes
Process Normal Login Sends result back to
Endpoint
Is the endpoint login a normal login?
No
Initial, migratory or isolated logins are
forwarded to the endpoint manager Sends new
login_info or -1 timeout if the endpoint manager
is busy to the endpoint
Login Process for TMF 4.1.1
10
Gateway Log Messages
  • Gateway Boot Process
  • Read
  • Initialize
  • Opens
  • Contacts Endpoint Manager
  • Threads started
  • Reconnect Thread
  • Listens for incoming request from the endpoints
  • Login Thread
  • Gateway process listens to the broadcast login
    requests from endpoints
  • Reader Thread
  • Processes all gateway requests assigned to the
    reader queue
  • Logstatus Thread
  • At every logstatus_interval logs status of JOBQs
    and endpoints

11
Endpoint Login Thread
  • Endpoint Data Functions
  • Data Stored
  • Downcall Process
  • Checks credentials of the user
  • Resolves method dependencies
  • Opens Connection
  • Endpoint Performs Downcall
  • Returns Results
  • Upcall Process
  • Reconnect Thread gets Upcall
  • Reader Thread creates Job
  • JOBQ Monitoring Threads Starts
  • Updates
  • Upcall Method resolves
  • Gateway Returns Results.
  • Endpoint Login Thread
  • Login request
  • Runs login_filter
  • Compares endpoint data
  • Login_info
  • Upgrade
  • Login Policy Boot Method
  • Multi-Cast
  • Notifies Repeater
  • Login_status
  • Endpoint Logout Process
  • Logout Packet

12
Gateways Purposes and Impact
  • Gateway Process
  • Designed to Handle a large percentage of
    Management Function.
  • Gateway Stability
  • Number of Management units affected
  • Ability to Management Endpoints
  • Crucial to 7X24 operations
  • Products Impacted by
  • Gateway Stability
  • Tivoli Software Distribution
  • Tivoli Inventory
  • System Availability

13
Gateway Daemon
  • Gateway Thread Usage
  • Fixed Remote Procedure Call (RPC)
  • Gateway method spawned from an oserv thread
  • gateway_method_in()
  • Two types of gateway threads execute gateway
    methods
  • Jobq threads
  • Tmf_dispatch thread

14
Gateway Daemon (contd)
  • Jobq Threads
  • new_job()
  • Processed by pthreads
  • max_concurrent_jobs
  • Pthreads best practices
  • Pthreadsmax_concurrent_jobsnumber of hosting
    gateways gt 200 lt 1000 (or to the nearest 100)
  • Monitoring memory usage of gateway process.
  • Monitoring the endpoint manager and management
    regions methods
  • TMF_Dispatch Threads
  • tmf_dispacth()

15
Gateway Thread Execution
16
Gateway Function - logstatus
  • wgateway ltgw_labelgt logstatus
  • Jobq_threads in queue
  • Jobq_threads running
  • Reader_threads in queue
  • GW methods running
  • Examples of logstatus responses
  • bash wgateway gbl_gw logstatus
  • Status Data ---------
  • Jobq_threads in queue 0
  • Jobq_threads running 0
  • Reader_threads in queue 0
  • GW methods running 1

17
Gateway Function logstatus (contd)
  • Logstatus_interval
  • Sets interval for gatelog update
  • Set smaller than the default for more useful data
  • Entries are written to the gatelog regardless of
    debug level
  • Example of logstatus entry in gatelog
  • 2002/09/18 230348 05 010275A0 STATUS DATA
    jobqq 0 jobqr 0 reqdq 0 gwmethods 0
  • Setting the logstatus_interval
  • wgateway ltgw_labelgt logstatus_interval 600

18
Logstatus of a Gateway
The reconnect_thread listens to TCP communication
from an endpoint
Gateway Method Calls
Gateway_method_in()
The read queue processes only endpoint logins,
endpoint logouts, upcall, upcall proxy and
endpoint control packets
tmf_dispatch()
Job Queue
Running Jobs
Exit
19
Gateway Commands
  • set_session_timeout seconds
  • Can override with task, software package times,
    etc
  • set_rpc_maxthreads count
  • Currently must be (max_concurrent_jobs 50)
  • set_max_concurrent_jobs count
  • Currently max 2000
  • set_max_concurrent_logins count
  • logstatus_interval seconds
  • Interval at which statistics are written to the
    gatelog
  • set_method_trace_time seconds
  • Disable, or set interval to refresh epact.bdb
    with endpoint method date and time information
  • set_debug_level
  • Keep the value of 1 when not troubleshooting to
    reduce the load on the system resources.

20
Gateway Tuning Considerations
  • Operating System File Descriptors
  • Oserv RPC threads
  • Gateway operations RPC threads and jobq
  • Endpoint Manager RPC threads

21
RPC Threads
OS File Descriptors
Limits
Admin Console SP Editor Web Gateway
Admin Console Agent SP Editor
Oserv RPC Threads
Limits
EPMgr RPC Threads
Gateway RPC Threads
The number of OS File Descriptors limits the
number of all other RPC Threads in Tivoli.
22
Operating System File Descriptors
  • Threads created consumes OS File Descriptors
  • NT/W2K - (all versions) testing reveals that the
    maximums are 2038.
  • AIX It not affected if the files descriptors
    (nofiles(descriptors)) are set to unlimited, but
    only the soft limit.
  • Solaris Does have limitations to how large the
    file descriptors (no files) can be set, but the
    best practices is to ensure that the value is at
    least twice the value of the oserv rpc threads.

23
Procedures to change File Descriptors
  • NT/W2K (all versions)
  • Default desktop memory heap is 512
  • Check value
  • regedt32
  • HKEY_LOCAL_MACHINES\System\CurrentControlSet\Contr
    ol\Session\Manager\subsystem
  • The default value for this registry value will
    look something like the following
  • SystemRoot\systems32\crss.exe
    ObjectDirectory\Windows
  • SharedSection1024,3072,512
  • WindowsOnSubSystem
  • TypeWindows
  • ServerDllbasesrv,1
  • ServerDllwinsrvUserserverDllInitialization,3
  • ServerDll-winsrvConServerDllInitialization,2
  • ProfileControlOffMaxRequestThreads16
  • Double click Windows and change the SharedSection
    to 1024,3072,1024
  • This changes the third value from 512 to 1024
  • Close regedt32
  • Reboot the system

24
Procedures to change File Descriptors (contd)
  • AIX
  • ulimit a will list the current values for the
    current userid.
  • The data will look something like the following
  • file(blocks) 100
  • data(kbytes) 523256
  • stack(kbytes) 512
  • coredump(blocks) 200
  • nofiles(descriptors) 64
  • memory(kbytes) unlimited
  • ulimit n will set the current userid descriptors
    dynamically
  • (eg. ulimit n unlimited)
  • The oserv needs to be re-cycled (NOT reexec)
    before changes will take affect
  • ulimit Sn will be the command to put in the
    system startup files to allow the value to be set
    to accommodate tivoli rpc threads at boot time
    statically.
  • The operating system will need to be rebooted
    before changes will take affect.

25
Procedures to change File Descriptors (contd)
  • Solaris
  • ulimit a will list the current values for the
    current userid.
  • The data will look something like the following
  • time(seconds) unlimited
  • file(blocks) 2097151
  • data(kbytes) 131072
  • stack(kbytes) 32768
  • memory(kbytes) 32768
  • coredump(blocks) 2097151
  • nofiles(descriptors) 2000
  • ulimit n will set the current userid descriptors
    dynamically
  • (eg. ulimit n unlimited)
  • The oserv needs to be re-cycled (NOT reexec)
    before changes will take affect
  • ulimit Sn will be the command to put in the
    system startup files to allow the value to be set
    to accommodate tivoli rpc threads at boot time
    statically.
  • The operating system will need to be rebooted
    before changes will take affect.

26
Oserv RPC Threads
  • Set the rpc_max_threads to as big as OS will
    allow
  • (default is 250)
  • Viewing this parameter
  • odadmin get_rpc_max_threads
  • Setting this parameter
  • odadmin set_rpc_max_threads

27
RPC Threads and Jobq
  • Every gateway method handled by a Gateway RPC
    thread - initially spawned via an oserv thread.
  • The Gateway RPC Thread will determine what type
    of thread to execute this method with
  • Queued
  • Note valid range for max concurrent jobs allowed
    Gateway is 200-2500 and cannot be greater than
    (rpc_maxthreads 50)
  • To set the value
  • wgateway ltgw_labelgt set_max_concurrent_jobs
    ltvaluegt
  • Non-Queued
  • Note valid range is 250 2000 and cannot be
    less than max_concurrent_jobs50)
  • Formula rpc_maxthreads gt max_concurrent_jobs
    max_concurrent_logins 20
  • To set the value
  • wgateway ltgw_labelgt set_rpc_maxthreads ltvaluegt

28
Endpoint Manager Threads
  • Endpoint Manager Policy Threads
  • max_install maximum number of concurrent
    allow_install_policy scripts
  • max_sgp maximum number of concurrent
    select_gateway_policy scripts
  • max_after maximum number of concurrent
    after_install_policy scripts
  • Viewing the Endpoint Manager RPC threads
  • wepmgr get max_epmgr_rpc_threads
  • Yields something like the following
  • max_install 10
  • max_sgp 10
  • max_after 10
  • login_interval 300
  • stanza_interval 720
  • max_iom_records 500
  • epmgr_flags 1
  • max_epmgr_rpc_threads 300
  • automigrate 0ff
  • migrate_max 0
  • chk_cntl_chars 0
  • labelspace

29
Gw_tune.pl script
Usage gw_tune.pl -h -r -s
Calculate the recommended values of RPC Threads
for the oserv, epmgr and gateway.
Additionally, calculate the
max_concurrent_jobs setting for the Gateways.
(Read Script Header for methodology,
or the IBM/Tivoli Gateway Tuning Field
Guide) This script only works on
the TMR it is being run on. It should be
run on each TMR in an environment. No
Options This Message -r
Calculate and show the recommendations ONLY.
-s Calculate the
recommendations AND actually
set them in the TMR.
-h This usage statement. NOTE This
script can ask you to change a value that
requires operating Changes.
30
Error and Examples
  • Gateway Thread Usage errors in the gatelog
  • 2003/09/20 140422 01 RPC Request rejected
    outstanding threads 250
  • Summary of Queued Methods
  • Downcalls to Endpoints
  • Excessive isolated logins
  • Delayed distributions
  • Summary of Non-Queued Methods
  • Note the following methods (except for
    login_policy) are bound by a RPC thread limit of
    250 on the gateway.
  • The foremost method is the login_policy method
  • Deleting and Adding endpoint uses RPC threads
  • Mass migrations of endpoints uses up RPC threads
    on the old gateway.
  • Debug levels
  • Endpoints
  • Debug level 3 is typically the best level to
    troubleshoot effectively.
  • Debug level 4 will provide the necessary data to
    identify a network issue in addition to Tivoli
    issues.
  • Gateways
  • Debug level 6 is typically the best level to
    troubleshoot effectively.
  • Debug level 9 provides more granular errors for
    troubleshooting and diagnosis.
  • NOTE There is a policy script mechanism that I
    engineered to help with endpoint login and
    streamlining of endpoint maintenance.

31
The Tivoli Factors
  • Endpoint Instability
  • Gateway Instability/Unresponsiveness
  • Endpoint Manager Instability/Unresponsiveness
  • Oserv instability

32
The Tivoli Factors - Oserv Instability
  • Resource exhaustion of RPC threads
    (rpc_max_threads)
  • Limits
  • Windows (Hard coded in TMF 2038)
  • Unix (flavour specific, Limited by the number of
    OS file descriptors)
  • FWK commands that contact all MNs will exhaust a
    poorly configured parameter wchkdb wgateway
  • ITMs Request Manager uses an oserv thread for
    each endpoint downcall it manages.
  • request_manager.threads (The default is 10, which
    is usually too low)
  • Be cautious when changing.
  • The tec_gateway uses RPC threads to receive
    upcalls from endpoints.
  • Parameter GWThreadCount (tec_gateway max RPC
    threads) range 250 10000 (default 250)
  • Be very cautious when changing.

33
The Tivoli Factors - Oserv Instability
  • Disk/File System Space
  • DBDIR
  • Corrupt/inconsistent odb.bdb
  • Memory on AIX
  • Default maximum process size of 256MBytes
    (function of LDR_CNTRL and hard limit)
  • Memory leaks can cause core dumps when limit is
    reached
  • Transaction log size
  • Odb.log

34
The Tivoli Factors - Endpoint Manager Instability
  • Resource Exhaustion of RPC threads
    (max_epmgr_rpc_threads)
  • Too many endpoint calls can exhaust this resource
    quickly
  • Logins (Isolation, Initial, Orphaned)
  • Migrations
  • Long running policy scripts and w-commands
  • The real story about max_sgp, max_allow and
    max_after
  • Hung/Hanging gateway methods
  • Restricted/Impacted by the availability of oserv
    RPC threads
  • So exercise some caution when changing

35
The Tivoli Factors - Gateway Instability
  • Resource Exhaustion of RPC threads
    (rpc_maxthreads)
  • Each gateway method uses an RPC thread until its
    type is determined.
  • Each MDist2 session uses a RPC thread (rpt lt-gt
    app, rpt lt-gt rpt)
  • Restricted/Impacted by the availability of oserv
    RPC threads.
  • JOBQ threads (pthreads) max_concurrent_jobs,
    max_concurrent_logins
  • Too many upcalls can result in this resource
    being over extended
  • Results in requests being queued
  • Can result in isolation logins

36
The Tivoli Factors - Gateway Instability
  • Disk Space
  • Depot directory (rpt_dir)
  • depot and states directories share same file
    system
  • Corrupt/inconsistent MDist2.bdb in states
    directory
  • reconnect_thread overloaded
  • Single threaded
  • TCP Backlog full
  • TCP requests get a slow response
  • No failures from a framework perspective
  • Applications fail because of delay

37
The Tivoli Factors - Gateway workflow
38
The Tivoli Factors - Endpoint Instability
  • Gateway unavailable
  • Down
  • Results in an Isolation login to alternate
    gateway
  • Unmanageable until it completes login
  • Busy
  • Results in queued requests
  • Delays
  • Retries
  • Port saturation
  • Upcall and downcall clash.
  • ITM heartbeat is a downcall that had a large
    dependency list
  • Fixed in 5.1.1-ITM-FP05
  • Upcalls not possible during downcall.

39
Software Distribution
  • Slow reporting
  • Distributions and reporting use MDist2 sessions
  • Consider increasing sessions to ease contention
  • report_thread_limit controls the number of
    mdist2_result methods that run from SH on TMR
    server
  • When report_thread_limit is reached, SWDIST
    throws an exception and messages in queue go into
    INTERRUPTED state
  • Parameter conn_retry_interval is used to trigger
    reprocessing of messages
  • Default of 900 seconds too high. Decrease to 120
    seconds
  • no endpoints on gateway
  • customers using 20 seconds with good results

Text slide withlarge image
40
Distributed Monitoring
  • Too many upcalls because of too many monitors
  • saturate the reader_thread on gateway
  • Symptoms
  • JOBQ_THREAD_EXCEPTION errors because listening
    port no longer available to reciprocating
    downcall.
  • Not many jobs in JOBQ
  • Upcall request delays seen in lcfd.log (no
    errors)

41
IBM Tivoli Monitoring
  • ITM Heartbeat downcall takes too long (fixed
    just an example)
  • Upcalls on endpoints cant succeed because of
    downcalls
  • When application upcall fails it forces endpoint
    to become isolated
  • Cascading effect of many isolation logins causing
    gateways to become unresponsive because it is
    busy
  • Incorrectly configured Web Health Console and
    Heartbeat
  • Too many downcalls generated

42
Inventory
  • Collection Table of Contents (CTOC) processing
  • Uses pthreads which are configured by
    max_input_thread and max_output_threads
  • Maximum of 100 input and output threads
    respectively
  • Each pthread opens an IOM channel to pickup
    datapack
  • same IOM channel used if datapack source is the
    same as previous connection
  • new IOM channel opened if datapack source is
    different from previous connection
  • Maximum of 250 concurrent IOM channels
  • RPC threads used to maintain IOM channels for
    upcall (CTOC) and downcall (datapack)
  • RIM connections
  • RDBMS deadlocks caused by too many connections

43
Best Practices - Stability
  • Monitor Disk/File System Space and Transaction
    Log Size
  • To avoid odb.bdb corruption and oserv instability
  • To avoid gateway distribution failures because of
    MDist2.bdb corruption
  • Dont stop oserv if odb.log is too large
  • INVESTIGATE!
  • Inventory
  • 1 RIM and 5 connections as a starting point
  • Increase connections and RIMs as performance
    requires
  • Software Distribution
  • 50,000 endpoints per distribution
  • 7,000-10,000 endpoints per activity plan
  • 5-10 activities per plan

44
ifs_ignore
  • Gateway will listen on one or all interfaces
  • Gateway listens to the same interfaces as the
    oserv based on the odadmin set_force_bind
    setting
  • Use one or all, no other solution
  • More flexibility and control
  • Can now select which interfaces you want the
    gateway to use
  • Was available in 3.7.1 and 4.1 as an idlattr
  • wgateway ltgwlabelgt set ifs_ignore ltIP addressgt
  • wgateway ltgwlabelgt run ifs_ignore_remove ltIP
    addressgt

45
mcache_bwcontrol
  • Method/dependency downloads were not bandwidth
    controlled
  • Method/dependency downloads flooded slowlinks
  • Preloading/prestaging of cache directory
    workarounds
  • Method/dependency downloads can now use mdist2
    bandwidth
  • wgateway ltgwlabelgt mcache_bwcontrol TRUE
  • Default is FALSE.

46
run sync_login_interval
  • Change to EPMGR login_interval
  • Gateway would use old value
  • Gateway loads login_interval at boot time
  • Hence a restart of gateway needed to effect
    changes
  • wgateway ltgwlabelgt run sync_login_interval
  • Gateways login_interval is synchronized with
    EPMGR
  • No restart of gateway needed to effect change.

47
New Fixes/Features
  • In TMF 4.3.1 the RDBMS interface module (RIM)
  • uses ODBC to connect to a Microsoft SQL Server
    database rather than using a Microsoft SQL Server
    client. Therefore, you must create an ODBC data
    source to enable RIM to connect to a Microsoft
    SQL Server database. For more information about
    creating an ODBC data source, see the chapter
    about using RIM objects in the Tivoli Enterprise
    Installation Guide.

48
New Fixes/Features
  • Update JRE to 1.4.2 in TMF 4.3.1
  • As JRE 1.3.1 is going out of service and new
    platforms prerequisite JRE 1.4, TMF 4.3.1 will
    distribute the new JRE 1.4.2. TCM Java components
    like Software Package Editor, APM, CCM, GUIs and
    ISMP will be rebuilt using the new JDK 1.4 and
    will work runtime with the JRE 1.4.2.

49
New Fixes/Features
  • The following new platforms are supported in
    4.3.1
  • AIX 6.1 (Server/Gateway)
  • Windows 2008 (Server/Gateway, ISMP)
  • Windows Vista EP (added Desktop formerly not
    supported due to JRE)
  • SLES 10 and RHEL 5 (EP, Server/Gateway, ISMP)
  • Linux PPC (Server/Gateway)
  • Solaris x86 (Server/Gateway)

50
New Fixes/Features
  • Manage 64 bit Windows OS in TMF 4.3.1
  • TCM/TMF are 32 bit applications, but may need to
    run 64 bit binaries or .vbs and access directly
    to the 64 bit registers and file system in
    Windows endpoints. Produced wrappers that take as
    input the name of script or binary and required
    parameters, then disable registers and fs
    redirections, run binary and scripts, then
    restore the redirection.
  • (wrun-AMD64.exe and wrun-IA64.exe) under
    directory lcf_bundle.43100\bin\w32-ix86\tools

51
Oserv Tuning
  • Protect the oserv
  • Child process RPC threads must not be greater
    than 60 of oserv RPC threads

52
Gateway Tuning
  • Configure gateway to handle workload
  • Preliminary/TBD - max_concurrent_logins number
    of endpoints that do regular logins (workstation
    machines, up to maximum of 500)
  • Preliminary/TBD - max_concurrent_jobs max
    (number of endpoints that are servers or have
    DM/ITM/TEC installed, Mdist2 sessions total)
  • Preliminary/TBD - rpc_maxthreads max
    (max_concurrent_jobs, max_concurrent_logins)1.3
  • Child RPC threads lt 60 of oserv RPC threads
  • Use logstatus to monitor workload (set
    logstatus_interval 300)
  • 750-1000 endpoints per gateway (Windows)
  • 1500-2000 endpoints per gateway (Unix)

53
Gateway Tuning (cont.)
  • Memory
  • 10MB for gateway process
  • 3KB per endpoint
  • 48KB per thread (pthread and RPC)
  • 48KB per job in JobQ
  • Minimum requirement of 256MB for 1,000 endpoints
  • 350MB gateway process for 1,000 endpoints (stress
    test)
  • Function of Mdist2 mem_max
  • Recommendation 512MB per 1,000 endpoints

54
Endpoint Manager Tuning
  • Beware of Endpoint Manager RPC thread usage
  • Increase max_epmgr_rpc_threads
  • Must increase oservs rpc_max_threads gt
    (max_epmgr_rpc_threads/0.6)
  • Must increase OS file descriptor limit gt
    rpc_max_threads
  • EPMGR
  • CLI wepmgr login_limit ltNUMgt
  • Default is 80 of max_epmgr_rpc_threads
  • TBD/preliminary Recommended values is 60 of
    max_epmgr_rpc_threads
  • 0 disallows logins
  • Thread usage by login type (NOTE Not all of the
    threads are run simultaneously)
  • MIGRATION
  • Avoid migrating endpoints to specific gateways
  • Each migration command uses a total of 12 epmgr
    threads
  • Use gateway clouds/farms where possible

55
Endpoint Manager Tuning (cont.)
  • ISOLATION
  • 7 epmgr threads
  • Impatient endpoints exacerbate the problem
  • Make udp_interval/login_interval reasonable
    (default)
  • ORPHANED
  • 7 epmgr threads
  • Do something about these logins
  • INITIAL
  • 8 epmgr threads
  • Dont be too aggressive with roll-out of new
    endpoints
  • Thread usage of other operations
  • CLI wep status uses 6 epmgr threads for
    security
  • CLI wdelep uses 7 threads, plus one thread for
    each PM

56
Gateway Internals
  • Spawning new gateway methods
  • Oserv RPC thread
  • Gateway RPC thread
  • What type of method is it?
  • tmf_dispatch method
  • jobq method
  • tmf_dispatch methods
  • Use a gateway RPC thread
  • Use an oserv RPC thread
  • Methods that run against TME objects, e.g. epmgr

57
Gateway Internals
  • jobq methods
  • Jobs are placed in Job Queue waiting for
    available pthreads
  • Max_concurrent_jobs pthreads available for job
    processing
  • Methods that run on endpoints, e.g. upcalls

58
Workload/Workflow
  • Workload how much work is the gateway doing?
  • CLI
  • bash wgateway cayman_gw4110 logstatus
  • Status Data ------
  • Jobq_threads in queue 0
  • Jobq_threads running 0
  • Reader_threads in queue 0
  • GW methods running 1
  • Number of Endpoints 2
  • Endpoints connected 2

59
Workload/Workflow
  • Updated STATUS DATA in gatelog
  • STATUS DATA jobqq0 jobqr3 redyq0 gwmethods0
    loginq0 loginr0 ephq0 ephr0 sendq0 recvq1
    high0 med0 low0
  • THREAD USAGE
  • jobqq JOBQ threads in queue
  • jobqr JOBQ threads running
  • redyq Reader_threads in queue
  • gwmethods GW methods/RPC threads running
  • LOGIN QUEUE
  • loginq logins requests in login queue
  • loginr logins running

60
Workload/Workflow
  • HEALTH CHECK
  • ephq endpoint health checks in queue
  • ephr endpoint health checks running (maximum of
    20)
  • MDIST2
  • sendq jobs in send queue on gateway
  • recvq jobs in receive queue on gateway
  • high number of high session in use on gateway
  • med number of medium session in use on gateway
  • low number of low session in use on gateway

61
Impediments
  • TCP Backlog
  • Listen(), handshaking, accept()
  • Contains the complete TCP connections
  • Contains the incompleted TCP connections (on
    Windows)
  • Windows has a backlog size of 128 (maximum)
  • Unix uses a syncache for incomplete connections
    (configurable)
  • TCP connection request flood
  • Backlog gets full
  • SYN,RST seen as connections are refused
  • Connection attempts never seen by gateway

62
Impediments
  • reconnect_thread
  • There is only one
  • High volume of TCP connections can overwhelm
    thread
  • Needs to accept() connections before TCP Backlog
    is full
  • Applications may trigger fail-over mechanisms due
    to delays
  • TMA does not record an error because upcall
    eventually works
  • Low number of jobs running and waiting in Job
    Queue

63
Tuning and Recommendations
  • Memory usage on AIX
  • AIX allocates 16 segments to each process
  • Each segment is 256MB
  • Process Private Data (heap, stack and user area)
    is one segment
  • LDR_CNTRL environment variable can be used to
    assign multiple data segments to a process
  • LDR_CNTRLMAXDATA0x80000000 (maximum of 8 data
    segments per process 2GB)
  • Used in odadmin environ
  • http//nscp.upenn.edu/aix4.3html/aixprggd/genprogc
    /lrg_prg_support.htm

64
Tuning and Recommendation
  • TCP Backlog - revisited
  • AIX
  • /usr/sbin/no -o clean_partial_conns1
  • This setting will instruct the kernel to randomly
    remove half-open sockets from the q0 queue to
    make room for new sockets.
  • /usr/sbin/no -a or no -o somaxconn
  • /usr/sbin/no -o somaxconnNewValue (default
    1024)
  • Change takes effect immediately. Change is
    effective until next boot. Permanent change is
    made by adding no command to /etc/rc.net.

65
Tuning and Recommendations
  • Solaris
  • /usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q
    1024
  • The q queue holds sockets awaiting an accept()
    call from the application.
  • /usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q0
    2048
  • The q0 queue contains half-open sockets.

66
Tuning and Recommendations
  • Tru64 UNIX
  • /sbin/sysconfig -r socket sominconn65535
  • The value of sominconn determines how many
    simultaneous incoming SYN packets can be handled
    by the system.
  • /sbin/sysconfig -r socket somaxconn65535
  • The value of somaxconn sets the maximum number of
    pending TCP connections.
  • HP-UX
  • /usr/sbin/ndd -set tcp_syn_rcvd_max 1024
  • /usr/sbin/ndd -set tcp_conn_request_max 200

67
Tuning and Recommendations
  • Linux kernel 2.2
  • /sbin/sysctl -w net.ipv4.tcp_max_syn_backlog1280
  • Increases the size of the socket queue
    (effectively, q0).
  • /sbin/sysctl -w net.ipv4.tcp_syn_cookies1
  • Enables support for TCP SYN cookies, which
    mitigates the effectiveness of SYN floods.
  • Possible side effects - see RFC1323 and RFC2018
  • FreeBSD
  • sysctl -w kern.ipc.somaxconn1024

68
Tuning and Recommendation
  • IRIX
  • The listen() queue is hardcoded to 32. However,
    the system actually enforces the limit of pending
    connections as ((3 backlog) / 2) 1. This
    yields a maximum backlog of 49 connections
  • Windows
  • NT fixed at 100
  • W2K fixed at 128

69
Login Types
  • Login Message Types
  • LOGIN_EP LOGIN_FAIL
  • LOGIN_EP sub-types
  • LOGIN_INITIAL
  • LOGIN_NORMAL
  • LOGIN_ISOLATED
  • LOGIN_INFO
  • LOGIN_MIGRATING
  • LOGIN_TMR_REDIRECT

70
INITIAL login
  • INITIAL login
  • Endpoint sends a LOGIN_INITIAL to gateway
  • Gateway forwards packet to EPMGR
  • EPMGR processes login
  • allow_install_policy
  • should this endpoint be allowed access?
  • select_gateway_policy
  • where should he login?
  • after_install_policy
  • now that it exists, now what?
  • Gateway passes back a LOGIN_INFO
  • Endpoint sends a LOGIN_NORMAL to assigned gateway
  • login_policy
  • now that you are available, what should you do?

71
ISOLATION login
  • ISOLATION login
  • Endpoint sends a LOGIN_ISOLATED when the
    LOGIN_NORMAL is not answered.
  • Gateways response is dependent on whether it
    owns the endpoint
  • Ownership uses key to decrypt packet
  • treated like LOGIN_NORMAL
  • No ownership (alternate gateway) packet sent to
    EPMGR
  • EPMGR uses key to decrypt packet
  • select_gateway_policy
  • Assigns endpoint to new gateway
  • Alternate (intercepting) gateway returns
    LOGIN_INFO to endpoint
  • Endpoint sends LOGIN_NORMAL to new gateway
  • login_policy

72
MIGRATORY login
  • MIGRATORY login
  • Endpoint sends LOGIN_NORMAL to gateway
  • No ownership, gateway forwards LOGIN_MIGRATING to
    EPMGR
  • CLI wep migrate used
  • CLI wep set gateway e/-g was not used
  • EPMGR tells gateway where endpoint should be
    logged in
  • Gateway send LOGIN_INFO to endpoint
  • Endpoint send LOGIN_NORMAL to the migration
    gateway
  • login_policy

73
ORPHANED login
  • ORPHANED login
  • Endpoint sends LOGIN_NORMAL
  • Gateway forwards to EPMGR (not in epcache)
  • EPMGR cant find key (resource not found)
  • Gateway returns LOGIN_FAIL
  • Endpoint send LOGIN_ISOLATED
  • EPMGR uses master key to decrypt packet
  • Checks region number for validity
  • EPMGR creates new account
  • Policy scripts executed as per INITIAL login
  • Gateway send LOGIN_INFO to endpoint
  • Endpoint sends LOGIN_NORMAL to gateway
  • login_policy

74
TMR Redirection
  • TMR Redirection
  • Endpoint sends LOGIN_INITIAL to gateway
  • Gateway forwards login to EPMGR
  • EPMGR processes login
  • allow_install_policy
  • select_gateway_policy
  • Passes back a gateway in another TMR region
  • EPMGR understands that this is a redirection
  • Gateway passes new gateway in LOGIN_INFO to
    endpoint
  • Endpoint does a LOGIN_TMR_REDIRECT to new gateway

75
Health Check
  • A health check, rather than a heart beat.
  • Framework based.
  • No products needed
  • Configurable
  • wgateway ltgwlabelgt epcheck_interval
  • Executes every epcheck_interval seconds (default
    3600)
  • wgateway ltgwlabelgt epcheck_atboot TRUEFALSE
  • Check the endpoints at boot time (default FALSE)
  • CLI wepstatus lteplabelgt
  • Reads the results from the gateways health check
    cache
  • Does not contact the endpoint
  • CLI wep lteplabelgt status 1
  • Initiates intrinsic health check method on single
    endpoint
  • Gateways health check cache updated with new
    result

76
What does it do?
  • Purpose to establish that downcalls and upcalls
    will work
  • Runs as intrinsic method on LCFD
  • Creates temporary file and write to file in temp
    directory. Then deletes it.
  • Checks the amount of temp space it must be more
    than diag_temp_space.
  • Creates a temporary file and write to file in
    cache root directory. Then deletes it.
    (LCF_DATDIR/cache)
  • Checks free space on cache file system/volume.
  • Generate tmersrvd token and spawn test process.
  • Generate BuiltinNTAdministrator token and spawn
    test process
  • Not performed on Netware, OS/2 and OS/400

77
Results
  • Connected
  • the endpoint has passed all the health check
    tests. Downcall, method and task executions and
    upcall are successful.
  • Disconnected
  • The endpoint sends a logout message to the
    gateway when it logs out. The gateway will not
    attempt any downcalls to the endpoint until its
    state changes.
  • Unavailable
  • Gateway cannot communicate with endpoint. The
    important point here is that the lcfd process
    accepted the connection from the gateway, but the
    method execution failed.

78
Results
  • Unreachable
  • Gateway cannot communicate with endpoint. In this
    case the lcfd process did not accept the
    connection from the gateway. This could mean that
    the lcfd process is not running, network failure
    or terminal illness of the process.
  • Unknown
  • Gateway that is hosting endpoint is down, or the
    gateway has no health check status of the
    endpoint yet.

79
Results
  • Endpoint Health Status Error Codes
  • 0 - File permission error - cannot create
    temporary file.
  • 1 - Insufficient disk space in temporary
    directory.
  • 2 - File permission error - cannot create/update
    method cache.
  • 3 - Insufficient disk space in LCF_CACHEDIR.
  • 4 - Cannot generate token for tmeresrvd account.
  • 5 - Cannot generate token for builtin
    administrator account.
  • 6 - Cannot spawn a process.

80
How long does a login take?
  • A monitoring thread manages login queue
  • Number of endpoint login jobs in run state
  • Number of endpoint login jobs in the queue
  • Rolling average time taken by a typical initial
    or normal logins
  • Initial login (in this context)
  • Any login that is sent to the EPMGR
  • INITIAL, ISOLATED, ORPHANED

81
Questions?
Write a Comment
User Comments (0)
About PowerShow.com